Large Downloads in astroquery.mast
#
For some programs stored in the MAST Archive, you may encounter issues when downloading data via the MAST Portal due to a large number of files. This applies particularly to JWST programs using Wide-Field Slitless Spectroscopy. It is preferable — and often, necessary — to use an API to get this data instead. In this tutorial, we’ll use seemingly innocuous observations that expand into a considerable number of related files.
To that end, this notebook will demonstrate:
Searching the MAST Portal for observations using the astroquery.mast API
Retreiving associated data products, without causing a timeout error
Downloading the desired subset of data products
Table of Contents#
Imports
Search the MAST Archive
Retrieve Associated Products
Filter and Download Products
Further Reading
Imports#
In order to run this notebook, we need:
astroquery.mast
to access the MAST Archiveastropy.table
to hold the results of our queries, combine them, and then filter them for unique products
from astroquery.mast import Observations
from astropy.table import unique, vstack, Table
Search the MAST Archive#
The first step to downloading the data is finding the observations we’re interested in. This is easiest to do using query_criteria
, which allows us to specify criteria such as RA/Dec, filters, exposure time, and any other fields listed here.
In this example, we use query_criteria
to find NIRCam observations from JWST Program 1073. When querying for JWST data, using obs_collection = 'JWST'
greatly inreases the speed of the search by decreasing the number of potential matches. This applies to all mission available in MAST, including HST.
matched_obs = Observations.query_criteria(
obs_collection = 'JWST'
, proposal_id = '1073'
, instrument_name = 'NIRCAM/IMAGE' # Be sure to specify the full "instrument/mode" configuration!
)
# This displays selected columns from the observation table, as a sanity check
columns = ['dataproduct_type', 'filters', 'calib_level', 't_exptime', 'proposal_pi', 'intentType', 'obsid','instrument_name']
matched_obs[columns].show_in_notebook(display_length=5)
WARNING: AstropyDeprecationWarning: show_in_notebook() is deprecated as of 6.1 and to create
interactive tables it is recommended to use dedicated tools like:
- https://github.com/bloomberg/ipydatagrid
- https://docs.bokeh.org/en/latest/docs/user_guide/interaction/widgets.html#datatable
- https://dash.plotly.com/datatable [warnings]
idx | dataproduct_type | filters | calib_level | t_exptime | proposal_pi | intentType | obsid | instrument_name |
---|---|---|---|---|---|---|---|---|
0 | image | F277W | 3 | 343.576 | Koekemoer, Anton M. | science | 75900624 | NIRCAM/IMAGE |
1 | image | F150W | 3 | 343.576 | Koekemoer, Anton M. | science | 75914186 | NIRCAM/IMAGE |
2 | image | F150W | 3 | 515.364 | Koekemoer, Anton M. | science | 76622656 | NIRCAM/IMAGE |
3 | image | F277W | 3 | 515.364 | Koekemoer, Anton M. | science | 76622650 | NIRCAM/IMAGE |
4 | image | F150W | 3 | 343.576 | Koekemoer, Anton M. | science | 118344942 | NIRCAM/IMAGE |
5 | image | F277W | 3 | 343.576 | Koekemoer, Anton M. | science | 75884422 | NIRCAM/IMAGE |
6 | image | F277W | 3 | 773.046 | Koekemoer, Anton M. | science | 83254500 | NIRCAM/IMAGE |
7 | image | F070W | 3 | 773.046 | Koekemoer, Anton M. | science | 83254519 | NIRCAM/IMAGE |
8 | image | F277W | 3 | 343.576 | Koekemoer, Anton M. | science | 83254380 | NIRCAM/IMAGE |
9 | image | F115W | 3 | 343.576 | Koekemoer, Anton M. | science | 83254391 | NIRCAM/IMAGE |
10 | image | F277W | 3 | 171.788 | Koekemoer, Anton M. | science | 83254329 | NIRCAM/IMAGE |
11 | image | F150W | 3 | 171.788 | Koekemoer, Anton M. | science | 230523784 | NIRCAM/IMAGE |
12 | image | F277W | 3 | 1374.304 | Koekemoer, Anton M. | science | 83254838 | NIRCAM/IMAGE |
13 | image | F150W | 3 | 687.152 | Koekemoer, Anton M. | science | 83254839 | NIRCAM/IMAGE |
14 | image | F115W | 3 | 687.152 | Koekemoer, Anton M. | science | 83254840 | NIRCAM/IMAGE |
The above search results in 15 observations. Keep this in the number in mind as we search for associated products.
Retreive Associated Products#
Each observation has associated data products. Which products are of interest to you depends on how you intend to use the data; more on this in the section below. For now, let’s retreive all the products by requesting them in small “chunks”.
Note: It is wise to avoid requesting all of the products simultaneously. This is extremely likely to take an enormous amount of time, fail, or worse, do both, ultimately giving you a headache. MAST offers no medical advice, but we are decidedly anti-headache. Requesting products in groups of five offers the best balance between speed and reliability.
# Split the observations into "chunks" of size five
sz_chunk = 5
chunks = [matched_obs[i:i+sz_chunk] for i in range(0,len(matched_obs), sz_chunk)]
# Get the list of products for each chunk
t = [Observations.get_product_list(chunk) for chunk in chunks]
# Keep only the unique files
files = unique(vstack(t), keys='productFilename')
# How many files are there? How large are they?
print(f"There are {len(files)} unique files, which are {sum(files['size'])/10**9:.1f} GB in size.")
There are 6768 unique files, which are 299.4 GB in size.
Now the issue with requesting all of the products simultaneously is clear: there are more than 6,000 unique files associated with our 15 observations.
Running this search on the MAST Portal results in over 30,000 files since the Portal does not exclude duplicate results; that is nearing the limit of the what the Portal can load. One of the advantages of using the API is avoiding this large number of duplicates.
Filter and Download Products#
If you are trying to download proprietary data, you will need to login. This requires a MAST token, which you can create at the auth.mast wesbite. If you have not set this as environment variable, you will have to enter it in the login prompt below.
In this example, we are looking to download the uncalibrated products. We will filter those out below using the productSubGroupDescription
field. You can find the other available product filters, including product type and file size, here. Examples are also included, but commented out, in the cell below.
An additional option we make use of is the curl_script
flag. Rather than downloading the data immediately, this method instead downloads a curl script. This is turned off by default, but is more robust than a direct download, and is highly recommended when downloading a large number of files. You can run this script using bash mastDownload_dddd.sh
, changing dddd
to reflect the actual name of your file.
# Un-comment below if downloading data during its exclusive access period.
#Observations.login()
manifest = Observations.download_products(
files
, productSubGroupDescription='UNCAL'
, curl_flag=True
#, dataproduct_type='IMAGE'
#, calib_level = [2]
)
Downloading URL https://mast.stsci.edu/api/v0.1/Download/bundle.sh to ./mastDownload_20241021185444.sh ...
[Done]
All of the code in this notebook is available as a ‘companion script’, for further convenience.
Futher Reading#
For a full explanation of product levels and the processing pipleline, see Science Data Products
About this Notebook#
Authors: Thomas Dutkiewicz, Dick Shaw
Keywords: Downloads, astroquery, MAST
Last Updated: Aug 2022
Next Review Date: Feb 2023