Large Downloads in astroquery.mast#


For some programs stored in the MAST Archive, you may encounter issues when downloading data via the MAST Portal due to a large number of files. This applies particularly to JWST programs using Wide-Field Slitless Spectroscopy. It is preferable — and often, necessary — to use an API to get this data instead. In this tutorial, we’ll use seemingly innocuous observations that expand into a considerable number of related files.

To that end, this notebook will demonstrate:

  • Searching the MAST Portal for observations using the astroquery.mast API

  • Retreiving associated data products, without causing a timeout error

  • Downloading the desired subset of data products

Table of Contents#

  • Imports

  • Search the MAST Archive

  • Retrieve Associated Products

  • Filter and Download Products

  • Further Reading

Imports#

In order to run this notebook, we need:

  • astroquery.mast to access the MAST Archive

  • astropy.table to hold the results of our queries, combine them, and then filter them for unique products

from astroquery.mast import Observations
from astropy.table import unique, vstack, Table

Search the MAST Archive#

The first step to downloading the data is finding the observations we’re interested in. This is easiest to do using query_criteria, which allows us to specify criteria such as RA/Dec, filters, exposure time, and any other fields listed here.

In this example, we use query_criteria to find NIRCam observations from JWST Program 1073. When querying for JWST data, using obs_collection = 'JWST' greatly inreases the speed of the search by decreasing the number of potential matches. This applies to all mission available in MAST, including HST.

matched_obs = Observations.query_criteria(
        obs_collection = 'JWST'
        , proposal_id = '1073'
        , instrument_name = 'NIRCAM/IMAGE' # Be sure to specify the full "instrument/mode" configuration!
        )
# This displays selected columns from the observation table, as a sanity check
columns = ['dataproduct_type', 'filters', 'calib_level', 't_exptime', 'proposal_pi', 'intentType', 'obsid','instrument_name']
matched_obs[columns].show_in_notebook(display_length=5)
WARNING: AstropyDeprecationWarning: show_in_notebook() is deprecated as of 6.1 and to create
         interactive tables it is recommended to use dedicated tools like:
         - https://github.com/bloomberg/ipydatagrid
         - https://docs.bokeh.org/en/latest/docs/user_guide/interaction/widgets.html#datatable
         - https://dash.plotly.com/datatable [warnings]
Table length=15
idxdataproduct_typefilterscalib_levelt_exptimeproposal_piintentTypeobsidinstrument_name
0imageF277W3343.576Koekemoer, Anton M.science75900624NIRCAM/IMAGE
1imageF150W3343.576Koekemoer, Anton M.science75914186NIRCAM/IMAGE
2imageF150W3515.364Koekemoer, Anton M.science76622656NIRCAM/IMAGE
3imageF277W3515.364Koekemoer, Anton M.science76622650NIRCAM/IMAGE
4imageF150W3343.576Koekemoer, Anton M.science118344942NIRCAM/IMAGE
5imageF277W3343.576Koekemoer, Anton M.science75884422NIRCAM/IMAGE
6imageF277W3773.046Koekemoer, Anton M.science83254500NIRCAM/IMAGE
7imageF070W3773.046Koekemoer, Anton M.science83254519NIRCAM/IMAGE
8imageF277W3343.576Koekemoer, Anton M.science83254380NIRCAM/IMAGE
9imageF115W3343.576Koekemoer, Anton M.science83254391NIRCAM/IMAGE
10imageF277W3171.788Koekemoer, Anton M.science83254329NIRCAM/IMAGE
11imageF150W3171.788Koekemoer, Anton M.science230523784NIRCAM/IMAGE
12imageF277W31374.304Koekemoer, Anton M.science83254838NIRCAM/IMAGE
13imageF150W3687.152Koekemoer, Anton M.science83254839NIRCAM/IMAGE
14imageF115W3687.152Koekemoer, Anton M.science83254840NIRCAM/IMAGE

The above search results in 15 observations. Keep this in the number in mind as we search for associated products.

Retreive Associated Products#

Each observation has associated data products. Which products are of interest to you depends on how you intend to use the data; more on this in the section below. For now, let’s retreive all the products by requesting them in small “chunks”.

Note: It is wise to avoid requesting all of the products simultaneously. This is extremely likely to take an enormous amount of time, fail, or worse, do both, ultimately giving you a headache. MAST offers no medical advice, but we are decidedly anti-headache. Requesting products in groups of five offers the best balance between speed and reliability.

# Split the observations into "chunks" of size five
sz_chunk = 5
chunks = [matched_obs[i:i+sz_chunk] for i in range(0,len(matched_obs), sz_chunk)]

# Get the list of products for each chunk
t = [Observations.get_product_list(chunk) for chunk in chunks]

# Keep only the unique files
files = unique(vstack(t), keys='productFilename')

# How many files are there? How large are they?
print(f"There are {len(files)} unique files, which are {sum(files['size'])/10**9:.1f} GB in size.")
There are 6768 unique files, which are 299.4 GB in size.

Now the issue with requesting all of the products simultaneously is clear: there are more than 6,000 unique files associated with our 15 observations.

Running this search on the MAST Portal results in over 30,000 files since the Portal does not exclude duplicate results; that is nearing the limit of the what the Portal can load. One of the advantages of using the API is avoiding this large number of duplicates.

Filter and Download Products#

If you are trying to download proprietary data, you will need to login. This requires a MAST token, which you can create at the auth.mast wesbite. If you have not set this as environment variable, you will have to enter it in the login prompt below.

In this example, we are looking to download the uncalibrated products. We will filter those out below using the productSubGroupDescription field. You can find the other available product filters, including product type and file size, here. Examples are also included, but commented out, in the cell below.

An additional option we make use of is the curl_script flag. Rather than downloading the data immediately, this method instead downloads a curl script. This is turned off by default, but is more robust than a direct download, and is highly recommended when downloading a large number of files. You can run this script using bash mastDownload_dddd.sh, changing dddd to reflect the actual name of your file.

# Un-comment below if downloading data during its exclusive access period.
#Observations.login()

manifest = Observations.download_products(
           files
           , productSubGroupDescription='UNCAL'
           , curl_flag=True
           #, dataproduct_type='IMAGE'
           #, calib_level = [2]
           )
Downloading URL https://mast.stsci.edu/api/v0.1/Download/bundle.sh to ./mastDownload_20241021185444.sh ...
 [Done]

All of the code in this notebook is available as a ‘companion script’, for further convenience.

Futher Reading#

About this Notebook#

Authors: Thomas Dutkiewicz, Dick Shaw
Keywords: Downloads, astroquery, MAST
Last Updated: Aug 2022
Next Review Date: Feb 2023


Top of Page Space Telescope Logo