Download Wide Field Slittless Spectroscopy (WFSS) Data#

This notebook uses the python astroquery.mast Observations class of the MAST API to query specific data products of a specific program. We are looking for NIRISS imaging and WFSS files of the NGDEEP program (ID 2079). The observations are in three NIRISS filters: F115W, F150W, and F200W using both GR150R and GR150C grisms. A WFSS observation sequence typically consists of a direct image followed by a grism observation in the same blocking filter to help identify the sources in the field. In program 2079, the exposure sequence follows the pattern: direct image -> GR150R -> direct image -> GR150C -> direct image.

Use case: use MAST to download data products.
Data: JWST/NIRISS images and spectra from program 2079 observation 004.
Tools: astropy, astroquery, glob, matplotlib, numpy, os, pandas, (yaml)
Cross-instrument: all

Content

  • Imports

  • Querying for Observations

    • Search with Proposal ID

    • Search with Observation ID

  • Filter and Download Products

    • Filtering Data Before Downloading

    • Downloading Data

  • Inspect Downloaded Data

Author: Camilla Pacifici (cpacifici@stsci.edu) & Rachel Plesha (rplesha@stsci.edu) & Jo Taylor (jotaylor@stsci.edu)
First Published: May 2024

This notebook was inspired by the JWebbinar session about MAST.

Imports#

from astropy.io import fits
from astroquery.mast import Observations
from matplotlib import pyplot as plt
import numpy as np
import os
import glob
import pandas as pd

Querying for Observations#

The observations class in astroquery.mast is used to download JWST data. Use the metadata function to see the available search options and their descriptions.

Note that for JWST, the instrument names have a specific format. More information about that can be found at: https://outerspace.stsci.edu/display/MASTDOCS/JWST+Instrument+Names

Observations.get_metadata("observations")

The two most common ways to download specific datasets are by using the proposal ID or by using the observation ID.

Search with Proposal ID#

# Select the proposal ID, instrument, and some useful keywords (filters in this case).
obs_table = Observations.query_criteria(obs_collection=["JWST"], 
                                        instrument_name=["NIRISS/IMAGE", "NIRISS/WFSS"],
                                        provenance_name=["CALJWST"], # Executed observations
                                        filters=['F115W', 'F150W', 'F200W'],
                                        proposal_id=[2079],
                                        )

print(len(obs_table), 'files found')
# look at what was obtained in this query for a select number of column names of interest
obs_table[['obs_collection', 'instrument_name', 'filters', 'target_name', 'obs_id', 's_ra', 's_dec', 't_exptime', 'proposal_id']]

Search with Observation ID#

The observation ID (obs_id) allows for flexibility of searching by the proposal ID and the observation ID because of how the JWST filenames are structured. More information about the JWST file naming conventions can be found at: https://jwst-pipeline.readthedocs.io/en/latest/jwst/data_products/file_naming.html. For the purposes of this notebook series, we will use only one of the two observations (004) in program 2079.

Additionally, there is flexibility using wildcards inside of the search criteria. For example, instead of specifying both “NIRISS/IMAGE” and “NIRISS/WFSS”, we can specify “NIRISS*”, which picks up both file modes. The wildcard also works within the obs_id, so we do not have to list all of the different IDs.

# Obtain a list to download from a specific list of observation IDs instead
obs_id_table = Observations.query_criteria(instrument_name=["NIRISS*"],
                                           provenance_name=["CALJWST"], # Executed observations
                                           obs_id=['jw02079-o004*'], # Searching for PID 2079 observation 004
                                           ) 

# this number will change with JWST pipeline and reference file updates
print(len(obs_id_table), 'files found') # ~613 files

Filter and Download Products#

If there are too many files to download, the API will time out. Instead, it is better to divide the observations in batches to download one at a time.

batch_size = 5 # 5 files at a time maximizes the download speed.

# Let's split up our list of files, ``obs_id_table``, into batches according to our batch size.
obs_batches = [obs_id_table[i:i+batch_size] for i in range(0, len(obs_id_table), batch_size)]
print("How many batches?", len(obs_batches))

single_group = obs_batches[0] # Useful to inspect the files obtained from one group
print("Inspect the first batch to ensure that it matches expectations of what you want downloaded:") 
single_group['obs_collection', 'instrument_name', 'filters', 'target_name', 'obs_id', 's_ra', 's_dec', 't_exptime', 'proposal_id']

Select the type of products needed. The various levels are:

  • uncalibrated files

    • productType=[“SCIENCE”]

    • productSubGroupDescription=[‘UNCAL’]

    • calib_level=[1]

  • rate images

    • productType=[“SCIENCE”]

    • productSubGroupDescription=[‘RATE’]

    • calib_level=[2]

  • level 2 associations for both spectroscopy and imaging

    • productType=[“INFO”]

    • productSubGroupDescription=[‘ASN’]

    • calib_level=[2]

  • level 3 associations for imaging

    • productType=[“INFO”]

    • productSubGroupDescription=[‘ASN’]

    • dataproduct_type=[“image”]

    • calib_level=[3]

Filtering Data Before Downloading#

# creating a dictionary of the above information to use for inspection of the filtering function
file_dict = {'uncal': {'product_type': 'SCIENCE',
                       'productSubGroupDescription': 'UNCAL',
                       'calib_level': [1]},
             'rate': {'product_type': 'SCIENCE',
                      'productSubGroupDescription': 'RATE',
                      'calib_level': [2]},
             'level2_association': {'product_type': 'INFO',
                                    'productSubGroupDescription': 'ASN',
                                    'calib_level': [2]},
             'level3_association': {'product_type': 'INFO',
                                    'productSubGroupDescription': 'ASN',
                                    'calib_level': [3]},
             }
# Look at the files existing for each of these different levels
files_to_download = []
for index, batch_exposure in enumerate(single_group):
    
    print('*'*50)
    print(f"Exposure #{index+1} ({batch_exposure['obs_id']})")
    # pull out the product names from the list to filter
    products = Observations.get_product_list(batch_exposure)
    
    for filetype, query_dict in file_dict.items():
        print('File type:', filetype)
        filtered_products = Observations.filter_products(products,
                                                         productType=query_dict['product_type'],
                                                         productSubGroupDescription=query_dict['productSubGroupDescription'],
                                                         calib_level=query_dict['calib_level'],
                                                         )
        print(filtered_products['productFilename'])
        files_to_download.extend(filtered_products['productFilename'])
        print()
    print('*'*50)

From above, we can see that for each exposure name in the observation list (obs_id_table), there are many associated files in the background that need to be downloaded as well. This is why we need to work in batches to download.

Downloading Data#

To actually download the products, provide Observations.download_products() with a list of the filtered products.

Typically, adjustments aren’t needed to the detector1 pipeline, so we can start with the outputs from detector1, the rate files, rather than the uncal files. Because of this, we only need to download the rate and association files. If you need to rerun the detector1 pipeline, productSubGroupDescription and calib_level will need to be adjusted in the Observations.filter_products call to download the uncal files instead.

If the data are proprietary, you may also need to set up your API token. NEVER commit your token to a public repository. An alternative is to create a separate configuration file (config_file.yaml) that is readable only to you and has a key: ‘mast_token’ : API token

To make create a new API token visit to following link: https://auth.mast.stsci.edu/token?suggested_name=Astroquery&suggested_scope=mast:exclusive_access

Note that a version of astroquery >= 0.4.7 is required to have the call flat=True when downloading the data. If you prefer to use an earlier version, remove that line from the call, download the data, and move all of the files in all downloaded subfolders into the single directory as defined by the download_dir variable.

# check that the version is above 0.4.7. See above note for more information
import astroquery
astroquery.__version__
download_dir = 'data'

# make sure the download directory exists; if not, write a new directory
if not os.path.exists(download_dir):
    os.mkdir(download_dir)
# Now let's get the products for each batch of observations, and filter down to only the products of interest.
for index, batch in enumerate(obs_batches):
    
    # Progress indicator...
    print('\n'+f'Batch #{index+1} / {len(obs_batches)}')
    
    # Make a list of the `obsid` identifiers from our Astropy table of observations to get products for.
    obsids = batch['obsid']
    print('Working with the following obsids:')
    for number, obs_text in zip(obsids, batch['obs_id']):
        print(f"{number} : {obs_text}")
    
    # Get list of products 
    products = Observations.get_product_list(obsids)
    
    # Filter the products to only get only the products of interest
    filtered_products = Observations.filter_products(products, 
                                                     productType=["SCIENCE", "INFO"],
                                                     productSubGroupDescription=["RATE", "ASN"], # Not using "UNCAL" here since we can start with the rate files
                                                     calib_level=[2, 3] # not using 1 here since not getting the UNCAL files
                                                     )
    
    # Download products for these records.
    # The `flat=True` option stores all files in a single directory specified by `download_dir`.
    manifest = Observations.download_products(filtered_products,
                                              download_dir=download_dir,
                                              flat=True, # astroquery v0.4.7 or later only
                                              ) 
    print('Products downloaded:\n', filtered_products['productFilename'])

If continuing on with the WFSS notebooks, let’s double check that we’ve downloaded all of the files that we need for the remaining notebooks. There should be 149 files downloaded.

downloaded_files = glob.glob(os.path.join(download_dir, '*.fits')) + glob.glob(os.path.join(download_dir, '*.json'))
print(len(downloaded_files), 'files downloaded to:', download_dir)

Inspect Downloaded Data#

The purpose of this function is to have a better idea of what data are available to you. Additionally, you will be able to use this dataframe to select specific files that match the mode you would like to take a closer look at.

ratefile_datadir = 'data/'

# first look for all of the rate files you have downloaded
rate_files = np.sort(glob.glob(os.path.join(ratefile_datadir, "*rate.fits")))

for file_num, ratefile in enumerate(rate_files):

    rate_hdr = fits.getheader(ratefile) # Primary header for each rate file

    # information we want to store that might be useful to us later for evaluating the data
    temp_hdr_dict = {"FILENAME": ratefile,
                     "TARG_RA": [rate_hdr["TARG_RA"]],
                     "TARG_DEC": [rate_hdr["TARG_DEC"]],
                     "FILTER": [rate_hdr["FILTER"]], # Grism; GR150R/GR150C
                     "PUPIL": [rate_hdr["PUPIL"]], # Filter used; F090W, F115W, F140M, F150W F158M, F200W
                     "EXPSTART": [rate_hdr['EXPSTART']], # Exposure start time (MJD)
                     "PATT_NUM": [rate_hdr["PATT_NUM"]], # Position number within dither pattern for WFSS
                     "NUMDTHPT": [rate_hdr["NUMDTHPT"]], # Total number of points in entire dither pattern
                     "XOFFSET": [rate_hdr["XOFFSET"]], # X offset from pattern starting position for NIRISS (arcsec)
                     "YOFFSET": [rate_hdr["YOFFSET"]], # Y offset from pattern starting position for NIRISS (arcsec)
                     }

    # Turn the dictionary into a pandas dataframe
    if file_num == 0:
        # if this is the first file, make an initial dataframe
        rate_df = pd.DataFrame(temp_hdr_dict)
    else:
        # otherwise, append to the dataframe for each file
        new_data_df = pd.DataFrame(temp_hdr_dict)

        # merge the two dataframes together to create a dataframe with all 
        rate_df = pd.concat([rate_df, new_data_df], ignore_index=True, axis=0)

rate_dfsort = rate_df.sort_values('EXPSTART', ignore_index=False)

# Save the dataframe to a file to read in later, if desired
outfile = './list_ngdeep_rate.csv'
rate_dfsort.to_csv(outfile, sep=',', index=False)
print('Saved:', outfile)

# Look at the resulting dataframe
rate_dfsort

In particular, let’s look at the observation sequence these rate files follow. We have sorted the files above by exposure time, so they should already be in time order in the dataframe.

FILTER = CLEAR indicates a direct image while FILTER=GR150R or FILTER=GR150C indicates a dispersed image. PUPIL is the blocking filter used. The first 14 exposures make up the first sequence set of direct image -> grism -> direct image -> grism. There are also multiple dither positions for the dispersed images and the direct images. The multiple direct image dithers will be combined in image3, while the multiple dispersed images can be combined in spec3.

rate_df[['EXPSTART', 'FILTER', 'PUPIL', 'PATT_NUM', 'XOFFSET', 'YOFFSET']].head(14)

Shown below are the first 14 rate files to give an idea of the above sequence visually. Grid lines are shown as a visual guide for any dithers that are made.

# plot set up
fig = plt.figure(figsize=(20, 35))
cols = 3
rows = int(np.ceil(14 / cols))

# loop over the first 14 rate files and plot them
for plt_num, rf in enumerate(rate_dfsort[0:14]['FILENAME']):

    # determine where the subplot should be
    xpos = (plt_num % 40) % cols
    ypos = ((plt_num % 40) // cols) # // to make it an int.

    # make the subplot
    ax = plt.subplot2grid((rows, cols), (ypos, xpos))

    # open the data and plot it
    with fits.open(rf) as hdu:
        data = hdu[1].data
        data[np.isnan(data)] = 0 # filling in nan data with 0s to help with the matplotlib color scale.
        
        ax.imshow(data, vmin=0, vmax=1.5, origin='lower')

        # adding in grid lines as a visual aid
        for gridline in [500, 1000, 1500]:
            ax.axhline(gridline, color='black', alpha=0.5)
            ax.axvline(gridline, color='black', alpha=0.5)

        ax.set_title(f"#{plt_num+1}: {hdu[0].header['FILTER']} {hdu[0].header['PUPIL']} Dither{hdu[0].header['PATT_NUM']}")
Space Telescope Logo