# Roman Research Nexus Data Discovery and Access in the Cloud 


***

## Kernel Information and Read-Only Status

To run this notebook, please select the "Roman Research Nexus" kernel at the top right of your window.

This notebook is read-only. You can run cells and make edits, but you must save changes to a different location. We recommend saving the notebook within your home directory, or to a new folder within your home (e.g. <span style="font-variant:small-caps;">file > save notebook as > my-nbs/nb.ipynb</span>). Note that a directory must exist before you attempt to add a notebook to it.

## Imports
Here we import the required packages for our data access examples including:
- *asdf* for accessing ASDF files
- *astropy.io fits* for accessing FITS files
- *astropy.mast Observations* for accessing, searching, and selecting data from other missions
- *s3fs* for streaming in data directly from the cloud
- *roman_datamodels* for opening Roman ASDF files. You can find additional information on how to work with ASDF files in the Working with ASDF notebook tutorial.

In [1]:
import asdf
from astropy.io import fits
from astroquery.mast import Observations
import s3fs
import roman_datamodels as rdm

***

## Introduction
This notebook is designed to provide examples of accessing data from the Research Nexus. Due to its survey nature, the Roman Space Telescope will produce large volumes of data that will need to be easily and quickly accessed to perform scientific tasks like creating catalogs, performing difference imaging, generating light curves, etc. Downloading all the required data would burden most users by requiring excessive data storage solutions (likely >10TB).

This notebook demonstrates how to stream data from the cloud directly into memory, bypassing the need to download the data locally and use excess storage. This method of cloud-based data access is *HIGHLY* recommended. However, we understand that some use-cases will require downloading the data locally, so we provide an example at the end of the notebook.

During operations, each Roman data file will be given a Unique Resource Identifier (URI), an analog to an online filepath that is similar to a URL, which points to where the data is hosted on the AWS cloud. Users will retrieve these URIs from one of several sources including MAST (see [Accessing WFI Data](https://roman-docs.stsci.edu/data-handbook-home/accessing-wfi-data) for more information) and will be able to use the URI to access the desired data from the cloud. 

Here-in we examine how to download data from two types of sources:
- The STScI MAST server which hosts data for in-flight telescopes including Hubble, TESS, and JWST and will host Roman data in the future
- Simulated Roman Space Telescope data hosted in storage containers on the AWS cloud

### Defining terms
- *Cloud computing*: the practice of using a network of remote servers hosted on the internet to store, manage, and process data, rather than using a local server or a personal computer.
- *AWS*: Amazon Web Services (AWS) is the cloud computing platform provided by Amazon.
- *URI*: a Universal Resource Identifier (URI) is a sequence of characters that identifies a name or a unique resource on the Internet. URLs for websites are a subclass of URIs.
- *AWS S3*: Amazon Simple Storage Service (S3) is a scalable and cost-effective object storage service on the AWS cloud platform. Storage containers within S3 are knwon as "buckets," so we often refer to these storage devices as "S3 buckets" or "S3 servers".

***

## Accessing MAST Data
In this section, we will go through the steps to retreive archived MAST data from the cloud including how to query the archive and stream the files directly from the cloud, as well as download them locally.

### Enabling Cloud Access
The most important step for accessing data from the cloud is to enable *astroquery* to retreive URIs and other relevant cloud information. Even if we are working locally and plan to download the data files (not recommended for Roman data), we need to use this command to copy the file locations.

In [2]:
Observations.enable_cloud_dataset()

INFO: Using the S3 STScI public dataset [astroquery.mast.cloud]


### Querying MAST
Now we are ready to begin our query. This example is rather simple, but it is quick and easy to reproduce. We will be querying HST WFC3/IR data of M85. In practice, the science platform should primarily be used for analyzing and exploring Roman data products. However due to the smaller file sizes, HST WFC3/IR data provides a nice example. The process is identical regardless of which space telescope is used.

In our query, we specify that we want to look at HST data using the F160W filter and WFC3/IR. We also specify the proposal id to easily get the data of interest. Once we get the desired observations, we gather the list of products that go into the observations. We then filter the products to gather all the level 3 science data products associated with a specific project which still leaves us with 60 data products.

In [3]:
# query MAST for matching observations
obs = Observations.query_criteria(obs_collection='HST',
                                  filters='F160W',
                                  instrument_name='WFC3/IR',
                                  proposal_id=['11360'],
                                  dataRights='PUBLIC')
# get the list of products (files)
products = Observations.get_product_list(obs)

# filter the products
filtered = Observations.filter_products(products,
                                        calib_level=[3], 
                                        productType=['SCIENCE'], 
                                        dataproduct_type=['image'], 
                                        project=['CALWF3'])
print('Filtered data products:\n', filtered, '\n')

# filter for just one product
single =  Observations.filter_products(filtered,
                                       obsID='24797441')
print('Single data product:\n', single, '\n')

Filtered data products:
  obsID   obs_collection dataproduct_type ... dataRights calib_level filters
-------- -------------- ---------------- ... ---------- ----------- -------
23831959            HST            image ...     PUBLIC           3   F160W
23831959            HST            image ...     PUBLIC           3   F160W
23831961            HST            image ...     PUBLIC           3   F160W
23831961            HST            image ...     PUBLIC           3   F160W
23831988            HST            image ...     PUBLIC           3   F160W
23831988            HST            image ...     PUBLIC           3   F160W
23831990            HST            image ...     PUBLIC           3   F160W
23831990            HST            image ...     PUBLIC           3   F160W
23832009            HST            image ...     PUBLIC           3   F160W
23832009            HST            image ...     PUBLIC           3   F160W
     ...            ...              ... ...        ...        

Now that we have our desired products, we can gather the URIs for each of the files which indicate their locations in the MAST AWS S3 servers.

In [4]:
uris = Observations.get_cloud_uris(filtered)
uris

INFO: 30 of 60 products were duplicates. Only returning 30 unique product(s). [astroquery.mast.utils]


['s3://stpubdata/hst/public/ib6w/ib6wd4lrq/ib6wd4lrq_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6wd4ltq/ib6wd4ltq_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6wd6f5q/ib6wd6f5q_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6wd6f7q/ib6wd6f7q_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6we1p9q/ib6we1p9q_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6we1paq/ib6we1paq_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6wr8kdq/ib6wr8kdq_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6wr8kfq/ib6wr8kfq_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w11050/ib6w11050_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w21050/ib6w21050_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w37040/ib6w37040_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w39010/ib6w39010_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w41050/ib6w41050_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w61070/ib6w61070_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w71040/ib6w71040_drz.fits',
 's3://stpubdata/hst/public/ib6w/ib6w810

The `get_cloud_uris` method checks for duplicates in the provided products to minimize the data access volume. It is also important to note that `get_cloud_uris` will always return a list. Thus, we need to extract an individual URI string to access the file. Here we choose the first URI, but in practice you would select the URI associated with the desired file.

In [5]:
uri = uris[0]

### Streaming files directly into memory
Here, we will use `fsspec` to directly access the data stored in the AWS S3 servers. Because the URI points to a FITS file, we can use `fits.open` to access the information in the file.

In [6]:
with fits.open(uri, 'readonly', fsspec_kwargs={"anon":True}) as HDUlist:
    HDUlist.info()
    sci = HDUlist[1].data
    
type(sci)

Filename: <class 's3fs.core.S3File'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     770   ()      
  1  SCI           1 ImageHDU        90   (543, 484)   float32   
  2  WHT           1 ImageHDU        45   (543, 484)   float32   
  3  CTX           1 ImageHDU        40   (543, 484)   int32   
  4  HDRTAB        1 BinTableHDU    561   1R x 276C   [9A, 3A, K, D, D, D, D, D, D, D, D, D, D, D, D, K, 8A, 4A, 1A, 4A, D, D, D, D, D, 3A, D, D, D, D, D, D, D, D, D, D, D, D, K, K, D, 3A, D, D, D, D, K, K, 8A, 23A, 11A, 19A, 4A, D, D, K, K, D, D, D, D, 23A, D, D, D, D, K, K, D, 3A, 8A, L, D, D, D, 23A, 1A, D, D, D, D, D, D, 12A, 12A, 8A, 23A, D, D, 10A, 10A, D, D, D, 2A, 23A, 3A, 4A, 8A, 7A, D, K, D, 6A, 9A, D, D, D, 4A, 18A, 3A, K, 5A, D, D, D, 8A, D, 3A, D, D, D, 3A, 1A, D, 23A, D, D, D, 3A, L, 1A, 4A, D, 3A, 6A, D, D, D, D, D, 23A, D, D, D, D, D, 1A, K, K, K, K, 8A, 23A, K, K, 10A, 7A, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, 13A, D

numpy.ndarray

***

## Streaming from the Roman Research Nexus S3 Bucket

Though Roman data will eventually be available through MAST, we currently offer a small set of simulated data available via an AWS Open Data Program S3 bucket. These files can be streamed in exactly the same way as the HST FITS file above. Additionally, we can browse the available files similarly to a Unix terminal. A full list of commands can be found in the `s3fs` documentation [here](https://s3fs.readthedocs.io/en/latest/api.html#).

We can view the files in the S3 bucket by performing a list command (`ls`) on the main Nexus directory:

In [7]:
fs = s3fs.S3FileSystem(anon=True)

asdf_dir_uri = 's3://stpubdata/roman/nexus/soc_simulations/tutorial_data/'
fs.ls(asdf_dir_uri)

['stpubdata/roman/nexus/soc_simulations/tutorial_data/._Roman_OpticalModel_v0.5.yml',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._r9999901001001001001_0001_wfi01_f129_cal.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._r9999901001001001001_0001_wfi01_f129_uncal.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._r9999902002002002002_0001_wfi01_f213_cal.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._r9999902002002002002_0001_wfi01_f213_uncal.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61524_f129.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61524_f213.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61529_f129.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61529_f213.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61534_f129.asdf',
 'stpubdata/roman/nexus/soc_simulations/tutorial_data/._roman_sn1a_61

The `fs.ls()` command allows us to list the contents of the URI. In the above example, the `s3://stpubdata/roman/nexus/soc_simulations/tutorial_data/` bucket contains numerous files for the notebook tutorials.

In the next subsection, we will explore opening data files made using the Roman image simulator "Roman I-Sim." These simulations are saved in the same file formats as Roman data and are useful to help develop file ingestion pipelines. More Roman I-Sim simulated data will be made available in the future.

For more information on the available data products, please visit the [Simulated Data Products](../../../markdown/simulated-data.md) documentation.

### Opening Roman I-Sim Data

Diving into the S3 bucket, we find several different files:
- `*_uncal.asdf`: Level 1 (L1; ucalibrated ramp cube) files.
- `*_cal.asdf`: Level 2 (L2; calibrated rate image) files.
- `*_coadd.asdf`: Level 3 (L3; resampled image) files.
- Some additional miscellaneous files for various tutorials.

To learn how these files were generated, please see the [Roman I-Sim](../romanisim/romanisim.ipynb) tutorial notebook.

As you can see, Roman WFI data are stored in Advanced Scientific Data Format (ASDF) files. See the tutorial notebook [Working with ASDF](../working_with_asdf/working_with_asdf.ipynb) for more information. Regarding file names, note that the first element (separated by underscores) of the file name string of L1* and L2* files denotes programmatic information (e.g., program ID, visit ID, etc.), while the second element gives the exposure number within the visit. Thus, `r0003201001001001004_0001_wfi02_f106_cal.asdf` and `r0003201001001001004_0002_wfi02_f106_cal.asdf` are exposure level (L2) files from the same visit but represent exposures 1 and 2, respectively, of the detector WFI02. For simulated data, such as the files used for these tutorials, the first element of the file name string has been chosen simply as an example. The file naming convention for Roman is quite elaborate as each includes all the relevant information about the observation. For more information on the file naming conventions, please see the [Data Levels and Products](https://roman-docs.stsci.edu/data-handbook-home/wfi-data-format/data-levels-and-products) Roman documentation page.

**Note:** Archival file names for the mosaic images (L3 products) are still being finalized. The L3 file in the S3 bucket has a generic file name (`my_roman_mosaic_coadd.asdf`) that does not follow any naming convention.

Below, we use `roman_datamodels` to read the ASDF file corresponding to a dense region. To simplify the workflow, we are providing a URI to the data. Once the data will be available through MAST during operations, users will need to retrieve the URIs using astroquery.

In [8]:
asdf_file_uri = asdf_dir_uri + 'r0003201001001001004_0001_wfi11_f106_cal.asdf'

with fs.open(asdf_file_uri, 'rb') as f:
    dm = rdm.open(f)
    dm.info()

root (AsdfObject)
├─asdf_library (Software)
│ ├─author (str): The ASDF Developers
│ ├─homepage (str): http://github.com/asdf-format/asdf
│ ├─name (str): asdf
│ └─version (str): 4.1.0
├─history (AsdfDictNode)
│ └─extensions (AsdfListNode) ...
└─roman (WfiImage) # Level 2 (L2) Calibrated Roman Wide Field Instrument (WFI) Rate Image.
  ├─meta (AsdfDictNode) ...
  ├─data (NDArrayType) # Science Data (DN/s) or (MJy/sr) ...
  ├─dq (NDArrayType) # Data Quality ...
  ├─err (NDArrayType) # Error (DN / s) or (MJy / sr) ...
  ├─var_poisson (NDArrayType) # Poisson Variance (DN^2/s^2) or (MJy^2/sr^2) ...
  ├─var_rnoise (NDArrayType) # Read Noise Variance (DN^2/s^2) or (MJy^2/sr^2) ...
  ├─var_flat (NDArrayType) # Flat Field Variance (DN^2/s^2) or (MJy^2/sr^2) ...
  ├─amp33 (NDArrayType) # Amplifier 33 Reference Pixel Data (DN) ...
  ├─border_ref_pix_left (NDArrayType) # Left-Edge Border Reference Pixels (DN) ...
  ├─border_ref_pix_right (NDArrayType) # Right-Edge Border Reference Pixels (DN) ...
  

### Opening OpenUniverse Simulated Data

The OpenUniverse data is hosted in its own S3 bucket (see the [OpenUniverse AWS Open Data](https://registry.opendata.aws/openuniverse2024/) website for more information). Additionally, IPAC has created two [OpenUniverse notebooks](https://irsa.ipac.caltech.edu/docs/notebooks/) showcasing how to interact with the original data and catalog files. In this notebook, we focus on how to access the files.

The simulations are natively saved as FITS files and are divided by survey (the Wide Area Survey (WAS) or the Time Domain Survey (TDS)), optical element, and HEALPix cell ([HEALPix](https://healpix.sourceforge.io) is a commonly used way to uniformly discretize the area of a sphere). Please see [Simulated Data Products](../../../markdown/simulated-data.md) for more information about specific OpenUniverse data products.

Below we provide an example of streaming a simulated "calibrated" image FITS file from an S3 bucket using an alternate method. Instead of initializing our own `S3FileSystem`, we pass the credentials (anonymous credentails in this case, as the data is public) directly to `fits.open`, allowing it to create the file system. This shorthand is covenient when the URI is explicitly provided, but it does not allow exploration of the S3 directory structure without intitializing the `S3FileSystem` separately.

In [9]:
s3bucket = 's3://nasa-irsa-simulations/openuniverse2024/roman/preview/RomanWAS/images/simple_model'
band = 'F184'
hpix = '15297'
sensor = 11
s3fpath = s3bucket+f'/{band}/{hpix}/Roman_WAS_simple_model_{band}_{hpix}_{sensor}.fits.gz'

fits_file = fits.open(s3fpath, fsspec_kwargs={'anon':True})
print(fits_file.info())

Filename: <class 's3fs.core.S3File'>
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU      63   ()      
  1  SCI           1 ImageHDU        68   (4088, 4088)   float64   
  2  ERR           1 ImageHDU        68   (4088, 4088)   float32   
  3  DQ            1 ImageHDU        70   (4088, 4088)   int32 (rescales to uint32)   


None


***

## Downloading Files (not recommended)

Downloading Roman data products is not recommended due to their large file sizes and the high volume expected from the mission's survey nature. Instead, users are encouraged to adopt workflows that utilize the file streaming services described above for an optimal experience.

However, in specific science cases, downloading files may be necessary. To do so, you can use the URIs along with the `S3FileSystem.get` function (documentation available [here](https://s3fs.readthedocs.io/en/latest/api.html#s3fs.core.S3FileSystem.get)). The code snippet below demonstrates how to download data to your personal instance of the Research Nexus.

In [10]:
# commented out as this use case is not recommended and should only be needed in rare circumstances
# from pathlib import Path
# URI =  ## Set this to the URI string you want to download.
# local_file_path = Path('data/')
# local_file_path.mkdir(parents=True, exist_ok=True)
# fs = s3fs.S3FileSystem()
# fs.get(URI, local_file_path)

***

## Additional Resources
Additional information can be found at the following links:

- [`s3fs` Documentation](https://s3fs.readthedocs.io/en/latest/api.html#)
- [Working with ASDF Notebook](../working_with_asdf/working_with_asdf.ipynb)
- [OpenUniverse AWS Open Data](https://registry.opendata.aws/openuniverse2024/)
- [OpenUniverse notebooks](https://irsa.ipac.caltech.edu/docs/notebooks/)
- [Simulated Data Products Document](../../../markdown/simulated-data.md)
- [MNRAS paper detailing Open Universe data simulation methods (Troxel et al 2021)](https://ui.adsabs.harvard.edu/abs/2021MNRAS.501.2044T/abstract)
- [MNRAS paper detailing the previewed Open Universe data (Troxel et al 2023)](https://ui.adsabs.harvard.edu/abs/2023MNRAS.522.2801T/abstract)

***

## About this Notebook
The data streaming information from this notebook largely builds off of the TIKE data-acces notebook by Thomas Dutkiewicz.

**Author:** Will C. Schultz, Tyler Desjardins \
**Updated On:** 2025-09-30

<table width="100%" style="border:none; border-collapse:collapse;">
  <tr style="border:none;">
    <td style="border:none; width:180px; white-space:nowrap;">
       <a href="#top" style="text-decoration:none; color:#0066cc;">↑ Top of page</a> 
    </td>
    <td style="border:none; text-align:center;">
       <img src="../../roman_logo.png" width="50">
    </td>
    <td style="border:none; text-align:right;">
       <img src="../../stsci_logo2.png" width="90">
    </td>
  </tr>
</table>