Skip to content

BlueFlux dataset integration options for GHG center #386

@abarciauskas-bgse

Description

@abarciauskas-bgse

BlueFlux is an interesting dataset - it is only 4 files but each file has a different variable and contains 24 years of data, which is somewhat the converse of what we normally see (each netcdf in a collection having a different temporal extent with multiple variables).

Another challenge is this dataset is managed by ORNL which we do not currently have role-based or requester-pays based access to.

There are a few options for integrating this dataset:

  1. Lowest effort: Get ORNL role-based access with titiler-cmr and then publish 4 STAC collections, one per file which each represents a different variable. Each file has a different variable and we have no way to query CMR for separate granules based on the variables they represent. @hrodmn I think titiler-cmr /timeseries/statistics would work here but I would have to test it out to be sure. Note we would be using titiler-multidim here since we can't filter granules on variable via CMR. There is no timeseries support in the UI for titiler-multidim at this time.
  2. Second level of effort: put data in SMCE bucket, publish a collection with 4 assets or publish 4 collections with individual assets, and then use zarr-timeseries.tsx layer which can visualize a collection-level asset via titiler-multidim. We would probably modify the code slightly so the way the collection-level asset is defined by its STAC reference is configurable in the veda-config MDX file. NOTE: Timeseries would still not work until we integrate titiler-multidim statistics endpoints into E+A.

The following options are not feasibly by July 15th and are both medium to high level of effort:

  1. Move the data into the SMCE bucket, and publish collection with multiple items. Modify the existing E+A UI code to query for STAC items and send requests to titiler-multidim /tiles and /statistics for visualization + analysis.
  2. Create an virtual zarr icechunk store for this dataset, publish a STAC collection with this as a zarr asset. Visualization would be supported via zarr-timeseries. Titiler-multidim would need to integrate reading icechunk stores and UI would still need to integrate titiler-multidim timeseries requests.

Additional details are in the "VEDA dataset integration planning - 2025/07/02" doc shared with relevant folks.

Upon discussion with various folks, there are many ways to integrate these datasets as you can see. However, we can't support all these methods. It would be safer, more maintainable and more user-friendly to maintain simpler uniform APIs for data that do not embed too much complexity. In other words, we wish to progress towards uniform dataset access over conforming our APIs and services to the variations found in "quirky data".

This is why in the medium to long term we would tend to lean towards option 4, which will create a data cube style interface into this dataset, which would match the data cube interface of many other datasets which use the zarr data model via virtual or native zarr stores. However, we could implement (1) or (2) by July 15th.

Should we move forward with option (1) or (2) @brianmfreitag @Jeanne-le-Roux @siddharth0248 or should we focus on the longer-term vision proposed in option (4)?

also chime in @aboydnw @hanbyul-here @hrodmn @sharkinsspatial if you have additional thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions