diff --git a/docs/.vuepress/config.js b/docs/.vuepress/config.js index 539ae1a15..23ec574ef 100644 --- a/docs/.vuepress/config.js +++ b/docs/.vuepress/config.js @@ -215,6 +215,17 @@ module.exports = { '/how-to/troubleshooting-kubo', '/how-to/webtransport', '/install/run-ipfs-inside-docker', + '/how-to/observe-peers', + '/how-to/peering-with-content-providers' + ] + }, + { + title: 'Scientific Data', + sidebarDepth: 1, + collapsable: true, + children: [ + '/how-to/scientific-data/landscape-guide', + '/how-to/scientific-data/publish-geospatial-zarr-data', ] }, { @@ -242,15 +253,6 @@ module.exports = { '/how-to/move-ipfs-installation/move-ipfs-installation', ] }, - { - title: 'Work with peers', - sidebarDepth: 1, - collapsable: true, - children: [ - '/how-to/observe-peers', - '/how-to/peering-with-content-providers' - ] - }, { title: 'Websites on IPFS', sidebarDepth: 1, @@ -417,6 +419,7 @@ module.exports = { children: [ ['/case-studies/arbol', 'Arbol'], ['/case-studies/audius', 'Audius'], + ['/case-studies/orcestra', 'ORCESTRA'], ['/case-studies/fleek', 'Fleek'], ['/case-studies/likecoin', 'LikeCoin'], ['/case-studies/morpheus', 'Morpheus.Network'], diff --git a/docs/case-studies/orcestra.md b/docs/case-studies/orcestra.md new file mode 100644 index 000000000..679d9456e --- /dev/null +++ b/docs/case-studies/orcestra.md @@ -0,0 +1,126 @@ +--- +title: 'Case study: ORCESTRA & IPFS' +description: Explore how a coordinated tropical atmospheric science campaign uses IPFS to share verifiable datasets across institutions worldwide. +--- + +# Case study: ORCESTRA + +::: callout +**"When our server infrastructure was stuck in customs in Barbados, we got a Raspberry Pi with a 1 TB SD card running a Kubo IPFS node. Within hours, 50 scientists on site were sharing data laptop-to-laptop. That's when we realized IPFS could be at the core of our data infrastructure."** +::: + +## Overview + +In this case study, you'll learn how [ORCESTRA](https://orcestra-campaign.org/), a coordinated international atmospheric science campaign, uses IPFS to ensure that scientific datasets are verifiable, openly accessible, and collaboratively distributed across institutions worldwide. + +## What is ORCESTRA + +[ORCESTRA](https://orcestra-campaign.org/) (Organized Convection and EarthCARE Studies over the Tropical Atlantic) is an international field campaign that launched in early 2024 to study tropical mesoscale convective systems: the storm systems that play a significant role in the Earth's weather and climate dynamics. + +The campaign brings together **over twenty scientific institutions** spanning Europe, North America, and Africa. Eight sub-campaigns (three airborne, one land-based, and four at sea) coordinate aircraft, ships, ground stations, and satellites to collect atmospheric measurements across the tropical Atlantic. + +ORCESTRA represents the kind of large-scale scientific collaboration where data infrastructure can make or break a mission: dozens of research groups generating terabytes of observational data that must be shared, verified, and preserved across institutional boundaries. + +### ORCESTRA by the numbers + + + +## The story + +Scientific campaigns like ORCESTRA face a fundamental infrastructure challenge: how do you enable real-time data sharing among researchers spread across aircraft, ships, and ground stations, often in remote locations with limited connectivity, while ensuring that every dataset remains verifiable and tamper-proof? + +Traditional approaches involve centralized servers and institutional data portals. But centralized infrastructure introduces single points of failure, and when teams are in the field, connectivity to distant data centers can't be taken for granted. + +ORCESTRA's IPFS story began not with a grand architectural plan, but with a practical crisis. During one of the field campaigns in Barbados, the team's planned server infrastructure was delayed in customs. With roughly 50 scientists on site needing to share and collaborate on data, they needed a solution fast. + +A Raspberry Pi with a 1 TB SD card became that solution. Running a [Kubo](https://github.com/ipfs/kubo) IPFS node, the device enabled local data sharing from laptop to laptop, provided temporary storage on the Pi node, and facilitated eventual transfer to a data center in Hamburg. No central server required; just content-addressed, peer-to-peer data sharing that worked. + +This improvised setup revealed something important: IPFS wasn't just a workaround, it was a better fit for how field science actually works. As the campaign evolved and datasets grew, the team expanded IPFS from an emergency fix to the core of their data publishing infrastructure. + +## How ORCESTRA works + +ORCESTRA's eight sub-campaigns span sea, air, and land, collecting atmospheric measurements such as temperature, humidity, wind, radiation, aerosols, and cloud properties. This observational data is structured as multidimensional arrays and stored primarily in the [Zarr](https://zarr.dev/) format, a cloud-native format optimized for chunked, distributed access to large scientific datasets. + +### From collection to publication + +When researchers collect data in the field, it follows a path from raw measurements to published, citable datasets: + +1. **Collection**: Instruments on aircraft, ships, and ground stations capture measurements +2. **Processing**: Raw data is quality-controlled, calibrated, and structured into Zarr datasets +3. **Publishing**: Processed datasets are added to IPFS, generating content identifiers (CIDs) that uniquely and verifiably identify each dataset +4. **Discovery**: Datasets are catalogued in a metadata-rich browser, making them findable by the wider community +5. **Retrieval**: Scientists worldwide access data through IPFS gateways or directly from peers + +## How ORCESTRA uses IPFS + +ORCESTRA uses IPFS to make scientific data openly accessible, verifiable, and resilient. + +Raw data from the different sub-campaigns is processed at the Max Planck Institute for Meteorology into publishable datasets. These datasets are added to IPFS, producing content identifiers (CIDs) that correspond to the published data from each sub-campaign. Because each CID is derived from the content itself, anyone who retrieves the data can independently verify that they received exactly what was published, without needing to trust any specific server that served it. + +The architecture involves several coordinated components: + +### IPFS nodes for collaborative hosting + +A team at the Max Planck Institute for Meteorology processes the data from the different teams into Zarr and publishes them to IPFS with a fleet of [Kubo](https://github.com/ipfs/kubo) nodes, ensuring some redundancy. The CID for the whole data set is published via [pinlist.yaml on GitHub](https://github.com/orcestra-campaign/ipfs_tools/blob/main/pinlist.yaml) with the CID of the whole data set, giving full snapshot history of the growing data set from all sub campaigns. + +### A metadata-rich data browser + +The [ORCESTRA data browser](http://browser.orcestra-campaign.org/) provides a web interface for discovering and retrieving datasets. Built on top of [Climate and Forecast (CF) conventions](https://cfconventions.org/) metadata embedded in the Zarr datasets, the browser lets researchers search by variable, time range, sub-campaign, and other dimensions, then retrieve data directly via IPFS. + +The browser leverages Helia, the TypeScript implementation of IPFS. + +### Pinset tracking on GitHub + +The campaign maintains a "pinset" (a list of datasets and their CIDs) in a [GitHub repository](https://github.com/orcestra-campaign). This serves a dual purpose: it provides **CID discovery**, so researchers can find which CIDs correspond to which datasets, and **provenance tracking**, since the Git history records when datasets were published and updated. Other institutions can use this pinset to replicate the entire collection on their own IPFS nodes. + +### Python ecosystem integration + +Through [ipfsspec](https://github.com/fsspec/ipfsspec/), ORCESTRA data integrates seamlessly with the Python data science ecosystem. Scientists can open datasets directly with familiar tools: + +```python +import xarray as xr + +ds = xr.open_dataset( + "ipfs://bafybeif52irmuurpb27cujwpqhtbg5w6maw4d7zppg2lqgpew25gs5eczm", + engine="zarr" +) +``` + +ipfsspec implements the [fsspec](https://filesystem-spec.readthedocs.io/) interface, the same abstraction layer used by xarray, pandas, Dask for remote data access. This means IPFS retrieval works anywhere these tools expect a filesystem, with no special handling needed. + +## IPFS benefits + +ORCESTRA's adoption of IPFS demonstrates several properties that matter for open scientific data: + +### Data integrity without trust + +Every dataset on IPFS is identified by a CID derived from its content. When a researcher retrieves a dataset, they can verify it matches the published CID; there is no need to trust the server, the network, or any intermediary. For science built on reproducibility, this is foundational. + +### Resilient, decentralized access + +With multiple institutions hosting the same content-addressed data, there is no single point of failure. If one provider goes offline, others continue serving the same verified data. This resilience matters for long-lived scientific datasets that need to remain accessible beyond the lifetime of any single project or server. + +### Collaborative distribution + +IPFS enables a model where providing data is a collaborative effort. Any institution can pin ORCESTRA datasets on their own nodes, contributing bandwidth and availability without any coordination protocol beyond the shared pinset. The more institutions that participate, the more resilient and performant the system becomes. + +### Open, auditable data sharing + +Because every dataset, its CID, and its metadata are public, anyone can audit the data: verify its integrity, replicate it, or build upon it. This aligns with the principles of open science and the [FAIR data principles](https://www.go-fair.org/fair-principles/) (Findable, Accessible, Interoperable, Reusable) that increasingly govern publicly funded research. + +### Field-ready infrastructure + +The Barbados Raspberry Pi story illustrates a property of IPFS worth highlighting: it works at the edge. In field conditions with limited connectivity, IPFS enables local peer-to-peer sharing without dependence on remote infrastructure. Data collected locally can be shared immediately among nearby peers and synced to institutional servers when connectivity is available. + +## ORCESTRA & IPFS: the future + +As ORCESTRA's datasets continue to grow and are used by research groups worldwide, the IPFS-based infrastructure positions the project for long-term sustainability. Datasets published today remain verifiable and retrievable as long as any node in the network continues to provide them, whether that's an ORCESTRA server, a university research group, or an individual scientist's node. + +The campaign's approach also serves as a reference for other scientific communities. By demonstrating that content-addressed, peer-to-peer data sharing works at the scale of an international field campaign, ORCESTRA shows a practical path forward for scientific data infrastructure: one that prioritizes verifiability, openness, and collaboration over centralized control. + +_Note: Details in this case study are current as of early 2026. The ORCESTRA campaign and its data infrastructure continue to evolve._ \ No newline at end of file diff --git a/docs/how-to/scientific-data/landscape-guide.md b/docs/how-to/scientific-data/landscape-guide.md new file mode 100644 index 000000000..cab272d0b --- /dev/null +++ b/docs/how-to/scientific-data/landscape-guide.md @@ -0,0 +1,275 @@ +--- +title: Scientific Data and IPFS Landscape Guide +description: an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS. +--- + +# Scientific Data and IPFS Landscape Guide + + + +Scientific data and IPFS are naturally aligned: research teams need to share large datasets across institutions, verify data integrity, and ensure resilient access. From sensor networks to global climate modeling efforts, scientific communities are using IPFS content addressing and peer-to-peer distribution to solve problems traditional infrastructure can't. + +In this guide, you'll find an overview of the problem space, available tools, and architectural patterns for publishing and working with scientific data using IPFS. + +## A Landscape in Flux + +Science advances through collaboration, yet the infrastructure for sharing scientific data has historically developed in silos. Different fields adopted different formats, metadata conventions, and distribution mechanisms. + +This fragmentation means there is no single "right way" to publish and share scientific data. Instead, this is an area of active innovation, with new tools and conventions emerging as communities identify common needs. Standards like [Zarr](https://zarr.dev) represent convergence points where different fields have found common ground. + +This guide surveys the landscape and available tooling, but the right approach for your project depends on your specific constraints: the size and structure of your data, your collaboration patterns, your existing infrastructure, and your community's conventions. The goal is to help you understand the options so you can make informed choices. + +## The Nature of Scientific Data + +Scientific data originates from a variety of sources. In the geospatial field, data is collected by sensors, measuring instruments, camera systems, and satellites. This data is commonly structured as multidimensional arrays (tensors), representing measurements across dimensions like time, latitude, longitude, and altitude. + +Key characteristics of scientific data include: + +- **Large scale**: Datasets often span terabytes to petabytes +- **Multidimensional**: Data is organized across multiple axes (e.g., time, space, wavelength) +- **Metadata-rich**: Extensive contextual information accompanies the raw measurements +- **Collaborative**: Research often involves multiple institutions and scientists sharing and building upon datasets + +## The Importance of Open Data Access + +As hinted above, open access to scientific data accelerates research, enables reproducibility, and maximizes the return on public investment in science. Organizations worldwide have recognized this, leading to mandates for open data sharing in publicly funded research. + +However, open access alone is not sufficient. In an ideal world, data needs to meet the following criteria to truly unleash the promise of open data access: + +- **Discoverable**: Researchers need to find relevant datasets for the research they are conducting. +- **Interoperable**: Formats and metadata should enable cross-dataset analysis. +- **Verifiable**: Researchers need to know they are working with valid and authentic data (corollary of reproducibility). + +These criteria are by no means exhaustive, for example initiatives like [FAIR](https://www.go-fair.org/resources/faq/what-is-fair/) have espoused similar criteria with some nuanced variations, though they all emphasise the importance of open data access. + +With that in mind, the next section will look at how these ideas come together with IPFS. + +## The Benefits of IPFS for Scientific Data + +IPFS addresses several pain points in scientific data distribution: + +- **Data integrity**: Content addressing ensures data hasn't been corrupted or tampered. +- **Collaborative distribution**: Multiple institutions can provide the same data sets, improving availability and performance +- **Open access**: Data can be retrieved from anyone who has it, not just the original publisher +- **Resilience**: No single point of failure when multiple providers host data + +To get a better sense of how these ideas which are central to IPFS' design are applied by the scientific community, it's worth looking at the [ORCESTRA Campaign Case Study](../../case-studies/orcestra.md) campaign, which uses IPFS to reap these benefits. + +## Architectural Patterns + +### CID-centric verifiable data management + +With this pattern, IPFS provides content-addressed verifiability by addressing data with CIDs. That means that network is optional, and can always be added later if publishing is desired. This approach is also agnostic about which format you use exactly, be it UnixFS or [BDASL](https://dasl.ing/bdasl.html). The other aspect that makes this useful is the ability to be able to perform some operations without access to the data, for example, with two known CIDs, you can construct a new data set composed of the two without access to the data. + +This pattern has several variants: + +- Data is stored as files and directories, and managed on original storage i.e. directly on the filesystem, or private networked storage and mounted as a filesystem with the Server Message Block (SMB) protocol. To generate CIDs by merkleizing data sets, there are two approaches: + - Using the `ipfs add --only-hash -r ` command returns the CID for the folder. This uses Kubo only for the generation of the CID. + - A variation of the previous approach is to use the experimental [ipfs filestore](https://github.com/ipfs/kubo/blob/master/docs/experimental-features.md#ipfs-filestore) and the `ipfs add --nocopy` command with Kubo, to both generate the CID and import files in a way that doesn't duplicate the data in Kubo's blockstore. This approach allows performing read operations on the original copy on disk, which may be necessary for querying. The main benefit over the previous is that the data can also be published easily. +- Data is stored on disk in a content-addressed format, either managed by an IPFS node that tracks and stores the chunks in a blockstore, or as CAR files (Content Addressable aRchives). With both these approaches, the data is implicitly duplicated, if the original copy is also kept. With CAR files you get all the benefits of verifiability in a storage agnostic way, since CAR files can be stored anywhere from on disk, cloud storage, to pinning services. + +Ultimately the choice between these approaches for content-addressed data management comes down to the following questions: + +- How important is duplication? This is probably a function of the volume of your data and market costs of storage. +- How important is it to maintain a copy of the data in a content-addressed format? If no public publishing is expected and you only need integrity checks, you may choose not to store a full content-addressed replica and instead compute hashes on demand. +- What libraries and which programming languages will you use to interact with the data? For example, Python’s xarray library, via fsspec, can read directly from a local IPFS gateway using [`ipfsspec`](https://github.com/fsspec/ipfsspec). + +### Single Publisher + +A single institution runs Kubo nodes to publish and provide data. Users retrieve via gateways or their own nodes. + +### Collaborative Publishing + +Multiple institutions coordinate to provide the same datasets: + +- Permissionless: single writer multiple follower providers +- Coordination can happen out of band, for example via a shared pinset on GitHub. The original publisher must ensure their data is provided, but once it's added to the pinset, others can replicate it. + +### Connecting to Existing Infrastructure + +IPFS can complement existing data infrastructure: + +- STAC catalogs can include IPFS CIDs alongside traditional URLs +- Data portals can offer IPFS as an alternative retrieval method +- CI/CD pipelines can automatically add new data to IPFS nodes + +## Geospatial Format Evolution: From NetCDF to Zarr + +The scientific community has long relied on formats like NetCDF, HDF5, and GeoTIFF for storing multidimensional n-array data (also referred to as tensors). While these formats served research well, they were designed for local filesystems and face challenges in cloud and distributed environments, that have become the norm over the last decades. This has been a trend driven by both the size of datasets growing and the advent of cloud and distributed systems enabling the storage and processing of larger volumes of data. + +### Limitations of Traditional Formats + +NetCDF and HDF5 interleave metadata with data, requiring large sequential reads to access metadata before reaching the data itself. This creates performance bottlenecks when accessing data over networks, whether that's cloud storage or a peer-to-peer network. + +### The Rise of Zarr + +[Zarr](https://zarr.dev/) has emerged as a cloud-native format optimized for distributed storage: + +- **Separation of metadata and data**: A `zarr.json` file at the root describes the dataset structure, enabling fast reads without scanning all data +- **Chunked by default**: Arrays are split into individually addressable chunks, allowing both partial and concurrent reads +- **Consolidated metadata**: All metadata can be consolidated into a single file for datasets with many arrays +- **Designed for network access patterns**: Distributed storage tends to have high throughput and high latency + +> Note: To learn more about Zarr, check out the following resources: [Introduction to the Zarr format by Copernicus Marine](https://help.marine.copernicus.eu/en/articles/10401542-introduction-to-the-zarr-format), [What is Cloud-Optimized Scientific Data?](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/). + +Zarr has seen widespread adoption across scientific domains, for example: + +- **[Copernicus Marine Service](https://marine.copernicus.eu/)**: Provides free and open marine data for policy implementation and scientific innovation +- **[CMIP6](https://wcrp-cmip.org/cmip-phases/cmip6/)**: The Coupled Model Intercomparison Project Phase 6 distributes climate model outputs in Zarr format via cloud platforms like Google Cloud +- **Open Microscopy Environment**: [OME-NGFF](https://ngff.openmicroscopy.org/) (Next-generation file format) builds on Zarr for bioimaging +- **OGC**: The Open Geospatial Consortium has standardized Zarr in their [Zarr Storage Specification](https://www.ogc.org/standards/zarr-storage-specification/) + +### Zarr and IPFS + +- In IPFS, the common format for representing files and directories is [UnixFS](https://specs.ipfs.tech/unixfs/), and much like Zarr, files are chunked to enable incremental verification. +- Chunking: + - Both Zarr and IPFS chunk data, for different reasons, with some overlap. Zarr chunks for partial access to reduce unnecessary data retrieval and enable concurrent retrieval. IPFS chunks for incremental verification and concurrent retrieval. + - IPFS implementations enforce a block limit of 1 MiB +- Optimising Zarr chunk size is a nuanced topic and largely dependent on the access patterns of data + - The established convention is to try to align Zarr chunk sizes with the IPFS maximal chunk size of 1 MiB whenever possible so that each Zarr chunk fetched maps to a single IPFS block. + - There are many resources that cover this in more details: + - https://zarr.readthedocs.io/en/stable/user-guide/performance/ + - https://element84.com/software-engineering/chunks-and-chunkability-tyranny-of-the-chunk/ + - https://eopf-toolkit.github.io/eopf-101/03_about_chunking/31_zarr_chunking_intro.html + - https://esipfed.github.io/cloud-computing-cluster/optimization-practices.html +- There are a number of trade-offs to consider with UnixFS: + - Overhead of around 0.5%-1% for the additional metadata and proto + - But you might want to keep original copy of the data before encoding with UnixFS so that might double it + - UnixFS is network agnostic but integrates seamlessly with the Web via IPFS gateways including the service worker gateway that allows for resilient multi-provider retrieval. +- IPFS is agnostic about metadata, and it's left to you to pick the conventions that's most applicable to your use case. Below are some of the conventions common within the ecosystem. + +#### Metadata + +Metadata in scientific datasets serves to make the data self-describing, like what the values represent, what units they're in, when and where they were measured, how they were processed, and how the dimensions relate to each other. + +[**Climate and Forecast (CF) Conventions**](https://cfconventions.org/) are a community-maintained specification that defines how to write metadata in netCDF files so that scientific datasets are self-describing and interoperable. The spec covers things like how to label coordinate axes, encode time, specify units, and name variables using a standardized vocabulary (the CF Standard Name Table), so that software can automatically interpret data from different producers without custom parsing. Originally developed for the climate and weather communities, CF has become the dominant metadata convention for gridded earth science data more broadly, and its ideas have influenced newer cloud-native specifications like **GeoZarr**. + +[**Attribute Convention for Data Discovery (ACDD)**](https://wiki.esipfed.org/Attribute_Convention_for_Data_Discovery_1-3) builds upon CF and is compatible with CF. + +[**GeoZarr**](https://github.com/zarr-developers/geozarr-spec) is a specification for storing geospatial raster/grid data in the Zarr format. It defines conventions for how to encode coordinate reference systems, spatial dimensions, and other geospatial metadata within Zarr stores. It's conceptually downstream of the ideas in CF CDM (from the [netCDF ecosystem](https://docs.unidata.ucar.edu/netcdf-java/5.2/userguide/common_data_model_overview.html)), but designed for the Zarr ecosystem. + +## Ecosystem Tooling + +### Organizing Content-Addressed Data + +#### UnixFS and CAR Files + +UnixFS is the default format for representing files and directories in IPFS. It chunks large files for incremental verification and parallel retrieval. + +[CAR (Content Addressed Archive)](https://ipld.io/specs/transport/car/carv1/) files package IPFS data for backup or storage at rest, containing blocks and their CIDs in a single file. They can be stored anywhere while still giving you all the verification properties + +#### Mutable File System (MFS) + +MFS provides a familiar filesystem interface for organizing immutable content that is encoded with UnixFS. You can create directories, move files, and maintain a logical structure while the underlying data remains content-addressed. + +TODO: give an example with the `kubo ipfs files api` or maybe an asciicinema + +- Since UnixFS is an encoding format that is inherently mutable, MFS provides an API to construct and mutate trees. +- MFS helps you organise already merkelised data (even if you don't have it locally!) +- You can produce new CIDs or add new CIDs to a growing data set + +### Publishing + +#### Kubo + +[Kubo](https://github.com/ipfs/kubo) is the reference IPFS implementation. It handles: + +- Adding files, encoding with UnixFS and generating CIDs +- Providing content to the network via the DHT +- Serving (providing) content to other nodes + +#### IPFS Cluster + +[IPFS Cluster](https://ipfscluster.io/) is a cluster solution built on top of Kubo for multi-node deployments. IPFS Cluster coordinates pinning across a set of Kubo nodes, ensuring data redundancy and availability. +Support for the [Pinning API spec](https://ipfs.github.io/pinning-services-api-spec/). + +#### Pinning Services + +Third-party pinning services provide managed infrastructure for persistent storage, useful when you don't want to run your own nodes. +TODO: link to pinning services list in docs + +### Retrieval + +#### ipfsspec + +[ipfsspec](https://github.com/fsspec/ipfsspec/) integrates IPFS with Python's filesystem specification, enabling direct use with tools like xarray: + +```python +import xarray as xr + +ds = xr.open_dataset( + "ipfs://bafybeif52irmuurpb27cujwpqhtbg5w6maw4d7zppg2lqgpew25gs5eczm", + engine="zarr" +) +``` + +### Discovery, Metadata, and Data Portals: From discovery all the way to retrieval + +TODO: add an intro in the form of a user journey of a scientists looking for data, all the way to retrieving it. + +Content Discovery is an loaded term that can mean related, albeit distinct concepts in IPFS. By discovery, we typically mean one of following: + +- **CID discovery**: How do you find the CID for a given data set. CID discovery is a corollary of **trust**, because while a CID can ensure the integrity of a dataset, the CID is the verifiable pointer to the data. In other words, CID discovery is how you find the CID for data you are interested in. + - Programmatic + - Human-centric +- **Content discovery:** also commonly known as **content routing**, refers to finding providers (nodes serving the data) for a given CID, including their network addresses. By default, IPFS supports a number of content routing systems: the Amino DHT, IPNI and Delegated Routing over HTTP as a common interface for interoperability. + +### CID Discovery + +When using content-addressed systems like IPFS, a new challenge emerges: how do users discover the Content Identifiers (CIDs) for datasets they want to access? + +From a high level, there are a number of common approaches to this problem, that vary in terms of whether they're for human or programatic discovery. + +- Programmatic Discovery: + - **DNSLink**: Maps DNS names to CIDs, allowing human-readable URLs that resolve to IPFS content. Update the DNS record when you publish new data. + - **IPNS + DHT**: InterPlanetary Name System provides mutable pointers to content using cryptographic keys. More self-certifying than DNS but with less tooling support. + - **STAC:** TODO: add brief context and link to the section below +- Human-Centric Discovery: + - **Website**: websites documenting available datasets and their CIDs + - **GitHub repositories**: Publishing CID lists and dataset metadata in version-controlled repositories + - **Custom data portals/catalogue**: Purpose-built portals like the [Orcestra IPFS UI](https://github.com/orcestra-campaign/ipfsui) leverage CF + +#### STAC + +**STAC (SpatioTemporal Asset Catalog)** is a specification for cataloging and discovering geospatial data assets. Rather than defining how data is stored internally, STAC describes _what_ data exists, _where_ it is, and _when_ it covers. A STAC catalog might point to assets that are NetCDF files, Zarr stores + +The [EASIER Data Initiative](https://easierdata.org/) has built [ipfs-stac](https://github.com/DecentralizedGeo/ipfs-stac), a Python library to help onboarding and interfacing geospatial data on IPFS. The library enables developers and researchers to leverage STAC APIs enriched with IPFS metadata to seamlessly fetch, pin, and explore data in a familiar manner. + +### Content routing + +- **Public DHT**: All providers announce to the main IPFS network +- **Separate public DHT namespace (not amino, i.e.`/id/ipfs/`)**: a public but separate DHT +- **"Private" DHT**: A closed network with a shared key for institutional partners with public gateway access +- **central indexer**: with delegated routing, you can build a central indexer (or multiples) with routing endpoints for discovery + +### Collaboration + +#### Pinsets on GitHub + +Teams can maintain shared lists of CIDs to pin, enabling collaborative data preservation. When one institution adds data, others can pin it automatically through CI/CD pipelines or manual synchronization. + + + +## Next Steps + +- [Publishing Zarr Datasets with IPFS](./publish-geospatial-zarr-data.md) - A hands-on guide to publishing your first dataset +- [Kubo Configuration Reference](https://github.com/ipfs/kubo/blob/master/docs/config.md) +- [ipfsspec Documentation](https://github.com/fsspec/ipfsspec/) + +## Resources + +- [Zarr Format Documentation](https://zarr.dev/) +- [STAC Specification](https://stacspec.org/) +- [OME-NGFF Specification](https://ngff.openmicroscopy.org/) +- [Cloud-Optimized Scientific Data](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/) - Background on format design diff --git a/docs/how-to/scientific-data/publish-geospatial-zarr-data.md b/docs/how-to/scientific-data/publish-geospatial-zarr-data.md new file mode 100644 index 000000000..899315531 --- /dev/null +++ b/docs/how-to/scientific-data/publish-geospatial-zarr-data.md @@ -0,0 +1,232 @@ +--- +title: Publish Geospatial Zarr Data with IPFS +description: Learn how to publish geospatial datasets using IPFS and Zarr for decentralized distribution, data integrity, and open access. +--- + +# Publish Geospatial Zarr Data with IPFS + +In this guide, you will learn how to publish public geospatial data sets using IPFS, with a focus on the [Zarr](https://zarr.dev/) format. You'll learn how to leverage decentralized distribution with IPFS for better collaboration, data integrity, and open access. + +Note that while this guide focuses on Zarr, it's applicable to other data sets. + +By the end of this guide, you will publish a Zarr dataset to the IPFS network in a way that is retrievable directly within [Xarray](https://xarray.dev/). + +If you are interested in a real-world example following the patterns in this guide, check out the [ORCESTRA campaign](https://orcestra-campaign.org/intro.html). + +- [Why IPFS for Geospatial Data?](#why-ipfs-for-geospatial-data) +- [Prerequisites](#prerequisites) +- [Step 1: Prepare Your Zarr Data Set](#step-1-prepare-your-zarr-data-set) +- [Step 2: Add Your Data Set to IPFS](#step-2-add-your-data-set-to-ipfs) +- [Step 3: Organizing Your Data](#step-3-organizing-your-data) +- [Step 4: Verify Providing Status](#step-4-verify-providing-status) +- [Step 5: Content Discovery](#step-5-content-discovery) + - [Option A: Share the CID Directly](#option-a-share-the-cid-directly) + - [Option B: Use IPNS for Updatable References](#option-b-use-ipns-for-updatable-references) + - [Option C: Use DNSLink for Human-Readable URLs](#option-c-use-dnslink-for-human-readable-urls) +- [Accessing Published Data](#accessing-published-data) +- [Choosing Your Approach](#choosing-your-approach) +- [Reference](#reference) + +## Why IPFS for Geospatial Data? + +Geospatial data sets such as weather observations, satellite imagery, and sensor readings, are typically stored as multidimensional arrays, also commonly known as tensors. + +As these data sets grow larger and more distributed, traditional formats like NetCDF and HDF5 show their limitations: metadata interleaved with data requires large sequential reads before you can access the data you need. + +**[Zarr](https://zarr.dev/)** is a modern format that addresses these limitations and is optimized for networked and distributed storage characterised by high throughput with high latency. Zarr complements the popular [Xarray](https://xarray.dev/) which provides the data structures and operations for analyzing the data sets. + +Some of the key properties of Zarr include: + +- **Separated metadata**: A data catalogue/index lets you understand data set structure before fetching any data, +- **Chunked by default**: Arrays split into small chunks let you download only the subset you need. +- **Consolidated metadata**: All metadata in a single `zarr.json` file speeds reads for multi-array data sets. + +> **Note:** For a more elaborate explanation on the underlying principles and motivation for Zarr, check out [this blog post](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/), by one of the Zarr contributors. + +**IPFS** complements Zarr with decentralized distribution: + +- **Content addressing**: Data is identified by what it contains using CIDs, not where it's stored +- **Built-in integrity**: Cryptographic hashes verify data hasn't been corrupted or tampered with +- **Participatory sharing**: Anyone can help distribute data sets they've downloaded +- **Open access**: No vendor lock-in or centralized infrastructure required + +This combination has proven effective in real-world campaigns like [Orcestra](https://orcestra-campaign.org/orcestra.html), where scientists collaborated with limited internet connectivity in the field while sharing data globally. + +## Prerequisites + +Before starting, ensure you have: + +- A Zarr data set ready for publishing +- Basic familiarity with the command line +- [Kubo](/install/command-line/) or [IPFS Desktop](/install/ipfs-desktop/) installed on a machine. + +:::callout +See the [NAT and port forwarding guide](../nat-configuration.md) for more information on how to configure port forwarding so that your IPFS node is publicly reachable, thus allowing reliable retrievability of data by other nodes. + +::: + +## Step 1: Prepare Your Zarr Data Set + +When preparing your Zarr data set for IPFS, aim for approximately 1 MiB chunks to align with IPFS's 1 MiB maximum block size. While this is not a strict requirement, using larger Zarr chunks will cause IPFS to split them into multiple blocks, potentially increasing retrieval latency. + +To calculate chunk dimensions for a target byte size, work backwards from your datatype: + +```python +import xarray as xr + +ds = xr.open_dataset(filename) +# Example: targeting ~1 MB chunks with float32 data +ds.to_zarr('output.zarr', encoding={ + 'var_name': {'chunks': (1, 512, 512)} +}) + +# Total size: 1 × 512 × 512 × 4 bytes (float32) = 1048576 bytes = 1 MiB per chunk +``` + +:::callout +Chunking in Zarr is a nuanced topic beyond the scope of this guide. For more information on optimizing chunk sizes, see: + +- [Zarr performance guide](https://zarr.readthedocs.io/en/stable/user-guide/performance/) +- [Chunks and chunkability](https://element84.com/software-engineering/chunks-and-chunkability-tyranny-of-the-chunk/) +- [Zarr chunking introduction](https://eopf-toolkit.github.io/eopf-101/03_about_chunking/31_zarr_chunking_intro.html) +- [Cloud optimization practices](https://esipfed.github.io/cloud-computing-cluster/optimization-practices.html) + +::: + +## Step 2: Add Your Data Set to IPFS + +Add your Zarr folder to IPFS using the `ipfs add` command: + +```bash +ipfs add --recursive \ + --hidden \ + --raw-leaves \ + --chunker=size-1048576 \ + --cid-version=1 \ + --pin-name="halo-measurements-2026-01-23" \ + --quieter \ + ./my-dataset.zarr +``` + +This command: + +1. **Merkleizes** the folder: converts files and directories into content-addressed blocks with UnixFS +1. **Pins** the data locally: prevents garbage collection from removing it +1. **Provides** to the IPFS network that your node has this data +1. **Outputs the root CID**: the identifier for your entire dataset + +The `--quieter` flag outputs only the root CID, which identifies the complete dataset. + +> **Note:** Check out the [lifecycle of data in IPFS](../../../concepts/lifecycle.md) to learn more about how merkleizing, pinning, and providing work under the hood. + +## Step 3: Organizing Your Data + +Two options help manage multiple data sets on your node: + +**Named pins** (`--pin-name`): Label data sets for easy identification in `ipfs pin ls`. + +**[MFS (Mutable File System)](../../concepts/file-systems.md#mutable-file-system-mfs]**: MFS gives you an interface to organize content-addressed data under a familiar file system structure with folders and names, where the root of the MFS is a CID that changes every time you change anything in the MFS tree. + +```bash +ipfs add ./halo-measurements-2026-01-23 --to-files=/datasets/halo-measurements-2026-01-23 +``` + +## Step 4: Verify Providing Status + +After adding, Kubo continuously announces your content to the network. Check the status: + +```bash +ipfs provide stat +``` + +For detailed diagnostics, see the [provide system documentation](https://github.com/ipfs/kubo/blob/master/docs/provide-stats.md). + +## Step 5: Content Discovery + +Now that your data is available on the public network, the next step is making it discoverable to others. Choose a sharing approach based on your needs: + +### Option A: Share the CID Directly + +For one-off sharing, provide the CID directly: + +``` +ipfs://bafybeif52irmuurpb27cujwpqhtbg5w6maw4d7zppg2lqgpew25gs5eczm +``` + +### Option B: Use IPNS for Updatable References + +If you want to share a stable identifier but be able to update the underlying dataset, create an [IPNS](https://docs.ipfs.tech/concepts/ipns/) identifier and share that instead. This is useful for datasets that get updated regularly — users can bookmark your IPNS name and always retrieve the latest version. + +```bash +# Publish your dataset under your node's IPNS key +ipfs name publish /ipfs/ + +# Update to a new version later +ipfs name publish /ipfs/ +``` + +IPNS is supported by all the retrieval methods in the [Accessing Published Data](#accessing-published-data) section below. Keep in mind that IPNS name resolution adds latency to the retrieval process. + +### Option C: Use DNSLink for Human-Readable URLs + +Link a DNS name to your CID by adding a TXT record: + +``` +_dnslink.data.example.org TXT "dnslink=/ipfs/" +``` + +Users can then access your data using one of the following methods: + +- With an IPFS gateway: `https://inbrowser.link/ipns/data.example.org` +- With Kubo: `ipfs cat /ipns/data.example.org/zarr.json` +- Using ipfsspec in Python as detailed below in [Python with ipfsspec](#python-with-ipfsspec), which also supports IPNS names, so you can use `ipns://data.example.org/zarr.json` directly. + +## Accessing Published Data + +Once published, users can access your Zarr datasets through multiple methods: + +### IPFS HTTP Gateways + +See the [retrieval guide](../../quickstart/retrieve.md). + +### Python with ipfsspec + +[ipfsspec](https://pypi.org/project/ipfsspec/) brings verified IPFS retrieval to the Python ecosystem by implementing the [fsspec](https://github.com/fsspec/filesystem_spec) interface, the same abstraction layer used by xarray, pandas, Dask, and Zarr for remote data access. + +```python +import xarray as xr + +# after the installation of ipfsspec, `ipfs://` urls are automatically recognized +ds = xr.open_dataset( + "ipfs://bafybeiesyutuduzqwvu4ydn7ktihjljicywxeth6wtgd5zi4ynxzqngx4m", + engine="zarr" +) +``` + +### JavaScript with Verified Fetch + +```javascript +import { verifiedFetch } from '@helia/verified-fetch' + +const response = await verifiedFetch('ipfs:///zarr.json') +``` + +## Choosing Your Approach + +Consider these factors when planning your publishing strategy: + +| Factor | Considerations | +| ------------------- | -------------------------------------------- | +| **Publishers** | Single node or multiple providers? | +| **Dataset size** | How large are individual datasets? | +| **Growth rate** | How frequently do you add new data? | +| **Content routing** | Public DHT, private DHT, or central indexer? | + +For most Geospatial use cases, start with a single Kubo node publishing to the public Amino DHT. Scale to multiple providers or private infrastructure as your needs grow. + +## Reference + +- [Kubo documentation](https://docs.ipfs.tech/install/command-line/) +- [Kubo configuration options](https://github.com/ipfs/kubo/blob/master/docs/config.md) +- [ipfsspec for Python](https://github.com/fsspec/ipfsspec/) +- [Cloud-Optimized Scientific Data (Zarr deep-dive)](https://tom-nicholas.com/blog/2025/cloud-optimized-scientific-data/) diff --git a/docs/quickstart/retrieve.md b/docs/quickstart/retrieve.md index 94a255d14..36fa25b6d 100644 --- a/docs/quickstart/retrieve.md +++ b/docs/quickstart/retrieve.md @@ -18,6 +18,7 @@ The CID you will retrieve is actually a folder containing a single image file. T - [IPFS retrieval methods](#ipfs-retrieval-methods) - [Verified vs. trusted CID retrieval](#verified-vs-trusted-cid-retrieval) - [Fetching the CID with Kubo](#fetching-the-cid-with-kubo) +- [Fetching the CID with Python and ipfsspec](#fetching-the-cid-with-python-and-ipfsspec) - [Fetching the CID with an IPFS Gateway](#fetching-the-cid-with-an-ipfs-gateway) - [Summary and next steps](#summary-and-next-steps) @@ -92,6 +93,37 @@ You may notice that there's a path following the CID, e.g. `bafybeicn7i3soqdgr7d ::: +## Fetching the CID with Python and ipfsspec + +[ipfsspec](https://github.com/fsspec/ipfsspec) is a read-only [fsspec](https://filesystem-spec.readthedocs.io/) implementation for IPFS. It performs **verified** retrieval by fetching [CAR files](../concepts/glossary.md#car) containing Merkle proofs, so you don't have to trust the gateway. It works without a local IPFS node. + +1. Install `fsspec` and `ipfsspec`: + + ```bash + pip install fsspec ipfsspec + ``` + +1. Fetch the image using `fsspec.open`: + +```python +import fsspec + +cid = "ipfs://bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4/welcome-to-IPFS.jpg" + +with fsspec.open(cid, "rb") as f: + image_data = f.read() + print(f"Retrieved {len(image_data)} bytes") +``` + +You can also address the file directly by its own CID: + +```python +with fsspec.open("ipfs://bafkreie7ohywtosou76tasm7j63yigtzxe7d5zqus4zu3j6oltvgtibeom", "rb") as f: + image_data = f.read() +``` + +To determine which gateway to use, ipfsspec follows [IPIP-280](https://github.com/ipfs/specs/pull/280). You can point it at a different gateway, via the options, by setting the `IPFS_GATEWAY` environment variable or writing the gateway URL to `~/.ipfs/gateway`. + ## Fetching the CID with an IPFS Gateway To fetch the CID using an IPFS gateway is as simple as loading one of the following URLs: