Skip to content

Support sharding through config and raster_write_kwargs#1106

Open
melonora wants to merge 36 commits into
scverse:mainfrom
melonora:support_sharding
Open

Support sharding through config and raster_write_kwargs#1106
melonora wants to merge 36 commits into
scverse:mainfrom
melonora:support_sharding

Conversation

@melonora
Copy link
Copy Markdown
Collaborator

@melonora melonora commented Apr 14, 2026

This PR adds the following:

  • passing kwargs for zarr.create_array directly as raster_write_kwargs for io functions like .write and .write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.
  • proper docstrings for the new raster_write_kwargs argument.
  • Extension of the current config to include raster_chunks and raster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.
  • Adding zarrs as a dependency and enabled the codec by default to allow for faster io when writing shards. This is a discussion point of whether we should do this or provide more of an opt-in for advanced users.

Additional changes

  • Minimal supported version of dask is 2026.3.0. The reason here is that only this provides the api in such a way that you don't risk zarr format 2 being written in a zarr v3 group and vice versa + it includes the setting that prevents collaps of partitions of dask dataframes after reading from parquet.

@LucaMarconato

@melonora
Copy link
Copy Markdown
Collaborator Author

melonora commented Apr 14, 2026

Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main.

Also, need to add support for zarrs to improve speed of shard io

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 90.59829% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.56%. Comparing base (d8bf265) to head (e65c0fb).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/spatialdata/config.py 87.50% 9 Missing ⚠️
src/spatialdata/_core/_utils.py 90.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1106      +/-   ##
==========================================
+ Coverage   92.06%   92.56%   +0.49%     
==========================================
  Files          51       51              
  Lines        7792     7868      +76     
==========================================
+ Hits         7174     7283     +109     
+ Misses        618      585      -33     
Files with missing lines Coverage Δ
src/spatialdata/__init__.py 96.00% <100.00%> (+0.34%) ⬆️
src/spatialdata/_core/spatialdata.py 93.92% <100.00%> (+1.96%) ⬆️
src/spatialdata/_io/io_raster.py 95.67% <100.00%> (+2.45%) ⬆️
src/spatialdata/_core/_utils.py 97.05% <90.00%> (-2.95%) ⬇️
src/spatialdata/config.py 88.60% <87.50%> (-11.40%) ⬇️

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also
the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions
when reading data back in from parquet.
@Mr-Milk
Copy link
Copy Markdown

Mr-Milk commented Apr 15, 2026

Should we also allow the control of sharding for anndata?

@melonora
Copy link
Copy Markdown
Collaborator Author

Yes, but not as part of this PR. I will adjust the config though to accommodate.

Comment thread src/spatialdata/_io/io_raster.py Outdated
Comment thread src/spatialdata/_io/io_raster.py

if isinstance(storage_options, dict):
storage_options = {
**{k.split("_")[1]: v for k, v in asdict(settings).items() if k in ("raster_chunks", "raster_shards")},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd bundle this into an helper function since it is called 3 times.

Comment thread src/spatialdata/_io/io_raster.py
Comment on lines +134 to +136
import zarr

zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summoning @ilan-gold and @flying-sheep to comment on this please.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't use global settings like this. I would only use it in a targeted way where you know it is what you want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, @melonora I'd suggest removing this. We should still make it discoverable to the users somehow. We could write a small tutorial on how to enable this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use it by default on certain methods. I'm not arguing against its use in a library, just in this way. See: https://github.com/scverse/anndata/blob/main/src/anndata/_io/zarr.py#L30-L46 which is only used for the top-level AnnData.write_zarr and anndata.read_zarr because those are "bulk operations" where I don't expect users to be very opinionated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok can remove, but then maybe should we move this context manager to scverse-misc?

Copy link
Copy Markdown
Member

@flying-sheep flying-sheep May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that’s a real trivial one, just copy it!

I’d add a comment explaining why the warning filter is added (@ilan-gold can you link the issue pls?), otherwise the warning filter is just cargo-culted and not kept for a reason.

@ilan-gold
Copy link
Copy Markdown
Contributor

I'll loop in @ilan-gold and @flying-sheep. What's the plan for anndata on this?

I haven't made it a default dependency for anndata but it will be enabled by default if discovered.

Comment thread pyproject.toml
"dask>=2025.12.0,<2026.1.2",
"distributed<2026.1.2",
"dask>=2026.3.0",
"distributed>=2026.3.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melonora can we drop this dep altogether?

Comment thread pyproject.toml
"xarray>=2024.10.0",
"xarray-spatial>=0.3.5",
"zarr>=3.0.0",
"zarrs",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow up on this.

Comment thread src/spatialdata/config.py


@dataclass
class Settings:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment. Sounds like this file is a duplicate of the settings mechanism from scverse-misc: https://github.com/scverse/scverse-misc/blob/main/src/scverse_misc/_settings.py.

I think scverse-misc doesn't support reading settings from .json (they support .env files), but since it's based on Pydantic, it would be a quick extension.

I'll go trough the PR and review. In a follow-up PR we should probably align to use scverse-misc.

@ilia-kats @flying-sheep, devs behind scverse-misc, may have extra comments.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be something for this PR or follow up PR before release? I believe before settings were not stored at all in any case and I think in general we need a PR to move towards using scverse-misc, also with adding the context manager there.

Comment thread src/spatialdata/config.py
from pathlib import Path
from typing import Literal

from platformdirs import user_config_dir
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

platformdirs is a transitive dependency; it should be added to pyproject.toml.

Comment thread src/spatialdata/config.py
Comment on lines +33 to +38
raster_chunks
The chunksize to use for chunking an array. Length of the tuple must match
the number of dimensions.
raster_shards
The default shard size (zarr v3) to use when storing arrays. Length of the tuple
must match the number of dimensions.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit weird to have these 2 variables in a class that is for global settings, since, if I am getting this right, we for cyx data we would need len(raster_chunks) == 3, while for yx data we would need len(raster_chunks) == 2.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to split in image_shards/chunks and label or just remove all together? I would maybe vote for the former.

Comment thread src/spatialdata/config.py
must match the number of dimensions.
"""

custom_config_path: Path | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the settings should contain this path. Feels out of place.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depends on whether certain settings are project specific I believe. This allows for that. Otherwise global settings on hpc could be shared by more than 1 user even though there are project specific needs.

Comment thread src/spatialdata/config.py
Comment on lines +41 to 42
custom_config_path: Path | None = None
shapes_geometry_encoding: Literal["WKB", "geoarrow"] = "WKB"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels weird because I could have a situation like the following:

pathA/settings.json containing custom_config_path == pathB/settings.json
pathB/settings.json containing custom_config_path == pathA/settings.json

Comment thread src/spatialdata/config.py
Comment on lines +83 to +85
global_data["custom_config_path"] = str(target)
with global_path.open("w", encoding="utf-8") as f:
json.dump(global_data, f, indent=2)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This handles the circular path scenario that I described here: #1106 (comment)

Comment thread src/spatialdata/config.py
Comment thread src/spatialdata/config.py Outdated
Comment thread src/spatialdata/config.py
Comment on lines +12 to +14
def _config_path() -> Path:
"""Return the platform-appropriate path to the user config file."""
return Path(user_config_dir(appname="spatialdata")) / "settings.json"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not moving this inside config_path()?

assert s.large_chunk_threshold_bytes == 1_000_000_000
assert s.raster_chunks == [512, 512]
assert s.raster_shards == [1024, 1024]
assert s.custom_config_path is None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine. But since we called save(), shouldn't custom_config_path be equal to the path we are reading from?

This is a consequence of https://github.com/melonora/spatialdata/blob/f44221ab23d6b136084157203e1d0b466979318c/src/spatialdata/config.py#L65

path is None, so it gets assigned the value from _config_path(), without saving it in the in-memory value of custom_config_path.


s.reset()
s.save()
assert s.custom_config_path is None # This returns False
Copy link
Copy Markdown
Member

@LucaMarconato LucaMarconato May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the comment about?

@LucaMarconato
Copy link
Copy Markdown
Member

I haven't made it a default dependency for anndata but it will be enabled by default if discovered.

  • @melonora I would do the same that @ilan-gold is doing in anndata, and therefore drop the dependency (we can add this as an optional dependency).

assert arr.shards == write_shards

other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"]
assert other_arr.chunks == base_chunks
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could add an explicit check that shards are None (?) here.

assert other_arr.chunks == base_chunks


def test_write_raster_elements_sharding_chunking(tmp_path: Path) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we testing that it is not tested before? That write_element() supports sharding? In that case I'd rename the test.

@LucaMarconato
Copy link
Copy Markdown
Member

LucaMarconato commented May 12, 2026

Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent.

Comment thread pyproject.toml
"networkx",
"numba>=0.55.0",
"numpy",
"ome_zarr>=0.14.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we bump to ome_zarr>=0.16.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants