Support sharding through config and raster_write_kwargs#1106
Conversation
- Added additional settings - Allow environment variables that overwrite config
|
Failing atm due to ome-zarr not yet being released. You can test locally with ome-zarr-py from main. Also, need to add support for zarrs to improve speed of shard io |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1106 +/- ##
==========================================
+ Coverage 92.06% 92.56% +0.49%
==========================================
Files 51 51
Lines 7792 7868 +76
==========================================
+ Hits 7174 7283 +109
+ Misses 618 585 -33
🚀 New features to boost your workflow:
|
The reason for only supporting these versions is that they provide the proper use of the zarr api inside dask and also the possibility for setting the tune optimization. The latter is required to prevent errors due to collapsing dask partitions when reading data back in from parquet.
|
Should we also allow the control of sharding for anndata? |
|
Yes, but not as part of this PR. I will adjust the config though to accommodate. |
|
|
||
| if isinstance(storage_options, dict): | ||
| storage_options = { | ||
| **{k.split("_")[1]: v for k, v in asdict(settings).items() if k in ("raster_chunks", "raster_shards")}, |
There was a problem hiding this comment.
I'd bundle this into an helper function since it is called 3 times.
| import zarr | ||
|
|
||
| zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) |
There was a problem hiding this comment.
Summoning @ilan-gold and @flying-sheep to comment on this please.
There was a problem hiding this comment.
In general, I don't use global settings like this. I would only use it in a targeted way where you know it is what you want.
There was a problem hiding this comment.
I agree, @melonora I'd suggest removing this. We should still make it discoverable to the users somehow. We could write a small tutorial on how to enable this.
There was a problem hiding this comment.
You can also use it by default on certain methods. I'm not arguing against its use in a library, just in this way. See: https://github.com/scverse/anndata/blob/main/src/anndata/_io/zarr.py#L30-L46 which is only used for the top-level AnnData.write_zarr and anndata.read_zarr because those are "bulk operations" where I don't expect users to be very opinionated.
There was a problem hiding this comment.
ok can remove, but then maybe should we move this context manager to scverse-misc?
There was a problem hiding this comment.
that’s a real trivial one, just copy it!
I’d add a comment explaining why the warning filter is added (@ilan-gold can you link the issue pls?), otherwise the warning filter is just cargo-culted and not kept for a reason.
I haven't made it a default dependency for |
| "dask>=2025.12.0,<2026.1.2", | ||
| "distributed<2026.1.2", | ||
| "dask>=2026.3.0", | ||
| "distributed>=2026.3.0", |
| "xarray>=2024.10.0", | ||
| "xarray-spatial>=0.3.5", | ||
| "zarr>=3.0.0", | ||
| "zarrs", |
|
|
||
|
|
||
| @dataclass | ||
| class Settings: |
There was a problem hiding this comment.
General comment. Sounds like this file is a duplicate of the settings mechanism from scverse-misc: https://github.com/scverse/scverse-misc/blob/main/src/scverse_misc/_settings.py.
I think scverse-misc doesn't support reading settings from .json (they support .env files), but since it's based on Pydantic, it would be a quick extension.
I'll go trough the PR and review. In a follow-up PR we should probably align to use scverse-misc.
@ilia-kats @flying-sheep, devs behind scverse-misc, may have extra comments.
There was a problem hiding this comment.
would this be something for this PR or follow up PR before release? I believe before settings were not stored at all in any case and I think in general we need a PR to move towards using scverse-misc, also with adding the context manager there.
| from pathlib import Path | ||
| from typing import Literal | ||
|
|
||
| from platformdirs import user_config_dir |
There was a problem hiding this comment.
platformdirs is a transitive dependency; it should be added to pyproject.toml.
| raster_chunks | ||
| The chunksize to use for chunking an array. Length of the tuple must match | ||
| the number of dimensions. | ||
| raster_shards | ||
| The default shard size (zarr v3) to use when storing arrays. Length of the tuple | ||
| must match the number of dimensions. |
There was a problem hiding this comment.
Feels a bit weird to have these 2 variables in a class that is for global settings, since, if I am getting this right, we for cyx data we would need len(raster_chunks) == 3, while for yx data we would need len(raster_chunks) == 2.
There was a problem hiding this comment.
want to split in image_shards/chunks and label or just remove all together? I would maybe vote for the former.
| must match the number of dimensions. | ||
| """ | ||
|
|
||
| custom_config_path: Path | None = None |
There was a problem hiding this comment.
I don't think the settings should contain this path. Feels out of place.
There was a problem hiding this comment.
depends on whether certain settings are project specific I believe. This allows for that. Otherwise global settings on hpc could be shared by more than 1 user even though there are project specific needs.
| custom_config_path: Path | None = None | ||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] = "WKB" |
There was a problem hiding this comment.
It feels weird because I could have a situation like the following:
pathA/settings.json containing custom_config_path == pathB/settings.json
pathB/settings.json containing custom_config_path == pathA/settings.json
| global_data["custom_config_path"] = str(target) | ||
| with global_path.open("w", encoding="utf-8") as f: | ||
| json.dump(global_data, f, indent=2) |
There was a problem hiding this comment.
This handles the circular path scenario that I described here: #1106 (comment)
| def _config_path() -> Path: | ||
| """Return the platform-appropriate path to the user config file.""" | ||
| return Path(user_config_dir(appname="spatialdata")) / "settings.json" |
There was a problem hiding this comment.
why not moving this inside config_path()?
| assert s.large_chunk_threshold_bytes == 1_000_000_000 | ||
| assert s.raster_chunks == [512, 512] | ||
| assert s.raster_shards == [1024, 1024] | ||
| assert s.custom_config_path is None |
There was a problem hiding this comment.
Fine. But since we called save(), shouldn't custom_config_path be equal to the path we are reading from?
This is a consequence of https://github.com/melonora/spatialdata/blob/f44221ab23d6b136084157203e1d0b466979318c/src/spatialdata/config.py#L65
path is None, so it gets assigned the value from _config_path(), without saving it in the in-memory value of custom_config_path.
|
|
||
| s.reset() | ||
| s.save() | ||
| assert s.custom_config_path is None # This returns False |
There was a problem hiding this comment.
what's the comment about?
|
| assert arr.shards == write_shards | ||
|
|
||
| other_arr = zarr.open_group(path / zarr_subpath / other_name, mode="r")["s0"] | ||
| assert other_arr.chunks == base_chunks |
There was a problem hiding this comment.
we could add an explicit check that shards are None (?) here.
| assert other_arr.chunks == base_chunks | ||
|
|
||
|
|
||
| def test_write_raster_elements_sharding_chunking(tmp_path: Path) -> None: |
There was a problem hiding this comment.
What are we testing that it is not tested before? That write_element() supports sharding? In that case I'd rename the test.
|
Looks good in general. There are a few decisions to be made, but once done most of the requested changes can be one-shot by an agent. |
| "networkx", | ||
| "numba>=0.55.0", | ||
| "numpy", | ||
| "ome_zarr>=0.14.0", |
There was a problem hiding this comment.
Shouldn't we bump to ome_zarr>=0.16.0
chore: fix typo in docstrings
…to support_sharding
This PR adds the following:
raster_write_kwargsfor io functions like.writeand.write_element. This also adds the ability to write sharded arrays. Support for anndata sharding is to be added in a follow up PR.raster_write_kwargsargument.raster_chunksandraster_shards. The config can now be stored in a default location or a custom location. Additionally, environment variables can be set to temporarily override the values.Additional changes
@LucaMarconato