-
Notifications
You must be signed in to change notification settings - Fork 87
Support sharding through config and raster_write_kwargs #1106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7039c00
63fbe9a
b7e98b9
558fe97
8eebdb0
790be0c
1a1c673
471f72c
cd48574
0187fe9
b629de0
c2b1375
bf5b910
49441fe
6334fa8
73ca72a
c6041bb
278606a
f6c0a37
36e2271
50cb6bb
6257096
5525fbf
44414c1
e974647
bd6249e
2600237
bbb4bb6
b95c710
03aafc9
f44221a
f77cff4
d614519
10ba46b
c036d49
e65c0fb
26cb40b
2642b74
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -131,6 +131,10 @@ | |
| "settings", | ||
| ] | ||
|
|
||
| import zarr | ||
|
|
||
| zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) | ||
|
Comment on lines
+134
to
+136
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Summoning @ilan-gold and @flying-sheep to comment on this please.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, I don't use global settings like this. I would only use it in a targeted way where you know it is what you want.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, @melonora I'd suggest removing this. We should still make it discoverable to the users somehow. We could write a small tutorial on how to enable this.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can also use it by default on certain methods. I'm not arguing against its use in a library, just in this way. See: https://github.com/scverse/anndata/blob/main/src/anndata/_io/zarr.py#L30-L46 which is only used for the top-level
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ok can remove, but then maybe should we move this context manager to scverse-misc?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that’s a real trivial one, just copy it! I’d add a comment explaining why the warning filter is added (@ilan-gold can you link the issue pls?), otherwise the warning filter is just cargo-culted and not kept for a reason.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The warning is there in case zarrs doesn't support the store type you passed in to |
||
|
|
||
|
|
||
| def __getattr__(name: str) -> Any: | ||
| if name in _submodules: | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,10 @@ | ||
| from __future__ import annotations | ||
|
|
||
| from collections.abc import Iterable | ||
| from typing import Any | ||
|
|
||
| from anndata import AnnData | ||
| from ome_zarr.types import JSONDict | ||
|
|
||
| from spatialdata._core.spatialdata import SpatialData | ||
|
|
||
|
|
@@ -164,3 +166,36 @@ def get_unique_name(name: str, attr: str, is_dataframe_column: bool = False) -> | |
| setattr(sanitized, attr, new_dict) | ||
|
|
||
| return None if inplace else sanitized | ||
|
|
||
|
|
||
| def create_raster_element_kwargs( | ||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict]] | list[JSONDict], | ||
| element_name: str, | ||
| element_names: set[str], | ||
| ) -> dict[str, Any] | list[dict[str, Any]]: | ||
|
Comment on lines
+171
to
+175
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function would strongly benefit from a docstring. In particular explaining:
In particular, if the input/output types are implicitly defined in a library, e.g. by Dask, I suggest to write that this is the format defined by X library.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that the input type is defined in the docstring of
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah I did it there because this is private API and write is public API, but can add it to private API as well if prefered.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please reopen if you would prefer that.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would put some context at least. Right now if a contributor sees that function, it can be quite confusing. |
||
| element_raster_write_kwargs = None | ||
| if isinstance(raster_write_kwargs, dict) and (kwargs := raster_write_kwargs.get(element_name)): | ||
| element_raster_write_kwargs = kwargs | ||
|
|
||
| if not element_raster_write_kwargs: | ||
|
Comment on lines
+171
to
+180
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just |
||
| if isinstance(raster_write_kwargs, dict): | ||
| for name in element_names: | ||
| raster_write_kwargs.pop(name, None) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be mutating the user's input. We should make a copy instead. |
||
| if not raster_write_kwargs: | ||
| element_raster_write_kwargs = {} | ||
| elif isinstance(raster_write_kwargs, dict) and not all( | ||
| isinstance(x, (dict, list)) for x in raster_write_kwargs.values() | ||
| ): | ||
| element_raster_write_kwargs = raster_write_kwargs | ||
| elif isinstance(raster_write_kwargs, list): | ||
| if not all(isinstance(x, dict) for x in raster_write_kwargs): | ||
| raise ValueError( | ||
| "If passing raster_write_kwargs as list, it is assumed to be the storage " | ||
| "options for each scale of a multiscale raster as a dictionary." | ||
| ) | ||
| element_raster_write_kwargs = raster_write_kwargs | ||
| else: | ||
| raise ValueError( | ||
| f"Type of raster_write_kwargs should be either dict or list, got {type(raster_write_kwargs)}." | ||
| ) | ||
| return element_raster_write_kwargs | ||
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -16,6 +16,7 @@ | |||
| from dask.dataframe import DataFrame as DaskDataFrame | ||||
| from dask.dataframe import Scalar, read_parquet | ||||
| from geopandas import GeoDataFrame | ||||
| from ome_zarr.types import JSONDict | ||||
| from shapely import MultiPolygon, Polygon | ||||
| from upath import UPath | ||||
| from xarray import DataArray, DataTree | ||||
|
|
@@ -1108,6 +1109,7 @@ def write( | |||
| update_sdata_path: bool = True, | ||||
| sdata_formats: SpatialDataFormatType | list[SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict]] | list[JSONDict] | None = None, | ||||
| ) -> None: | ||||
| """ | ||||
| Write the `SpatialData` object to a Zarr store. | ||||
|
|
@@ -1155,7 +1157,27 @@ def write( | |||
| shapes_geometry_encoding | ||||
| Whether to use the WKB or geoarrow encoding for GeoParquet. See :meth:`geopandas.GeoDataFrame.to_parquet` | ||||
| for details. If None, uses the value from :attr:`spatialdata.settings.shapes_geometry_encoding`. | ||||
| """ | ||||
| raster_write_kwargs | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please replace with docstring parameter since this docstring is repeated later
|
||||
| Storage options for raster elements. These options are passed to the zarr storage backend for writing and | ||||
| can be provided in several formats: | ||||
|
|
||||
| 1. Single dictionary | ||||
| A dictionary containing all storage options applied globally. | ||||
| 2. Dictionary per raster element | ||||
| A dictionary where: | ||||
| - Keys = names of raster elements | ||||
| - Values = storage options for each element | ||||
| - For single-scale data: a dictionary | ||||
| - For multiscale data: a list of dictionaries (one per scale) | ||||
| 3. List of dictionaries (multiscale only) | ||||
| A list where each dictionary defines the storage options for one scale of a multiscale raster element. | ||||
|
|
||||
| Important Notes | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here I would strongly recommend adding an example for users so that they have a recommendation on what to write, at least for Zarr v3. The tests already contain this information.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean as part of the docstring or actual doc? I would provide a follow up PR with specific docs. |
||||
| - The available key–value pairs in these dictionaries depend on the Zarr format used for writing. | ||||
| - For a full list of supported storage options, refer to: | ||||
| https://zarr.readthedocs.io/en/stable/api/zarr/create/#zarr.create_array | ||||
| """ | ||||
| from spatialdata._core._utils import create_raster_element_kwargs | ||||
| from spatialdata._io._utils import _resolve_zarr_store | ||||
| from spatialdata._io.format import _parse_formats | ||||
|
|
||||
|
|
@@ -1173,6 +1195,13 @@ def write( | |||
| store.close() | ||||
|
|
||||
| for element_type, element_name, element in self.gen_elements(): | ||||
| element_raster_write_kwargs = None | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of calling this in |
||||
| if element_type in ("images", "labels") and raster_write_kwargs: | ||||
| element_names = set(self.images.keys()).union(self.labels.keys()) | ||||
| element_raster_write_kwargs = create_raster_element_kwargs( | ||||
| raster_write_kwargs, element_name, element_names | ||||
| ) | ||||
|
|
||||
| self._write_element( | ||||
| element=element, | ||||
| zarr_container_path=file_path, | ||||
|
|
@@ -1181,6 +1210,7 @@ def write( | |||
| overwrite=False, | ||||
| parsed_formats=parsed, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| element_raster_write_kwargs=element_raster_write_kwargs, | ||||
| ) | ||||
|
|
||||
| if self.path != file_path and update_sdata_path: | ||||
|
|
@@ -1198,6 +1228,7 @@ def _write_element( | |||
| overwrite: bool, | ||||
| parsed_formats: dict[str, SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| element_raster_write_kwargs: JSONDict | list[JSONDict] | None = None, | ||||
| ) -> None: | ||||
| from spatialdata._io.io_zarr import _get_groups_for_element | ||||
|
|
||||
|
|
@@ -1236,13 +1267,15 @@ def _write_element( | |||
| group=element_group, | ||||
| name=element_name, | ||||
| element_format=parsed_formats["raster"], | ||||
| storage_options=element_raster_write_kwargs, | ||||
| ) | ||||
| elif element_type == "labels": | ||||
| write_labels( | ||||
| labels=element, | ||||
| group=root_group, | ||||
| name=element_name, | ||||
| element_format=parsed_formats["raster"], | ||||
| storage_options=element_raster_write_kwargs, | ||||
| ) | ||||
| elif element_type == "points": | ||||
| write_points( | ||||
|
|
@@ -1273,6 +1306,7 @@ def write_element( | |||
| overwrite: bool = False, | ||||
| sdata_formats: SpatialDataFormatType | list[SpatialDataFormatType] | None = None, | ||||
| shapes_geometry_encoding: Literal["WKB", "geoarrow"] | None = None, | ||||
| raster_write_kwargs: dict[str, JSONDict | list[JSONDict] | Any] | list[JSONDict] | None = None, | ||||
| ) -> None: | ||||
| """ | ||||
| Write a single element, or a list of elements, to the Zarr store used for backing. | ||||
|
|
@@ -1291,12 +1325,32 @@ def write_element( | |||
| shapes_geometry_encoding | ||||
| Whether to use the WKB or geoarrow encoding for GeoParquet. See :meth:`geopandas.GeoDataFrame.to_parquet` | ||||
| for details. If None, uses the value from :attr:`spatialdata.settings.shapes_geometry_encoding`. | ||||
| raster_write_kwargs | ||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Duplicated docstring. I would rather use the
|
||||
| Storage options for raster elements. These options are passed to the zarr storage backend for writing and | ||||
| can be provided in several formats: | ||||
|
|
||||
| 1. Single dictionary | ||||
| A dictionary containing all storage options applied globally. | ||||
| 2. Dictionary per raster element | ||||
| A dictionary where: | ||||
| - Keys = names of raster elements | ||||
| - Values = storage options for each element | ||||
| - For single-scale data: a dictionary | ||||
| - For multiscale data: a list of dictionaries (one per scale) | ||||
| 3. List of dictionaries (multiscale only) | ||||
| A list where each dictionary defines the storage options for one scale of a multiscale raster element. | ||||
|
|
||||
| Important Notes | ||||
| - The available key–value pairs in these dictionaries depend on the Zarr format used for writing. | ||||
| - For a full list of supported storage options, refer to: | ||||
| https://zarr.readthedocs.io/en/stable/api/zarr/create/#zarr.create_array | ||||
|
|
||||
| Notes | ||||
| ----- | ||||
| If you pass a list of names, the elements will be written one by one. If an error occurs during the writing of | ||||
| an element, the writing of the remaining elements will not be attempted. | ||||
| """ | ||||
| from spatialdata._core._utils import create_raster_element_kwargs | ||||
| from spatialdata._io.format import _parse_formats | ||||
|
|
||||
| parsed_formats = _parse_formats(formats=sdata_formats) | ||||
|
|
@@ -1309,6 +1363,7 @@ def write_element( | |||
| overwrite=overwrite, | ||||
| sdata_formats=sdata_formats, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| raster_write_kwargs=raster_write_kwargs, | ||||
| ) | ||||
| return | ||||
|
|
||||
|
|
@@ -1336,6 +1391,11 @@ def write_element( | |||
|
|
||||
| self._check_element_not_on_disk_with_different_type(element_type=element_type, element_name=element_name) | ||||
|
|
||||
| element_raster_write_kwargs = None | ||||
| if element_type in ("images", "labels") and raster_write_kwargs: | ||||
| element_names = set(self.images.keys()).union(self.labels.keys()) | ||||
| element_raster_write_kwargs = create_raster_element_kwargs(raster_write_kwargs, element_name, element_names) | ||||
|
|
||||
| self._write_element( | ||||
| element=element, | ||||
| zarr_container_path=self.path, | ||||
|
|
@@ -1344,6 +1404,7 @@ def write_element( | |||
| overwrite=overwrite, | ||||
| parsed_formats=parsed_formats, | ||||
| shapes_geometry_encoding=shapes_geometry_encoding, | ||||
| element_raster_write_kwargs=element_raster_write_kwargs, | ||||
| ) | ||||
| # After every write, metadata should be consolidated, otherwise this can lead to IO problems like when deleting. | ||||
| if self.has_consolidated_metadata(): | ||||
|
|
||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow up on this.