Skip to content

Conversation

@lucas-diedrich
Copy link

@lucas-diedrich lucas-diedrich commented Feb 17, 2025

Description

This PR addresses the challenge that the currently implemented and planned image loaders require loading imaging data entirely into memory, typically as NumPy arrays. Given the large size of microscopy datasets, this is not always feasible.

To mitigate this issue, and as discussed with @LucaMarconato, this PR aims to introduce a generalizable approach for reading large microscopy files in chunks, enabling efficient handling of data that does not fit into memory.

Some related discussions.

Strategy

In this PR, we focus on .tiff images, as implemented in the _tiff_to_chunks function.

  1. Get a lazy representation of the image via a suitable reader function (here: tifffile.memmap)
  2. Pre-define chunks that fit into memory, based on the dimensions of the image (_compute_chunks)
  3. Load small chunks via a custom reader function and pass the chunks to dask.array which is memory-mapped and avoids memory overflow (_read_chunks)
  4. Reassembling the chunks into a dask.array (via dask.array.block)
  5. Parse to Image2DModel.

The strategy is implemented in

  • src/spatialdata_io/readers/generic.py and
  • src/spatialdata_io/readers/_utils/_image.py

Future extensions

The strategy can be implemented for any image type, as long as it is possible to implement

  1. a lazy image-data loader
  2. define a custom reader function

We have implemented similar readers for openslide-compatible whole slide images and the Carl-Zeiss microscopy format.

@lucas-diedrich lucas-diedrich marked this pull request as draft February 17, 2025 17:03
@codecov-commenter
Copy link

codecov-commenter commented Feb 17, 2025

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.96%. Comparing base (2ebfab0) to head (0c482ac).

Files with missing lines Patch % Lines
src/spatialdata_io/readers/generic.py 95.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #279      +/-   ##
==========================================
+ Coverage   55.16%   55.96%   +0.80%     
==========================================
  Files          26       27       +1     
  Lines        2844     2898      +54     
==========================================
+ Hits         1569     1622      +53     
- Misses       1275     1276       +1     
Files with missing lines Coverage Δ
src/spatialdata_io/readers/_utils/_image.py 100.00% <100.00%> (ø)
src/spatialdata_io/readers/generic.py 91.42% <95.00%> (+5.71%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lucas-diedrich lucas-diedrich marked this pull request as ready for review March 21, 2025 15:44
Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I have 2 minor suggestions. I also saw that you use the width by height convention. Personally, I don't have a strong opinion here, though we could also stick to array api conventions. @LucaMarconato WDYT? Pre-approving for now.

Copy link
Collaborator

@melonora melonora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry had to change due to rethinking memmap. This does not always work, for example when dealing with compressed tiffs as far as I am aware.

@lucas-diedrich
Copy link
Author

lucas-diedrich commented May 4, 2025

Thanks! Addressed your comments

  • Naming, simplification 46ed3e5, 99731fe
  • Read compressed images with dask_image.imread instead of tifffile 9e057de

@lucas-diedrich
Copy link
Author

Hi @LucaMarconato - Just saw that you pushed some updates. Could you comment on the current implementation/is this PR still interesting for you?

@LucaMarconato
Copy link
Member

@lucas-diedrich yes still interested in it, I have time now to go through it. I added my "code review" in terms of "TODO" items for myself. I'll check the code and push some modifications if I see that some minor fixes are needed; I'll comment eventual larger changes, but the code looks good. I'll also do a benchmark with asv for performance.

@LucaMarconato
Copy link
Member

There was a problem with the dimension of the returned data. Image2DModel.parse() always returns a cyx, but the variable reference in test_read_tiff() was showing a non-cyx image. Now it's fixed.

@LucaMarconato
Copy link
Member

LucaMarconato commented Jan 13, 2026

@lucas-diedrich I'm done with edits, please double check the changes if you have time. The main comment, added in the notes in _read_chunks(), is the following:

    Notes
    -------
    As shown in `_compute_chunks()`, `coords` are in the form `(x, y,
    width, height)`. In that function, the inner list (dim = -1) iterates over `y`
    values, and the outer list (dim = -2) iterates over `x` values. In `_read_chunks(
    )`, we use the more common `(y, x)` ordering: the inner list (dim = -1) iterates
    over `x` values, and the outer list (dim = -2) iterates over `y` values.

    This mismatch can be confusing. A straightforward fix is to standardize `coords`
    to `(y, x, height, width)` instead of `(x, y, width, height)`.

In summary, redefining coords as y, x, height, width would make the code more readable, but it works also as it is, so no strong opinion.

I will now do the benchmark with asv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants