Skip to content

Reduce dask graph size for large raster reprojection#1065

Merged
brendancol merged 3 commits intomasterfrom
optimize/dask-graph-reduction
Mar 24, 2026
Merged

Reduce dask graph size for large raster reprojection#1065
brendancol merged 3 commits intomasterfrom
optimize/dask-graph-reduction

Conversation

@brendancol
Copy link
Contributor

Summary

For a 30TB raster with millions of chunks, the previous _reproject_dask and _merge_dask functions created one dask.delayed + one da.from_delayed node per output chunk, then stitched them with da.block. The graph metadata alone could exceed available memory.

This PR makes three changes:

  • Replace delayed/from_delayed/block with da.map_blocks over a template array. This produces a single blockwise layer in the HighLevelGraph with O(1) metadata, regardless of chunk count
  • Add empty-chunk skipping: precompute the source footprint in target CRS and skip chunks that don't overlap, avoiding pyproj init and source data fetching for nodata-only chunks
  • Remove the 1GB graph-size streaming threshold since graph metadata is now constant-size

Test plan

  • 6 new tests in TestDaskGraphOptimization covering graph structure, numpy/dask parity, empty-chunk skipping, and helper functions
  • Full test_reproject.py suite passes (85/85)

Replace the O(N) delayed/from_delayed/block pattern in _reproject_dask
and _merge_dask with da.map_blocks over a template array. This produces
a single blockwise layer in the HighLevelGraph with O(1) metadata, so
graph construction no longer scales with chunk count.

Also adds empty-chunk skipping: precompute the source footprint in target
CRS and skip chunks that fall outside it entirely. This avoids pyproj
initialization and source data fetching for chunks that would just be
filled with nodata.

The streaming fallback threshold (previously 1GB graph metadata) is
removed since graph metadata is now constant-size. Large in-memory
arrays always go through dask when available.
@github-actions github-actions bot added the performance PR touches performance-sensitive code label Mar 24, 2026
da.map_blocks scans kwargs for dask Array instances and adds them as
whole-array positional dependencies. This meant every output block
depended on the full source array, causing MemoryError on distributed
schedulers when the source exceeded the worker memory limit.

Fix: bind the source dask array (and all other params) to the adapter
function via functools.partial before passing to map_blocks. Dask
doesn't scan partial internals for array dependencies, so each block
task only depends on the template block, not the entire source.

The chunk function still slices and computes just the needed source
window at execution time.
The test creates many concurrent dask chunks that all call
pyproj.CRS.__init__ simultaneously via the threaded scheduler.
PROJ's C library is not fully thread-safe on macOS, causing SIGABRT.

Use dask synchronous scheduler for this test since we're verifying
empty-chunk correctness, not concurrent execution.
@brendancol brendancol merged commit 7381cc2 into master Mar 24, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant