Reduce dask graph size for large raster reprojection#1065
Merged
brendancol merged 3 commits intomasterfrom Mar 24, 2026
Merged
Conversation
Replace the O(N) delayed/from_delayed/block pattern in _reproject_dask and _merge_dask with da.map_blocks over a template array. This produces a single blockwise layer in the HighLevelGraph with O(1) metadata, so graph construction no longer scales with chunk count. Also adds empty-chunk skipping: precompute the source footprint in target CRS and skip chunks that fall outside it entirely. This avoids pyproj initialization and source data fetching for chunks that would just be filled with nodata. The streaming fallback threshold (previously 1GB graph metadata) is removed since graph metadata is now constant-size. Large in-memory arrays always go through dask when available.
da.map_blocks scans kwargs for dask Array instances and adds them as whole-array positional dependencies. This meant every output block depended on the full source array, causing MemoryError on distributed schedulers when the source exceeded the worker memory limit. Fix: bind the source dask array (and all other params) to the adapter function via functools.partial before passing to map_blocks. Dask doesn't scan partial internals for array dependencies, so each block task only depends on the template block, not the entire source. The chunk function still slices and computes just the needed source window at execution time.
The test creates many concurrent dask chunks that all call pyproj.CRS.__init__ simultaneously via the threaded scheduler. PROJ's C library is not fully thread-safe on macOS, causing SIGABRT. Use dask synchronous scheduler for this test since we're verifying empty-chunk correctness, not concurrent execution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
For a 30TB raster with millions of chunks, the previous
_reproject_daskand_merge_daskfunctions created onedask.delayed+ oneda.from_delayednode per output chunk, then stitched them withda.block. The graph metadata alone could exceed available memory.This PR makes three changes:
delayed/from_delayed/blockwithda.map_blocksover a template array. This produces a singleblockwiselayer in the HighLevelGraph with O(1) metadata, regardless of chunk countTest plan
TestDaskGraphOptimizationcovering graph structure, numpy/dask parity, empty-chunk skipping, and helper functionstest_reproject.pysuite passes (85/85)