Skip to content

Conversation

@Illviljan
Copy link
Contributor

@Illviljan Illviljan commented Dec 6, 2025

  • Closes cumsum drops index coordinates #6528
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

The non-flox version reduces chunksizes significantly:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=False):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<getitem, shape=(5,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

With flox the chunksize is retained:

x = xr.DataArray([1, 1, 1, 1, 1], name="x").chunk()
grp_idx = xr.DataArray([-1, 0, 0, -1, 1])
with xr.set_options(use_flox=True):
    print(x.groupby(grp_idx).cumsum())
<xarray.DataArray 'x' (dim_0: 5)> Size: 40B
dask.array<_finalize_scan, shape=(5,), dtype=int64, chunksize=(5,), chunktype=numpy.ndarray>
Dimensions without coordinates: dim_0

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
# median isn't enabled yet, because it would break if a single group was present in multiple
# chunks. The non-flox code path will just rechunk every group to a single chunk and execute the median
method_is_not_flox_supported = method.name in ("median", "cumsum", "cumprod")
method_is_not_flox_supported = method.name in ("median", "cumprod")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in a future PR, I'd like to use the new flox.is_supported_aggregation here. It's a little smarter about this dispatching. We'll also have to figure out what to do about median which currently auto-rechunks so it always works.

)
return f"""\
return self.reduce(
out = self.reduce(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this looked nice and it was clear there was tweak to fix cumsum/cumprod:

            out = self.reduce(
                duck_array_ops.cumsum,
                dim=dim,
                skipna=skipna,
                keep_attrs=keep_attrs,
                **kwargs,
            )
            return out.assign_coords(self._obj.coords)

Then for consistency and readability I followed that pattern on the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply the assign_coords fix for Dataset.cumsum et al too?

assert_identical(expected.foo, actual)

@pytest.mark.parametrize(
"method, expected_array, use_flox, use_dask",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use_dask here should mean grouping a dask array by a numpy array. That will work always.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cumsum drops index coordinates

2 participants