Skip to content

Conversation

@eshort0401
Copy link

@eshort0401 eshort0401 commented Dec 11, 2025

Addresses a number of issues related to initializing DataTree objects with path like names. Not sure if @TomNicholas is still working on these, but thought I'd have a go.

Recursion Issue #9978

As discussed by @dawiedotcom in #9978, commands like

from xarray import DataTree
DataTree(children={'a/b': DataTree()})

currently cause infinite recursion. To fix, we amend the children setter to allow path like names, creating intermediate nodes using ._set_item. Note the discussion in #9378 proposes making path like names illegal, but I think allowing them is more natural. With this PR, the above command will now return

<xarray.DataTree>
Group: /
└── Group: /a
    └── Group: /a/b

DataTree.update Inconsistencies #9485

Fix the DataTree.update method by adapting the approach suggested by @TomNicholas in #9485. Specifically, rewrite the existing update method as _update_local_node, which assumes the names given in other are not path like, and all updates apply to given node. Then write a new update method which groups the items of other into those items specific to each node, then calls _update_local_node on each group.

  • xarray.Dataarray and xarray.Variable values in other are handled as before.
  • Now we also allow xarray.Dataset values in other, assuming the user's intent is to create a node at the given path with the given dataset, or replace the node's dataset if it already exists.
  • Add tests to ensure update and __setitem__ give identical results.

With this PR, commands like

import xarray as xr

dt = xr.DataTree(xr.Dataset(coords={"x": [0, 1, 2]}))
# Add a DataArray q to child node p
new_content = {"p/q": xr.DataArray(data=[10, 20, 30], dims=("x",))}
# Add a Variable r to child node p
new_content.update({"p/r": xr.Variable(data=2, dims=())})
# Create new nodes s, t, and assign a Dataset to t
ds = xr.Dataset({"a": ("x", [1, 2, 3]), "b": ("x", [4, 5, 6])})
new_content.update({"s/t": ds})
dt.update(new_content)
print(dt)

will return

<xarray.DataTree>
Group: /
│   Dimensions:  (x: 3)
│   Coordinates:
│     * x        (x) int64 24B 0 1 2
├── Group: /s
│   └── Group: /s/t
│           Dimensions:  (x: 3)
│           Data variables:
│               a        (x) int64 24B 1 2 3
│               b        (x) int64 24B 4 5 6
└── Group: /p
        Dimensions:  (x: 3)
        Data variables:
            q        (x) int64 24B 10 20 30
            r        int64 8B 2

Child Node Coordinate Assignment #9485

As discussed by @shoyer in #9485, allow for assignment to child node coordinates using path like names. The commands

import xarray as xr
tree = xr.DataTree(xr.Dataset(coords={'x': 0}), children={'child': xr.DataTree()})
tree.coords['/child/y'] = 2
print(tree)

now return

<xarray.DataTree>
Group: /
│   Dimensions:  ()
│   Coordinates:
│       x        int64 8B 0
└── Group: /child
        Dimensions:  ()
        Coordinates:
            y        int64 8B 2

General Comments

These fixes are accomplished by adding DataTree._get_target_object, which returns the object at the given path if it exists, or creates an empty node at that path using _set_item if it doesn't. We also add NodePath._get_components which returns self.parent and self.name provided self.name exists, or raises an error otherwise.

@welcome
Copy link

welcome bot commented Dec 11, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Dec 11, 2025
Comment on lines +188 to +197
def handle_child(name, child):
"""
Decide whether to call _set_item or _set_parent. If the child name is a
path, we call ._set_item to make sure any intermediate nodes are also
created. When name is a path there will be nested calls of children due to
the setter decorator, and the fact ._set_item assigns to the children
attribute.
"""
if "/" in name:
# Path like node name. Create node at appropriate level of nesting.
Copy link
Member

@TomNicholas TomNicholas Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this design change will silently break things somehow. It effectively allows violating an intended design invariant (that node names do not have slashes) then attempts to ensure that it cleans up after itself. I would much prefer to solve this in some way that never violates the invariant, even internally.

@TomNicholas
Copy link
Member

TomNicholas commented Dec 12, 2025

thought I'd have a go.

Awesome! Thank you for taking the initiative here!

Note the discussion in #9378 proposes making path like names illegal, but I think allowing them is more natural.

We did make them illegal in variable names, and intended to make them illegal in child names too (see #9490, which we apparently didn't get around to implementing yet). I think all sorts of things could break if we try to reverse that right now, because DataTree was never designed to handle paths in node names. It's also a bigger decision than I think is required to fix these bugs.

This PR changes a lot of behaviour - can it be broken up? I'm unclear whether you have implemented one change that addresses 3 issues, or 3 changes that address 3 issues. At the very least the coordinate feature seems easily separable. In general a larger number of smaller, more targeted PRs is preferable.

These fixes are accomplished by adding DataTree._get_target_object, which returns the object at the given path if it exists, or creates an empty node at that path using _set_item if it doesn't.

This seems useful, but also it sounds like this method both mutates state and has a return value, which is considered unpythonic (as @shoyer mentions in #9196). I'm worried that could lead to unintentional side-effects.

I think a safer approach here is a sequence of small, separate PRs, that address #9490, then #9978 (ideally without adding any new methods - there presumably must be a bug in an existing method that causes this), then finally #9498.

I really appreciate you diving in here!

@eshort0401
Copy link
Author

Thanks so much for the review @TomNicholas!

I think a safer approach here is a sequence of small, separate PRs, that address #9490, then #9978 (ideally without adding any new methods - there presumably must be a bug in an existing method that causes this), then finally #9498.

I agree, and also agree with your other points. I'll close this PR and re-open once I've revised!

@eshort0401 eshort0401 closed this Dec 13, 2025
@TomNicholas
Copy link
Member

Thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic-DataTree Related to the implementation of a DataTree class

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assigning to DataTree.coords should support full paths

2 participants