Allow writing StringDType variables to netCDF by kkollsga · Pull Request #11218 · pydata/xarray

kkollsga · 2026-03-08T22:37:57Z

Closes Datasets concatenated along string dimension cannot write to netCDF #11199
Tests added
User visible changes documented in whats-new.rst
New functions/methods listed in api.rst (N/A — no new public API)

Summary

Recognizes numpy.dtypes.StringDType (kind "T") as a unicode string type in is_unicode_dtype, so the encoding pipeline and backend dtype selection handle it correctly.
Converts StringDType arrays to object arrays in netCDF4 and h5netcdf backend prepare_variable methods, since neither C library supports StringDType natively.
Null values from StringDType(na_object=None) are replaced with empty strings on write, matching existing behavior for object-dtype string arrays with missing values.
The scipy backend already works because EncodedStringCoder(allows_unicode=False) encodes strings to bytes via encode_string_array, which handles StringDType.

Test plan

test_is_unicode_dtype_stringdtype — unit test for is_unicode_dtype with StringDType
test_roundtrip_stringdtype_data — roundtrip test in DatasetIOBase, runs across all backends (netCDF4, h5netcdf, scipy, zarr)
Manual verification of null handling with StringDType(na_object=None)
Pre-commit (ruff, formatting) passes
mypy passes (no new errors)

🤖 Generated with Claude Code

kkollsga · 2026-03-09T07:31:30Z

The mypy and test failures look related to numpy 2.4.2's stricter type stubs (#11183, #11204).

jsignell

This looks really good @kkollsga! Just one comment about adding some more tests. Also does the scipy backend need this same handling?

xarray/tests/test_backends.py

Recognize numpy.dtypes.StringDType (kind "T") as a unicode string type in is_unicode_dtype, and convert StringDType arrays to object arrays before passing to netCDF4/h5netcdf backends which don't support StringDType natively. Null values from StringDType(na_object=None) are replaced with empty strings on write. Co-authored-by: Claude <noreply@anthropic.com>

- Handle StringDType null values in encode_string_array (scipy/nc3 path) - Add roundtrip tests for StringDType with na_object=None and na_object="" - Add unit test for encode_string_array with StringDType nulls Co-Authored-By: Claude <noreply@anthropic.com>

kkollsga · 2026-03-14T11:36:50Z

Thanks for the review @jsignell ! I looked into scipy and the situation is a bit different there. The scipy backend goes through EncodedStringCoder(allows_unicode=False) which converts strings to bytes before prepare_variable ever sees the data, so it doesn't need the same prepare_variable change. However, encode_string_array (called by that coder) would crash on null values since None.encode("utf-8") raises AttributeError. So the fix needed to go in encode_string_array instead, using the same convert-to-object-and-replace-None pattern to keep null handling consistent across all paths.

I added these changes:

Null handling in encode_string_array for StringDType (covers scipy and nc3 format paths)
test_roundtrip_stringdtype_nulls — roundtrip with StringDType(na_object=None) containing a null, verifies it comes back as ""
test_roundtrip_stringdtype_with_na_object — roundtrip with StringDType(na_object="")
test_encode_string_array_stringdtype_nulls — unit test for the encode_string_array fix

All roundtrip tests are in DatasetIOBase so they run across every backend.

jsignell

It looks like this is getting close. I really like the new tests. Since tests are passing now on CI it does look like you have some failures in yours (https://github.com/pydata/xarray/actions/runs/23211060740/job/67459709862?pr=11218#step:10:42). I am suspicious about whether this string_array[string_array == None] = "" is doing quite the right thing. Does == None accurately capture the null value even if na_object is set?

kkollsga · 2026-03-17T21:10:47Z

I realized that unlike converting StringDType to numpy 'object' dtype, converting to fixed-width unicode (U) is supported by all backends natively, so I moved the null handling into EncodedStringCoder.encode() instead. This let me remove the per-backend conversions in netCDF4 and h5netcdf prepare_variable entirely. Hopefully the tests will pass now.

Convert StringDType to fixed-width unicode (U) in EncodedStringCoder.encode() instead of per-backend prepare_variable, fixing Zarr and CFEncodedDataStore. Co-authored-by: Claude <noreply@anthropic.com>

jsignell

This looks great!

github-actions bot added topic-backends io labels Mar 8, 2026

kkollsga mentioned this pull request Mar 8, 2026

Datasets concatenated along string dimension cannot write to netCDF #11199

Closed

5 tasks

kkollsga mentioned this pull request Mar 9, 2026

2 tests failures #11183

Closed

2 tasks

jsignell reviewed Mar 13, 2026

View reviewed changes

xarray/tests/test_backends.py Show resolved Hide resolved

kkollsga and others added 2 commits March 14, 2026 12:20

kkollsga force-pushed the fix-stringdtype-netcdf-11199 branch from 4f7280c to 79b4ae1 Compare March 14, 2026 11:36

kkollsga and others added 2 commits March 17, 2026 18:40

Merge branch 'main' into fix-stringdtype-netcdf-11199

f053f7f

Merge branch 'main' into fix-stringdtype-netcdf-11199

6e782be

jsignell requested changes Mar 17, 2026

View reviewed changes

kkollsga and others added 2 commits March 17, 2026 22:11

Move StringDType handling into shared encoder (pydata#11199)

3df16ac

Convert StringDType to fixed-width unicode (U) in EncodedStringCoder.encode() instead of per-backend prepare_variable, fixing Zarr and CFEncodedDataStore. Co-authored-by: Claude <noreply@anthropic.com>

Merge branch 'main' into fix-stringdtype-netcdf-11199

59b6453

jsignell approved these changes Mar 18, 2026

View reviewed changes

jsignell merged commit 6c36d81 into pydata:main Mar 18, 2026
40 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow writing StringDType variables to netCDF#11218

Allow writing StringDType variables to netCDF#11218
jsignell merged 6 commits intopydata:mainfrom
kkollsga:fix-stringdtype-netcdf-11199

kkollsga commented Mar 8, 2026 •

edited

Loading

Uh oh!

kkollsga commented Mar 9, 2026 •

edited

Loading

Uh oh!

jsignell left a comment

Uh oh!

Uh oh!

kkollsga commented Mar 14, 2026

Uh oh!

jsignell left a comment

Uh oh!

kkollsga commented Mar 17, 2026

Uh oh!

jsignell left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kkollsga commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

kkollsga commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kkollsga commented Mar 14, 2026

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

kkollsga commented Mar 17, 2026

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kkollsga commented Mar 8, 2026 •

edited

Loading

kkollsga commented Mar 9, 2026 •

edited

Loading