Skip to content

Allow writing StringDType variables to netCDF#11218

Merged
jsignell merged 6 commits intopydata:mainfrom
kkollsga:fix-stringdtype-netcdf-11199
Mar 18, 2026
Merged

Allow writing StringDType variables to netCDF#11218
jsignell merged 6 commits intopydata:mainfrom
kkollsga:fix-stringdtype-netcdf-11199

Conversation

@kkollsga
Copy link
Contributor

@kkollsga kkollsga commented Mar 8, 2026

Summary

  • Recognizes numpy.dtypes.StringDType (kind "T") as a unicode string type in is_unicode_dtype, so the encoding pipeline and backend dtype selection handle it correctly.
  • Converts StringDType arrays to object arrays in netCDF4 and h5netcdf backend prepare_variable methods, since neither C library supports StringDType natively.
  • Null values from StringDType(na_object=None) are replaced with empty strings on write, matching existing behavior for object-dtype string arrays with missing values.
  • The scipy backend already works because EncodedStringCoder(allows_unicode=False) encodes strings to bytes via encode_string_array, which handles StringDType.

Test plan

  • test_is_unicode_dtype_stringdtype — unit test for is_unicode_dtype with StringDType
  • test_roundtrip_stringdtype_data — roundtrip test in DatasetIOBase, runs across all backends (netCDF4, h5netcdf, scipy, zarr)
  • Manual verification of null handling with StringDType(na_object=None)
  • Pre-commit (ruff, formatting) passes
  • mypy passes (no new errors)

🤖 Generated with Claude Code

@kkollsga
Copy link
Contributor Author

kkollsga commented Mar 9, 2026

The mypy and test failures look related to numpy 2.4.2's stricter type stubs (#11183, #11204).

@kkollsga kkollsga mentioned this pull request Mar 9, 2026
2 tasks
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @kkollsga! Just one comment about adding some more tests. Also does the scipy backend need this same handling?

kkollsga and others added 2 commits March 14, 2026 12:20
Recognize numpy.dtypes.StringDType (kind "T") as a unicode string type
in is_unicode_dtype, and convert StringDType arrays to object arrays
before passing to netCDF4/h5netcdf backends which don't support
StringDType natively. Null values from StringDType(na_object=None) are
replaced with empty strings on write.

Co-authored-by: Claude <noreply@anthropic.com>
- Handle StringDType null values in encode_string_array (scipy/nc3 path)
- Add roundtrip tests for StringDType with na_object=None and na_object=""
- Add unit test for encode_string_array with StringDType nulls

Co-Authored-By: Claude <noreply@anthropic.com>
@kkollsga kkollsga force-pushed the fix-stringdtype-netcdf-11199 branch from 4f7280c to 79b4ae1 Compare March 14, 2026 11:36
@kkollsga
Copy link
Contributor Author

Thanks for the review @jsignell ! I looked into scipy and the situation is a bit different there. The scipy backend goes through EncodedStringCoder(allows_unicode=False) which converts strings to bytes before prepare_variable ever sees the data, so it doesn't need the same prepare_variable change. However, encode_string_array (called by that coder) would crash on null values since None.encode("utf-8") raises AttributeError. So the fix needed to go in encode_string_array instead, using the same convert-to-object-and-replace-None pattern to keep null handling consistent across all paths.

I added these changes:

  • Null handling in encode_string_array for StringDType (covers scipy and nc3 format paths)
  • test_roundtrip_stringdtype_nulls — roundtrip with StringDType(na_object=None) containing a null, verifies it comes back as ""
  • test_roundtrip_stringdtype_with_na_object — roundtrip with StringDType(na_object="")
  • test_encode_string_array_stringdtype_nulls — unit test for the encode_string_array fix

All roundtrip tests are in DatasetIOBase so they run across every backend.

Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is getting close. I really like the new tests. Since tests are passing now on CI it does look like you have some failures in yours (https://github.com/pydata/xarray/actions/runs/23211060740/job/67459709862?pr=11218#step:10:42). I am suspicious about whether this string_array[string_array == None] = "" is doing quite the right thing. Does == None accurately capture the null value even if na_object is set?

@kkollsga
Copy link
Contributor Author

I realized that unlike converting StringDType to numpy 'object' dtype, converting to fixed-width unicode (U) is supported by all backends natively, so I moved the null handling into EncodedStringCoder.encode() instead. This let me remove the per-backend conversions in netCDF4 and h5netcdf prepare_variable entirely. Hopefully the tests will pass now.

kkollsga and others added 2 commits March 17, 2026 22:11
Convert StringDType to fixed-width unicode (U) in EncodedStringCoder.encode()
instead of per-backend prepare_variable, fixing Zarr and CFEncodedDataStore.

Co-authored-by: Claude <noreply@anthropic.com>
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@jsignell jsignell merged commit 6c36d81 into pydata:main Mar 18, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datasets concatenated along string dimension cannot write to netCDF

2 participants