Allow writing StringDType variables to netCDF#11218
Conversation
Recognize numpy.dtypes.StringDType (kind "T") as a unicode string type in is_unicode_dtype, and convert StringDType arrays to object arrays before passing to netCDF4/h5netcdf backends which don't support StringDType natively. Null values from StringDType(na_object=None) are replaced with empty strings on write. Co-authored-by: Claude <noreply@anthropic.com>
- Handle StringDType null values in encode_string_array (scipy/nc3 path) - Add roundtrip tests for StringDType with na_object=None and na_object="" - Add unit test for encode_string_array with StringDType nulls Co-Authored-By: Claude <noreply@anthropic.com>
4f7280c to
79b4ae1
Compare
|
Thanks for the review @jsignell ! I looked into scipy and the situation is a bit different there. The scipy backend goes through I added these changes:
All roundtrip tests are in |
jsignell
left a comment
There was a problem hiding this comment.
It looks like this is getting close. I really like the new tests. Since tests are passing now on CI it does look like you have some failures in yours (https://github.com/pydata/xarray/actions/runs/23211060740/job/67459709862?pr=11218#step:10:42). I am suspicious about whether this string_array[string_array == None] = "" is doing quite the right thing. Does == None accurately capture the null value even if na_object is set?
|
I realized that unlike converting StringDType to numpy 'object' dtype, converting to fixed-width unicode ( |
Convert StringDType to fixed-width unicode (U) in EncodedStringCoder.encode() instead of per-backend prepare_variable, fixing Zarr and CFEncodedDataStore. Co-authored-by: Claude <noreply@anthropic.com>
Summary
numpy.dtypes.StringDType(kind"T") as a unicode string type inis_unicode_dtype, so the encoding pipeline and backend dtype selection handle it correctly.netCDF4andh5netcdfbackendprepare_variablemethods, since neither C library supports StringDType natively.StringDType(na_object=None)are replaced with empty strings on write, matching existing behavior for object-dtype string arrays with missing values.EncodedStringCoder(allows_unicode=False)encodes strings to bytes viaencode_string_array, which handles StringDType.Test plan
test_is_unicode_dtype_stringdtype— unit test foris_unicode_dtypewith StringDTypetest_roundtrip_stringdtype_data— roundtrip test inDatasetIOBase, runs across all backends (netCDF4, h5netcdf, scipy, zarr)StringDType(na_object=None)🤖 Generated with Claude Code