Support non-UTF-8 encoded CSV files#20626
Conversation
5bab23f to
09a2357
Compare
|
@alamb Hey, I think this would be a useful addition to Datafusion, as CSV files in non-UTF-8 encodings are quite common. I'm wondering if there's anything I can do to make review easier? |
I just need to find time (or we need to find some other reviewers) See |
|
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
make schema inference charset-aware
|
bump |
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
Which issue does this PR close?
Closes #20473
Rationale for this change
CSV is a ubiquitous file format, and many are encoded in Windows-1252 and other encodings. It would be useful to have the option to read them in datafusion.
What changes are included in this PR?
encoding_rsto do the actual decodingCsvSourcesomewhat to aid the implementationDecoderDeserializer::digestas it was misleading (call sites were ignoring it)Are these changes tested?
I have added one unit test that attempts to read a SHIFT-JIS encoded CSV file. More tests are probably needed, but I may need some guidance on this. I'm also running into issues getting the test suite to run locally on my Windows machine.
Are there any user-facing changes?
Adds a new field to
CsvOptions.