Skip to content

Support non-UTF-8 encoded CSV files#20626

Open
Rafferty97 wants to merge 10 commits into
apache:mainfrom
Rafferty97:non-utf8-csv2
Open

Support non-UTF-8 encoded CSV files#20626
Rafferty97 wants to merge 10 commits into
apache:mainfrom
Rafferty97:non-utf8-csv2

Conversation

@Rafferty97
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #20473

Rationale for this change

CSV is a ubiquitous file format, and many are encoded in Windows-1252 and other encodings. It would be useful to have the option to read them in datafusion.

What changes are included in this PR?

  • Adds a configuration option to the CSV reader to specify an encoding
  • Adds an optional dependency on encoding_rs to do the actual decoding
  • Refactored CsvSource somewhat to aid the implementation
  • Removed the return value from DecoderDeserializer::digest as it was misleading (call sites were ignoring it)

Are these changes tested?

I have added one unit test that attempts to read a SHIFT-JIS encoded CSV file. More tests are probably needed, but I may need some guidance on this. I'm also running into issues getting the test suite to run locally on my Windows machine.

Are there any user-facing changes?

Adds a new field to CsvOptions.

@github-actions github-actions Bot added core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate documentation Improvements or additions to documentation labels Mar 1, 2026
@Rafferty97 Rafferty97 force-pushed the non-utf8-csv2 branch 4 times, most recently from 5bab23f to 09a2357 Compare March 5, 2026 00:24
@Rafferty97
Copy link
Copy Markdown
Contributor Author

@alamb Hey, I think this would be a useful addition to Datafusion, as CSV files in non-UTF-8 encodings are quite common. I'm wondering if there's anything I can do to make review easier?

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Mar 18, 2026

@alamb Hey, I think this would be a useful addition to Datafusion, as CSV files in non-UTF-8 encodings are quite common. I'm wondering if there's anything I can do to make review easier?

I just need to find time (or we need to find some other reviewers)

See

@github-actions
Copy link
Copy Markdown

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions Bot added the Stale PR has not had any activity for some time label May 18, 2026
@Rafferty97
Copy link
Copy Markdown
Contributor Author

bump

@github-actions
Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [ 101.454s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.038s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  96.137s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.037s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.985s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CsvReadOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/core/src/datasource/file_format/options.rs:89
  field CsvReadOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/core/src/datasource/file_format/options.rs:89
  field CsvReadOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/core/src/datasource/file_format/options.rs:89

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 201.677s] datafusion
    Building datafusion-common v53.1.0 (current)
       Built [  31.585s] (current)
     Parsing datafusion-common v53.1.0 (current)
      Parsed [   0.059s] (current)
    Building datafusion-common v53.1.0 (baseline)
       Built [  31.638s] (baseline)
     Parsing datafusion-common v53.1.0 (baseline)
      Parsed [   0.063s] (baseline)
    Checking datafusion-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.948s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/common/src/config.rs:3163

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  65.672s] datafusion-common
    Building datafusion-datasource v53.1.0 (current)
       Built [  34.621s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.032s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  34.662s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.034s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.360s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure trait_method_now_returns_unit: pub trait method now returns unit ---

Description:
A trait method that used to return a value now returns `()`.
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/trait_method_now_returns_unit.ron
Failed in:
  datafusion_datasource::decoder::BatchDeserializer::digest in /home/runner/work/datafusion/datafusion/datafusion/datasource/src/decoder.rs:42

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  71.038s] datafusion-datasource
    Building datafusion-datasource-csv v53.1.0 (current)
       Built [  34.942s] (current)
     Parsing datafusion-datasource-csv v53.1.0 (current)
      Parsed [   0.013s] (current)
    Building datafusion-datasource-csv v53.1.0 (baseline)
       Built [  34.928s] (baseline)
     Parsing datafusion-datasource-csv v53.1.0 (baseline)
      Parsed [   0.012s] (baseline)
    Checking datafusion-datasource-csv v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.146s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  71.735s] datafusion-datasource-csv
    Building datafusion-proto v53.1.0 (current)
       Built [  51.931s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.141s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  51.892s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.144s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   2.337s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/datafusion_proto_common.rs:703
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/proto/src/generated/datafusion_proto_common.rs:703

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [ 109.012s] datafusion-proto
    Building datafusion-proto-common v53.1.0 (current)
       Built [  20.047s] (current)
     Parsing datafusion-proto-common v53.1.0 (current)
      Parsed [   0.045s] (current)
    Building datafusion-proto-common v53.1.0 (baseline)
       Built [  20.161s] (baseline)
     Parsing datafusion-proto-common v53.1.0 (baseline)
      Parsed [   0.049s] (baseline)
    Checking datafusion-proto-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.516s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:703
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:703
  field CsvOptions.charset in /home/runner/work/datafusion/datafusion/datafusion/proto-common/src/generated/prost.rs:703

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  43.012s] datafusion-proto-common

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate Stale PR has not had any activity for some time

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support non-UTF-8 encoded CSV files

2 participants