Skip to content

fix: preserve CSV semantics for single-column files#4322

Open
PastelStorm wants to merge 2 commits intoevoss/pdf-rendering-refactorfrom
evoss/csv-single-column-parsing
Open

fix: preserve CSV semantics for single-column files#4322
PastelStorm wants to merge 2 commits intoevoss/pdf-rendering-refactorfrom
evoss/csv-single-column-parsing

Conversation

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm commented Apr 5, 2026

Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back.

Made-with: Cursor


Note

Medium Risk
Changes CSV delimiter sniffing and the partition_csv parsing path, which could alter how edge-case CSV/TSV inputs are interpreted (especially around truncation and quoting). Covered by new targeted tests for single-column, quoted fields, and fallback delimiter detection.

Overview
Fixes single-column CSV parsing by keeping those files on a CSV-aware code path: when delimiter detection yields None, partition_csv() now parses via csv.reader into a one-column DataFrame so quoted commas, escaped quotes, and multiline fields are decoded correctly.

Improves delimiter detection robustness by expanding sniffable delimiters to include tabs, handling fixed-size samples without trailing newlines, and adding a fallback heuristic that tries common delimiters and accepts only consistent multi-column shapes. Adds comprehensive tests covering these cases and documents the fix in the changelog.

Reviewed by Cursor Bugbot for commit 06fd2f4. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Test assertions expect literal \n instead of space
    • Updated the multiline quoted-field assertions to expect whitespace-normalized output (line 1 line 2) in both element text and HTML.

Create PR

Or push these changes by commenting:

@cursor push 9f629b4695
Preview (9f629b4695)
diff --git a/test_unstructured/partition/test_csv.py b/test_unstructured/partition/test_csv.py
--- a/test_unstructured/partition/test_csv.py
+++ b/test_unstructured/partition/test_csv.py
@@ -141,11 +141,11 @@
 
     elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)
 
-    assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'
+    assert elements[0].text == 'notes hello, world a "quote" line 1 line 2'
     assert elements[0].metadata.text_as_html is not None
     assert "<td>hello, world</td>" in elements[0].metadata.text_as_html
     assert '<td>a "quote"</td>' in elements[0].metadata.text_as_html
-    assert "<td>line 1\\nline 2</td>" in elements[0].metadata.text_as_html
+    assert "<td>line 1 line 2</td>" in elements[0].metadata.text_as_html
     assert '"hello, world"' not in elements[0].metadata.text_as_html
     assert '""quote""' not in elements[0].metadata.text_as_html

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.


elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)

assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test assertions expect literal \n instead of space

Medium Severity

The test assertions for the multiline quoted field expect a literal two-character \n (backslash + n) in both elements[0].text and text_as_html, but that's not what the production code produces. The csv.reader correctly decodes "line 1\nline 2" into a string with an actual newline character. Then HtmlTable.from_html_text() normalizes all whitespace via " ".join(e.text.split()), converting the newline to a plain space. The actual output text contains line 1 line 2 (with a space), not line 1\nline 2 (with literal backslash-n). These assertions will fail when run.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.

Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back.

Made-with: Cursor
@PastelStorm PastelStorm force-pushed the evoss/csv-single-column-parsing branch from 236467a to 06fd2f4 Compare April 6, 2026 00:53
@cragwolfe
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants