fix: preserve CSV semantics for single-column files by PastelStorm · Pull Request #4322 · Unstructured-IO/unstructured

PastelStorm · 2026-04-05T21:51:07Z

Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back.

Made-with: Cursor

Note

Medium Risk
Changes CSV delimiter sniffing and the partition_csv parsing path, which could alter how edge-case CSV/TSV inputs are interpreted (especially around truncation and quoting). Covered by new targeted tests for single-column, quoted fields, and fallback delimiter detection.

Overview
Fixes single-column CSV parsing by keeping those files on a CSV-aware code path: when delimiter detection yields None, partition_csv() now parses via csv.reader into a one-column DataFrame so quoted commas, escaped quotes, and multiline fields are decoded correctly.

Improves delimiter detection robustness by expanding sniffable delimiters to include tabs, handling fixed-size samples without trailing newlines, and adding a fallback heuristic that tries common delimiters and accepts only consistent multi-column shapes. Adds comprehensive tests covering these cases and documents the fix in the changelog.

^{Reviewed by Cursor Bugbot for commit 06fd2f4. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Test assertions expect literal \n instead of space
- Updated the multiline quoted-field assertions to expect whitespace-normalized output (line 1 line 2) in both element text and HTML.

Or push these changes by commenting:

@cursor push 9f629b4695

Preview (9f629b4695)

diff --git a/test_unstructured/partition/test_csv.py b/test_unstructured/partition/test_csv.py
--- a/test_unstructured/partition/test_csv.py
+++ b/test_unstructured/partition/test_csv.py
@@ -141,11 +141,11 @@
 
     elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)
 
-    assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'
+    assert elements[0].text == 'notes hello, world a "quote" line 1 line 2'
     assert elements[0].metadata.text_as_html is not None
     assert "<td>hello, world</td>" in elements[0].metadata.text_as_html
     assert '<td>a "quote"</td>' in elements[0].metadata.text_as_html
-    assert "<td>line 1\\nline 2</td>" in elements[0].metadata.text_as_html
+    assert "<td>line 1 line 2</td>" in elements[0].metadata.text_as_html
     assert '"hello, world"' not in elements[0].metadata.text_as_html
     assert '""quote""' not in elements[0].metadata.text_as_html

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.}

cursor · 2026-04-05T22:03:29Z

test_unstructured/partition/test_csv.py

+
+    elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)
+
+    assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'


Test assertions expect literal \n instead of space

Medium Severity

The test assertions for the multiline quoted field expect a literal two-character \n (backslash + n) in both elements[0].text and text_as_html, but that's not what the production code produces. The csv.reader correctly decodes "line 1\nline 2" into a string with an actual newline character. Then HtmlTable.from_html_text() normalizes all whitespace via " ".join(e.text.split()), converting the newline to a plain space. The actual output text contains line 1 line 2 (with a space), not line 1\nline 2 (with literal backslash-n). These assertions will fail when run.

Additional Locations (1)

test_unstructured/partition/test_csv.py#L147-L148

^{Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.}

Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back. Made-with: Cursor

cragwolfe · 2026-04-06T20:22:35Z

gpt-pro review: https://docs.google.com/document/d/18htWZDTCncavuoOozfLVCKjLx-fdSAfbzXUcS2xxwyw/edit?usp=sharing

cursor bot reviewed Apr 5, 2026

View reviewed changes

PastelStorm added 2 commits April 5, 2026 17:49

fix: preserve CSV semantics for single-column files

2cd1668

Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back. Made-with: Cursor

fixes

06fd2f4

PastelStorm force-pushed the evoss/csv-single-column-parsing branch from 236467a to 06fd2f4 Compare April 6, 2026 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: preserve CSV semantics for single-column files#4322

fix: preserve CSV semantics for single-column files#4322
PastelStorm wants to merge 2 commits intoevoss/pdf-rendering-refactorfrom
evoss/csv-single-column-parsing

PastelStorm commented Apr 5, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment •

edited

Loading

Uh oh!

cursor bot Apr 5, 2026

Uh oh!

cragwolfe commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)

		assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'

Conversation

PastelStorm commented Apr 5, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 5, 2026

Choose a reason for hiding this comment

Test assertions expect literal \n instead of space

Uh oh!

cragwolfe commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PastelStorm commented Apr 5, 2026 •

edited by cursor bot

Loading

cursor bot left a comment •

edited

Loading