fix: preserve CSV semantics for single-column files#4322
fix: preserve CSV semantics for single-column files#4322PastelStorm wants to merge 2 commits intoevoss/pdf-rendering-refactorfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Test assertions expect literal \n instead of space
- Updated the multiline quoted-field assertions to expect whitespace-normalized output (
line 1 line 2) in both element text and HTML.
- Updated the multiline quoted-field assertions to expect whitespace-normalized output (
Or push these changes by commenting:
@cursor push 9f629b4695
Preview (9f629b4695)
diff --git a/test_unstructured/partition/test_csv.py b/test_unstructured/partition/test_csv.py
--- a/test_unstructured/partition/test_csv.py
+++ b/test_unstructured/partition/test_csv.py
@@ -141,11 +141,11 @@
elements = partition_csv(file=io.BytesIO(csv_data), include_header=True)
- assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2'
+ assert elements[0].text == 'notes hello, world a "quote" line 1 line 2'
assert elements[0].metadata.text_as_html is not None
assert "<td>hello, world</td>" in elements[0].metadata.text_as_html
assert '<td>a "quote"</td>' in elements[0].metadata.text_as_html
- assert "<td>line 1\\nline 2</td>" in elements[0].metadata.text_as_html
+ assert "<td>line 1 line 2</td>" in elements[0].metadata.text_as_html
assert '"hello, world"' not in elements[0].metadata.text_as_html
assert '""quote""' not in elements[0].metadata.text_as_htmlThis Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.
|
|
||
| elements = partition_csv(file=io.BytesIO(csv_data), include_header=True) | ||
|
|
||
| assert elements[0].text == 'notes hello, world a "quote" line 1\\nline 2' |
There was a problem hiding this comment.
Test assertions expect literal \n instead of space
Medium Severity
The test assertions for the multiline quoted field expect a literal two-character \n (backslash + n) in both elements[0].text and text_as_html, but that's not what the production code produces. The csv.reader correctly decodes "line 1\nline 2" into a string with an actual newline character. Then HtmlTable.from_html_text() normalizes all whitespace via " ".join(e.text.split()), converting the newline to a plain space. The actual output text contains line 1 line 2 (with a space), not line 1\nline 2 (with literal backslash-n). These assertions will fail when run.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit c2b87fd. Configure here.
Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back. Made-with: Cursor
236467a to
06fd2f4
Compare



Keep single-column CSV parsing on the CSV code path so quoted commas, escaped quotes, and multiline cells remain decoded correctly when delimiter sniffing falls back.
Made-with: Cursor
Note
Medium Risk
Changes CSV delimiter sniffing and the
partition_csvparsing path, which could alter how edge-case CSV/TSV inputs are interpreted (especially around truncation and quoting). Covered by new targeted tests for single-column, quoted fields, and fallback delimiter detection.Overview
Fixes single-column CSV parsing by keeping those files on a CSV-aware code path: when delimiter detection yields
None,partition_csv()now parses viacsv.readerinto a one-column DataFrame so quoted commas, escaped quotes, and multiline fields are decoded correctly.Improves delimiter detection robustness by expanding sniffable delimiters to include tabs, handling fixed-size samples without trailing newlines, and adding a fallback heuristic that tries common delimiters and accepts only consistent multi-column shapes. Adds comprehensive tests covering these cases and documents the fix in the changelog.
Reviewed by Cursor Bugbot for commit 06fd2f4. Bugbot is set up for automated code reviews on this repo. Configure here.