Always use C engine for CSV parsing (fix #696)#697
Open
laughingman7743 wants to merge 3 commits intomasterfrom
Open
Always use C engine for CSV parsing (fix #696)#697laughingman7743 wants to merge 3 commits intomasterfrom
laughingman7743 wants to merge 3 commits intomasterfrom
Conversation
The Python engine selection for large files (>50MB) introduced in PR #594 was based on incorrect AI-generated claims that pandas' C parser has 32-bit integer limitations. This is factually wrong — pandas defaults to int64 and the C parser has no such limits. The original "signed integer is greater than maximum" error (Issue #414) was caused by an OpenSSL SSL_read() 2GB buffer limit in Python 3.8 (bpo-42853), which was fixed in Python 3.10 and is unrelated to CSV parsing. The forced Python engine causes significant performance degradation (up to 28% slower on pandas 2.3.3 per Issue #696 benchmarks). Changes: - _get_optimal_csv_engine() now always returns "c" - Remove misleading "C parser limitations" error handler - Update tests to expect C engine for all file sizes Closes #696 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… logic Delete _get_optimal_csv_engine (always returned "c") and _get_pyarrow_engine, inlining the pyarrow compatibility checks directly into _get_csv_engine. This reduces the method chain from 4 methods to 2 (keeping _get_available_engine which is shared with _get_parquet_engine). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate multiple "return c" branches into a single fallthrough by expressing the pyarrow compatibility check as a positive condition (is_compatible) instead of multiple negative early returns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
WHAT
Simplify CSV engine selection logic by removing dead code and inlining pyarrow compatibility checks. Always use the C engine as the default (same as pandas' own default).
Changes
Removed:
_get_optimal_csv_engine()— always returned"c", no reason to exist_get_pyarrow_engine()— inlined into_get_csv_engine()test_get_optimal_csv_engine— tests a deleted method_get_optimal_csv_enginemocks fromtest_get_csv_engine_explicit_specificationSimplified:
_get_csv_engine()now contains all CSV engine selection logic directly (was dispatching through 3 helper methods)_get_csv_engine+_get_available_enginewhich is shared with_get_parquet_engine)Kept:
_get_available_engine()— also used by_get_parquet_engineLARGE_FILE_THRESHOLD_BYTES— still used by_auto_determine_chunksizefor automatic chunkingWHY
PR #594 introduced 4 methods for CSV engine selection based on incorrect AI-generated claims about pandas C parser int32 limitations. This is factually wrong:
int64for integer dtypesThe original "signed integer is greater than maximum" error (Issue #414) was actually caused by an OpenSSL
SSL_read()2GB buffer limit in Python 3.8 (bpo-42853), which was fixed in Python 3.10.The forced Python engine caused significant performance degradation as reported in Issue #696:
Now that the first commit fixed the engine selection to always use C, this second commit cleans up the unnecessary complexity left behind.
Closes #696