Skip to content

Always use C engine for CSV parsing (fix #696)#697

Open
laughingman7743 wants to merge 3 commits intomasterfrom
fix/csv-engine-always-use-c
Open

Always use C engine for CSV parsing (fix #696)#697
laughingman7743 wants to merge 3 commits intomasterfrom
fix/csv-engine-always-use-c

Conversation

@laughingman7743
Copy link
Member

@laughingman7743 laughingman7743 commented Mar 23, 2026

WHAT

Simplify CSV engine selection logic by removing dead code and inlining pyarrow compatibility checks. Always use the C engine as the default (same as pandas' own default).

Changes

Removed:

  • _get_optimal_csv_engine() — always returned "c", no reason to exist
  • _get_pyarrow_engine() — inlined into _get_csv_engine()
  • test_get_optimal_csv_engine — tests a deleted method
  • _get_optimal_csv_engine mocks from test_get_csv_engine_explicit_specification

Simplified:

  • _get_csv_engine() now contains all CSV engine selection logic directly (was dispatching through 3 helper methods)
  • Method chain reduced from 4 methods → 2 methods (_get_csv_engine + _get_available_engine which is shared with _get_parquet_engine)

Kept:

  • _get_available_engine() — also used by _get_parquet_engine
  • LARGE_FILE_THRESHOLD_BYTES — still used by _auto_determine_chunksize for automatic chunking
  • All pyarrow compatibility checks (chunksize, quoting, converters, file size, import availability)

WHY

PR #594 introduced 4 methods for CSV engine selection based on incorrect AI-generated claims about pandas C parser int32 limitations. This is factually wrong:

  • pandas' C parser does not have int32 limitations
  • pandas defaults to int64 for integer dtypes
  • The C parser can handle files well over 4GB without issues

The original "signed integer is greater than maximum" error (Issue #414) was actually caused by an OpenSSL SSL_read() 2GB buffer limit in Python 3.8 (bpo-42853), which was fixed in Python 3.10.

The forced Python engine caused significant performance degradation as reported in Issue #696:

pandas version Python engine (before) C engine (after) Slowdown
1.5.3 241s 218s +10.5%
2.3.3 68.3s 53.3s +28.1%

Now that the first commit fixed the engine selection to always use C, this second commit cleans up the unnecessary complexity left behind.

Closes #696

The Python engine selection for large files (>50MB) introduced in PR #594 was
based on incorrect AI-generated claims that pandas' C parser has 32-bit integer
limitations. This is factually wrong — pandas defaults to int64 and the C parser
has no such limits.

The original "signed integer is greater than maximum" error (Issue #414) was
caused by an OpenSSL SSL_read() 2GB buffer limit in Python 3.8 (bpo-42853),
which was fixed in Python 3.10 and is unrelated to CSV parsing.

The forced Python engine causes significant performance degradation (up to 28%
slower on pandas 2.3.3 per Issue #696 benchmarks).

Changes:
- _get_optimal_csv_engine() now always returns "c"
- Remove misleading "C parser limitations" error handler
- Update tests to expect C engine for all file sizes

Closes #696

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
laughingman7743 and others added 2 commits March 24, 2026 08:59
… logic

Delete _get_optimal_csv_engine (always returned "c") and _get_pyarrow_engine,
inlining the pyarrow compatibility checks directly into _get_csv_engine.
This reduces the method chain from 4 methods to 2 (keeping _get_available_engine
which is shared with _get_parquet_engine).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Consolidate multiple "return c" branches into a single fallthrough by
expressing the pyarrow compatibility check as a positive condition
(is_compatible) instead of multiple negative early returns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@laughingman7743 laughingman7743 marked this pull request as ready for review March 24, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow parsing of large CSV files

1 participant