Always use C engine for CSV parsing (fix #696) by laughingman7743 · Pull Request #697 · pyathena-dev/PyAthena

laughingman7743 · 2026-03-23T23:41:15Z

WHAT

Simplify CSV engine selection logic by removing dead code and inlining pyarrow compatibility checks. Always use the C engine as the default (same as pandas' own default).

Changes

Removed:

_get_optimal_csv_engine() — always returned "c", no reason to exist
_get_pyarrow_engine() — inlined into _get_csv_engine()
test_get_optimal_csv_engine — tests a deleted method
_get_optimal_csv_engine mocks from test_get_csv_engine_explicit_specification

Simplified:

_get_csv_engine() now contains all CSV engine selection logic directly (was dispatching through 3 helper methods)
Method chain reduced from 4 methods → 2 methods (_get_csv_engine + _get_available_engine which is shared with _get_parquet_engine)

Kept:

_get_available_engine() — also used by _get_parquet_engine
LARGE_FILE_THRESHOLD_BYTES — still used by _auto_determine_chunksize for automatic chunking
All pyarrow compatibility checks (chunksize, quoting, converters, file size, import availability)

WHY

PR #594 introduced 4 methods for CSV engine selection based on incorrect AI-generated claims about pandas C parser int32 limitations. This is factually wrong:

pandas' C parser does not have int32 limitations
pandas defaults to int64 for integer dtypes
The C parser can handle files well over 4GB without issues

The original "signed integer is greater than maximum" error (Issue #414) was actually caused by an OpenSSL SSL_read() 2GB buffer limit in Python 3.8 (bpo-42853), which was fixed in Python 3.10.

The forced Python engine caused significant performance degradation as reported in Issue #696:

pandas version	Python engine (before)	C engine (after)	Slowdown
1.5.3	241s	218s	+10.5%
2.3.3	68.3s	53.3s	+28.1%

Now that the first commit fixed the engine selection to always use C, this second commit cleans up the unnecessary complexity left behind.

Closes #696

The Python engine selection for large files (>50MB) introduced in PR #594 was based on incorrect AI-generated claims that pandas' C parser has 32-bit integer limitations. This is factually wrong — pandas defaults to int64 and the C parser has no such limits. The original "signed integer is greater than maximum" error (Issue #414) was caused by an OpenSSL SSL_read() 2GB buffer limit in Python 3.8 (bpo-42853), which was fixed in Python 3.10 and is unrelated to CSV parsing. The forced Python engine causes significant performance degradation (up to 28% slower on pandas 2.3.3 per Issue #696 benchmarks). Changes: - _get_optimal_csv_engine() now always returns "c" - Remove misleading "C parser limitations" error handler - Update tests to expect C engine for all file sizes Closes #696 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… logic Delete _get_optimal_csv_engine (always returned "c") and _get_pyarrow_engine, inlining the pyarrow compatibility checks directly into _get_csv_engine. This reduces the method chain from 4 methods to 2 (keeping _get_available_engine which is shared with _get_parquet_engine). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Consolidate multiple "return c" branches into a single fallthrough by expressing the pyarrow compatibility check as a positive condition (is_compatible) instead of multiple negative early returns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

laughingman7743 mentioned this pull request Mar 23, 2026

Slow parsing of large CSV files #696

Open

laughingman7743 and others added 2 commits March 24, 2026 08:59

laughingman7743 marked this pull request as ready for review March 24, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use C engine for CSV parsing (fix #696)#697

Always use C engine for CSV parsing (fix #696)#697
laughingman7743 wants to merge 3 commits intomasterfrom
fix/csv-engine-always-use-c

laughingman7743 commented Mar 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laughingman7743 commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WHAT

Changes

WHY

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

laughingman7743 commented Mar 23, 2026 •

edited

Loading