Skip to content

fix: chunk large DataFrames in register to avoid duckdb-rs panic#114

Merged
georgestagg merged 6 commits intoposit-dev:mainfrom
cpsievert:fix/chunk-large-dataframe-register
Feb 18, 2026
Merged

fix: chunk large DataFrames in register to avoid duckdb-rs panic#114
georgestagg merged 6 commits intoposit-dev:mainfrom
cpsievert:fix/chunk-large-dataframe-register

Conversation

@cpsievert
Copy link
Collaborator

@cpsievert cpsievert commented Feb 11, 2026

Summary

DuckDBReader::register panics on any DataFrame with more than 2048 rows. This affects all callers — including the Python bindings (PyDuckDBReader.register) — making it impossible to register moderately-sized datasets.

Root cause

dataframe_to_arrow_params concatenates all Arrow IPC batches into a single RecordBatch, then passes it to duckdb-rs's arrow() table function. That function's ArrowVTab::func writes the entire RecordBatch into a single DuckDB DataChunk in one call — but DataChunk vectors have a fixed capacity of STANDARD_VECTOR_SIZE (2048). When the RecordBatch has more rows than that, FlatVector::copy hits assert!(data.len() <= self.capacity()) and the process aborts.

Fix

Chunk DataFrames larger than 2048 rows before passing them to dataframe_to_arrow_params. The first chunk creates the table (CREATE TEMP TABLE ... AS SELECT * FROM arrow(?, ?)), and subsequent chunks insert into it (INSERT INTO ... SELECT * FROM arrow(?, ?)). Small DataFrames (≤2048 rows) take the existing single-batch path with no overhead.

This is a workaround for a bug in duckdb-rs's ArrowVTab implementation rather than a fundamental limitation of DuckDB itself.

Test plan

  • Added test_register_large_dataframe (3000 rows with int, float, and string columns — verifies row count and data integrity at chunk boundaries)
  • All existing register tests continue to pass

🤖 Generated with Claude Code

duckdb-rs's Arrow virtual table function writes an entire RecordBatch
into a single DuckDB DataChunk whose vectors have a fixed capacity of
STANDARD_VECTOR_SIZE (2048 rows). When a RecordBatch exceeds this,
FlatVector::copy panics with "assertion failed: data.len() <= self.capacity()".

Work around this by splitting DataFrames larger than 2048 rows into
chunks: the first chunk creates the table via CREATE TEMP TABLE ... AS
SELECT * FROM arrow(?, ?), and subsequent chunks use INSERT INTO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the fix/chunk-large-dataframe-register branch from 6df1866 to 7c53e1c Compare February 11, 2026 23:56
@georgestagg
Copy link
Collaborator

georgestagg commented Feb 12, 2026

Great spot, thanks!

This looks like it would work, but before we start managing the chunking ourselves can we look into the appender API for an alternative to loading the arrow datasets into duckdb. I think it's gated behind the appender-arrow feature, rather than vtab-arrow.

It looks like there was a fix in the appender API to support chunking for large arrow datasets, so if we can use that we'd get the chunking without having to manage it ourselves.

If it is no good, or it ends up being to complex a change to switch, we'll stick with this method.

Comment on lines +160 to +162

/// Map an Arrow data type to a DuckDB SQL type name
fn arrow_type_to_duckdb_sql(dt: &ArrowDataType) -> Result<&'static str> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's a shame we have to do this mapping ourselves...

Bah, both solutions have drawbacks.

Which did you prefer between chunking the data ourselves or this appender API, if you have a preference?

Copy link
Collaborator Author

@cpsievert cpsievert Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I prefer the original approach in hindsight. 4cb2c4c reverts to that approach and
4f63f69 explains in more detail where the constant is coming from.

@cpsievert cpsievert force-pushed the fix/chunk-large-dataframe-register branch 3 times, most recently from 24c938c to d28e170 Compare February 18, 2026 00:56
cpsievert and others added 3 commits February 17, 2026 18:58
Replace manual 2048-row chunking with duckdb-rs's Appender API, which
handles chunking internally via zero-copy Arrow slicing (duckdb-rs#530).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document where STANDARD_VECTOR_SIZE comes from in DuckDB's C++ source
and explain the failure mode.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the fix/chunk-large-dataframe-register branch from d28e170 to 4f63f69 Compare February 18, 2026 00:58
@georgestagg georgestagg merged commit 3b6b659 into posit-dev:main Feb 18, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants