fix: chunk large DataFrames in register to avoid duckdb-rs panic by cpsievert · Pull Request #114 · posit-dev/ggsql

cpsievert · 2026-02-11T23:54:04Z

Summary

DuckDBReader::register panics on any DataFrame with more than 2048 rows. This affects all callers — including the Python bindings (PyDuckDBReader.register) — making it impossible to register moderately-sized datasets.

Root cause

dataframe_to_arrow_params concatenates all Arrow IPC batches into a single RecordBatch, then passes it to duckdb-rs's arrow() table function. That function's ArrowVTab::func writes the entire RecordBatch into a single DuckDB DataChunk in one call — but DataChunk vectors have a fixed capacity of STANDARD_VECTOR_SIZE (2048). When the RecordBatch has more rows than that, FlatVector::copy hits assert!(data.len() <= self.capacity()) and the process aborts.

Fix

Chunk DataFrames larger than 2048 rows before passing them to dataframe_to_arrow_params. The first chunk creates the table (CREATE TEMP TABLE ... AS SELECT * FROM arrow(?, ?)), and subsequent chunks insert into it (INSERT INTO ... SELECT * FROM arrow(?, ?)). Small DataFrames (≤2048 rows) take the existing single-batch path with no overhead.

This is a workaround for a bug in duckdb-rs's ArrowVTab implementation rather than a fundamental limitation of DuckDB itself.

Test plan

Added test_register_large_dataframe (3000 rows with int, float, and string columns — verifies row count and data integrity at chunk boundaries)
All existing register tests continue to pass

🤖 Generated with Claude Code

duckdb-rs's Arrow virtual table function writes an entire RecordBatch into a single DuckDB DataChunk whose vectors have a fixed capacity of STANDARD_VECTOR_SIZE (2048 rows). When a RecordBatch exceeds this, FlatVector::copy panics with "assertion failed: data.len() <= self.capacity()". Work around this by splitting DataFrames larger than 2048 rows into chunks: the first chunk creates the table via CREATE TEMP TABLE ... AS SELECT * FROM arrow(?, ?), and subsequent chunks use INSERT INTO. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

georgestagg · 2026-02-12T10:32:37Z

Great spot, thanks!

This looks like it would work, but before we start managing the chunking ourselves can we look into the appender API for an alternative to loading the arrow datasets into duckdb. I think it's gated behind the appender-arrow feature, rather than vtab-arrow.

It looks like there was a fix in the appender API to support chunking for large arrow datasets, so if we can use that we'd get the chunking without having to manage it ourselves.

If it is no good, or it ends up being to complex a change to switch, we'll stick with this method.

georgestagg · 2026-02-16T09:38:00Z

src/reader/duckdb.rs

+
+/// Map an Arrow data type to a DuckDB SQL type name
+fn arrow_type_to_duckdb_sql(dt: &ArrowDataType) -> Result<&'static str> {


Oh it's a shame we have to do this mapping ourselves...

Bah, both solutions have drawbacks.

Which did you prefer between chunking the data ourselves or this appender API, if you have a preference?

Yeah, I think I prefer the original approach in hindsight. 4cb2c4c reverts to that approach and
4f63f69 explains in more detail where the constant is coming from.

Replace manual 2048-row chunking with duckdb-rs's Appender API, which handles chunking internally via zero-copy Arrow slicing (duckdb-rs#530). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit cd72769.

Document where STANDARD_VECTOR_SIZE comes from in DuckDB's C++ source and explain the failure mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cpsievert force-pushed the fix/chunk-large-dataframe-register branch from 6df1866 to 7c53e1c Compare February 11, 2026 23:56

georgestagg reviewed Feb 16, 2026

View reviewed changes

cpsievert force-pushed the fix/chunk-large-dataframe-register branch 3 times, most recently from 24c938c to d28e170 Compare February 18, 2026 00:56

cpsievert and others added 3 commits February 17, 2026 18:58

refactor: use appender-arrow API for DataFrame registration

cd72769

Replace manual 2048-row chunking with duckdb-rs's Appender API, which handles chunking internally via zero-copy Arrow slicing (duckdb-rs#530). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Revert "refactor: use appender-arrow API for DataFrame registration"

4cb2c4c

This reverts commit cd72769.

docs: clarify origin of MAX_ARROW_BATCH_ROWS constant

4f63f69

Document where STANDARD_VECTOR_SIZE comes from in DuckDB's C++ source and explain the failure mode. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cpsievert force-pushed the fix/chunk-large-dataframe-register branch from d28e170 to 4f63f69 Compare February 18, 2026 00:58

cpsievert added 2 commits February 17, 2026 20:08

Merge branch 'main' into fix/chunk-large-dataframe-register

dbb6508

Run cargo fmt

c22a981

thomasp85 requested a review from georgestagg February 18, 2026 07:45

georgestagg approved these changes Feb 18, 2026

View reviewed changes

georgestagg merged commit 3b6b659 into posit-dev:main Feb 18, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: chunk large DataFrames in register to avoid duckdb-rs panic#114

fix: chunk large DataFrames in register to avoid duckdb-rs panic#114
georgestagg merged 6 commits intoposit-dev:mainfrom
cpsievert:fix/chunk-large-dataframe-register

cpsievert commented Feb 11, 2026 •

edited

Loading

Uh oh!

georgestagg commented Feb 12, 2026 •

edited

Loading

Uh oh!

georgestagg Feb 16, 2026

Uh oh!

cpsievert Feb 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		/// Map an Arrow data type to a DuckDB SQL type name
		fn arrow_type_to_duckdb_sql(dt: &ArrowDataType) -> Result<&'static str> {

Conversation

cpsievert commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Uh oh!

georgestagg commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

georgestagg Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

cpsievert Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpsievert commented Feb 11, 2026 •

edited

Loading

georgestagg commented Feb 12, 2026 •

edited

Loading

cpsievert Feb 18, 2026 •

edited

Loading