Fix polars lazy frame from con.execute() crashing on pushdown#484
Open
wckao wants to merge 1 commit into
Open
Conversation
Result-backed DuckDBPyRelation objects (the ones returned by
`con.execute(...)`) have a null `rel` member, so `project`, `filter`,
and `limit` silently return nullptr — which pybind11 turns into Python
`None`. The polars IO plugin captured the relation and called these
methods whenever polars pushed down `with_columns`, `predicate`, or
`n_rows`, producing:
AttributeError: 'NoneType' object has no attribute 'fetch_arrow_reader'
Add a `_has_relation` property exposing whether the wrapped object is a
replayable Relation. The polars plugin branches on it: when False, it
streams raw Arrow batches via `to_arrow_reader()` and applies
projection/filter/limit per batch in polars. Memory stays bounded by
batch_size (verified at ~7 MB peak on a 10M-row scan).
`ToRecordBatch` now throws a clear `InvalidInputException` when the
relation has been consumed instead of silently returning `None`, so
re-collecting a one-shot lazy frame surfaces an actionable error.
Fixes duckdb/duckdb#20094
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Calling
.pl(lazy=True)on a connection result and then performing any polarsoperation that triggers pushdown raises:
Minimal repro from the upstream issue:
Direct
duckdb.sql(...).pl(lazy=True)works fine — only thecon.execute(...).pl(lazy=True)path is broken.Root cause
con.execute(...)stores its query result as aDuckDBPyRelationconstructedvia
DuckDBPyRelation(shared_ptr<DuckDBPyResult>)(pyrelation.cpp:66), whichleaves the
relmember null. Every relational method (Project,Filter,Limit) is guarded byif (!rel) return nullptr;and pybind11 turns thatnullptrinto PythonNone. So when polars's IO plugin induckdb/polars_io.pypushes downwith_columns/predicate/n_rows, thechained
relation.project(...).filter(...).limit(...).to_arrow_reader(...)collapses to
None.to_arrow_reader(...)and raisesAttributeError.The existing
test_polars_lazy_from_conntest only exercises a bare.collect()with no pushdown, so polars passeswith_columns=Noneand theplugin goes straight to
relation.to_arrow_reader(), which works onresult-backed relations. Any non-trivial polars op (e.g.
pl.len(),.select,.filter,.head) triggers the crash.Fix
Expose
_has_relationonDuckDBPyRelation(true iffrel != nullptr, i.e.the object wraps a replayable Relation). In
polars_io.duckdb_source, branchon it:
with_columns,predicate,n_rowsdown into DuckDB viaproject/filter/limit, then stream batches.No behavior change.
_streaming_source_generatoropens theArrow C stream via
relation.to_arrow_reader(batch_size)and appliesprojection/filter/limit per batch in polars. No DuckDB-side pushdown is
possible because the underlying result is a one-shot iterator that can't be
re-planned.
Memory stays bounded by
batch_size: a 10M-row scan through theresult-backed lazy path peaks at ~7 MB of Python heap (verified with
tracemalloc), instead of ~80 MB if we had materialized.Also:
DuckDBPyRelation::ToRecordBatchpreviously returnedpy::none()whenboth
resultandrelwere null (the post-consumption state of a one-shotrelation). That made a second
.collect()on a result-backed LazyFrame surfacethe same opaque
AttributeError. It now throwsInvalidInputExceptionwithan actionable message ("This result-backed relation has already been consumed
and cannot be read again.").
Tests
Five new tests in
tests/fast/arrow/test_polars.pycovering theresult-backed lazy path:
test_polars_lazy_from_conn_select_len— exact repro from the issue.test_polars_lazy_from_conn_select_subset— column projection pushdown.test_polars_lazy_from_conn_filter— predicate pushdown.test_polars_lazy_from_conn_limit—n_rowspushdown.test_polars_lazy_from_conn_consumed_once— second collect raises clearly.All 6
lazy_from_conn*tests pass. Fulltest_polars.pyruns at 99 passed /2 skipped (3 pre-existing failures in my local venv are missing-optional-dep
issues, unrelated).
test_polars_filter_pushdown.pyshows 562 passed / 8pre-existing
timestamptz-lazyframefailures (confirmed unrelated viagit stash).Fixes duckdb/duckdb#20094