Skip to content

Fix polars lazy frame from con.execute() crashing on pushdown#484

Open
wckao wants to merge 1 commit into
duckdb:mainfrom
wckao:fix/20094-polars-lazy-conn-result
Open

Fix polars lazy frame from con.execute() crashing on pushdown#484
wckao wants to merge 1 commit into
duckdb:mainfrom
wckao:fix/20094-polars-lazy-conn-result

Conversation

@wckao

@wckao wckao commented Jun 8, 2026

Copy link
Copy Markdown

Problem

Calling .pl(lazy=True) on a connection result and then performing any polars
operation that triggers pushdown raises:

polars.exceptions.ComputeError: caught exception during execution of a Python
source, exception: AttributeError: 'NoneType' object has no attribute
'fetch_arrow_reader'

Minimal repro from the upstream issue:

import duckdb
import polars as pl

con = duckdb.connect()
lf = con.execute("SELECT 1 AS id, 'banana' AS fruit").pl(lazy=True)
print(lf.select(pl.len()).collect().item())  # crashes

Direct duckdb.sql(...).pl(lazy=True) works fine — only the
con.execute(...).pl(lazy=True) path is broken.

Root cause

con.execute(...) stores its query result as a DuckDBPyRelation constructed
via DuckDBPyRelation(shared_ptr<DuckDBPyResult>) (pyrelation.cpp:66), which
leaves the rel member null. Every relational method (Project, Filter,
Limit) is guarded by if (!rel) return nullptr; and pybind11 turns that
nullptr into Python None. So when polars's IO plugin in
duckdb/polars_io.py pushes down with_columns / predicate / n_rows, the
chained relation.project(...).filter(...).limit(...).to_arrow_reader(...)
collapses to None.to_arrow_reader(...) and raises AttributeError.

The existing test_polars_lazy_from_conn test only exercises a bare
.collect() with no pushdown, so polars passes with_columns=None and the
plugin goes straight to relation.to_arrow_reader(), which works on
result-backed relations. Any non-trivial polars op (e.g. pl.len(), .select,
.filter, .head) triggers the crash.

Fix

Expose _has_relation on DuckDBPyRelation (true iff rel != nullptr, i.e.
the object wraps a replayable Relation). In polars_io.duckdb_source, branch
on it:

  • Replayable relation (existing path) — push with_columns, predicate,
    n_rows down into DuckDB via project/filter/limit, then stream batches.
    No behavior change.
  • Result-backed (one-shot) — new _streaming_source_generator opens the
    Arrow C stream via relation.to_arrow_reader(batch_size) and applies
    projection/filter/limit per batch in polars. No DuckDB-side pushdown is
    possible because the underlying result is a one-shot iterator that can't be
    re-planned.

Memory stays bounded by batch_size: a 10M-row scan through the
result-backed lazy path peaks at ~7 MB of Python heap (verified with
tracemalloc), instead of ~80 MB if we had materialized.

Also: DuckDBPyRelation::ToRecordBatch previously returned py::none() when
both result and rel were null (the post-consumption state of a one-shot
relation). That made a second .collect() on a result-backed LazyFrame surface
the same opaque AttributeError. It now throws InvalidInputException with
an actionable message ("This result-backed relation has already been consumed
and cannot be read again.").

Tests

Five new tests in tests/fast/arrow/test_polars.py covering the
result-backed lazy path:

  • test_polars_lazy_from_conn_select_len — exact repro from the issue.
  • test_polars_lazy_from_conn_select_subset — column projection pushdown.
  • test_polars_lazy_from_conn_filter — predicate pushdown.
  • test_polars_lazy_from_conn_limitn_rows pushdown.
  • test_polars_lazy_from_conn_consumed_once — second collect raises clearly.

All 6 lazy_from_conn* tests pass. Full test_polars.py runs at 99 passed /
2 skipped (3 pre-existing failures in my local venv are missing-optional-dep
issues, unrelated). test_polars_filter_pushdown.py shows 562 passed / 8
pre-existing timestamptz-lazyframe failures (confirmed unrelated via
git stash).

Fixes duckdb/duckdb#20094

Result-backed DuckDBPyRelation objects (the ones returned by
`con.execute(...)`) have a null `rel` member, so `project`, `filter`,
and `limit` silently return nullptr — which pybind11 turns into Python
`None`. The polars IO plugin captured the relation and called these
methods whenever polars pushed down `with_columns`, `predicate`, or
`n_rows`, producing:

    AttributeError: 'NoneType' object has no attribute 'fetch_arrow_reader'

Add a `_has_relation` property exposing whether the wrapped object is a
replayable Relation. The polars plugin branches on it: when False, it
streams raw Arrow batches via `to_arrow_reader()` and applies
projection/filter/limit per batch in polars. Memory stays bounded by
batch_size (verified at ~7 MB peak on a 10M-row scan).

`ToRecordBatch` now throws a clear `InvalidInputException` when the
relation has been consumed instead of silently returning `None`, so
re-collecting a one-shot lazy frame surfaces an actionable error.

Fixes duckdb/duckdb#20094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AttributeError when computing length of DuckDB-produced lazy Polars frame

1 participant