|
| 1 | +# Issue Triage: Progress indicators for long-running queries |
| 2 | + |
| 3 | +## 1. Issue Analysis |
| 4 | +- **Summary:** Users running long DataFusion queries through the Python bindings do not receive any execution feedback, which hurts UX in terminals and Jupyter notebooks. The community is already prototyping a CLI progress bar, and the feature request asks how similar feedback could be surfaced in Python, ideally via a user-supplied callable for custom visualizations. |
| 5 | +- **Actual vs. expected:** `DataFrame.collect()` simply delegates to the Rust binding without emitting progress information, so users see nothing until the entire query finishes. Expected behavior is a progress reporting hook (e.g., built-in progress bar or callback) that surfaces intermediate execution status while the query is running. |
| 6 | + |
| 7 | +## 2. Codebase Scope |
| 8 | +- **Key modules:** |
| 9 | + - `python/datafusion/dataframe.py` exposes `DataFrame.collect()` and `collect_partitioned()`, both of which synchronously call into the Rust layer with no instrumentation.【F:python/datafusion/dataframe.py†L682-L712】 |
| 10 | + - `src/dataframe.rs` implements the PyO3 bindings for collection; it calls into DataFusion and converts the resulting `RecordBatch`es to PyArrow objects, again without emitting progress events.【F:src/dataframe.rs†L518-L545】 |
| 11 | + - `src/utils.rs` provides `wait_for_future`, which periodically checks for Python interrupts while awaiting Rust futures. This is where a cooperative progress hook could be injected while the GIL is released.【F:src/utils.rs†L68-L93】 |
| 12 | +- **Dependencies:** Collection ultimately invokes `datafusion::dataframe::DataFrame::collect`, so any progress solution must either leverage core DataFusion task metrics or poll execution state via the runtime environment exposed through the bindings.【F:src/dataframe.rs†L521-L526】【F:src/utils.rs†L68-L93】 |
| 13 | +- **Recent activity:** No recent commits in this repository add progress reporting; current functionality dates back to the existing synchronous collection path noted above. |
| 14 | + |
| 15 | +## 3. Classification |
| 16 | +- **Type:** Feature |
| 17 | +- **Severity:** Minor (lack of progress updates degrades UX but does not break functionality) |
| 18 | +- **Scope:** Single (limited to query execution UX in Python bindings) |
| 19 | +- **Priority:** Medium (useful quality-of-life improvement; aligns with ongoing CLI work) |
| 20 | + |
| 21 | +## 4. Resolution Plan |
| 22 | +1. Investigate DataFusion core support for progress listeners (e.g., session task manager metrics used by the CLI PR) and document the Rust APIs available from version 50. |
| 23 | +2. Prototype a Rust-side progress bridge that accepts an optional Python callback (probably stored as a `PyObject` guarded by the GIL) and invokes it from `wait_for_future` or a spawned task while `collect` executes. |
| 24 | +3. Extend the Python API (possibly via `SessionConfig` or an argument to `collect`/`show`) to register the callback or select a default progress indicator implementation for terminals vs. notebooks. |
| 25 | +4. Add user documentation and examples demonstrating both the default progress output and how to plug in a custom callable. |
| 26 | +5. Write integration tests (likely asynchronous or using mocked callbacks) to ensure callbacks fire and do not block execution, plus manual verification in a notebook. |
| 27 | + |
| 28 | +## 5. Estimation of Fix Size |
| 29 | +Estimated effort: Medium — 2–3 days; Files: 4–6; Tests: 2–3; Risk: Medium (requires careful cross-language callback handling). Deployment: Requires release. |
| 30 | + |
| 31 | +## 6. Next Steps |
| 32 | +- **Recommendation:** Defer to future release after API design consensus (needs design alignment with ongoing CLI progress work and Python callback ergonomics). |
| 33 | + |
| 34 | +## 7. Fix Location |
| 35 | +- **Fix in this repo:** The Python bindings must expose and manage the progress hook locally, though it may depend on upstream DataFusion metrics infrastructure. |
0 commit comments