Skip to content

Add Python bindings for accessing ExecutionMetrics#1381

Open
ShreyeshArangath wants to merge 2 commits intoapache:mainfrom
ShreyeshArangath:feat/support-metrics
Open

Add Python bindings for accessing ExecutionMetrics#1381
ShreyeshArangath wants to merge 2 commits intoapache:mainfrom
ShreyeshArangath:feat/support-metrics

Conversation

@ShreyeshArangath
Copy link

@ShreyeshArangath ShreyeshArangath commented Feb 15, 2026

Which issue does this PR close?

Closes #1379

Rationale for this change

Today, DataFusion Python only exposes execution metrics through formatted console output via explain(analyze=True). This makes it difficult to programmatically inspect execution behavior.

There is currently no structured python API to access per-operator metrics such as output_rows, elapsed_compute, spill_count and other runtime metrics collected during execution.

This PR introduces APIs to surface the execution metrics, mirroring the Rust API in datafusion::physical_plan::metrics.

What changes are included in this PR?

  • Added plan caching to PyDataFrame so the physical plan used during execution is retained and available for metrics access.
  • Kept the metrics() method and added collect_metrics() helper to walk the execution plan tree and aggregate metrics from all operators.

Are there any user-facing changes?

Users can now programmatically access execution metrics

  df = ctx.sql("SELECT * FROM t WHERE x > 1")
  df.collect()
  plan = df.execution_plan()
  metrics = plan.collect_metrics() 
  for operator_name, metrics_set in metrics:
      print(f"{operator_name}: {metrics_set.output_rows} rows")

@ShreyeshArangath ShreyeshArangath changed the title feat: add Python bindings for accessing ExecutionMetrics Add Python bindings for accessing ExecutionMetrics Feb 15, 2026
@ShreyeshArangath ShreyeshArangath marked this pull request as ready for review February 15, 2026 01:53
Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level, I think this could bring a lot of value. Thank you for putting in the work!

From an implementation perspective, did you consider instead of caching the prior execution plan that instead we simply add the collect() and execute_stream() and so forth on PyExecutionPlan? It seems like that would more closely mirror the upstream repo and simplify the code. I haven't spent a lot of time going through the details of why you're caching the prior plan, so it's very possible I missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Python bindings for accessing ExecutionMetrics

2 participants

Comments