DataFrame API: allow aggregate functions in select() (#17874) by cj-zhukov · Pull Request #21021 · apache/datafusion

cj-zhukov · 2026-03-18T06:21:38Z

Which issue does this PR close?

Closes #DataFrame API: allow aggregate functions in select() #17874.

Rationale for this change

DataFrame::select currently does not support aggregate expressions directly.
Users must explicitly call DataFrame::aggregate. This PR improves ergonomics by allowing aggregate expressions inside select, while preserving existing behavior and validation rules.

What changes are included in this PR?

Added support for using aggregate functions directly in DataFrame::select
Updated snapshot tests to reflect the new logical plan datafusion-cli/tests/snapshots/cli_explain_environment_overrides@explain_plan_environment_overrides.snap

Are these changes tested?

Existing tests continue to pass
Added new test for this new use case test_dataframe_api_aggregate_fn_in_select

Are there any user-facing changes?

Yes (behavioral improvement, no API changes).
Users can now use aggregate functions directly in select() without explicitly calling aggregate().

ctx.table("aggregate_test_100")
    .await?
    .select(vec![
        approx_distinct(col("c9")).alias("count_c9"),
        approx_distinct(cast(col("c9"), arrow_schema::DataType::Utf8View))
            .alias("count_c9_str"),
    ])?
    .show()
    .await?;

Previously, this required:

ctx.table("aggregate_test_100")
    .await?
    .aggregate(
        vec![],
        vec![
            approx_distinct(col("c9")).alias("count_c9"),
            approx_distinct(cast(col("c9"), arrow_schema::DataType::Utf8View))
                .alias("count_c9_str"),
        ],
    )?
    .show()
    .await?;

Both syntaxes are now valid and can be used interchangeably.

There are no breaking changes to the public API.

martin-g · 2026-03-18T11:35:47Z

You need to update the Insta snapshot for the datafusion-cli crate

martin-g · 2026-03-18T11:37:06Z

...li/tests/snapshots/cli_explain_environment_overrides@explain_plan_environment_overrides.snap

 |              |     "Plan": {                           |
-|              |       "Expressions": [                  |
-|              |         "Int64(123)"                    |
-|              |       ],                                |


Hm, you modified the Insta snapshot!
But it is failing now.
So, maybe you have to revert the change ?!

That helped - thank you!

martin-g · 2026-03-18T13:13:09Z

datafusion/core/src/dataframe/mod.rs


-        let project_plan = LogicalPlanBuilder::from(plan).project(expr_list)?.build()?;
+        // Collect aggregate expressions
+        let aggr_exprs = find_aggregate_exprs(expressions.clone());


find_aggregate_exprs() deduplicates the expressions.
Test like:

let res = df.select(vec![ count(col("c9")).alias("count_c9"), count(col("c9")).alias("count_c9_str"), ])?;

fails with:

failures: ---- dataframe::test_dataframe_api_aggregate_fn_in_select2 stdout ---- Error: SchemaError(FieldNotFound { field: Column { relation: None, name: "__agg_1" }, valid_fields: [Column { relation: None, name: "__agg_0" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c1" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c2" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c3" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c4" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c5" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c6" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c7" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c8" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c9" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c10" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c11" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c12" }, Column { relation: Some(Bare { table: "aggregate_test_100" }), name: "c13" }] }, Some(""))

__agg_1 is lost

martin-g · 2026-03-18T13:17:37Z

datafusion/core/src/dataframe/mod.rs

+        // Check if any expression is non-aggregate
+        let has_non_aggregate_expr = expressions
+            .clone()
+            .any(|expr| find_aggregate_exprs(std::iter::once(expr)).is_empty());


What about aggregate expr + non-aggregate one ?
E.g.:

let res = df.select(vec![ count(col("c9")).alias("count_c9") + lit(1) ])?;

I'd expect 101 but it returns 100

martin-g · 2026-03-18T13:19:01Z

datafusion/core/tests/dataframe/mod.rs

+    );
+
+    Ok(())
+}


Please add tests for some more complex queries, e.g. df.select([sum(col("a")) + count(col("b"))]) and something with (qualified and non-qualified) wildcards too.

martin-g · 2026-03-18T13:20:50Z

datafusion/core/src/dataframe/mod.rs

+        for (i, select_expr) in expr_list.into_iter().enumerate() {
+            match select_expr {
+                SelectExpr::Expression(expr) => {
+                    let column = Expr::Column(Column::from_name(format!("__agg_{i}")));


__agg_0 could collide with a real column. Is this how it is being done elsewhere ?

DataFrame API: allow aggregate functions in select() (apache#17874)

ff0ada7

github-actions bot added the core Core DataFusion crate label Mar 18, 2026

use count instead of approx_distinct in test

1659fa7

martin-g reviewed Mar 18, 2026

View reviewed changes

Update CLI snapshot

f9f351e

martin-g reviewed Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame API: allow aggregate functions in select() (#17874)#21021

DataFrame API: allow aggregate functions in select() (#17874)#21021
cj-zhukov wants to merge 3 commits intoapache:mainfrom
cj-zhukov:cj-zhukov/DataFrame-API-allow-aggregate-functions-in-select

cj-zhukov commented Mar 18, 2026

Uh oh!

martin-g commented Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Uh oh!

cj-zhukov Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+                  );
+                  Ok(())
+              }

Conversation

cj-zhukov commented Mar 18, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

martin-g commented Mar 18, 2026

Uh oh!

martin-g Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

cj-zhukov Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

martin-g Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants