Skip to content

[fix](nereids) Fix pre-aggregation context leakage across join branches#63357

Open
felixwluo wants to merge 1 commit into
apache:masterfrom
felixwluo:de-ms
Open

[fix](nereids) Fix pre-aggregation context leakage across join branches#63357
felixwluo wants to merge 1 commit into
apache:masterfrom
felixwluo:de-ms

Conversation

@felixwluo
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

for agg_keys tables, Nereids may incorrectly set PREAGGREGATION: ON for a normal scan in a join when the other join branch contains an aggregate subquery.

example sql

select cw.term_id, cw.class_id, cw.original_num, cw.abs_week
from repro_preagg_join cw
join (
    select term_id, class_id, max(abs_week) as abs_week
    from repro_preagg_join
    where class_id in (10192497, 10192366)
    group by term_id, class_id
) mw on cw.class_id = mw.class_id
    and cw.abs_week = mw.abs_week;

root cause:

  1. If repro_preagg_join is an AGGREGATE KEY table and the same key exists in multiple rowsets, the left scan cw should merge rows before returning data. However, before this fix, the plan could mark the left scan as: "PREAGGREGATION: ON
    "
  2. This makes be read the underlying rows directly without merging, so duplicated physical rows from multiple rowsets can be returned and then joined, producing duplicated query results.
  3. The right aggregate subquery itself is not duplicated. The wrong result is caused by the left sibling scan being incorrectly affected by the aggregate context from the right branch.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@felixwluo
Copy link
Copy Markdown
Member Author

/review

@felixwluo felixwluo requested a review from starocean999 May 18, 2026 08:39
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking issues found in the actual PR diff.

Checkpoint conclusions:

  • Goal/test: The change targets pre-aggregation context leakage from an aggregate subquery into a sibling normal scan. The added retainAll(logicalAggregate.child().getInputRelations()) keeps aggregate metadata scoped to the aggregate child, and the regression test checks the expected mixed PREAGGREGATION OFF/ON plan shape.
  • Scope: The implementation is small and focused, touching only the context IDs at the aggregate boundary plus one regression case.
  • Concurrency/lifecycle/config/compatibility/persistence: Not applicable; this is planner rewrite state local to one optimization pass and adds no persistent or protocol state.
  • Parallel paths: The affected path is the Nereids SetPreAggStatus rewrite. Existing duplicate/unique/MoW early exits remain unchanged.
  • Tests: Regression coverage was added for the reported same-table join-with-aggregate-subquery pattern. I did not run the regression suite in this review environment.
  • Observability/performance: No new observability is needed. The added set intersection is small relative to planning work and is not on BE execution hot paths.
  • User focus: No additional user-provided focus points were supplied.

@felixwluo
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31250 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 2b67c6289a61fbbec7da547a2e204e7224b95d9c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17599	3871	3907	3871
q2	q3	10786	1425	833	833
q4	4696	471	344	344
q5	7795	2257	2068	2068
q6	242	176	139	139
q7	927	765	640	640
q8	9413	1839	1683	1683
q9	5116	4943	4880	4880
q10	6407	2048	1766	1766
q11	429	271	250	250
q12	628	418	295	295
q13	18133	3345	2773	2773
q14	262	257	240	240
q15	q16	826	771	705	705
q17	1003	899	963	899
q18	6950	5907	5571	5571
q19	1331	1351	1121	1121
q20	513	576	292	292
q21	6274	2894	2576	2576
q22	526	431	304	304
Total cold run time: 99856 ms
Total hot run time: 31250 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4590	4630	4475	4475
q2	q3	4902	5195	4640	4640
q4	2121	2203	1411	1411
q5	4746	4869	4636	4636
q6	234	172	128	128
q7	1831	1714	1580	1580
q8	2395	2099	2046	2046
q9	7787	7629	7304	7304
q10	4514	4387	3974	3974
q11	520	384	352	352
q12	709	728	510	510
q13	3023	3330	2832	2832
q14	270	288	252	252
q15	q16	680	694	614	614
q17	1251	1234	1238	1234
q18	7259	6809	6667	6667
q19	1128	1089	1106	1089
q20	2208	2198	1936	1936
q21	5300	4602	4472	4472
q22	514	479	411	411
Total cold run time: 55982 ms
Total hot run time: 50563 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170602 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2b67c6289a61fbbec7da547a2e204e7224b95d9c, data reload: false

query5	4324	655	505	505
query6	343	227	194	194
query7	4208	582	299	299
query8	330	250	231	231
query9	8830	3988	3967	3967
query10	436	343	298	298
query11	5835	2389	2176	2176
query12	185	127	124	124
query13	1295	624	436	436
query14	5996	5374	5061	5061
query14_1	4396	4404	4376	4376
query15	216	204	186	186
query16	1078	481	461	461
query17	1169	760	624	624
query18	2775	493	379	379
query19	225	212	166	166
query20	141	163	128	128
query21	221	138	116	116
query22	13610	13539	13280	13280
query23	17178	16433	15951	15951
query23_1	16141	16227	16140	16140
query24	7409	1743	1267	1267
query24_1	1294	1302	1300	1300
query25	551	478	400	400
query26	1311	319	167	167
query27	2726	549	354	354
query28	4340	1927	1949	1927
query29	960	606	501	501
query30	289	240	197	197
query31	1105	1062	948	948
query32	87	76	73	73
query33	536	347	289	289
query34	1187	1120	640	640
query35	765	790	685	685
query36	1318	1361	1190	1190
query37	154	110	93	93
query38	3215	3134	3055	3055
query39	968	924	893	893
query39_1	881	876	869	869
query40	224	147	123	123
query41	69	64	62	62
query42	112	110	108	108
query43	325	325	285	285
query44	
query45	214	199	191	191
query46	1055	1173	721	721
query47	2305	2405	2274	2274
query48	402	414	314	314
query49	632	511	392	392
query50	968	356	253	253
query51	4365	4293	4265	4265
query52	107	105	93	93
query53	248	280	205	205
query54	310	265	248	248
query55	93	95	86	86
query56	309	317	322	317
query57	1434	1407	1305	1305
query58	294	272	274	272
query59	1535	1628	1345	1345
query60	321	321	309	309
query61	166	155	155	155
query62	681	617	552	552
query63	240	199	210	199
query64	2352	795	657	657
query65	
query66	1654	467	360	360
query67	30145	30012	29927	29927
query68	
query69	471	336	306	306
query70	1009	990	1013	990
query71	302	272	276	272
query72	2990	2696	2609	2609
query73	842	753	432	432
query74	5068	4920	4790	4790
query75	2676	2611	2267	2267
query76	2269	1130	758	758
query77	421	419	330	330
query78	12139	12077	11706	11706
query79	1225	1005	692	692
query80	571	544	490	490
query81	456	281	246	246
query82	244	162	125	125
query83	278	287	252	252
query84	258	144	115	115
query85	961	622	543	543
query86	360	338	322	322
query87	3390	3408	3225	3225
query88	3500	2679	2665	2665
query89	441	387	345	345
query90	2193	191	174	174
query91	198	185	157	157
query92	82	82	75	75
query93	1371	1589	843	843
query94	572	371	353	353
query95	692	480	338	338
query96	988	842	349	349
query97	2707	2694	2547	2547
query98	240	231	235	231
query99	1131	1104	986	986
Total cold run time: 251630 ms
Total hot run time: 170602 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.67% (1/150) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants