Skip to content

[fix](iceberg) Avoid dict reads on mixed-encoding position delete files#61759

Open
suxiaogang223 wants to merge 1 commit intoapache:masterfrom
suxiaogang223:codex/fix-cir19308-master
Open

[fix](iceberg) Avoid dict reads on mixed-encoding position delete files#61759
suxiaogang223 wants to merge 1 commit intoapache:masterfrom
suxiaogang223:codex/fix-cir19308-master

Conversation

@suxiaogang223
Copy link
Contributor

What problem does this PR solve?

Iceberg parquet position delete files currently treat the file_path column as dictionary-coded as long as the column chunk has a dictionary page. That check is too loose: parquet allows mixed encodings in the same column chunk, so a chunk can contain both dictionary-encoded and plain-encoded data pages.

When that happens, Doris builds a ColumnDictI32 for file_path, but the plain decoder later calls insert_many_strings(), which fails with:

Method insert_many_strings is not supported for ColumnDictionary

This PR fixes the issue by only using dictionary-backed decoding for Iceberg position delete file_path columns when the entire parquet column chunk is fully dictionary encoded. Mixed-encoding chunks now fall back to normal string columns.

It also adds BE unit coverage for:

  • fully dictionary-encoded parquet metadata
  • mixed dictionary/plain parquet metadata
  • parquet metadata without encoding_stats but with non-dictionary encodings

Release note

None

Check List

  • This issue was confirmed with code analysis and user logs
  • This change includes unit test coverage
  • Local unit tests were run in this environment

Testing

Local git diff --check passed.
BE unit test execution was not run locally because the current build directory on this machine does not include the doris_be_test target.

@Thearas
Copy link
Contributor

Thearas commented Mar 26, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 26539 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c3d47013bd70168d85e883f56ecebd82938cfb78, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17617	4516	4288	4288
q2	q3	10649	796	507	507
q4	4675	361	252	252
q5	7588	1200	1007	1007
q6	182	181	151	151
q7	818	854	672	672
q8	10031	1466	1400	1400
q9	5448	4808	4756	4756
q10	6322	1952	1647	1647
q11	478	246	245	245
q12	744	573	466	466
q13	18033	2726	1971	1971
q14	238	239	212	212
q15	q16	738	737	671	671
q17	747	862	440	440
q18	5902	5291	5204	5204
q19	1229	986	628	628
q20	540	480	370	370
q21	4440	1832	1407	1407
q22	341	283	245	245
Total cold run time: 96760 ms
Total hot run time: 26539 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4482	4365	4363	4363
q2	q3	3848	4312	3757	3757
q4	855	1161	751	751
q5	4016	4362	4295	4295
q6	176	173	139	139
q7	1728	1635	1516	1516
q8	2421	2658	2494	2494
q9	7742	7563	7466	7466
q10	3707	3976	3627	3627
q11	534	439	431	431
q12	519	625	475	475
q13	2601	2937	2041	2041
q14	281	306	282	282
q15	q16	723	762	762	762
q17	1210	1454	1383	1383
q18	6991	6906	6566	6566
q19	955	985	995	985
q20	2103	2191	2006	2006
q21	3908	3484	3325	3325
q22	459	421	454	421
Total cold run time: 49259 ms
Total hot run time: 47085 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 168664 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c3d47013bd70168d85e883f56ecebd82938cfb78, data reload: false

query5	4333	641	516	516
query6	338	221	201	201
query7	4223	473	270	270
query8	367	254	232	232
query9	8731	2769	2761	2761
query10	562	367	351	351
query11	6919	5115	4847	4847
query12	181	135	123	123
query13	1284	447	357	357
query14	5740	3724	3494	3494
query14_1	2813	2888	2857	2857
query15	203	193	177	177
query16	1016	465	452	452
query17	1127	736	614	614
query18	2459	450	358	358
query19	219	213	188	188
query20	137	126	127	126
query21	218	134	112	112
query22	13217	13407	13047	13047
query23	16203	15884	15960	15884
query23_1	16154	16295	16476	16295
query24	8273	1683	1310	1310
query24_1	1284	1291	1346	1291
query25	611	546	502	502
query26	1278	283	176	176
query27	3321	479	310	310
query28	4509	1841	1861	1841
query29	841	562	469	469
query30	293	226	189	189
query31	1009	947	869	869
query32	83	78	70	70
query33	498	330	301	301
query34	928	866	507	507
query35	632	680	598	598
query36	1080	1155	990	990
query37	132	102	86	86
query38	2921	2873	2854	2854
query39	846	826	804	804
query39_1	808	797	785	785
query40	230	162	155	155
query41	81	111	62	62
query42	254	253	247	247
query43	246	249	222	222
query44	
query45	188	187	179	179
query46	873	992	605	605
query47	2159	2133	2048	2048
query48	302	316	219	219
query49	630	459	395	395
query50	698	286	209	209
query51	4056	4086	4018	4018
query52	257	264	255	255
query53	294	341	286	286
query54	289	286	270	270
query55	89	84	84	84
query56	321	327	306	306
query57	1925	1741	1648	1648
query58	282	278	267	267
query59	2785	2949	2749	2749
query60	337	330	327	327
query61	157	148	152	148
query62	618	580	544	544
query63	320	279	272	272
query64	5074	1294	1026	1026
query65	
query66	1475	450	365	365
query67	24234	24216	24178	24178
query68	
query69	396	325	281	281
query70	935	989	941	941
query71	330	303	296	296
query72	2842	2711	2537	2537
query73	558	541	313	313
query74	9605	9543	9392	9392
query75	2862	2785	2448	2448
query76	2303	1027	689	689
query77	362	395	309	309
query78	10880	11134	10491	10491
query79	1118	767	589	589
query80	1334	631	560	560
query81	547	258	227	227
query82	1001	160	121	121
query83	331	262	241	241
query84	248	121	102	102
query85	949	491	450	450
query86	426	297	284	284
query87	3135	3109	2995	2995
query88	3564	2675	2669	2669
query89	433	372	340	340
query90	2024	177	169	169
query91	177	173	135	135
query92	78	75	72	72
query93	923	829	504	504
query94	645	312	260	260
query95	590	413	318	318
query96	652	516	227	227
query97	2483	2480	2421	2421
query98	235	231	225	225
query99	999	966	892	892
Total cold run time: 250639 ms
Total hot run time: 168664 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.87% (19912/37660)
Line Coverage 36.40% (186543/512515)
Region Coverage 32.67% (144719/442998)
Branch Coverage 33.87% (63458/187336)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100% (0/0) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.80% (26482/36885)
Line Coverage 54.67% (279354/510970)
Region Coverage 51.89% (232013/447123)
Branch Coverage 53.37% (100284/187902)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants