Skip to content

Conversation

@liutang123
Copy link
Contributor

@liutang123 liutang123 commented Jan 29, 2026

When a string column's data are all null, the dict page may be empty.
The error message is as follows:

INTERNAL_ERROR]Read parquet file hdfs://xxx.0.parq failed, reason = [INVALID_ARGUMENT]ZSTD_decompressDCtx error: Unknown frame descriptor. cur path: [xxx](hdfs://xxx.0.parq) 

We needn't decompress dcit page data when dict page is empty and just cache empty data as decompressed data.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@liutang123 liutang123 force-pushed the fix-parquet-empty-dic-page-master branch 3 times, most recently from 4827ed9 to 3051f98 Compare January 29, 2026 16:23
@liutang123
Copy link
Contributor Author

run buildall

@liutang123
Copy link
Contributor Author

@kaka11chen Hello, do you have time to see this fix?

@doris-robot
Copy link

TPC-H: Total hot run time: 32044 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3051f98cbf4a1e82624575ae97a91d3a10ca11d5, data reload: false

------ Round 1 ----------------------------------
q1	17652	5382	5080	5080
q2	2042	350	195	195
q3	10161	1325	792	792
q4	10214	831	314	314
q5	7537	2169	1889	1889
q6	202	183	149	149
q7	881	797	631	631
q8	9265	1408	1052	1052
q9	5152	4890	4848	4848
q10	6790	1946	1599	1599
q11	524	304	291	291
q12	331	382	222	222
q13	17761	4054	3266	3266
q14	233	235	212	212
q15	890	828	811	811
q16	699	677	623	623
q17	629	736	554	554
q18	6906	6627	6465	6465
q19	1236	1000	617	617
q20	398	354	223	223
q21	2661	1936	1935	1935
q22	363	318	276	276
Total cold run time: 102527 ms
Total hot run time: 32044 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5336	5285	5303	5285
q2	262	331	236	236
q3	2188	2669	2234	2234
q4	1363	1753	1277	1277
q5	4297	4188	4216	4188
q6	215	184	140	140
q7	2243	2182	1883	1883
q8	2642	2374	2416	2374
q9	7567	7507	7599	7507
q10	2797	3096	2693	2693
q11	547	480	477	477
q12	747	779	611	611
q13	3923	4353	3614	3614
q14	319	311	303	303
q15	864	809	811	809
q16	715	757	690	690
q17	1200	1261	1263	1261
q18	8187	7872	8001	7872
q19	914	863	920	863
q20	2097	2153	2012	2012
q21	4935	4526	4183	4183
q22	590	563	529	529
Total cold run time: 53948 ms
Total hot run time: 51041 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.34 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3051f98cbf4a1e82624575ae97a91d3a10ca11d5, data reload: false

query1	0.06	0.04	0.04
query2	0.10	0.04	0.04
query3	0.25	0.08	0.08
query4	1.61	0.11	0.11
query5	0.28	0.24	0.25
query6	1.17	0.68	0.69
query7	0.03	0.02	0.02
query8	0.06	0.04	0.04
query9	0.57	0.50	0.49
query10	0.56	0.55	0.54
query11	0.14	0.09	0.10
query12	0.13	0.10	0.11
query13	0.63	0.61	0.60
query14	1.08	1.07	1.08
query15	0.87	0.85	0.87
query16	0.41	0.40	0.41
query17	1.13	1.10	1.09
query18	0.22	0.21	0.21
query19	2.07	2.01	1.99
query20	0.02	0.01	0.01
query21	15.44	0.29	0.15
query22	5.30	0.05	0.05
query23	16.06	0.29	0.10
query24	1.84	0.89	0.33
query25	0.06	0.06	0.06
query26	0.13	0.13	0.13
query27	0.09	0.06	0.05
query28	3.80	1.12	0.96
query29	12.58	3.88	3.17
query30	0.27	0.13	0.11
query31	2.81	0.66	0.41
query32	3.24	0.59	0.50
query33	3.24	3.22	3.24
query34	16.42	5.41	4.75
query35	4.83	4.78	4.78
query36	0.65	0.51	0.50
query37	0.12	0.07	0.07
query38	0.06	0.04	0.04
query39	0.05	0.03	0.03
query40	0.18	0.16	0.15
query41	0.08	0.04	0.03
query42	0.04	0.04	0.03
query43	0.05	0.04	0.03
Total cold run time: 98.73 s
Total hot run time: 28.34 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 53.33% (8/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.71% (19263/36542)
Line Coverage 36.11% (179013/495786)
Region Coverage 32.56% (138851/426413)
Branch Coverage 33.49% (60073/179357)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 53.33% (8/15) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.48% (25604/35820)
Line Coverage 54.09% (267508/494553)
Region Coverage 51.57% (222112/430731)
Branch Coverage 53.06% (95551/180073)

@liutang123 liutang123 force-pushed the fix-parquet-empty-dic-page-master branch from 3051f98 to c93b096 Compare January 30, 2026 17:02
@liutang123 liutang123 requested a review from kaka11chen January 30, 2026 17:02
liutang123 added 2 commits January 31, 2026 01:03
When a string column's data are all null, the dict page may be empty.
The error message is as follows:
INTERNAL_ERROR]Read parquet file hdfs://HDFS82742/ydbi/original/server/tlbbgl/auction_zstd/dt=2024-12-13/084cadfc5200b4ad-c2b2568a00000045_1132749056_data.0.parq failed, reason = [INVALID_ARGUMENT]ZSTD_decompressDCtx error: Unknown frame descriptor. cur path: xxx
We needn't decompress dcit page data when dict page is empty and just cache empty data as decompressed data.
…ictI32::insert_many_dict_data` when dict data is empty.
@liutang123 liutang123 force-pushed the fix-parquet-empty-dic-page-master branch from c93b096 to 04583b0 Compare January 30, 2026 17:03
@liutang123
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31846 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 04583b01f2e089eb575bf90bb7d2d671bdd3b80b, data reload: false

------ Round 1 ----------------------------------
q1	17622	5332	5063	5063
q2	2007	341	198	198
q3	10207	1313	745	745
q4	10222	905	326	326
q5	7527	2157	1918	1918
q6	202	184	151	151
q7	873	742	602	602
q8	9264	1404	1169	1169
q9	5265	4817	4833	4817
q10	6869	1947	1564	1564
q11	493	285	279	279
q12	354	373	228	228
q13	17777	4039	3224	3224
q14	230	238	215	215
q15	904	833	809	809
q16	676	668	618	618
q17	634	817	467	467
q18	6647	6319	6520	6319
q19	1236	984	625	625
q20	387	337	228	228
q21	2621	2028	2003	2003
q22	359	314	278	278
Total cold run time: 102376 ms
Total hot run time: 31846 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5323	5236	5245	5236
q2	242	337	255	255
q3	2174	2673	2271	2271
q4	1356	1739	1334	1334
q5	4326	4219	4268	4219
q6	214	182	140	140
q7	2178	2142	1883	1883
q8	2624	2533	2364	2364
q9	7383	7441	7309	7309
q10	2849	3082	2567	2567
q11	562	462	453	453
q12	676	784	624	624
q13	3884	4681	3462	3462
q14	300	309	297	297
q15	910	859	836	836
q16	662	711	685	685
q17	1113	1309	1322	1309
q18	8477	7690	7878	7690
q19	856	866	869	866
q20	2126	2146	1979	1979
q21	4773	4228	4077	4077
q22	593	547	507	507
Total cold run time: 53601 ms
Total hot run time: 50363 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.34 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 04583b01f2e089eb575bf90bb7d2d671bdd3b80b, data reload: false

query1	0.05	0.04	0.05
query2	0.09	0.04	0.06
query3	0.26	0.08	0.08
query4	1.62	0.11	0.11
query5	0.28	0.24	0.25
query6	1.17	0.68	0.67
query7	0.03	0.02	0.03
query8	0.04	0.03	0.04
query9	0.56	0.49	0.50
query10	0.54	0.54	0.55
query11	0.14	0.10	0.10
query12	0.13	0.10	0.11
query13	0.62	0.62	0.61
query14	1.06	1.05	1.09
query15	0.86	0.85	0.86
query16	0.40	0.37	0.40
query17	1.15	1.15	1.16
query18	0.22	0.21	0.22
query19	2.04	2.06	2.10
query20	0.02	0.01	0.01
query21	15.42	0.24	0.15
query22	5.13	0.05	0.04
query23	15.93	0.28	0.10
query24	2.32	0.33	0.23
query25	0.08	0.09	0.06
query26	0.14	0.14	0.13
query27	0.06	0.04	0.05
query28	3.95	1.17	0.96
query29	12.57	4.03	3.28
query30	0.27	0.13	0.11
query31	2.81	0.63	0.40
query32	3.24	0.59	0.49
query33	3.30	3.29	3.26
query34	15.99	5.43	4.70
query35	4.83	4.76	4.83
query36	0.64	0.50	0.49
query37	0.11	0.07	0.07
query38	0.07	0.05	0.04
query39	0.04	0.03	0.03
query40	0.19	0.15	0.16
query41	0.09	0.03	0.02
query42	0.04	0.03	0.03
query43	0.05	0.04	0.03
Total cold run time: 98.55 s
Total hot run time: 28.34 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 38.71% (12/31) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.52% (19287/36723)
Line Coverage 35.98% (179182/497986)
Region Coverage 32.39% (138988/429098)
Branch Coverage 33.34% (60143/180371)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants