Skip to content

[fix](serde) match STRUCT sub-fields by name when loading JSON#64011

Open
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:fix/OPENSOURCE-374-struct-field-align
Open

[fix](serde) match STRUCT sub-fields by name when loading JSON#64011
csun5285 wants to merge 1 commit into
apache:masterfrom
csun5285:fix/OPENSOURCE-374-struct-field-align

Conversation

@csun5285
Copy link
Copy Markdown
Contributor

@csun5285 csun5285 commented Jun 2, 2026

Stream Load into a STRUCT column reads each value as a string and converts it with DataTypeStructSerDe::from_string (CAST varchar -> struct). That path matched sub-fields by position, so JSON keys whose order differed from the DDL turned the whole struct column into NULL, and a row missing a field failed to load.

doc: apache/doris-website#3907

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 2, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 100.00% (27/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.12% (21121/39024)
Line Coverage 37.68% (200687/532583)
Region Coverage 33.79% (157896/467260)
Branch Coverage 34.78% (68980/198358)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (27/27) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.92% (28249/38214)
Line Coverage 57.85% (307288/531207)
Region Coverage 54.59% (257478/471681)
Branch Coverage 56.08% (111661/199104)

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29321 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ee43d13e9a290dcf0f1eb388e2cddf18563418aa, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17642	4185	4084	4084
q2	q3	10795	1426	834	834
q4	4682	484	344	344
q5	7543	900	594	594
q6	189	172	138	138
q7	771	826	660	660
q8	9342	1461	1498	1461
q9	5772	4552	4503	4503
q10	6785	1790	1542	1542
q11	435	267	251	251
q12	641	421	290	290
q13	18137	3383	2772	2772
q14	265	262	243	243
q15	q16	827	796	698	698
q17	1004	998	957	957
q18	7135	5787	5552	5552
q19	1316	1345	1082	1082
q20	524	400	266	266
q21	6221	2787	2727	2727
q22	464	376	323	323
Total cold run time: 100490 ms
Total hot run time: 29321 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5094	4882	4929	4882
q2	q3	4914	5193	4665	4665
q4	2142	2223	1399	1399
q5	4822	4921	4714	4714
q6	235	173	130	130
q7	1823	1884	1584	1584
q8	2561	2159	2170	2159
q9	8010	7766	7545	7545
q10	4749	4716	4202	4202
q11	546	399	368	368
q12	725	733	527	527
q13	3040	3328	2790	2790
q14	279	278	253	253
q15	q16	687	696	610	610
q17	1304	1292	1282	1282
q18	7361	6824	6773	6773
q19	1118	1118	1129	1118
q20	2216	2229	1951	1951
q21	5404	4763	4511	4511
q22	532	456	412	412
Total cold run time: 57562 ms
Total hot run time: 51875 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170190 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ee43d13e9a290dcf0f1eb388e2cddf18563418aa, data reload: false

query5	4322	634	477	477
query6	453	219	191	191
query7	4988	573	313	313
query8	384	234	211	211
query9	8816	3968	4029	3968
query10	474	315	267	267
query11	5879	2372	2138	2138
query12	156	105	104	104
query13	1299	621	432	432
query14	6390	5388	5101	5101
query14_1	4455	4459	4430	4430
query15	211	201	179	179
query16	995	458	455	455
query17	962	728	610	610
query18	2451	502	361	361
query19	206	189	152	152
query20	123	112	107	107
query21	243	148	119	119
query22	13705	13658	13515	13515
query23	17400	16582	16234	16234
query23_1	16390	16475	16354	16354
query24	7617	1784	1317	1317
query24_1	1340	1330	1334	1330
query25	597	480	423	423
query26	1299	318	174	174
query27	2690	549	334	334
query28	4467	2014	2022	2014
query29	1102	650	499	499
query30	305	246	208	208
query31	1153	1076	979	979
query32	114	65	64	64
query33	543	330	268	268
query34	1175	1150	681	681
query35	800	785	695	695
query36	1376	1406	1269	1269
query37	164	107	95	95
query38	3225	3197	3123	3123
query39	933	918	903	903
query39_1	878	876	877	876
query40	219	120	101	101
query41	66	63	63	63
query42	99	95	94	94
query43	331	337	275	275
query44	
query45	197	188	189	188
query46	1104	1215	730	730
query47	2392	2371	2190	2190
query48	415	389	300	300
query49	620	470	362	362
query50	967	354	259	259
query51	4364	4342	4276	4276
query52	89	89	76	76
query53	238	267	202	202
query54	273	217	210	210
query55	79	78	71	71
query56	239	238	228	228
query57	1449	1407	1336	1336
query58	248	220	216	216
query59	1641	1722	1464	1464
query60	291	248	239	239
query61	165	167	153	153
query62	701	668	592	592
query63	237	192	185	185
query64	2532	780	601	601
query65	
query66	1807	462	358	358
query67	29881	29884	29727	29727
query68	
query69	435	310	261	261
query70	982	951	962	951
query71	307	225	217	217
query72	2991	2744	2434	2434
query73	846	787	429	429
query74	5152	4951	4791	4791
query75	2684	2595	2231	2231
query76	2323	1155	756	756
query77	348	397	304	304
query78	12288	12389	11938	11938
query79	1432	1054	784	784
query80	604	462	392	392
query81	454	286	249	249
query82	844	156	123	123
query83	349	274	249	249
query84	308	142	113	113
query85	875	525	433	433
query86	370	309	300	300
query87	3399	3313	3199	3199
query88	3607	2747	2706	2706
query89	461	374	331	331
query90	1964	191	191	191
query91	178	171	143	143
query92	72	64	57	57
query93	1573	1441	853	853
query94	531	365	312	312
query95	681	484	350	350
query96	1010	807	338	338
query97	2707	2694	2577	2577
query98	217	209	207	207
query99	1179	1174	1071	1071
Total cold run time: 252323 ms
Total hot run time: 170190 ms

Stream Load into a STRUCT column reads each value as a string and converts it
with DataTypeStructSerDe::from_string (CAST varchar -> struct). That path matched
sub-fields by position, so JSON keys whose order differed from the DDL turned the
whole struct column into NULL, and a row missing a field failed to load.

from_string now detects named mode by the delimiter structure and matches
sub-fields by name (case-insensitive), filling missing nullable fields with NULL.
Unknown/extra fields are ignored in non-strict mode (load), matching the simdjson
JSON reader that feeds STRUCT columns on JSON stream load; strict CAST instead
rejects an unknown field name. Positional input still requires an exact field
count, and struct-to-struct CAST stays positional.

Add BE unit tests (DataTypeStructSerDeTest.FromStringByFieldName) covering by-name
matching, case-insensitivity, missing/unknown/extra fields, positional input and
the error paths; and a stream-load regression test (test_struct_field_align).
Update the existing struct expectations in test_stream_load,
test_stream_load_move_memtable and test_cast_struct to the by-name results.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@csun5285 csun5285 force-pushed the fix/OPENSOURCE-374-struct-field-align branch from ee43d13 to d585cbe Compare June 3, 2026 11:53
@csun5285
Copy link
Copy Markdown
Contributor Author

csun5285 commented Jun 3, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29666 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d585cbe88c5c82ded0901703a11bff8ef226a1b7, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17672	4190	4173	4173
q2	q3	10801	1448	840	840
q4	4687	481	342	342
q5	7550	914	598	598
q6	189	186	142	142
q7	790	832	645	645
q8	9446	1596	1602	1596
q9	5978	4575	4570	4570
q10	6751	1828	1528	1528
q11	443	281	260	260
q12	642	441	304	304
q13	18119	3497	2745	2745
q14	269	263	237	237
q15	q16	825	770	716	716
q17	995	980	1004	980
q18	7089	5790	5538	5538
q19	1807	1355	1110	1110
q20	543	415	269	269
q21	6381	2903	2745	2745
q22	473	385	328	328
Total cold run time: 101450 ms
Total hot run time: 29666 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5354	5006	5012	5006
q2	q3	4905	5414	4760	4760
q4	2193	2228	1421	1421
q5	4832	5016	4701	4701
q6	232	181	131	131
q7	1881	1843	1561	1561
q8	2582	2247	2236	2236
q9	7926	7673	7537	7537
q10	4740	4675	4247	4247
q11	558	413	370	370
q12	745	744	525	525
q13	3066	3455	2772	2772
q14	285	290	264	264
q15	q16	699	703	619	619
q17	1313	1287	1299	1287
q18	7824	6846	7071	6846
q19	1099	1103	1118	1103
q20	2235	2219	1957	1957
q21	5458	4738	4677	4677
q22	528	468	409	409
Total cold run time: 58455 ms
Total hot run time: 52429 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170390 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d585cbe88c5c82ded0901703a11bff8ef226a1b7, data reload: false

query5	4328	652	481	481
query6	437	199	186	186
query7	4904	605	309	309
query8	373	231	212	212
query9	8761	4153	4142	4142
query10	453	330	259	259
query11	5865	2361	2160	2160
query12	152	104	98	98
query13	1311	615	452	452
query14	6382	5441	5122	5122
query14_1	4473	4417	4414	4414
query15	213	199	177	177
query16	1062	469	458	458
query17	1144	765	597	597
query18	2655	488	359	359
query19	225	192	152	152
query20	118	111	110	110
query21	227	145	121	121
query22	13699	13604	13402	13402
query23	17457	16524	16230	16230
query23_1	16382	16407	16365	16365
query24	7476	1777	1359	1359
query24_1	1310	1312	1327	1312
query25	560	454	394	394
query26	1301	343	165	165
query27	2640	573	320	320
query28	4461	2048	2027	2027
query29	1080	627	480	480
query30	313	237	202	202
query31	1123	1072	953	953
query32	110	61	61	61
query33	509	331	255	255
query34	1165	1214	648	648
query35	754	789	691	691
query36	1433	1417	1240	1240
query37	152	119	94	94
query38	3232	3149	3018	3018
query39	932	913	907	907
query39_1	887	868	862	862
query40	222	125	105	105
query41	71	67	62	62
query42	96	98	97	97
query43	332	331	279	279
query44	
query45	196	188	180	180
query46	1112	1259	761	761
query47	2348	2328	2244	2244
query48	418	413	286	286
query49	637	478	380	380
query50	962	377	265	265
query51	4441	4342	4209	4209
query52	93	98	78	78
query53	248	277	191	191
query54	282	234	213	213
query55	79	76	70	70
query56	251	223	220	220
query57	1443	1419	1331	1331
query58	247	215	213	213
query59	1605	1654	1439	1439
query60	290	258	238	238
query61	163	157	159	157
query62	693	674	587	587
query63	231	184	191	184
query64	2588	801	639	639
query65	
query66	1773	476	355	355
query67	29739	29725	29649	29649
query68	
query69	427	308	270	270
query70	978	944	947	944
query71	310	221	218	218
query72	3075	2762	2194	2194
query73	903	780	441	441
query74	5131	4987	4747	4747
query75	2699	2600	2257	2257
query76	2342	1160	781	781
query77	360	378	287	287
query78	12347	12492	11861	11861
query79	1283	1070	763	763
query80	533	501	405	405
query81	451	284	245	245
query82	245	159	119	119
query83	275	285	251	251
query84	260	143	116	116
query85	874	535	437	437
query86	325	301	286	286
query87	3393	3328	3166	3166
query88	3687	2792	2748	2748
query89	423	384	329	329
query90	2181	183	190	183
query91	180	163	143	143
query92	64	66	59	59
query93	1629	1464	876	876
query94	540	359	325	325
query95	685	383	359	359
query96	1035	820	353	353
query97	2710	2703	2590	2590
query98	221	207	206	206
query99	1188	1180	1050	1050
Total cold run time: 251849 ms
Total hot run time: 170390 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (31/31) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.94% (27498/38226)
Line Coverage 55.48% (294380/530632)
Region Coverage 52.23% (245659/470333)
Branch Coverage 53.44% (106259/198828)

csun5285 added a commit to csun5285/doris-website that referenced this pull request Jun 4, 2026
Document the behavior change from apache/doris#64011: when casting a
string to STRUCT with field names (named mode), fields are now matched
by name (case-insensitive) instead of by position. The input field
order may differ from the schema, missing fields are filled with NULL,
and unknown fields are rejected in strict mode / ignored in non-strict
mode. Positional input (no field names) still requires an exact field
count. Updated EN/ZH for dev and 4.x.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants