Skip to content

Conversation

@luwei16
Copy link
Contributor

@luwei16 luwei16 commented Jan 29, 2026

Previously, all compaction types (base, cumulative, full) shared a single sample_infos vector per tablet. When different compaction types ran concurrently on the same tablet, one compaction could resize sample_infos while another was accessing it, causing out-of-bounds access and crash.

Crash stack:

*** Aborted at 1769502009 (unix time) try "date -d @1769502009" if you are using GNU date ***
*** Current BE git commitID: 0c75960cd13 ***
*** SIGABRT unknown detail explain (@0x4c61) received by PID 19553 (TID 20096 OR 0x7b7f13caa640) from PID 19553; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/common/signal_handler.h:420
 1# 0x00007F82B398B520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# 0x000055BA75135461 in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
 6# std::vector >::operator[](unsigned long) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/stl_vector.h:1263
 7# doris::estimate_batch_size(int, std::shared_ptr, long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:416
 8# doris::Merger::vertical_merge_rowsets(std::shared_ptr, doris::ReaderType, doris::TabletSchema const&, std::vector, std::allocator > > const&, doris::RowsetWriter*, unsigned int, long, doris::Merger::Statistics*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:496
 9# doris::Compaction::merge_input_rowsets() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:210
10# doris::CloudCompactionMixin::execute_compact_impl(long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1490
11# doris::CloudCompactionMixin::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1528
12# doris::CloudBaseCompaction::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_base_compaction.cpp:296
13# doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0::operator()() const at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_storage_engine.cpp:806
14# void std::__invoke_impl const&)::$_0&>(std::__invoke_other, doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
15# std::enable_if const&)::$_0&>, void>::type std::__invoke_r const&)::$_0&>(doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
16# std::_Function_handler const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
17# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
18# doris::FunctionRunnable::run() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:60
19# doris::ThreadPool::dispatch_thread() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:616
20# void std::__invoke_impl(std::__invoke_memfun_deref, void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:76
21# std::__invoke_result::type std::__invoke(void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:98
22# void std::_Bind::__call(std::tuple<>&&, std::_Index_tuple<0ul>) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:515
23# void std::_Bind::operator()<, void>() at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:600
24# void std::__invoke_impl&>(std::__invoke_other, std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
25# std::enable_if&>, void>::type std::__invoke_r&>(std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
26# std::_Function_handler >::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
27# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
28# doris::Thread::supervise_thread(void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/thread.cpp:460
29# asan_thread_start(void*) in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
30# start_thread at ./nptl/pthread_create.c:442
31# 0x00007F82B3A6F850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83

Root cause:
Base/Full/Cumulative compactions can run concurrently on the same tablet
They share a single sample_infos vector
resize() and operator[] are not in the same critical section

Fix:
Separate sample_infos for each compaction type (cumu/base/full)
Each type has its own mutex and vector
Add getter methods to select the correct sample_infos by ReaderType

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

… shared sample_infos

Previously, all compaction types (base, cumulative, full) shared a single
sample_infos vector per tablet. When different compaction types ran
concurrently on the same tablet, one compaction could resize sample_infos
while another was accessing it, causing out-of-bounds access and crash.

Crash stack:

Root cause:
  Base/Full/Cumulative compactions can run concurrently on the same tablet
  They share a single sample_infos vector
  resize() and operator[] are not in the same critical section

Fix:
  Separate sample_infos for each compaction type (cumu/base/full)
  Each type has its own mutex and vector
  Add getter methods to select the correct sample_infos by ReaderType
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@luwei16
Copy link
Contributor Author

luwei16 commented Jan 29, 2026

run buildall

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 29, 2026
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-H: Total hot run time: 32791 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3fd8c52e2c90be6ad64e950a90b1b5478cfb41bc, data reload: false

------ Round 1 ----------------------------------
q1	17672	5274	5095	5095
q2	2083	317	203	203
q3	10171	1368	738	738
q4	10223	901	314	314
q5	7676	2181	1975	1975
q6	229	184	152	152
q7	890	761	633	633
q8	9289	1428	1145	1145
q9	5559	4816	4978	4816
q10	6898	1977	1570	1570
q11	495	291	281	281
q12	369	376	226	226
q13	17811	4103	3191	3191
q14	242	237	223	223
q15	905	837	818	818
q16	669	689	615	615
q17	659	824	521	521
q18	6897	6763	7237	6763
q19	1841	1077	746	746
q20	451	363	248	248
q21	3105	2325	2231	2231
q22	389	317	287	287
Total cold run time: 104523 ms
Total hot run time: 32791 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5685	5500	5516	5500
q2	264	345	255	255
q3	2477	2897	2518	2518
q4	1411	1880	1400	1400
q5	4699	4454	4666	4454
q6	227	179	140	140
q7	2037	1980	1727	1727
q8	2578	2370	2482	2370
q9	7636	7651	7378	7378
q10	2910	3046	2430	2430
q11	542	459	435	435
q12	641	700	550	550
q13	3574	4050	3238	3238
q14	273	285	264	264
q15	833	813	784	784
q16	643	675	640	640
q17	1076	1328	1338	1328
q18	7611	7408	7444	7408
q19	875	848	826	826
q20	1979	2048	1889	1889
q21	4491	4158	4102	4102
q22	583	540	503	503
Total cold run time: 53045 ms
Total hot run time: 50139 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.29 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3fd8c52e2c90be6ad64e950a90b1b5478cfb41bc, data reload: false

query1	0.05	0.04	0.05
query2	0.10	0.04	0.04
query3	0.26	0.08	0.08
query4	1.60	0.12	0.11
query5	0.27	0.24	0.25
query6	1.16	0.67	0.67
query7	0.03	0.03	0.03
query8	0.05	0.04	0.03
query9	0.56	0.51	0.48
query10	0.54	0.55	0.54
query11	0.14	0.10	0.09
query12	0.14	0.10	0.10
query13	0.62	0.61	0.61
query14	1.05	1.05	1.06
query15	0.88	0.86	0.87
query16	0.38	0.40	0.39
query17	1.15	1.11	1.09
query18	0.22	0.22	0.20
query19	2.09	1.98	1.98
query20	0.02	0.02	0.02
query21	15.43	0.27	0.14
query22	5.17	0.06	0.05
query23	16.03	0.29	0.11
query24	1.02	0.53	0.35
query25	0.12	0.09	0.06
query26	0.14	0.13	0.14
query27	0.08	0.06	0.09
query28	4.23	1.16	0.98
query29	12.60	3.89	3.18
query30	0.28	0.13	0.12
query31	2.84	0.63	0.40
query32	3.24	0.59	0.49
query33	3.27	3.33	3.19
query34	16.03	5.38	4.74
query35	4.82	4.85	4.78
query36	0.65	0.51	0.49
query37	0.14	0.07	0.07
query38	0.06	0.04	0.03
query39	0.05	0.03	0.03
query40	0.19	0.16	0.15
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.06	0.04	0.03
Total cold run time: 97.89 s
Total hot run time: 28.29 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 85.71% (36/42) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.72% (19265/36544)
Line Coverage 36.11% (179042/495801)
Region Coverage 32.56% (138834/426395)
Branch Coverage 33.50% (60086/179361)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 92.86% (39/42) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.48% (25606/35822)
Line Coverage 54.11% (267606/494568)
Region Coverage 51.77% (222979/430713)
Branch Coverage 53.11% (95645/180077)

@luwei16 luwei16 merged commit fd14556 into apache:master Jan 30, 2026
33 of 35 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 30, 2026
… shared sample_infos (#60376)

Previously, all compaction types (base, cumulative, full) shared a
single sample_infos vector per tablet. When different compaction types
ran concurrently on the same tablet, one compaction could resize
sample_infos while another was accessing it, causing out-of-bounds
access and crash.

Crash stack:

```gdb
*** Aborted at 1769502009 (unix time) try "date -d @1769502009" if you are using GNU date ***
*** Current BE git commitID: 0c75960cd13 ***
*** SIGABRT unknown detail explain (@0x4c61) received by PID 19553 (TID 20096 OR 0x7b7f13caa640) from PID 19553; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/common/signal_handler.h:420
 1# 0x00007F82B398B520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill at ./nptl/pthread_kill.c:89
 3# raise at ../sysdeps/posix/raise.c:27
 4# abort at ./stdlib/abort.c:81
 5# 0x000055BA75135461 in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
 6# std::vector >::operator[](unsigned long) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/stl_vector.h:1263
 7# doris::estimate_batch_size(int, std::shared_ptr, long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:416
 8# doris::Merger::vertical_merge_rowsets(std::shared_ptr, doris::ReaderType, doris::TabletSchema const&, std::vector, std::allocator > > const&, doris::RowsetWriter*, unsigned int, long, doris::Merger::Statistics*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/merger.cpp:496
 9# doris::Compaction::merge_input_rowsets() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:210
10# doris::CloudCompactionMixin::execute_compact_impl(long) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1490
11# doris::CloudCompactionMixin::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/olap/compaction.cpp:1528
12# doris::CloudBaseCompaction::execute_compact() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_base_compaction.cpp:296
13# doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0::operator()() const at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/cloud/cloud_storage_engine.cpp:806
14# void std::__invoke_impl const&)::$_0&>(std::__invoke_other, doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
15# std::enable_if const&)::$_0&>, void>::type std::__invoke_r const&)::$_0&>(doris::CloudStorageEngine::_submit_base_compaction_task(std::shared_ptr const&)::$_0&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
16# std::_Function_handler const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
17# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
18# doris::FunctionRunnable::run() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:60
19# doris::ThreadPool::dispatch_thread() at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/threadpool.cpp:616
20# void std::__invoke_impl(std::__invoke_memfun_deref, void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:76
21# std::__invoke_result::type std::__invoke(void (doris::ThreadPool::*&)(), doris::ThreadPool*&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:98
22# void std::_Bind::__call(std::tuple<>&&, std::_Index_tuple<0ul>) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:515
23# void std::_Bind::operator()<, void>() at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/functional:600
24# void std::__invoke_impl&>(std::__invoke_other, std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:63
25# std::enable_if&>, void>::type std::__invoke_r&>(std::_Bind&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/invoke.h:119
26# std::_Function_handler >::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
27# std::function::operator()() const at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:593
28# doris::Thread::supervise_thread(void*) at /mnt/disk3/pipeline/repo/selectdb-core_branch-selectdb-doris-4.0/selectdb-core/be/src/util/thread.cpp:460
29# asan_thread_start(void*) in /mnt/hdd01/ci/doris-deploy-branch-selectdb-doris-4.0-cloud/be/lib/doris_be
30# start_thread at ./nptl/pthread_create.c:442
31# 0x00007F82B3A6F850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
```

Root cause:
Base/Full/Cumulative compactions can run concurrently on the same tablet
  They share a single sample_infos vector
  resize() and operator[] are not in the same critical section

Fix:
  Separate sample_infos for each compaction type (cumu/base/full)
  Each type has its own mutex and vector
  Add getter methods to select the correct sample_infos by ReaderType
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants