sched: add a reference count to the TCB to prevent it from being deleted. #17468

hujun260 · 2025-12-10T08:16:02Z

Summary

add a reference count to the TCB to prevent it from being deleted.

To replace the large lock with smaller ones and reduce the large locks related to the TCB, in many scenarios, we only need to ensure that the TCB won't be released instead of locking, thus reducing the possibility of lock recursion.

should merge with apache/nuttx-apps#3246

Impact

tcb release

Test Objective nucleo-g431rb:nsh

after this patch
size nuttx
text data bss dec hex filename
127964 904 8644 137512 21928 nuttx

before this patch
size nuttx
text data bss dec hex filename
127232 904 8612 136748 2162c nuttx

flash size increase 732 byte

Testing

test in hardware

esp32s3-devkit:nsh

user_main: scheduler lock test
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Finished

End of test memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 7 6
mxordblk 548a0 548a0
uordblks 5014 5014
fordblks 588a8 588a8

Final memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 1 6
mxordblk 59238 548a0
uordblks 4684 5014
fordblks 59238 588a8
user_main: Exiting
ostest_main: Exiting with status 0
nsh> u
nsh: u: command not found
nsh>
nsh>
nsh>
nsh> uname -a
NuttX 12.11.0 ef91333e3ac-dirty Dec 10 2025 16:11:04 xtensa esp32s3-devkit
nsh>

hartmannathan · 2025-12-12T15:04:21Z

Is it possible to make this PR optional? Perhaps it should affect only the bigger SMP-capable chips and not the smaller single core ones?

GUIDINGLI · 2025-12-16T12:39:37Z

@anchao @acassis
Patch Motivation & Proper Solution:
The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP:
These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections:
Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice:
The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact:
Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

acassis · 2025-12-16T13:07:26Z

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI thank you for bring more information to this discussion.
I think this commit will be beneficial to NuttX, but I think the data that @hujun260 brought is not helping much: osperf results are inconclusive, see:

why context-switch have decreased 3.36 times?
why hpwork decreased as well?
why pipe-rw had an overhead of 35% ?
why semwait became almost 2x fast?

Seems like osperf is not reliable for this kind of test, maybe we need to use Segger JLink RTT or OBRTrace Orbuculum to get more precise/reliable data.

hartmannathan · 2025-12-16T13:31:33Z

@anchao @acassis

Patch Motivation & Proper Solution:

The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP:

These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections:

Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 

{

    enter_critical_section();

    tcb = nxsched_get_tcb(pid);

    syslog();             // might wake up or schedule another thread

    name = get_task_name(tcb);

    leave_critical_section();

}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice:

The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact:

Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI @hujun260 This explanation is extremely helpful for giving context and understanding this change. I recommend to please add this explanation to the PR description and also to the commit log, so that it can be easily understood in the future. Thank you!

hartmannathan

Please add the explanation of @GUIDINGLI to the commit log and PR description because it helps to understand the changes.

GUIDINGLI · 2025-12-16T13:47:35Z

@acassis @hartmannathan
Thanks for feeding back.
We will re-test the performance data, not only osperf, but also rtos-benchmark.
Also provide the flash data.

anchao · 2025-12-16T14:47:14Z

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:

#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}

Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:

nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s

After this change:

nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

acassis · 2025-12-16T15:16:54Z

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.
Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.
Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.
On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.
On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.
You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:
#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}
Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:
nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s
After this change:
nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

Nice work @anchao !!!

Based on this test alone, I can see the overhead was about 3.26% => ((0.303873630 / 0.296554913) - 1) * 100

So, we need to decide whether make sense to accept this overhead to get the benefit from this PR or not. Alternatively as @hartmannathan pointed this feature could be configurable (although I think NuttX already have enough control switches, we need to remove some instead of adding it)

GUIDINGLI · 2025-12-16T15:29:14Z

@anchao
Thank you for providing some measurement data.
As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression.
We are also actively measuring the data and looking for opportunities to minimize any performance impact.

anchao · 2025-12-16T15:38:59Z

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

During the discussion about the impact of adding counter to TCB Mr anchao created a pthread mutex performance example: apache/nuttx#17468 (comment) Because this example can exercise (not ideally) the pthread mutex I decided to add it to apps/testing/sched, the program name will be called "pmutexp" (because "pmp" is not a good name). Signed-off-by: Alan C. Assis <acassis@gmail.com>

hujun260 · 2025-12-17T00:53:57Z

@acassis @hartmannathan Thanks for feeding back. We will re-test the performance data, not only osperf, but also rtos-benchmark. Also provide the flash data.

Test Objective nucleo-g431rb:nsh

after this patch
size nuttx
text data bss dec hex filename
127964 904 8644 137512 21928 nuttx

before this patch
size nuttx
text data bss dec hex filename
127232 904 8612 136748 2162c nuttx

flash size increase 732 byte

GUIDINGLI · 2025-12-18T09:38:36Z

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

As I said before:
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

anchao · 2025-12-18T11:21:07Z

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

As I said before: If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

Could we adopt @acassis and @hartmannathan suggestion and implement this feature as a configurable option?

hujun260 · 2025-12-20T04:53:08Z

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.
Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.
Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:
mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}
In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.
On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.
On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.
You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:
#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}
Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:
nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s
After this change:
nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

I've added the nxsched_get_tcb_noref interface—its performance is the same as before—and I've also reimplemented nxsched_verify_pid.

Furthermore, I retested the scenario you mentioned, and there's no performance degradation at all.

hujun260 · 2025-12-20T04:55:18Z

I also ran the rtosbenchmark, and the core scenarios are largely consistent.

anchao · 2025-12-21T19:00:54Z

I've added the nxsched_get_tcb_noref interface—its performance is the same as before—and I've also reimplemented nxsched_verify_pid.

Furthermore, I retested the scenario you mentioned, and there's no performance degradation at all.

It seems that not all implementations have replaced nxsched_get_tcb_noref, which means the system performance will still degrade. My test case is only intended to make you aware of the performance degradation caused by nxsched_get_tcb. The pthread_mutex case is just one of them—if I provide other test cases, your patch will have many more performance bottlenecks requiring optimization. could we address this issue from the perspective of architectural implementation?

GUIDINGLI

LGTM

…deleted. To replace the large lock with smaller ones and reduce the large locks related to the TCB, in many scenarios, we only need to ensure that the TCB won't be released instead of locking, thus reducing the possibility of lock recursion. Signed-off-by: hujun5 <hujun5@xiaomi.com>

hujun260 requested review from Donny9, GUIDINGLI, Ouss4, anchao, davids5, fjpanag, gustavonihei, hartmannathan, jerpelea, masayuki2009, pkarashchenko, pussuw, simbit18, tmedicci, xiaoxiang781216, yamt and yf13 as code owners December 10, 2025 08:16

hujun260 force-pushed the apache_12 branch from 1ef0f02 to 8ef0e9e Compare December 10, 2025 08:19

hujun260 mentioned this pull request Dec 10, 2025

add a reference count to the TCB to prevent it from being deleted. apache/nuttx-apps#3246

Open

hujun260 force-pushed the apache_12 branch 3 times, most recently from f192057 to cae0ffd Compare December 11, 2025 06:54

hujun260 requested a review from acassis as a code owner December 11, 2025 06:54

github-actions bot added the Board: arm label Dec 11, 2025

hartmannathan requested changes Dec 16, 2025

View reviewed changes

acassis mentioned this pull request Dec 16, 2025

apps/testing: Add pthread_mutex_perf created by Anchao apache/nuttx-apps#3252

Merged

jerpelea changed the title ~~add a reference count to the TCB to prevent it from being deleted.~~ sched: add a reference count to the TCB to prevent it from being deleted. Dec 19, 2025

hujun260 force-pushed the apache_12 branch from c85f396 to f802c97 Compare December 20, 2025 04:52

hujun260 force-pushed the apache_12 branch 4 times, most recently from 39057a1 to a4c127c Compare December 20, 2025 09:48

acassis previously approved these changes Dec 20, 2025

View reviewed changes

GUIDINGLI previously approved these changes Dec 22, 2025

View reviewed changes

hujun260 dismissed stale reviews from GUIDINGLI and acassis via 12c8c80 December 22, 2025 08:05

hujun260 force-pushed the apache_12 branch from a4c127c to 12c8c80 Compare December 22, 2025 08:05

sched: add a reference count to the TCB to prevent it from being deleted. #17468

Are you sure you want to change the base?

sched: add a reference count to the TCB to prevent it from being deleted. #17468

Uh oh!

Conversation

hujun260 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Testing

Uh oh!

hartmannathan commented Dec 12, 2025

Uh oh!

GUIDINGLI commented Dec 16, 2025

Uh oh!

acassis commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hartmannathan commented Dec 16, 2025

Uh oh!

hartmannathan left a comment

Choose a reason for hiding this comment

Uh oh!

GUIDINGLI commented Dec 16, 2025

Uh oh!

anchao commented Dec 16, 2025

Uh oh!

acassis commented Dec 16, 2025

Uh oh!

GUIDINGLI commented Dec 16, 2025

Uh oh!

anchao commented Dec 16, 2025

Uh oh!

hujun260 commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GUIDINGLI commented Dec 18, 2025

Uh oh!

anchao commented Dec 18, 2025

Uh oh!

hujun260 commented Dec 20, 2025

Uh oh!

hujun260 commented Dec 20, 2025

Uh oh!

anchao commented Dec 21, 2025

Uh oh!

GUIDINGLI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

hujun260 commented Dec 10, 2025 •

edited

Loading

acassis commented Dec 16, 2025 •

edited

Loading

hujun260 commented Dec 17, 2025 •

edited

Loading