Skip to content

Conversation

@hujun260
Copy link
Contributor

@hujun260 hujun260 commented Dec 10, 2025

Summary

add a reference count to the TCB to prevent it from being deleted.

To replace the large lock with smaller ones and reduce the large locks related to the TCB, in many scenarios, we only need to ensure that the TCB won't be released instead of locking, thus reducing the possibility of lock recursion.

should merge with apache/nuttx-apps#3246

Impact

tcb release

Test Objective nucleo-g431rb:nsh

after this patch
size nuttx
text data bss dec hex filename
127964 904 8644 137512 21928 nuttx

before this patch
size nuttx
text data bss dec hex filename
127232 904 8612 136748 2162c nuttx

flash size increase 732 byte

Testing

test in hardware

esp32s3-devkit:nsh

user_main: scheduler lock test
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Starting lowpri_thread at 97
sched_lock: Set lowpri_thread priority to 97
sched_lock: Starting highpri_thread at 98
sched_lock: Set highpri_thread priority to 98
sched_lock: Waiting...
sched_lock: PASSED No pre-emption occurred while scheduler was locked.
sched_lock: Finished

End of test memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 7 6
mxordblk 548a0 548a0
uordblks 5014 5014
fordblks 588a8 588a8

Final memory usage:
VARIABLE BEFORE AFTER
======== ======== ========
arena 5d8bc 5d8bc
ordblks 1 6
mxordblk 59238 548a0
uordblks 4684 5014
fordblks 59238 588a8
user_main: Exiting
ostest_main: Exiting with status 0
nsh> u
nsh: u: command not found
nsh>
nsh>
nsh>
nsh> uname -a
NuttX 12.11.0 ef91333e3ac-dirty Dec 10 2025 16:11:04 xtensa esp32s3-devkit
nsh>

@github-actions github-actions bot added Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Drivers Drivers issues Area: File System File System issues Area: Memory Management Memory Management issues Area: OS Components OS Components issues Size: L The size of the change in this PR is large labels Dec 10, 2025
@hujun260 hujun260 force-pushed the apache_12 branch 3 times, most recently from f192057 to cae0ffd Compare December 11, 2025 06:54
@hujun260 hujun260 requested a review from acassis as a code owner December 11, 2025 06:54
@hartmannathan
Copy link
Contributor

Is it possible to make this PR optional? Perhaps it should affect only the bigger SMP-capable chips and not the smaller single core ones?

@GUIDINGLI
Copy link
Contributor

@anchao @acassis
Patch Motivation & Proper Solution:
The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP:
These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections:
Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice:
The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact:
Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@acassis
Copy link
Contributor

acassis commented Dec 16, 2025

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI thank you for bring more information to this discussion.
I think this commit will be beneficial to NuttX, but I think the data that @hujun260 brought is not helping much: osperf results are inconclusive, see:

  1. why context-switch have decreased 3.36 times?
  2. why hpwork decreased as well?
  3. why pipe-rw had an overhead of 35% ?
  4. why semwait became almost 2x fast?

Seems like osperf is not reliable for this kind of test, maybe we need to use Segger JLink RTT or OBRTrace Orbuculum to get more precise/reliable data.

@hartmannathan
Copy link
Contributor

@anchao @acassis

Patch Motivation & Proper Solution:

The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP:

These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections:

Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:




mm_memdump() 

{

    enter_critical_section();

    tcb = nxsched_get_tcb(pid);

    syslog();             // might wake up or schedule another thread

    name = get_task_name(tcb);

    leave_critical_section();

}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice:

The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact:

Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@GUIDINGLI @hujun260 This explanation is extremely helpful for giving context and understanding this change. I recommend to please add this explanation to the PR description and also to the commit log, so that it can be easily understood in the future. Thank you!

Copy link
Contributor

@hartmannathan hartmannathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the explanation of @GUIDINGLI to the commit log and PR description because it helps to understand the changes.

@GUIDINGLI
Copy link
Contributor

@acassis @hartmannathan
Thanks for feeding back.
We will re-test the performance data, not only osperf, but also rtos-benchmark.
Also provide the flash data.

@anchao
Copy link
Contributor

anchao commented Dec 16, 2025

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.

Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.

Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.

On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.

On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.

If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:

#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}

Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:

nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s

After this change:

nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

@acassis
Copy link
Contributor

acassis commented Dec 16, 2025

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.
Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.
Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.
On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.
On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:

#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}

Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:

nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s

After this change:

nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

Nice work @anchao !!!

Based on this test alone, I can see the overhead was about 3.26% => ((0.303873630 / 0.296554913) - 1) * 100

So, we need to decide whether make sense to accept this overhead to get the benefit from this PR or not. Alternatively as @hartmannathan pointed this feature could be configurable (although I think NuttX already have enough control switches, we need to remove some instead of adding it)

@GUIDINGLI
Copy link
Contributor

@anchao
Thank you for providing some measurement data.
As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression.
We are also actively measuring the data and looking for opportunities to minimize any performance impact.

@anchao
Copy link
Contributor

anchao commented Dec 16, 2025

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

acassis added a commit to apache/nuttx-apps that referenced this pull request Dec 16, 2025
During the discussion about the impact of adding counter to TCB
Mr anchao created a pthread mutex performance example:
apache/nuttx#17468 (comment)

Because this example can exercise (not ideally) the pthread mutex
I decided to add it to apps/testing/sched, the program name will
be called "pmutexp" (because "pmp" is not a good name).

Signed-off-by: Alan C. Assis <acassis@gmail.com>
@hujun260
Copy link
Contributor Author

hujun260 commented Dec 17, 2025

@acassis @hartmannathan Thanks for feeding back. We will re-test the performance data, not only osperf, but also rtos-benchmark. Also provide the flash data.

Test Objective nucleo-g431rb:nsh

after this patch
size nuttx
text data bss dec hex filename
127964 904 8644 137512 21928 nuttx

before this patch
size nuttx
text data bss dec hex filename
127232 904 8612 136748 2162c nuttx

flash size increase 732 byte

@GUIDINGLI
Copy link
Contributor

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

As I said before:
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

@anchao
Copy link
Contributor

anchao commented Dec 18, 2025

@anchao Thank you for providing some measurement data. As I mentioned before, I did not claim that this patch has no performance overhead—any robustness improvement may introduce some performance regression. We are also actively measuring the data and looking for opportunities to minimize any performance impact.

I think we are aligned on the same objective. NuttX is a real-time operating system, and our modifications should be more deterministic and faster, and should not impose any additional burden on developers familiar with NuttX.

As I said before: If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

Could we adopt @acassis and @hartmannathan suggestion and implement this feature as a configurable option?

@jerpelea jerpelea changed the title add a reference count to the TCB to prevent it from being deleted. sched: add a reference count to the TCB to prevent it from being deleted. Dec 19, 2025
@hujun260
Copy link
Contributor Author

@anchao @acassis Patch Motivation & Proper Solution: The main purpose of this patch is to address use-after-free issues that can occur when a thread unexpectedly exits. For example, in scenarios such as proc_open(), when nxsched_get_tcb obtains the TCB but the corresponding thread exits at a higher priority, or during calls to functions like nxclock_gettime(), sched_backtrace(), or mm_memdump(), a similar risk exists. These problems should be systematically addressed by introducing critical sections or by employing other synchronization strategies in those critical paths.
Not Just SMP: These use-after-free problems can occur in both SMP and UP environments — this is not an SMP-only issue.
Limitations of Critical Sections: Even with the addition of critical sections, these issues may not be fully resolved. For instance, consider the following example:


mm_memdump() 
{
    enter_critical_section();
    tcb = nxsched_get_tcb(pid);
    syslog();             // might wake up or schedule another thread
    name = get_task_name(tcb);
    leave_critical_section();
}

In this scenario, even with enter_critical_section(), there can still be a race if syslog() internally leads to context switches that affect the TCB’s lifetime.
On Mainstream RTOS Practice: The notion that "mainstream RTOS do not incorporate this feature, as the number of threads is predetermined in most scenarios" is not accurate. In a typical IoT system, creating and destroying threads (pthreads) is frequent and widespread—thread counts are rarely static. While in safety-critical automotive contexts thread counts might be fixed, such scenarios are the exception, not the rule, for consumer and general-purpose products.
On Performance and Size Impact: Statements such as "it will exert an extremely detrimental impact on both size and performance" are not substantiated without concrete measurement or testing. According to the data provided by @hujun260, the actual performance impact is minimal; please carefully review the test report. Regarding code size: yes, adding additional safety checks will increase flash usage, and we will supplement the PR with precise flash size measurements.
If there is a better alternative to solve this problem, then this patch is unnecessary. Patches are welcome.

You can try running this code. The nxsched_get_tcb() function incurs a 2.5~4% performance overhead, and this function is used throughout the entire nuttx:

#include <nuttx/config.h>
#include <stdio.h>
#include <pthread.h>

/****************************************************************************
 * Public Functions
 ****************************************************************************/

/****************************************************************************
 * hello_main
 ****************************************************************************/
pthread_mutex_t g_mutex = PTHREAD_MUTEX_INITIALIZER;

static void timespec_diff(const struct timespec *start, const struct timespec *end, struct timespec *diff) {
    diff->tv_sec = end->tv_sec - start->tv_sec;
    diff->tv_nsec = end->tv_nsec - start->tv_nsec;
    
    if (diff->tv_nsec < 0) {
        diff->tv_sec--;
        diff->tv_nsec += 1000000000;
    }
}

static void timespec_add(struct timespec *total, const struct timespec *diff) {
    total->tv_sec += diff->tv_sec;
    total->tv_nsec += diff->tv_nsec;
    
    if (total->tv_nsec >= 1000000000) {
        total->tv_sec += total->tv_nsec / 1000000000;
        total->tv_nsec = total->tv_nsec % 1000000000;
    }
}

static void timespec_avg(const struct timespec *total, int count, struct timespec *avg) {
    uint64_t total_ns = (uint64_t)total->tv_sec * 1000000000 + total->tv_nsec;
    uint64_t avg_ns = total_ns / count;
    
    avg->tv_sec = avg_ns / 1000000000;
    avg->tv_nsec = avg_ns % 1000000000;
}

int main(int argc, char *argv[])                                                                               
{             
    struct timespec start;
    struct timespec end;                                                                                             
    struct timespec diff;
    struct timespec total = {0, 0};
    struct timespec avg;           
    int i, j = 0;
    const int loop_count = 10;
    
    while (j < loop_count)
    { 
        i = 0;
        j++;
        
        clock_gettime(CLOCK_BOOTTIME, &start);
        pthread_mutex_lock(&g_mutex);
        while (i < 1000 * 1000)
        { 
            i++;
            pthread_mutex_trylock(&g_mutex);
        }
        pthread_mutex_unlock(&g_mutex);
        
        clock_gettime(CLOCK_BOOTTIME, &end);
        
        timespec_diff(&start, &end, &diff);
        
        timespec_add(&total, &diff);
        
        timespec_avg(&total, j, &avg);
        
        printf("%d: diff = %lu.%09lu s | avg = %lu.%09lu s\n", 
               j, 
               (unsigned long)diff.tv_sec, (unsigned long)diff.tv_nsec,
               (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    }
    
    printf("\n===== result =====\n");
    printf("count: %d\n", loop_count);
    printf("total: %lu.%09lu s\n", (unsigned long)total.tv_sec, (unsigned long)total.tv_nsec);
    printf("avg: %lu.%09lu s\n", (unsigned long)avg.tv_sec, (unsigned long)avg.tv_nsec);
    
    return 0;
}

Additionally, I conducted the validation on the sim host PC; running this test on an MCU may result in even greater performance degradation:

Before this change:

nsh> hello
1: diff = 0.306540108 s | avg = 0.306540108 s
2: diff = 0.297250552 s | avg = 0.301895330 s
3: diff = 0.296764389 s | avg = 0.300185016 s
4: diff = 0.297345113 s | avg = 0.299475040 s
5: diff = 0.296678248 s | avg = 0.298915682 s
6: diff = 0.296436913 s | avg = 0.298502553 s
7: diff = 0.296414429 s | avg = 0.298204250 s
8: diff = 0.296148430 s | avg = 0.297947272 s
9: diff = 0.296353964 s | avg = 0.297770238 s
10: diff = 0.296554913 s | avg = 0.297648705 s

===== result =====
count: 10
total: 2.976487059 s
avg: 0.297648705 s

After this change:

nsh> hello
1: diff = 0.322894500 s | avg = 0.322894500 s
2: diff = 0.304714785 s | avg = 0.313804642 s
3: diff = 0.303939787 s | avg = 0.310516357 s
4: diff = 0.304068904 s | avg = 0.308904494 s
5: diff = 0.304900920 s | avg = 0.308103779 s
6: diff = 0.304780429 s | avg = 0.307549887 s
7: diff = 0.304122773 s | avg = 0.307060299 s
8: diff = 0.304266724 s | avg = 0.306711102 s
9: diff = 0.304566429 s | avg = 0.306472805 s
10: diff = 0.303873630 s | avg = 0.306212888 s

===== result =====
count: 10
total: 3.062128881 s
avg: 0.306212888 s

I've added the nxsched_get_tcb_noref interface—its performance is the same as before—and I've also reimplemented nxsched_verify_pid.

Furthermore, I retested the scenario you mentioned, and there's no performance degradation at all.

@hujun260
Copy link
Contributor Author

I also ran the rtosbenchmark, and the core scenarios are largely consistent.
image

@hujun260 hujun260 force-pushed the apache_12 branch 4 times, most recently from 39057a1 to a4c127c Compare December 20, 2025 09:48
acassis
acassis previously approved these changes Dec 20, 2025
@anchao
Copy link
Contributor

anchao commented Dec 21, 2025

I've added the nxsched_get_tcb_noref interface—its performance is the same as before—and I've also reimplemented nxsched_verify_pid.

Furthermore, I retested the scenario you mentioned, and there's no performance degradation at all.

It seems that not all implementations have replaced nxsched_get_tcb_noref, which means the system performance will still degrade. My test case is only intended to make you aware of the performance degradation caused by nxsched_get_tcb. The pthread_mutex case is just one of them—if I provide other test cases, your patch will have many more performance bottlenecks requiring optimization. could we address this issue from the perspective of architectural implementation?

GUIDINGLI
GUIDINGLI previously approved these changes Dec 22, 2025
Copy link
Contributor

@GUIDINGLI GUIDINGLI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…deleted.

To replace the large lock with smaller ones and reduce the large locks related to the TCB,
in many scenarios, we only need to ensure that the TCB won't be released
instead of locking, thus reducing the possibility of lock recursion.

Signed-off-by: hujun5 <hujun5@xiaomi.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arch: risc-v Issues related to the RISC-V (32-bit or 64-bit) architecture Area: Drivers Drivers issues Area: File System File System issues Area: Memory Management Memory Management issues Area: OS Components OS Components issues Board: arm Size: L The size of the change in this PR is large

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants