Avoid control threads race condition #20002

jenshannoschwalm · 2025-12-26T15:23:13Z

When starting the dt gui and calling gtk_main() we must be sure we can safely dispatch a control job. We can do so only if all control threads are running.

We achieve this by adding a control->running atomic counter which is increased whenever any of the threads code has been started. As we know the number of started threads we can check & wait (for up to 1sec) until all threads are up&running before leaving dt_control_jobs_init().

The new function gboolean dt_control_all_running() tests for running vs requested threads and is used to test if we can finally use gtk_main().

@wpferguson this would be my "proper" solution for the hypothesis, that we have a race condition as the root cause for #19992. At least: if we dispatch a control job either in lighttable or darkroom it could be the case that the worker for that job is not running yet and thus ends in nirvana. I have to confess i couldn't trigger it yet a single time. (but i am mostly on my slow notebook still).

Would you and others discussing in #19992 be able to compile and test (@victoryforce @gi-man)

@TurboGit code is absolutely safe to me.

When starting the dt gui and calling `gtk_main()` we must be sure we can safely dispatch a control job. We can do so only if all control threads are running. We achieve this by adding a control->running atomic counter which is increased whenever any of the threads code has been started. As we know the number of started threads we can check & wait (for up to 1sec) until all threads are up&running before leaving `dt_control_jobs_init()`. The new function `gboolean dt_control_all_running()` tests for running vs requested threads and is used to test if we can finally use gtk_main().

wpferguson · 2025-12-26T18:36:05Z

Can't produce a hang right now, maybe the logo change?

jenshannoschwalm · 2025-12-26T18:41:01Z

Can't produce a hang right now, maybe the logo change?

Sorry - i don't understand.

wpferguson · 2025-12-27T01:56:57Z

The darktable icon changed to the santa hat

src/control/jobs.c

dterrahe · 2025-12-29T09:53:13Z

When starting the dt gui and calling gtk_main() we must be sure we can safely dispatch a control job. We can do so only if all control threads are running.

Could a little explanation be added here for why this is so? This might help someone with more knowledge of concurrency (or in the absence of that, maybe even me) in future to clean up some of this code. There are several places that seem to check things that shouldn't have to be checked because, well, it can't hurt. But too many different mechanism trying to regulate concurrency tend to get in each others way and if there's no clear understanding how they all work and why they are needed, it well never be possible to clean this up and test.

Naively, I would think that if we dispatch a job, it should be good enough if it eventually gets picked up. So as long as we know that a thread will be created ("we are in gui mode"), we should be good. Why do we need to know the receiver is running? How would this help? Even if it is running, it might be doing something else at the moment; how would that be different from "still starting up".

Asking because reading #19992 I get the impression this didn't solve anything. So will it be reverted? Or will we just keep it around "just in case"? Until it causes problems and noone can figure out why it was there anyway and now we'll implement some even more complicated workaround to keep everything "working" even though we don't know what that means and therefore can't test if it actually still does what it's supposed to do?

@TurboGit @jenshannoschwalm @ralfbrown in case you don't get notifications for merged PRs anymore.

dterrahe · 2025-12-29T10:09:38Z

src/gui/gtk.c

 #endif
  /* start the event loop */
-  if(dt_control_running())
+  if(dt_control_all_running())


If not all jobs have really started yet by the time we get here, we just skip gtk_main and immediately quit? In what situation does this actually solve anything?

We might reach this point without all threads running safely.

jenshannoschwalm · 2025-12-29T10:13:24Z

I would keep it and even add #20025 :-)

It's not difficult at all, together both PR's: ensure that the control job dispatcher is fully functional when dt_control_jobs_init() exits or at least leave note in the logs. Nothing more.

If we dispatch a job before that, we might not be sure what exactly happend and we would want to avoid it.

dterrahe · 2025-12-29T16:00:29Z

ensure that the control job dispatcher is fully functional

I'm just really not sure what you mean by this. The jobs have been created with pthread_create and no error was returned from that call, which means that the job has either been scheduled to run or has already run at least once. Are you saying that what it does during that first run makes a difference? I don't see that it does (it sets a name that nothing depends on?). It will just pick up jobs as and when it is scheduled again.

The only thing that I see this PR doing is a more complicated version of checking that 1+1 = 2. It has to be, otherwise you might as well throw the whole thing out the window.

The other thing it obviously does is add more delays. So yes, that might "help" with race conditions, in the sense that it may make them less frequent. But that is worse, not better because less frequent bugs are harder to reproduce and therefore harder to fix. So I really don't see the benefit of adding stuff like this that doesn't obviously solve anything. I know that I won't get anything out of arguing this (least of all a clear answer) but at least I wanted to document here that there's no problem that is being solved by this so that whoever comes across this PR in future doesn't have to deal with the argument that "it must have been good for something, otherwise it would not have been added, so now you can never ever remove it. Or touch it because you can't test that you didn't break what it was supposed to be doing." Maintainability suffers.

On a separate note, a while back @ralfbrown and I were wondering why in dt_control_shutdown there's a loop

for(int k = 0; k < s->num_threads-1; k++)
   err = dt_pthread_join(s->thread[k]);

which seem to leave the last thread hanging. Is this intentional? If so, could the reason be lightly documented there.

jenshannoschwalm · 2025-12-30T17:53:39Z

@dterrahe

a while back @ralfbrown and I were wondering why in dt_control_shutdown there's a loop

Clearly wrong code, i simply can't remember if i got notice on this in any discussion ... PR has just been opened

I fully accept that if i think or write code being "crap" it can/should be named as that and of course such code should be reverted. And i think you know that i take all your criticism as constructive. And i am far from trying to prove 1+1 = 2 :-) ...

When i started trying to understand from available reports (i could not reproduce a single time on too machines) what's happening all descriptions for me a) hinted to a race condition leading b) to an unresponsive UI in dt_gui_gtk_run() / gtk_main() My hypothesis was: some control/kicker job was "blocking" or executing "bad stuff". So the commit ensured that jobs are "counted as running" if they have actually reached the "dt_control_running() loop" and dt_control_jobs_init() can exit. I don't know if that's absolutely crazy/bad or if it would count as "defensive". No other intentional delays.

Right now i think the issue in the gphoto thread is a better candidate for being the culprit for hanging and if we see "the above is not a problem at all" we can of course revert.

But that is worse, not better because less frequent bugs are harder to reproduce and therefore harder to fix.

Fully agreed and nothing to discuss here, "HIT HARD AND FAST" has been one of the first main FORTH lessons i got.

I know that I won't get anything out of arguing this (least of all a clear answer)

That took a bit to swallow ...

dterrahe · 2025-12-31T09:07:08Z

i am far from trying to prove 1+1 = 2 :-)

POSIX says that when the thread has been created without returning an error, it is live. It will run. There's no need to wait until it has run at least once, because you'll be in the same position then; it will be scheduled again to be allocated another time slot but if you don't believe that is enough of a guarantee, you'll have to wait for the 2nd and 3rd etc timeslot as well.

No other intentional delays.

Understood. I'm just saying that doing more "stuff" introduces a wait which may make crashes less frequent but not resolve them.

That took a bit to swallow ...

I know this was unfair, especially to you, and I apologise. Sorry.

I come from a culture where it is acceptable to harshly challenge each other. In a weird way it is sort of a sign of respect. "I think you're doing something crazy here, but it is you so it can't be. Please help me understand by explaining." But also, explaining an issue you're struggling with to someone else often helps form a better understanding and sometimes a lightbulb moment. And if not, at least there's a bit of documentation even if that is just doubt about whether this is correct or a real solution. That will help future archeologists or people that want to continue the investigation now. None of us here have time to start from first principles when we want to help others resolve issues.

I still have my doubt regarding this patch for the situation where the event loop gets reached before dt_control_all_running returns true. It looks to me like the program will then just terminate (without waiting for all the threads to come up). Maybe that is better than a hang, but not much better. The only situation where I can imagine this happening is if there are many cores, but the system is heavily loaded and the main thread gets an outsized allocation making it reach gtk_main before other threads have been allocated their first time slice. As said above, I don't think this is a problem at all, so no need to make this even more "robust" by waiting instead of exiting. I'm still in favor of reverting this.

jenshannoschwalm · 2025-12-31T10:17:30Z

All good, i also don't mind harsh tones :-)

Let's continue on other threads understanding and fixing the issue.

…0002' As discussed there this was not a fix and added misleading code, so let's revert. Also removed the pthread naming for the jobs as it didn't work correctly and we have the jobs logged anyway.

jenshannoschwalm added the bugfix pull request fixing a bug label Dec 26, 2025

TurboGit added this to the 5.4.1 milestone Dec 26, 2025

TurboGit reviewed Dec 27, 2025

View reviewed changes

src/control/jobs.c Show resolved Hide resolved

TurboGit merged commit 6d50199 into darktable-org:master Dec 27, 2025
5 checks passed

jenshannoschwalm deleted the safe_control_start branch December 28, 2025 05:03

dterrahe reviewed Dec 29, 2025

View reviewed changes

jenshannoschwalm mentioned this pull request Dec 30, 2025

Fix thread joining in dt_control_shutdown() #20039

Merged

jenshannoschwalm mentioned this pull request Jan 10, 2026

Safe starting without splash #20116

Open

Avoid control threads race condition #20002

Avoid control threads race condition #20002

Uh oh!

Conversation

jenshannoschwalm commented Dec 26, 2025

Uh oh!

wpferguson commented Dec 26, 2025

Uh oh!

jenshannoschwalm commented Dec 26, 2025

Uh oh!

wpferguson commented Dec 27, 2025

Uh oh!

Uh oh!

Uh oh!

dterrahe commented Dec 29, 2025

Uh oh!

dterrahe Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

jenshannoschwalm commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dterrahe commented Dec 29, 2025

Uh oh!

jenshannoschwalm commented Dec 30, 2025

Uh oh!

dterrahe commented Dec 31, 2025

Uh oh!

jenshannoschwalm commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jenshannoschwalm commented Dec 29, 2025 •

edited

Loading