Fix scheduler exception propagation to Python #2983

Anri-Lombard · 2026-01-11T07:14:14Z

Propagates exceptions from CPU scheduler worker threads to Python, preventing process crashes when operations like mx.linalg.solve encounter errors (e.g., singular matrices).

Closes #2888. Supersedes #2964.

This revision addresses @awni's feedback from the original PR:

On state management - the task wrapper in encoder.h now uses try-catch to guarantee notify_task_completion() always runs. This prevents the task counter from staying elevated and causing deadlocks. Memory cleanup is handled via RAII.

On timing - added check_cpu_exceptions() to check all CPU streams, and now checking exceptions after every wait_for_one() call in transforms.cpp. Exceptions surface during the evaluation loop, not just at synchronize().

@awni - please let me know if there are any areas I might have missed! 🙏

Exceptions thrown in CPU scheduler worker threads (e.g., when LAPACK detects a singular matrix) were not being caught, causing process crashes instead of raising Python exceptions. Added exception capture in StreamThread and re-throwing during synchronize() so errors propagate as RuntimeError to Python. Fixes ml-explore#2888

- Wrap task in try-catch to guarantee notify_task_completion runs - Add check_cpu_exceptions() to check all CPU streams - Check exceptions after wait_for_one() calls in transforms.cpp Fixes state management concern: task counter always decremented. Fixes timing concern: exceptions checked during eval, not just sync.

zcbenz · 2026-01-18T09:25:03Z

I think this does not really solve the most important thing mentioned in the original comment:

The challenge with this is not so much propagating the exception but making sure the MLX is in a sane state when the exception is raised. Otherwise we don't want to give the client to the opportunity to catch the exception because they may assume it's safe to continue.

We need to ensure MLX is exception-safe before working on surfacing all errors to users, especially when the solution could add runtime overhead and maintenance burden.

Anri-Lombard · 2026-01-18T13:04:10Z

Thanks for checking @zcbenz. I guess this needs more work.

Would a "poisoned state" mechanism address this? The idea:

Set a flag when any exception is captured
Check at eval entry - if poisoned, throw "MLX encountered an error. Call mx.reset() to continue"
mx.reset() clears the flag

zcbenz · 2026-01-18T23:04:44Z

Before calling eval_cpu/eval_gpu there are usually some prepare works like calling add_temporary and set_input_array, when eval_cpu throws the temporary state does not get cleanup up and the program would be in a stale state, so we would want to crash in this case, instead of surfacing the exception to user and continuing execution.

And this is just one of the many cases that MLX is not exception-safe, so I would say this is something we should only look into after we get everything else stable.

Anri-Lombard · 2026-01-19T18:15:36Z

Is exception-safety something you'd want prioritized, or should I close this for now?

awni · 2026-01-19T18:19:58Z

Is exception-safety something you'd want prioritized, or should I close this for now?

It's not a top priority for us (as in no-one is planning to work on it in the near term).. but I do think we should strive towards having it -- conditioning on a) not making code too complex and b) not making code too slow. So I think we would review and accept PRs that match those conditions but make the code more exception safe.

Anri-Lombard added 2 commits January 3, 2026 18:09

Anri-Lombard closed this Jan 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix scheduler exception propagation to Python #2983

Fix scheduler exception propagation to Python #2983

Anri-Lombard commented Jan 11, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 18, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 19, 2026

Uh oh!

awni commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix scheduler exception propagation to Python #2983

Fix scheduler exception propagation to Python #2983

Conversation

Anri-Lombard commented Jan 11, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 18, 2026

Uh oh!

zcbenz commented Jan 18, 2026

Uh oh!

Anri-Lombard commented Jan 19, 2026

Uh oh!

awni commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants