Skip to content

Conversation

@Anri-Lombard
Copy link
Contributor

Propagates exceptions from CPU scheduler worker threads to Python, preventing process crashes when operations like mx.linalg.solve encounter errors (e.g., singular matrices).

Closes #2888. Supersedes #2964.

This revision addresses @awni's feedback from the original PR:

On state management - the task wrapper in encoder.h now uses try-catch to guarantee notify_task_completion() always runs. This prevents the task counter from staying elevated and causing deadlocks. Memory cleanup is handled via RAII.

On timing - added check_cpu_exceptions() to check all CPU streams, and now checking exceptions after every wait_for_one() call in transforms.cpp. Exceptions surface during the evaluation loop, not just at synchronize().

@awni - please let me know if there are any areas I might have missed! 🙏

Exceptions thrown in CPU scheduler worker threads (e.g., when LAPACK
detects a singular matrix) were not being caught, causing process
crashes instead of raising Python exceptions.

Added exception capture in StreamThread and re-throwing during
synchronize() so errors propagate as RuntimeError to Python.

Fixes ml-explore#2888
- Wrap task in try-catch to guarantee notify_task_completion runs
- Add check_cpu_exceptions() to check all CPU streams
- Check exceptions after wait_for_one() calls in transforms.cpp

Fixes state management concern: task counter always decremented.
Fixes timing concern: exceptions checked during eval, not just sync.
@zcbenz
Copy link
Collaborator

zcbenz commented Jan 18, 2026

I think this does not really solve the most important thing mentioned in the original comment:

The challenge with this is not so much propagating the exception but making sure the MLX is in a sane state when the exception is raised. Otherwise we don't want to give the client to the opportunity to catch the exception because they may assume it's safe to continue.

We need to ensure MLX is exception-safe before working on surfacing all errors to users, especially when the solution could add runtime overhead and maintenance burden.

@Anri-Lombard
Copy link
Contributor Author

Thanks for checking @zcbenz. I guess this needs more work.

Would a "poisoned state" mechanism address this? The idea:

  1. Set a flag when any exception is captured
  2. Check at eval entry - if poisoned, throw "MLX encountered an error. Call mx.reset() to continue"
  3. mx.reset() clears the flag

@zcbenz
Copy link
Collaborator

zcbenz commented Jan 18, 2026

Before calling eval_cpu/eval_gpu there are usually some prepare works like calling add_temporary and set_input_array, when eval_cpu throws the temporary state does not get cleanup up and the program would be in a stale state, so we would want to crash in this case, instead of surfacing the exception to user and continuing execution.

And this is just one of the many cases that MLX is not exception-safe, so I would say this is something we should only look into after we get everything else stable.

@Anri-Lombard
Copy link
Contributor Author

Is exception-safety something you'd want prioritized, or should I close this for now?

@awni
Copy link
Member

awni commented Jan 19, 2026

Is exception-safety something you'd want prioritized, or should I close this for now?

It's not a top priority for us (as in no-one is planning to work on it in the near term).. but I do think we should strive towards having it -- conditioning on a) not making code too complex and b) not making code too slow. So I think we would review and accept PRs that match those conditions but make the code more exception safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Singular matrix

3 participants