Vouch request: brianwtaylor #395
brianwtaylor
started this conversation in
Vouch Request
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
What do you want to work on?
Fix #181 — GPU setup failures shouldn't take down the whole cluster. I'd add a fallback path so the cluster still comes up without GPU workloads if something in the GPU chain breaks (missing runtime, device-plugin timeout, driver mismatch, etc), with clear log output about what failed.
I have two DGX Sparks (GB10, aarch64, cgroup v2) connected over QFD so I can test both success and failure paths. I've been contributing to NemoClaw — cgroup v2 preflight fix (PR #62, merged).
Why this change?
Too many moving parts in GPU setup for it to be a hard gate. Spark users especially hit edge cases with unified memory and aarch64 that discrete GPU hosts don't see. Cluster should still be usable.
Checklist
Beta Was this translation helpful? Give feedback.
All reactions