-
Notifications
You must be signed in to change notification settings - Fork 92
Description
Agent Diagnostic
GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)
Summary
The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs openshell gateway start --name nemoclaw --gpu. The non-GPU gateway works perfectly — openshell sandbox create succeeds and I can connect to the sandbox without issues.
Environment
- Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
- Architecture: aarch64
- OS: DGX OS (Ubuntu-based)
- Kernel: (run
uname -rand paste here) - OpenShell version: 0.0.6
- Docker:
- Cgroup Driver: systemd
- Cgroup Version: 2
- NVIDIA Container Toolkit: 1.19.0
- NVIDIA GPU: 1 GPU detected, 124610 MB VRAM
Steps to Reproduce
- Install OpenShell v0.0.6 via
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh - Confirm non-GPU gateway works:
openshell sandbox create→ succeeds, sandbox created and connectable - Stop the existing gateway:
openshell gateway stop - Clone NemoClaw:
git clone https://github.com/NVIDIA/NemoClaw.git - Run
cd NemoClaw && ./install.sh - Installer reaches step [2/7] and runs
openshell gateway start --name nemoclaw --gpu
Expected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
Additional Context
- The DGX Spark uses cgroup v2 with the systemd driver (
stat -fc %T /sys/fs/cgroup/returnscgroup2fs). - The non-GPU gateway works fine on this system, so the issue appears specific to the
--gpugateway variant. - The cgroup error mentions
/sys/fs/cgroup/kubepods/pids.maxwhich looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2. - NVIDIA Container Toolkit 1.19.0 is installed and functional.
- This was tested on the day of the NemoClaw GTC 2026 announcement (16 March 2026).
Description
Expected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
Reproduction Steps
Steps to Reproduce
Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu
Environment
Environment
Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:
Cgroup Driver: systemd
Cgroup Version: 2
NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM
Logs
Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why