Skip to content

bug: GPU gateway fails on DGX Spark - missing cgroup controller #374

@lupinrider

Description

@lupinrider

Agent Diagnostic

GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)

Summary

The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs openshell gateway start --name nemoclaw --gpu. The non-GPU gateway works perfectly — openshell sandbox create succeeds and I can connect to the sandbox without issues.

Environment

  • Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
  • Architecture: aarch64
  • OS: DGX OS (Ubuntu-based)
  • Kernel: (run uname -r and paste here)
  • OpenShell version: 0.0.6
  • Docker:
    • Cgroup Driver: systemd
    • Cgroup Version: 2
  • NVIDIA Container Toolkit: 1.19.0
  • NVIDIA GPU: 1 GPU detected, 124610 MB VRAM

Steps to Reproduce

  1. Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
  2. Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
  3. Stop the existing gateway: openshell gateway stop
  4. Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
  5. Run cd NemoClaw && ./install.sh
  6. Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu

Expected Behaviour

GPU-enabled gateway starts successfully.

Actual Behaviour

Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:

Error:   × K8s namespace not ready
  ╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
      (status=EXITED, exit_code=1)

Key error from container logs:

E0316 21:10:05.545118     118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156     118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"

Additional Context

  • The DGX Spark uses cgroup v2 with the systemd driver (stat -fc %T /sys/fs/cgroup/ returns cgroup2fs).
  • The non-GPU gateway works fine on this system, so the issue appears specific to the --gpu gateway variant.
  • The cgroup error mentions /sys/fs/cgroup/kubepods/pids.max which looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2.
  • NVIDIA Container Toolkit 1.19.0 is installed and functional.
  • This was tested on the day of the NemoClaw GTC 2026 announcement (16 March 2026).

Description

Expected Behaviour
GPU-enabled gateway starts successfully.

Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"

Reproduction Steps

Steps to Reproduce

Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu

Environment

Environment

Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:

Cgroup Driver: systemd
Cgroup Version: 2

NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcompat

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions