GPU-Health-eXpert
-
Updated
Oct 30, 2025 - C
GPU-Health-eXpert
Enhanced GPU throttle diagnostic for DGX Spark (GB10): NVML direct telemetry, throttle cause decoder, PCIe link monitoring, baseline drift detection, timeline capture.
gpu thrashingNVIDIA GPU Unified Memory diagnostic tool — architecture-aware, measurement-based, PCIe/coherent transport detection
NVIDIA GPU validation: PCIe transport, Unified Memory prefetch, SGEMM compute, drift detection.
ML research control plane — experiment lifecycle, model registry, cloud training launcher
Cycle-accurate UMA fault latency and bandwidth measurement for NVIDIA GPUs. C and PTX. No Python. Pascal (SM 6.0) through Blackwell GB10 (SM 12.1).
Add a description, image, and links to the gpu-diagnostics topic page so that developers can more easily learn about it.
To associate your repository with the gpu-diagnostics topic, visit your repo's landing page and select "manage topics."