GPU-Native Persistent Actor Model Framework for Rust -- NVIDIA CUDA Focus
Transform GPU computing from batch-oriented kernel launches to a true actor-based paradigm where GPU kernels are long-lived, stateful actors that communicate via high-performance message passing. RingKernel focuses exclusively on NVIDIA CUDA, leveraging Hopper and future architectures for maximum persistent actor performance.
H100-verified persistent GPU actor framework.
- Persistent cooperative kernel execution (CUDA)
- Lock-free SPSC/MPSC queues (truly lock-free, no mutexes)
- GPU actor lifecycle (create/destroy/restart/supervise)
- Named actor registry with wildcard service discovery
- Credit-based backpressure and dead letter queue
- GPU memory pressure handling (budgets, mitigation)
- Dynamic scheduling (work stealing protocol)
- Hybrid Logical Clocks for causal ordering (30 ns/tick)
- Thread Block Clusters with DSMEM messaging
- cluster.sync() for intra-GPC synchronization (2.98x faster than grid.sync())
- Green Contexts for SM partitioning
- TMA (Tensor Memory Accelerator) integration
- Async memory pool (116.9x faster than cuMemAlloc)
- NVTX profiling, Chrome trace export, memory tracking
- Cooperative groups with grid-wide synchronization
- Rust-to-CUDA transpiler (155+ intrinsics)
- Global, stencil, ring, and persistent FDTD kernel modes
- Unified IR (ringkernel-ir) with optimization passes (DCE, constant folding, algebraic simplification)
- Kernel checkpointing with state preservation
- Hot reload with rollback
- Graceful degradation (5 levels)
- Health monitoring with liveness/readiness probes
- Memory encryption (AES-256-GCM, ChaCha20)
- Audit logging with tamper-evident chains
- Kernel sandboxing with resource limits
- Compliance reporting (SOC2, GDPR, HIPAA, PCI-DSS)
- Actix, Axum, Tower, gRPC integrations with persistent GPU actors
- SSE and WebSocket handlers for real-time GPU events
- Arrow and Polars GPU operation support
- CLI scaffolding, codegen, and compatibility checking
- 8,698x faster than traditional cuLaunchKernel
- 3,005x faster than CUDA Graph replay
- 5.54M ops/s sustained throughput (CV 0.05%, 60 seconds)
- 0.628 us cluster.sync() (2.98x vs grid.sync())
- 0.544 ns zero-copy serialization
- Paper-quality benchmarks with statistical analysis (95% CI, Cohen's d, Welch's t-test)
2x H100 NVL-verified. See docs/benchmarks/v1.1-2x-h100-results.md.
- NVLink-aware actor placement (
PlacementHint::NvlinkPreferredusesNvlinkTopology::probe) - P2P direct memory access via
cuCtxEnablePeerAccess+cuMemcpyPeerAsync - Multi-GPU actor migration with 3-phase transfer (8.7x faster than host-stage at 16 MiB)
- Load balancing across GPU pool (LoadBalance + CommunicationAware rebalance strategies)
- PROV-O provenance header (8 relation kinds, chain walk, signature hook)
- Multi-tenant K2K isolation (per-tenant sub-brokers, audit sink, quota enforcement)
- Live introspection streaming (EWMA, drop-tolerant ring)
- Hot rule reload (
CompiledRule, version-monotonic, quiescence under load)
- 6 TLA+ specs (hlc, k2k_delivery, migration, multi_gpu_k2k, tenant_isolation, actor_lifecycle)
- TLC model-checking pipeline (no counterexamples)
- HBM-tier direct measurement -- new
cluster_hbm_k2kkernel, included as paper Exp 1 hbm tier - Multi-GPU K2K sustained bandwidth micro-bench (paper Addendum 6b): 258 GB/s @ 16 MiB (~81% of 318 GB/s peak)
- Incremental/delta checkpoints --
Checkpoint::delta_from/applied_with_delta/content_digest - Intra-block warp work stealing --
warp_work_stealkernel + audit tests
v1.2 groundwork is on main but not yet tagged -- see the [Unreleased] section of CHANGELOG.md for full detail.
- Intra-cluster DSMEM work stealing --
cluster_dsmem_work_stealCUDA kernel; blocks atomically share a DSMEM-hosted counter viacluster.map_shared_rank - Cross-cluster HBM work stealing --
grid_hbm_work_stealCUDA kernel; completes the block -> cluster -> grid stealing hierarchy - NVSHMEM symmetric-heap bindings -- opt-in
nvshmemfeature onringkernel-cuda,NvshmemHeapRAII wrapper over the NVSHMEM host ABI; bootstrap (MPI /nvshmrun/ unique-ID) is caller's responsibility - Blackwell / sm_100 capability queries --
GpuArchitecture::supports_{cluster_launch_control,fp8,fp6,fp4,nvlink5,tee} - Post-Hopper codegen types --
ScalarType::BF16,FP8E4M3,FP8E5M2,FP6E3M2,FP6E2M3,FP4E2M1with per-typemin_compute_capability(); CUDA / MSL / WGSL lowerings wired - Rubin preset --
GpuArchitecture::rubin()placeholder (compute cap 12.x) - SpscQueue cache-line padding + split stats --
head/tail/ producer stats / consumer stats each on their own 128-byte line;fetch_maxreplaces CAS loop inupdate_max_depth - 2-thread SPSC throughput benchmark --
tests/spsc_two_thread_throughput.rsmeasures actual concurrent shape (not single-thread round-trip latency) - Workspace dep consolidation -- 8 crates migrated from
version + pathto{ workspace = true }; root[workspace.dependencies]has aringkernelfacade entry
Each of these is a focused opt pass worth a dedicated session; all are doable on the current 2 × H100 NVL hardware (no B200 needed). Listed in rough priority order:
- Tier-aware K2K routing at send time -- broker picks SMEM / DSMEM / HBM based on sender/receiver co-location (same block / same cluster / other). Paper data shows 2.2× to 2.7× latency ratio across tiers — wins scale with fraction of traffic that stays inside a block or cluster.
- DSMEM-hosted K2K routing table -- for intra-cluster traffic, move the per-actor route entries from HBM to DSMEM via
cluster.map_shared_rank. Lookup latency drops from ~500 ns (HBM) to ~20 ns (SMEM/DSMEM) per message. - Batched device-side dequeue in the persistent kernel -- current inner loop dequeues one message per iteration; batch 8–16 to amortize atomic coherence cost. Targeting the 20+ Mmsg/s queue goal.
- TMA-accelerated state capture for snapshot/restart -- Hopper's Tensor Memory Accelerator does async tiled copies with lower SM occupancy cost than
memcpy_async. Exp 2 capture is 1.3 µs at 1 KiB; TMA should take it sub-microsecond.
- Persistent kernel occupancy tuning -- audit
min_blocks_per_smacross every persistent kernel; under-occupancy on H100 adds latency, over-occupancy causes register spills. Use NCU profiles to tune. - K2K route table layout: AoS vs SoA --
K2KRouteEntryis 72 bytes (> H100 L2 line pair). For broadcast-style routing (one sender, many destinations) SoA wins; for direct pair lookups AoS is fine. Measure before flipping. - HLC tick amortization -- HLC tick is ~30 ns; at 5 Mops/s that's 15% of host overhead. Batch tick every N messages for workloads where inter-event causality is coarser than single-message granularity (opt-in per-actor).
- Adaptive actor placement on warm restart -- after a migration or restart, re-evaluate
PlacementHint::NvlinkPreferredwith the current traffic matrix rather than the launch-time hint. Works with therebalance(CommunicationAware)strategy but runs per-actor, not bulk. - Device-side
CompiledRulecache -- hot rule-reload currently stages to HBM then activates. Keeping a warm DSMEM-resident cached copy for the most-active rules cuts activation from ~100 ns to ~10 ns.
- Blackwell runtime validation -- requires B200 hardware; codegen stubs compile, runtime paths untested
- 4- / 8-GPU linear scaling benchmarks -- bound by NC80adis having only 2 GPUs
- NVSHMEM end-to-end smoke -- requires dual-process bootstrap (mpirun or unique-ID); the wrapper handles post-bootstrap operations but the launch harness is still manual
- Sub-50 ns command injection on B200 -- needs B200 silicon
- 20+ Mmsg/s lock-free queue -- optimization path; current sustained is 5.10 Mops/s (single-thread latency test); two-thread concurrent is ~2 Mmsg/s and scaling is the goal
- Kafka consumer/producer with GPU-resident processing
- NATS persistent actor subscriptions
- Redis Streams GPU bridge
- gRPC streaming with persistent actor backends
- LLM provider bridge (OpenAI, Anthropic, local models) with GPU-resident tokenization
- Vector store / GPU-resident embedding index
- Candle model inference as persistent actors
- PyTorch interop for training loop GPU actors
- GPU profiler integration (Nsight Systems, Nsight Compute)
- Distributed tracing across multi-GPU actor systems
- Prometheus metrics exporter for persistent actor health
- Grafana dashboard templates
| Metric | v1.0 | v1.1 (Achieved) | v1.2 (groundwork on main) |
|---|---|---|---|
| Command latency | 55 ns (H100) | 23 ns mean / 30 ns p99 on all 5 lifecycle rules | <50 ns (B200 target, needs silicon) |
| Sustained throughput | 5.54 Mops/s | 5.10 Mops/s (CV 0.66%, flat over 4x60 s) | 20+ Mops/s (queue opt path) |
| NVLink P2P migration | n/a | 8.7x vs host-stage @ 16 MiB | 4/8-GPU scaling (hardware-bound) |
| Multi-GPU K2K bandwidth | n/a | 258 GB/s @ 16 MiB (81% of NV12 peak) | NVSHMEM symmetric heap (bootstrap wired) |
| Cross-tenant leaks | n/a | 0 across 13 isolation tests | same baseline |
| TLA+ specs verified | n/a | 6 / 6 (no counterexamples) | same, plus capability query tests |
| Test count | 1,496+ | 1,590 (stable Rust 1.95) | 1,617 |
| GPU architectures | Hopper (H100) | Hopper multi-GPU (NV12) | Hopper + Blackwell codegen (FP4/FP6/FP8) |
| Multi-GPU support | Single GPU | 2-8 GPUs | 2-16 GPUs |
See CONTRIBUTING.md for guidelines on contributing to the roadmap and implementation.
- P0: Critical path, blocking other features
- P1: High value, should be in next release
- P2: Nice to have, can be deferred