RingKernel Roadmap

GPU-Native Persistent Actor Model Framework for Rust -- NVIDIA CUDA Focus

Vision

Transform GPU computing from batch-oriented kernel launches to a true actor-based paradigm where GPU kernels are long-lived, stateful actors that communicate via high-performance message passing. RingKernel focuses exclusively on NVIDIA CUDA, leveraging Hopper and future architectures for maximum persistent actor performance.

v1.0.0 -- Completed (April 2026)

H100-verified persistent GPU actor framework.

Core Runtime

Persistent cooperative kernel execution (CUDA)
Lock-free SPSC/MPSC queues (truly lock-free, no mutexes)
GPU actor lifecycle (create/destroy/restart/supervise)
Named actor registry with wildcard service discovery
Credit-based backpressure and dead letter queue
GPU memory pressure handling (budgets, mitigation)
Dynamic scheduling (work stealing protocol)
Hybrid Logical Clocks for causal ordering (30 ns/tick)

CUDA + Hopper Features

Thread Block Clusters with DSMEM messaging
cluster.sync() for intra-GPC synchronization (2.98x faster than grid.sync())
Green Contexts for SM partitioning
TMA (Tensor Memory Accelerator) integration
Async memory pool (116.9x faster than cuMemAlloc)
NVTX profiling, Chrome trace export, memory tracking
Cooperative groups with grid-wide synchronization

Code Generation

Rust-to-CUDA transpiler (155+ intrinsics)
Global, stencil, ring, and persistent FDTD kernel modes
Unified IR (ringkernel-ir) with optimization passes (DCE, constant folding, algebraic simplification)

Enterprise

Kernel checkpointing with state preservation
Hot reload with rollback
Graceful degradation (5 levels)
Health monitoring with liveness/readiness probes
Memory encryption (AES-256-GCM, ChaCha20)
Audit logging with tamper-evident chains
Kernel sandboxing with resource limits
Compliance reporting (SOC2, GDPR, HIPAA, PCI-DSS)

Ecosystem

Actix, Axum, Tower, gRPC integrations with persistent GPU actors
SSE and WebSocket handlers for real-time GPU events
Arrow and Polars GPU operation support
CLI scaffolding, codegen, and compatibility checking

Benchmarks (H100 NVL)

8,698x faster than traditional cuLaunchKernel
3,005x faster than CUDA Graph replay
5.54M ops/s sustained throughput (CV 0.05%, 60 seconds)
0.628 us cluster.sync() (2.98x vs grid.sync())
0.544 ns zero-copy serialization
Paper-quality benchmarks with statistical analysis (95% CI, Cohen's d, Welch's t-test)

v1.1 -- Multi-GPU + VynGraph NSAI -- Completed (April 2026)

2x H100 NVL-verified. See docs/benchmarks/v1.1-2x-h100-results.md.

Multi-GPU P2P Communication

NVLink-aware actor placement (PlacementHint::NvlinkPreferred uses NvlinkTopology::probe)
P2P direct memory access via cuCtxEnablePeerAccess + cuMemcpyPeerAsync
Multi-GPU actor migration with 3-phase transfer (8.7x faster than host-stage at 16 MiB)
Load balancing across GPU pool (LoadBalance + CommunicationAware rebalance strategies)

VynGraph NSAI Integration Points

PROV-O provenance header (8 relation kinds, chain walk, signature hook)
Multi-tenant K2K isolation (per-tenant sub-brokers, audit sink, quota enforcement)
Live introspection streaming (EWMA, drop-tolerant ring)
Hot rule reload (CompiledRule, version-monotonic, quiescence under load)

Formal Verification

6 TLA+ specs (hlc, k2k_delivery, migration, multi_gpu_k2k, tenant_isolation, actor_lifecycle)
TLC model-checking pipeline (no counterexamples)

Added late in v1.1

HBM-tier direct measurement -- new cluster_hbm_k2k kernel, included as paper Exp 1 hbm tier
Multi-GPU K2K sustained bandwidth micro-bench (paper Addendum 6b): 258 GB/s @ 16 MiB (~81% of 318 GB/s peak)
Incremental/delta checkpoints -- Checkpoint::delta_from / applied_with_delta / content_digest
Intra-block warp work stealing -- warp_work_steal kernel + audit tests

v1.2 -- Blackwell prep + hierarchical work stealing + NVSHMEM (in progress on `main`)

v1.2 groundwork is on main but not yet tagged -- see the [Unreleased] section of CHANGELOG.md for full detail.

Landed

Intra-cluster DSMEM work stealing -- cluster_dsmem_work_steal CUDA kernel; blocks atomically share a DSMEM-hosted counter via cluster.map_shared_rank
Cross-cluster HBM work stealing -- grid_hbm_work_steal CUDA kernel; completes the block -> cluster -> grid stealing hierarchy
NVSHMEM symmetric-heap bindings -- opt-in nvshmem feature on ringkernel-cuda, NvshmemHeap RAII wrapper over the NVSHMEM host ABI; bootstrap (MPI / nvshmrun / unique-ID) is caller's responsibility
Blackwell / sm_100 capability queries -- GpuArchitecture::supports_{cluster_launch_control,fp8,fp6,fp4,nvlink5,tee}
Post-Hopper codegen types -- ScalarType::BF16, FP8E4M3, FP8E5M2, FP6E3M2, FP6E2M3, FP4E2M1 with per-type min_compute_capability(); CUDA / MSL / WGSL lowerings wired
Rubin preset -- GpuArchitecture::rubin() placeholder (compute cap 12.x)
SpscQueue cache-line padding + split stats -- head / tail / producer stats / consumer stats each on their own 128-byte line; fetch_max replaces CAS loop in update_max_depth
2-thread SPSC throughput benchmark -- tests/spsc_two_thread_throughput.rs measures actual concurrent shape (not single-thread round-trip latency)
Workspace dep consolidation -- 8 crates migrated from version + path to { workspace = true }; root [workspace.dependencies] has a ringkernel facade entry

Deferred GPU-native persistent-actor optimizations

Each of these is a focused opt pass worth a dedicated session; all are doable on the current 2 × H100 NVL hardware (no B200 needed). Listed in rough priority order:

Hot-path wins

Tier-aware K2K routing at send time -- broker picks SMEM / DSMEM / HBM based on sender/receiver co-location (same block / same cluster / other). Paper data shows 2.2× to 2.7× latency ratio across tiers — wins scale with fraction of traffic that stays inside a block or cluster.
DSMEM-hosted K2K routing table -- for intra-cluster traffic, move the per-actor route entries from HBM to DSMEM via cluster.map_shared_rank. Lookup latency drops from ~500 ns (HBM) to ~20 ns (SMEM/DSMEM) per message.
Batched device-side dequeue in the persistent kernel -- current inner loop dequeues one message per iteration; batch 8–16 to amortize atomic coherence cost. Targeting the 20+ Mmsg/s queue goal.
TMA-accelerated state capture for snapshot/restart -- Hopper's Tensor Memory Accelerator does async tiled copies with lower SM occupancy cost than memcpy_async. Exp 2 capture is 1.3 µs at 1 KiB; TMA should take it sub-microsecond.

Correctness / tuning

Persistent kernel occupancy tuning -- audit min_blocks_per_sm across every persistent kernel; under-occupancy on H100 adds latency, over-occupancy causes register spills. Use NCU profiles to tune.
K2K route table layout: AoS vs SoA -- K2KRouteEntry is 72 bytes (> H100 L2 line pair). For broadcast-style routing (one sender, many destinations) SoA wins; for direct pair lookups AoS is fine. Measure before flipping.
HLC tick amortization -- HLC tick is ~30 ns; at 5 Mops/s that's 15% of host overhead. Batch tick every N messages for workloads where inter-event causality is coarser than single-message granularity (opt-in per-actor).

Infra wins with persistent-actor semantics

Adaptive actor placement on warm restart -- after a migration or restart, re-evaluate PlacementHint::NvlinkPreferred with the current traffic matrix rather than the launch-time hint. Works with the rebalance(CommunicationAware) strategy but runs per-actor, not bulk.
Device-side CompiledRule cache -- hot rule-reload currently stages to HBM then activates. Keeping a warm DSMEM-resident cached copy for the most-active rules cuts activation from ~100 ns to ~10 ns.

Still open for v1.2 (hardware / integration blockers)

Blackwell runtime validation -- requires B200 hardware; codegen stubs compile, runtime paths untested
4- / 8-GPU linear scaling benchmarks -- bound by NC80adis having only 2 GPUs
NVSHMEM end-to-end smoke -- requires dual-process bootstrap (mpirun or unique-ID); the wrapper handles post-bootstrap operations but the launch harness is still manual
Sub-50 ns command injection on B200 -- needs B200 silicon
20+ Mmsg/s lock-free queue -- optimization path; current sustained is 5.10 Mops/s (single-thread latency test); two-thread concurrent is ~2 Mmsg/s and scaling is the goal

v1.3 -- Streaming and Integration (Target: Q1 2027)

Streaming Integrations

Kafka consumer/producer with GPU-resident processing
NATS persistent actor subscriptions
Redis Streams GPU bridge
gRPC streaming with persistent actor backends

AI/ML Integration

LLM provider bridge (OpenAI, Anthropic, local models) with GPU-resident tokenization
Vector store / GPU-resident embedding index
Candle model inference as persistent actors
PyTorch interop for training loop GPU actors

Observability

GPU profiler integration (Nsight Systems, Nsight Compute)
Distributed tracing across multi-GPU actor systems
Prometheus metrics exporter for persistent actor health
Grafana dashboard templates

Success Metrics

Metric	v1.0	v1.1 (Achieved)	v1.2 (groundwork on `main`)
Command latency	55 ns (H100)	23 ns mean / 30 ns p99 on all 5 lifecycle rules	<50 ns (B200 target, needs silicon)
Sustained throughput	5.54 Mops/s	5.10 Mops/s (CV 0.66%, flat over 4x60 s)	20+ Mops/s (queue opt path)
NVLink P2P migration	n/a	8.7x vs host-stage @ 16 MiB	4/8-GPU scaling (hardware-bound)
Multi-GPU K2K bandwidth	n/a	258 GB/s @ 16 MiB (81% of NV12 peak)	NVSHMEM symmetric heap (bootstrap wired)
Cross-tenant leaks	n/a	0 across 13 isolation tests	same baseline
TLA+ specs verified	n/a	6 / 6 (no counterexamples)	same, plus capability query tests
Test count	1,496+	1,590 (stable Rust 1.95)	1,617
GPU architectures	Hopper (H100)	Hopper multi-GPU (NV12)	Hopper + Blackwell codegen (FP4/FP6/FP8)
Multi-GPU support	Single GPU	2-8 GPUs	2-16 GPUs

Contributing

See CONTRIBUTING.md for guidelines on contributing to the roadmap and implementation.

Priority Definitions

P0: Critical path, blocking other features
P1: High value, should be in next release
P2: Nice to have, can be deferred

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RingKernel Roadmap

Vision

v1.0.0 -- Completed (April 2026)

Core Runtime

CUDA + Hopper Features

Code Generation

Enterprise

Ecosystem

Benchmarks (H100 NVL)

v1.1 -- Multi-GPU + VynGraph NSAI -- Completed (April 2026)

Multi-GPU P2P Communication

VynGraph NSAI Integration Points

Formal Verification

Added late in v1.1

v1.2 -- Blackwell prep + hierarchical work stealing + NVSHMEM (in progress on `main`)

Landed

Deferred GPU-native persistent-actor optimizations

Hot-path wins

Correctness / tuning

Infra wins with persistent-actor semantics

Still open for v1.2 (hardware / integration blockers)

v1.3 -- Streaming and Integration (Target: Q1 2027)

Streaming Integrations

AI/ML Integration

Observability

Success Metrics

Contributing

Priority Definitions

Uh oh!

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

RingKernel Roadmap

Vision

v1.0.0 -- Completed (April 2026)

Core Runtime

CUDA + Hopper Features

Code Generation

Enterprise

Ecosystem

Benchmarks (H100 NVL)

v1.1 -- Multi-GPU + VynGraph NSAI -- Completed (April 2026)

Multi-GPU P2P Communication

VynGraph NSAI Integration Points

Formal Verification

Added late in v1.1

v1.2 -- Blackwell prep + hierarchical work stealing + NVSHMEM (in progress on main)

Landed

Deferred GPU-native persistent-actor optimizations

Hot-path wins

Correctness / tuning

Infra wins with persistent-actor semantics

Still open for v1.2 (hardware / integration blockers)

v1.3 -- Streaming and Integration (Target: Q1 2027)

Streaming Integrations

AI/ML Integration

Observability

Success Metrics

Contributing

Priority Definitions

v1.2 -- Blackwell prep + hierarchical work stealing + NVSHMEM (in progress on `main`)