Skip to content

feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15

Open
forkni wants to merge 24 commits into
SDTD_031_devfrom
feat/cuda-ipc-output
Open

feat: CUDA IPC zero-copy GPU transport (TD↔SD input, output, ControlNet)#15
forkni wants to merge 24 commits into
SDTD_031_devfrom
feat/cuda-ipc-output

Conversation

@forkni
Copy link
Copy Markdown
Collaborator

@forkni forkni commented May 18, 2026

Summary

Full zero-copy GPU transport between TouchDesigner and StreamDiffusion using CUDA IPC (cuda-link), replacing legacy shared-memory CPU copies in all three data directions:

  • SD→TD output (CUDAIPCExporter): ring-buffer IPC with CUDA graph memcpy, activation barrier, WDDM HW scheduling support
  • TD→SD input (CUDAIPCImporter): zero-copy GPU read of TD's render output; CPU cudaEventQuery sync (no GPU-stream entanglement)
  • TD→SD ControlNet (CUDAIPCImporter): same zero-copy path for canny/depth control image; activated via use_cuda_ipc_controlnet YAML key emitted by stream-start YAML emitter

ControlNet TRT 901 fix (core bug resolved in this PR)

cudaErrorStreamCaptureInvalidated (901) fired on every cold-start when controlnet_scale > 0 was saved in td_config.yaml. Full diagnosis trail:

Attempt What Result
v0 get_frame(stream=current_stream()) cudaStreamWaitEvent on legacy → 901
v1 dedicated non-blocking import stream + wait_stream re-coupled legacy → 901
v2 get_frame() no stream= arg, CPU cudaEventQuery poll fixes warm-activation; fails cold-start
Stage A CUDALINK_USE_GRAPHS=0 disproved — exporter graphs not involved
v3 drain legacy stream before cudaStreamBeginCapture disproved — problem is inside capture window
v4 use_cuda_graph=False for CN engines in wrapper.py verified ✓

Root cause: TRT's internal genericReformat::copyPackedRunKernel submits work to the legacy/NULL stream during execute_async_v3 inside the graph-capture window. wrapper.py:2208 had use_cuda_graph=True hard-coded for every CN engine. Setting it to False keeps TRT acceleration but skips the CUDA-graph wrapper — no capture window, no 901.

Key files changed

File Change
src/streamdiffusion/wrapper.py use_cuda_graph=False for CN engines (v4 fix)
src/streamdiffusion/acceleration/tensorrt/utilities.py defensive legacy-stream drain before cudaStreamBeginCapture
src/streamdiffusion/_compat/cuda_ipc/cuda_ipc_exporter.py ThreadLocal capture mode (was Global)
src/streamdiffusion/_compat/cuda_ipc/cuda_graphs.py docstring correction
Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py YAML emitter: emit use_cuda_ipc_controlnet + cuda_ipc_control_shm_name
StreamDiffusionTD/td_manager.py v2 runtime fix — gitignored; live via Scripts/ sync

Test plan

  • Cold-start .toe with controlnet_scale: 0.577 and use_cuda_ipc_controlnet: true — confirm no 901, CN active from frame 1
  • Toggle CN via OSC (enable/disable, scale 0→0.5→0.8) — confirm no 901
  • Warm-activation path: start with CN scale=0, enable via OSC mid-stream — confirm still works
  • TD-side IPC Receiver: all 3 slots open, event=YES, stream_wait < 0.1 ms
  • Sustained 3+ min run: steady FPS ≥ 15, no [E] IExecutionContext::enqueueV3 errors
  • Output IPC: TD Receiver consuming SD diffusion output — copyCUDAMemory < 0.15 ms per frame

🤖 Generated with Claude Code

forkni and others added 24 commits May 16, 2026 08:36
kvo_cache_in_* tensors have ONNX dim 0 = 2 (hard-static K/V pair), not a
symbolic batch dim. The previous naïve _max_rows tile pumped sample to
2×_n_itr rows, causing modelopt's CalibrationDataProvider to compute
n_itr=2×_n_itr (sample's symbolic dim 0 resolves to 1) and split kvo into
chunks of shape (1,...) instead of (2,...) — ORT then rejected them with
"Got 1 Expected 2".

Fix: compute per-input target_rows = n_itr × resolved_dim0(name), mirroring
modelopt's symbolic→1/static-kept substitution, so every input splits into
exactly n_itr uniform chunks. Adds regression test in tests/quality/.

Fixes SDXL-Turbo + use_cached_attn=True + cfg_type=self + use_controlnet TD config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eam-start YAML

Resolves cudaErrorStreamCaptureInvalidated (901) on first CN TRT inference
when use_cuda_ipc_controlnet is active. Root cause and runtime fix live in
the dotsimulate/StreamDiffusionTD repo (StreamDiffusionTD/td_manager.py:
drop stream= arg from CUDAIPCImporter.get_frame, use CPU eager-sync via
_wait_for_slot to avoid pending GPU work on the legacy stream).

This commit covers the tracked-side changes:
- cuda_ipc_exporter: capture mode Global->ThreadLocal (defensive hardening)
- cuda_graphs: docstring correction for multi-engine processes
- _plans: add 2026-05-17 emitter session + 2026-05-18 capture-fix session

YAML emitter (use_cuda_ipc_controlnet + cuda_ipc_control_shm_name keys)
was applied 2026-05-17 to Scripts/StreamDiffusionTD__Text__StreamDiffusionExt__td.py
(outside this repo, lives in dotsimulate/StreamDiffusionTD).

Verified: 19-28 FPS sustained with CN canny SDXL-Turbo 512x512, OSC
enable/scale changes accepted, no 901, TD-side Receiver healthy.
…901)

ControlNet TRT engines fail cudaStreamEndCapture with 901
(cudaErrorStreamCaptureInvalidated) on cold start when controlnet_scale > 0
in td_config.yaml. Root cause: TRT's internal genericReformat::copyPackedRunKernel
submits work to the legacy/NULL stream during execute_async_v3 inside the
graph-capture window on the engine's (polygraphy blocking) stream.

wrapper.py:2208 hard-coded use_cuda_graph=True for every CN engine. Setting
it to False keeps TRT acceleration for CN but skips the CUDA-graph wrapper,
eliminating the capture-window conflict. Cost: ~hundreds of us per CN forward
on WDDM (no graph batch-submission); steady-state FPS 18-25 vs 19-28.

Also:
- utilities.py: defensive torch.cuda.current_stream().synchronize() before
  cudaStreamBeginCapture, gated on first capture per engine. Covers the broader
  polygraphy blocking-stream / legacy-stream race for future TRT engines.

Diagnosis trail: v0 (streamWaitEvent on legacy), v1 (wait_stream bridge),
v2 (CPU cudaEventQuery - fixes warm-activation), Stage A (CUDALINK_USE_GRAPHS=0
- disproved), v3 (drain legacy pre-capture - disproved), v4 (this commit).

Verified: cold-start with CN scale=0.577 + use_cuda_ipc_controlnet=true,
no 901, CN active from frame 1, steady FPS sustained.
…N smoothness

CUDALINK_WAIT_SPIN_US 200 -> 1000: absorbs CN-preprocessing variance
on the importer side without falling to blocking wait path. Eliminates
micro-stutter visible during CN-enabled SDXL-turbo runs on RTX 4090.

CUDALINK_BARRIER_STALE_NS 5s -> 0.2s: at ~16 FPS, the 5s upstream
default would let ~80 stale frames through the activation barrier
before rejection. 0.2s (~3 frames) is tight enough to catch a
genuinely stale publish without false-positive on healthy frames.

Applied to both _compat/cuda_ipc/ (library) and _compat/td_exporter/
(TD COMP mirror, auto-synced to .tox) halves in lockstep.
CUDALINK_TD_USE_GRAPHS default of False preserved.
…uild_engine

The direct mutation at lines 1147-1149 was immediately overwritten by the
full GPUBuildProfile dataclass rebuild at 1153-1172 — dead code. Drop the
first block; keep the dataclass rebuild as the single override path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- bump VENDORED_VERSION 1.4.1 -> 1.5.1 (upstream 2d44ef8)
- split cuda_ipc_exporter.py into exporter.py/importer.py/_cuda_adapters.py/_env.py/_profile.py; add activation_barrier.py; drop debug_utils.py
- migrate StreamDiffusionWrapper export path to Exporter.open(FrameSpec)/export(GpuFrame)->FrameOutcome/close() with env-driven ExportPolicy
- mirror td_exporter/ in lockstep (auto-syncs to .tox); retain CUDAIPCImporter as deprecation shim (removal v1.8)
P1: set TF32/cudnn/matmul precision flags at StreamDiffusion init
P2: gate per-frame GPU sync to every 16th frame (remove ~15/16 host stalls)
P3: GPU-native Canny in _process_tensor_core (eliminates D2H+cv2+H2D round-trip)
P4: GPU-side uint8 output conversion + pinned D2H in _send_output_frame
P5: upload CPU-SHM input as uint8, normalize on-GPU, use pipeline fast path
P6: fix IPC zero-copy -- get_frame() instead of get_frame_numpy()
P7: verified IPC output export already implemented

Grounding: CUDA Handbook Ch.5/6/11 + PMPP Ch.4/5/6/18
- Add ControlNet CUDA-IPC consumer to both td_manager copies: ipc_control_importer
  state vars, _try_construct_ipc_control_importer(), throttled lazy-reconnect (1s),
  get_frame() -> (1,3,H,W) float16 [0,1] -> wrapper.update_control_image(). CPU-SHM
  fallback also given lazy-reconnect. Fixes no-conditioning-effect regression caused
  by missing consumer + startup race on control_memory.
- P3 Canny hardening: replace per-frame amax normalization with constant /4.0 divisor
  for stable frame-to-frame thresholds; replace expand() stride-0 view with repeat()
  for contiguous CHW output tensor.
- Add post-implementation corrections to cuda-perf-plan.md (file paths, td_manager
  untracked status, dead use_cuda_ipc_controlnet flag now backed by real consumer).
- Add new plan doc: docs/plans/2026-05-24-controlnet-ipc-consumer.md
…efault off)

The per-frame _send_processed_controlnet_frame().cpu().numpy() D2H copy stalled the
CUDA stream every frame once controlnet_images[0] became non-None after the IPC
consumer fix. Preview is display-only; diffusion is unaffected.

Changes (Scripts/ auto-sync to running .tox; runtime td_manager.py is untracked):
- _send_back_processed_controlnet: early-return when send_controlnet_preview is false
- _initialize_memory_interfaces: skip control_processed_memory allocation when disabled
- StreamDiffusionExt emitter: emit send_controlnet_preview flag (default false),
  overridable via Sendcontrolnetpreview TD par
- docs/plans/2026-05-24-controlnet-preview-throttle.md: plan + diagnosis
inference_time_ema feeds only the similar-filter sleep heuristic. When the
filter is off the EMA has no reader, so gate start.record()/end.record()/
end.synchronize() behind if self.similar_image_filter — eliminates even the
residual 1-in-16 host stall on the default (filter-off) production path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant