Open
Conversation
f220c86 to
fb890b3
Compare
Add delta-compression support for weight sync from trainer to rollout
engines. The trainer holds a pinned-CPU baseline, computes deltas on
GPU, snapshots the new baseline back to CPU on a secondary stream, and
sparse-encodes deltas for broadcast.
- CLI flags in slime/utils/arguments.py:
--enable-delta-compression
--delta-compression-dtype {fp16, bf16, fp32}
--delta-compression-transport {dense, sparse_indices, sparse_bitmask}
--delta-compression-full-sync-interval N
--delta-compression-artifact-dir PATH
- DeltaCompressionTracker in delta_weight_update.py — owns baseline,
issues async D2H on a secondary CUDA stream, computes deltas.
- Shared helpers in common.py (ChunkUpdate, materialize_delta_transport,
dtype map).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
a3c070f to
d2aa1c0
Compare
Integrate DeltaCompressionTracker into both weight-update paths:
- update_weight_from_distributed.py (non-colocated): per-param gather
+ HF convert feeds into prepare_chunk, then sparse-encoded deltas
are bucketed into NCCL broadcasts. A sparse bucket flush threshold
(5 GB) keeps per-broadcast cost bounded.
- update_weight_from_tensor.py (colocated): same tracker, simpler
flush via load_format="flattened_bucket_delta".
- sglang_engine.py: update_weights_from_distributed accepts the new
sparse_metadata kwargs so the engine side can decode sparse payload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds _apply_sparse_delta_weights_from_distributed to the SGLang ModelRunner. Receiver decodes packed (indices, values) on-wire into persistent scratch buffers and calls load_weights in batches capped at 512 MiB to amortize the ~2ms/call name-resolution + MoE expert remapping overhead. Validated at GLM-4.7-355B (bf16, sparse_indices, 64 rollout GPUs): - Step 0 full sync: 66-83s — weights_checker 200 OK on all engines - Step 1 delta sync: 22.6s — weights_checker 200 OK (all tensors match) - Step 2 forced full: 69.7s — weights_checker 200 OK Batch cap sweep justifying 512 MiB inlined in the patch itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d2aa1c0 to
5fb928d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Inspiration / prior art
Summary
Add optional delta-compression for trainer → rollout-engine weight sync. When enabled, the trainer keeps a pinned-CPU baseline of each parameter, broadcasts only the sparse delta (bf16 packed indices + values) over NCCL, and the SGLang receiver decodes + applies the delta in large amortized batches.
At GLM-4.7-355B-A32B non-colocated (4 actor + 4 rollout nodes, 64 rollout H100s), this drops delta-sync time from ~50s (non-delta baseline, full 170 GB broadcast) to ~22.6s - ~2.2× faster.
For the colocated case we do not expect measurable gains: that path uses CUDA IPC (GPU-to-GPU memory sharing in the same process group), not NCCL broadcast, so there is no wire to compress. Delta compute would be pure overhead there. This PR leaves the colocated CUDA-IPC path unchanged in behavior — the tracker integration in
update_weight_from_tensor.pyis present for correctness but expected to be a no-op win for colocated runs.New CLI flags
--enable-delta-compression— master switch--delta-compression-dtype {fp16, bf16, fp32}— delta dtype on the wire (defaultfp32)--delta-compression-transport {dense, sparse_indices, sparse_bitmask}— on-wire encoding (defaultdense; recommendedsparse_indices)--delta-compression-full-sync-interval N— run a full sync every N successful delta syncs; step 0 is always full--delta-compression-artifact-dir PATH— optional async zstd artifact writer (off by default)Design choices
1. Sparse-indices transport
Deltas between consecutive RL steps are extremely sparse (~2-3% density at 355B). Sending them dense would waste ~97% of the wire. We pack non-zero
(index, value)pairs into two flat tensors and a small metadata list describing per-param slice ranges. Compression ratio observed: ~30× on the wire (170 GB dense → ~5.9 GB sparse).2. Secondary CUDA stream for baseline D2H
The trainer must keep a pinned-CPU baseline up to date. If D2H runs on the default stream, it serializes with delta compute and the next chunk's gather. We route all baseline D2H through a dedicated
_d2h_stream, recording a single event to gate on gather/convert. Both full and delta paths share the same_snapshot_baseline_asynchelper (see commit "delta tracker: unify baseline D2H across full and delta paths" in history).3. Inline commit — no GPU refs held
baseline_updates=[]inDeltaCompressionCommitState— gathered GPU tensors are freed as soon asprepare_chunkreturns, so the next bucket's gather has headroom. Prior iterations that held refs in commit state caused allocator stalls at 355B scale.4. Receiver batch cap = 512 MiB
SGLang's
model.load_weights(batch)call costs ~2 ms/call in name resolution + MoE expert remapping — so the receiver batches decoded deltas until they exceed 512 MiB before flushing toload_weights. Swept systematically at GLM-4.7-355B H100 64-rollout:load_callsper sync512 MiB won decisively and is hardcoded with the sweep rationale inlined in the patch comment.
5. bf16 deltas with fp32 baseline
The baseline stays fp32 on CPU (accumulator), but the on-wire delta is bf16. Rounding drift is flushed periodically via
--delta-compression-full-sync-interval— set to 10000 in practice (basically never) because the baseline is exact-refreshed each step anyway.Experimental results
Primary benchmark — GLM-4.7-355B-A32B non-colocated (4+4 nodes, 64 H100 rollout GPUs)
full_chunks=0, delta_chunks=598,transport=sparse_indices,batch_mib=5121c4935bf)7ed11a92, run #2)_snapshot_baseline_asyncmoves D2H to secondary stream,baseline_updates=[]frees gathered buffers. D2H now overlaps with NCCL broadcast; step 2 runs at pure-non-delta speed. Measured ~28s faster than pre-refactor on comparable step-0 runtimes (69.7s → 41.4s).Correctness validation
Ran a validate config (
check_weight_update_equal=True,full_sync_interval=2) to exercise full → delta → forced-full transitions against a snapshotted reference on each SGLang engine:weights_checker 200 OKon every engine;[check_tensors] equal tensorslisted every layerweights_checker 200 OKon every engine — bitwise/tolerance matchweights_checker 200 OKon every engineExperiments conducted during development
sparse_indiceswins — bitmask incurs unpack cost on the receivercheck_weight_update_equalfull→delta→forced-fullColocated case
The colocated weight-sync path (
update_weight_from_tensor.py) uses CUDA IPC: the sender registers GPU memory and shares an IPC handle; the colocated SGLang engine maps that memory directly into its own process. There is no broadcast over the network — the "wire" is a pointer, not bytes.Consequences:
update_weight_from_tensor.pyfor API consistency and so that future mixed colocated+distributed setups get consistent baseline semantics, but we do not claim any colocated perf gain, and users running fully colocated should leave--enable-delta-compressionoff.Test plan
check_weight_update_equal/weights_checker: all three paths match referencepy_compile)🤖 Generated with Claude Code