fix(rollout): support non extra gpu placement when using rollout-external mode#1997
Open
shinytang6 wants to merge 1 commit into
Open
fix(rollout): support non extra gpu placement when using rollout-external mode#1997shinytang6 wants to merge 1 commit into
shinytang6 wants to merge 1 commit into
Conversation
62203b9 to
b39c012
Compare
…rnal mode - sglang_engine: register the external server with the router in _init_external (mirror _init_normal); without this the router stays empty and rollouts have nowhere to go. - sglang_engine: skip sanity-check on host / base_gpu_id / gpu_id_step, which naturally differ between slime's ServerArgs and the externally launched server. - placement_group / rollout: stop reserving rollout_num_gpus extra local GPU bundles for external rollout. The local actor is a thin HTTP wrapper that doesn't touch GPU memory, so it shares the actor PG (rollout_offset=0) instead of demanding spare local GPUs. - arguments: add early validation -- require --rollout-external-engine-addrs, reject --colocate / --debug-rollout-only, derive --rollout-num-gpus from the address count, force offload_rollout=False, and assert the engine count fits within the actor PG. Signed-off-by: shinytang6 <1074461480@qq.com>
b39c012 to
d39a9a3
Compare
Author
|
@zhuzilin could you please help take a look on this PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
rollout-external mode is already supported in slime — it lets the training job talk to a pre-launched SGLang server instead of spawning one. While
smoke-testing this path end-to-end (machine A hosts SGLang, machine B runs train.py) I hit a few bugs and one piece of waste that this PR cleans up.
What this PR fixes:
actor_num_gpus + rollout_num_gpusGPU bundles in the placement group, with rollout starting at rollout_offset =actor_num_gpus.rollout_num_gpusextra bundles on the training node is pure waste (and fails outright when the training node doesn't have that many spare GPUs). Reuse the actor PG and set rollout_offset = 0 in both create_placement_groups and _compute_rollout_offset.How to reproduce
Two machines. Machine A hosts the pre-launched SGLang server; machine B runs the slime training job and points at it.
Verify reachable from B: curl http://<SGLANG_HOST>:10091/get_server_info.
The only --rollout-external-specific bit is the SGLANG_ARGS block:
Full script: