Skip to content

fix(rollout): support non extra gpu placement when using rollout-external mode#1997

Open
shinytang6 wants to merge 1 commit into
THUDM:mainfrom
shinytang6:feat/rollout-external
Open

fix(rollout): support non extra gpu placement when using rollout-external mode#1997
shinytang6 wants to merge 1 commit into
THUDM:mainfrom
shinytang6:feat/rollout-external

Conversation

@shinytang6
Copy link
Copy Markdown

@shinytang6 shinytang6 commented May 30, 2026

Background

rollout-external mode is already supported in slime — it lets the training job talk to a pre-launched SGLang server instead of spawning one. While
smoke-testing this path end-to-end (machine A hosts SGLang, machine B runs train.py) I hit a few bugs and one piece of waste that this PR cleans up.

What this PR fixes:

  1. External engine never registers with the router.
  2. Sanity check rejects fields that must differ for an external server.
  3. Redundant local GPU reservation for the rollout group.
    • For non-colocate runs, slime reserves actor_num_gpus + rollout_num_gpus GPU bundles in the placement group, with rollout starting at rollout_offset =actor_num_gpus.
    • But under --rollout-external the local rollout actor is a thin HTTP wrapper — it doesn't touch local GPU memory, so reserving rollout_num_gpus extra bundles on the training node is pure waste (and fails outright when the training node doesn't have that many spare GPUs). Reuse the actor PG and set rollout_offset = 0 in both create_placement_groups and _compute_rollout_offset.

How to reproduce

Two machines. Machine A hosts the pre-launched SGLang server; machine B runs the slime training job and points at it.

  1. Machine A — launch the external SGLang server
python -m sglang.launch_server \
    --model-path /path/to/model \
    --host 0.0.0.0 \
    --port 10091

Verify reachable from B: curl http://<SGLANG_HOST>:10091/get_server_info.

  1. Machine B — minimal GRPO smoke test driving the external server

The only --rollout-external-specific bit is the SGLANG_ARGS block:

SGLANG_ARGS=(
     --rollout-num-gpus-per-engine 1
     --sglang-mem-fraction-static 0.4
     --rollout-external
     --rollout-external-engine-addrs <SGLANG_HOST>:10091
  )

Full script:

  #!/bin/bash
  set -ex

  pkill -9 sglang 2>/dev/null || true
  ray stop --force 2>/dev/null || true
  pkill -9 ray python 2>/dev/null || true
  sleep 2

  export PYTHONBUFFERED=1
  SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
  source "${SCRIPT_DIR}/models/qwen2.5-0.5B.sh"

  CKPT_ARGS=(
     --hf-checkpoint /path/to/Qwen2.5-0.5B-Instruct/
     --ref-load     /path/to/Qwen2.5-0.5B-Instruct_torch_dist/
     --save /tmp/slime_smoke_save/
     --save-interval 9999
  )

  ROLLOUT_ARGS=(
     --prompt-data /path/to/dapo-math-17k.jsonl
     --input-key prompt --label-key label
     --apply-chat-template --rollout-shuffle
     --rm-type deepscaler
     --num-rollout 3 --rollout-batch-size 2
     --n-samples-per-prompt 2 --num-steps-per-rollout 1
     --global-batch-size 4
     --rollout-max-response-len 256 --rollout-temperature 1
  )

  PERF_ARGS=(
     --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1
     --context-parallel-size 1 --expert-model-parallel-size 1
     --expert-tensor-parallel-size 1
     --use-dynamic-batch-size --max-tokens-per-gpu 2048
  )

  GRPO_ARGS=(
     --advantage-estimator grpo --use-kl-loss --kl-loss-coef 0.00
     --kl-loss-type low_var_kl --entropy-coef 0.00
     --eps-clip 0.2 --eps-clip-high 0.28
  )

  OPTIMIZER_ARGS=(
     --optimizer adam --lr 1e-6 --lr-decay-style constant
     --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.98
  )

  SGLANG_ARGS=(
     --rollout-num-gpus-per-engine 1
     --sglang-mem-fraction-static 0.4
     --rollout-external
     --rollout-external-engine-addrs <SGLANG_HOST>:10091
  )

  MISC_ARGS=(
     --attention-dropout 0.0 --hidden-dropout 0.0
     --accumulate-allreduce-grads-in-fp32
     --attention-softmax-in-fp32
     --attention-backend flash
  )

  ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats

  ray job submit --address="http://127.0.0.1:8265" \
     --runtime-env-json='{
       "env_vars": {
          "PYTHONPATH": "/path/to/Megatron-LM/",
          "CUDA_DEVICE_MAX_CONNECTIONS": "1",
       }
     }' \
     -- python3 train.py \
     --actor-num-nodes 1 \
     --actor-num-gpus-per-node 2 \
     "${MODEL_ARGS[@]}" \
     "${CKPT_ARGS[@]}" "${ROLLOUT_ARGS[@]}" \
     "${OPTIMIZER_ARGS[@]}" "${GRPO_ARGS[@]}" \
     "${PERF_ARGS[@]}" "${SGLANG_ARGS[@]}" "${MISC_ARGS[@]}"

@shinytang6 shinytang6 force-pushed the feat/rollout-external branch 2 times, most recently from 62203b9 to b39c012 Compare May 30, 2026 13:39
…rnal mode

- sglang_engine: register the external server with the router in
  _init_external (mirror _init_normal); without this the router stays
  empty and rollouts have nowhere to go.
- sglang_engine: skip sanity-check on host / base_gpu_id / gpu_id_step,
  which naturally differ between slime's ServerArgs and the externally
  launched server.
- placement_group / rollout: stop reserving rollout_num_gpus extra local
  GPU bundles for external rollout. The local actor is a thin HTTP
  wrapper that doesn't touch GPU memory, so it shares the actor PG
  (rollout_offset=0) instead of demanding spare local GPUs.
- arguments: add early validation -- require
  --rollout-external-engine-addrs, reject --colocate /
  --debug-rollout-only, derive --rollout-num-gpus from the address
  count, force offload_rollout=False, and assert the engine count fits
  within the actor PG.

Signed-off-by: shinytang6 <1074461480@qq.com>
@shinytang6 shinytang6 force-pushed the feat/rollout-external branch from b39c012 to d39a9a3 Compare May 30, 2026 13:41
@shinytang6 shinytang6 changed the title [feat] add --rollout-external mode for pre-launched inference engines fix(rollout): support non extra gpu placement when using rollout-external mode May 30, 2026
@shinytang6
Copy link
Copy Markdown
Author

@zhuzilin could you please help take a look on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant