fix(rollout): support non extra gpu placement when using rollout-external mode by shinytang6 · Pull Request #1997 · THUDM/slime

shinytang6 · 2026-05-30T13:27:10Z

Background

rollout-external mode is already supported in slime — it lets the training job talk to a pre-launched SGLang server instead of spawning one. While
smoke-testing this path end-to-end (machine A hosts SGLang, machine B runs train.py) I hit a few bugs and one piece of waste that this PR cleans up.

What this PR fixes:

External engine never registers with the router.
Sanity check rejects fields that must differ for an external server.
Redundant local GPU reservation for the rollout group.
- For non-colocate runs, slime reserves actor_num_gpus + rollout_num_gpus GPU bundles in the placement group, with rollout starting at rollout_offset =actor_num_gpus.
- But under --rollout-external the local rollout actor is a thin HTTP wrapper — it doesn't touch local GPU memory, so reserving rollout_num_gpus extra bundles on the training node is pure waste (and fails outright when the training node doesn't have that many spare GPUs). Reuse the actor PG and set rollout_offset = 0 in both create_placement_groups and _compute_rollout_offset.

How to reproduce

Two machines. Machine A hosts the pre-launched SGLang server; machine B runs the slime training job and points at it.

Machine A — launch the external SGLang server

python -m sglang.launch_server \
    --model-path /path/to/model \
    --host 0.0.0.0 \
    --port 10091

Verify reachable from B: curl http://<SGLANG_HOST>:10091/get_server_info.

Machine B — minimal GRPO smoke test driving the external server

The only --rollout-external-specific bit is the SGLANG_ARGS block:

SGLANG_ARGS=(
     --rollout-num-gpus-per-engine 1
     --sglang-mem-fraction-static 0.4
     --rollout-external
     --rollout-external-engine-addrs <SGLANG_HOST>:10091
  )

Full script:

  #!/bin/bash
  set -ex

  pkill -9 sglang 2>/dev/null || true
  ray stop --force 2>/dev/null || true
  pkill -9 ray python 2>/dev/null || true
  sleep 2

  export PYTHONBUFFERED=1
  SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
  source "${SCRIPT_DIR}/models/qwen2.5-0.5B.sh"

  CKPT_ARGS=(
     --hf-checkpoint /path/to/Qwen2.5-0.5B-Instruct/
     --ref-load     /path/to/Qwen2.5-0.5B-Instruct_torch_dist/
     --save /tmp/slime_smoke_save/
     --save-interval 9999
  )

  ROLLOUT_ARGS=(
     --prompt-data /path/to/dapo-math-17k.jsonl
     --input-key prompt --label-key label
     --apply-chat-template --rollout-shuffle
     --rm-type deepscaler
     --num-rollout 3 --rollout-batch-size 2
     --n-samples-per-prompt 2 --num-steps-per-rollout 1
     --global-batch-size 4
     --rollout-max-response-len 256 --rollout-temperature 1
  )

  PERF_ARGS=(
     --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1
     --context-parallel-size 1 --expert-model-parallel-size 1
     --expert-tensor-parallel-size 1
     --use-dynamic-batch-size --max-tokens-per-gpu 2048
  )

  GRPO_ARGS=(
     --advantage-estimator grpo --use-kl-loss --kl-loss-coef 0.00
     --kl-loss-type low_var_kl --entropy-coef 0.00
     --eps-clip 0.2 --eps-clip-high 0.28
  )

  OPTIMIZER_ARGS=(
     --optimizer adam --lr 1e-6 --lr-decay-style constant
     --weight-decay 0.1 --adam-beta1 0.9 --adam-beta2 0.98
  )

  SGLANG_ARGS=(
     --rollout-num-gpus-per-engine 1
     --sglang-mem-fraction-static 0.4
     --rollout-external
     --rollout-external-engine-addrs <SGLANG_HOST>:10091
  )

  MISC_ARGS=(
     --attention-dropout 0.0 --hidden-dropout 0.0
     --accumulate-allreduce-grads-in-fp32
     --attention-softmax-in-fp32
     --attention-backend flash
  )

  ray start --head --node-ip-address 127.0.0.1 --num-gpus 2 --disable-usage-stats

  ray job submit --address="http://127.0.0.1:8265" \
     --runtime-env-json='{
       "env_vars": {
          "PYTHONPATH": "/path/to/Megatron-LM/",
          "CUDA_DEVICE_MAX_CONNECTIONS": "1",
       }
     }' \
     -- python3 train.py \
     --actor-num-nodes 1 \
     --actor-num-gpus-per-node 2 \
     "${MODEL_ARGS[@]}" \
     "${CKPT_ARGS[@]}" "${ROLLOUT_ARGS[@]}" \
     "${OPTIMIZER_ARGS[@]}" "${GRPO_ARGS[@]}" \
     "${PERF_ARGS[@]}" "${SGLANG_ARGS[@]}" "${MISC_ARGS[@]}"

…rnal mode - sglang_engine: register the external server with the router in _init_external (mirror _init_normal); without this the router stays empty and rollouts have nowhere to go. - sglang_engine: skip sanity-check on host / base_gpu_id / gpu_id_step, which naturally differ between slime's ServerArgs and the externally launched server. - placement_group / rollout: stop reserving rollout_num_gpus extra local GPU bundles for external rollout. The local actor is a thin HTTP wrapper that doesn't touch GPU memory, so it shares the actor PG (rollout_offset=0) instead of demanding spare local GPUs. - arguments: add early validation -- require --rollout-external-engine-addrs, reject --colocate / --debug-rollout-only, derive --rollout-num-gpus from the address count, force offload_rollout=False, and assert the engine count fits within the actor PG. Signed-off-by: shinytang6 <1074461480@qq.com>

shinytang6 · 2026-05-30T13:44:24Z

@zhuzilin could you please help take a look on this PR?

shinytang6 force-pushed the feat/rollout-external branch 2 times, most recently from 62203b9 to b39c012 Compare May 30, 2026 13:39

shinytang6 force-pushed the feat/rollout-external branch from b39c012 to d39a9a3 Compare May 30, 2026 13:41

shinytang6 changed the title ~~[feat] add --rollout-external mode for pre-launched inference engines~~ fix(rollout): support non extra gpu placement when using rollout-external mode May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rollout): support non extra gpu placement when using rollout-external mode#1997

fix(rollout): support non extra gpu placement when using rollout-external mode#1997
shinytang6 wants to merge 1 commit into
THUDM:mainfrom
shinytang6:feat/rollout-external

shinytang6 commented May 30, 2026 •

edited

Loading

Uh oh!

shinytang6 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shinytang6 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

How to reproduce

Uh oh!

shinytang6 commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shinytang6 commented May 30, 2026 •

edited

Loading