diff --git a/blog/2026-06-01-hetero-epd.md b/blog/2026-06-01-hetero-epd.md new file mode 100644 index 000000000..7ae253827 --- /dev/null +++ b/blog/2026-06-01-hetero-epd.md @@ -0,0 +1,156 @@ +--- +title: "Heterogeneous CPU + GPU EPD Disaggregation to Boost VLM Serving" +author: "Intel & SGLang Team" +date: "May 29, 2026" +previewImg: /images/blog/hetero-epd/1.png +--- + +**TL;DR** + +We enabled heterogeneous Encode-Prefill-Decode (EPD) disaggregation via Dynamo and SGLang for Vision-Language Models (VLMs). By offloading vision encoding tasks to CPUs (the easiest-getting CPU resource is the CPU in head node), we achieved consistent performance improvements across metrics: TTFT (Time to First Token), TPOT (Time Per Output Token), and overall throughput under load. + +## Introduction + +The SGLang community has already demonstrated the necessity and benefits of EPD disaggregation to VLM serving ^{[1]}. It shows that EPD can significantly reduce TTFT in image-heavy scenarios where multi-images are fed into the service. With the observation that vision encoding is the primary computational bottleneck in image-heavy scenarios, we see offloading some vision encoding work to the head node CPU can help improve performance: + +- Vision encoder (CNN/ViT) is usually smaller than language model part, which makes modern CPUs equipped with advanced matrix accelerators (e.g. AMX in Intel Xeon CPUs) be able to help. +- Vision encoding only happens during prefill, which makes it easy to plug in a heterogenous worker, without the needing of continuous cross-worker state management + +## Device-Aware Weighted Router + +By collaborating with Dynamo community, we merged a new device-aware weighted router mode into Dynamic router to support heterogeneous dispatching(PR [#7215](https://github.com/ai-dynamo/dynamo/pull/7215)). It introduces a budget-based throttle between devices (specifically CPU vs. GPU). + +In a heterogeneous deployment environment where computing capabilities vary (e.g., a GPU vs. a CPU), the device-aware weighted router uses a Capability Ratio $R$ to define the relative throughput of the GPU against CPU. The router calculates an Allowed CPU In-flight Budget ($B_{cpu}$). This budget represents the maximum number of requests the CPU pool should handle to stay "in sync" with the current pressure on the GPU pool: + +$$B_{cpu} = \frac{I_{gpu}N_{cpu}}{RN_{gpu}}$$ + +Here, $I_{gpu}$ is total in-flight requests across all GPU instances, $N_{cpu}$ and $N_{gpu}$ is the count of GPU and CPU instances, respectively. + +The routing decision is straightforward, when $I_{cpu}$ (total in-flight requests across all CPU instances) is less than $B_{cpu}$, it means CPU pool is under-utilized relative to its normalized capacity, so route to CPU pool; else, route to GPU pool. + + + +*Figure 1. device-aware weighted router* + +## Experiment Setup + +### Use Case Configuration + +**Environment:** +- Intel(R) Xeon(R) 6747P CPUs (2 sockets, 2 NUMA nodes per socket, in total 4 NUMA nodes) $^{[2]}$ +- 5x L40S CUDA GPUs $^{[3]}$ + +**Model:** +- Qwen3-VL-8B-Instruct + +**Dataset:** +- ISL/OSL: 128/256 +- Image Resolution: 1080p +- Image count: 8 +- QPS Range: 1.0, 1.2, 1.5, 2.0 + +**Deployment Configurations:** +- 1E/4PD (Encoder: 1 GPU Encoder, PD: 4 GPU) +- (4 CPU + 1 GPU) E/4PD (Encoder: 4 CPU + 1 GPU, PD: 4 GPU), Capability Ratio $R$ is set to be 12 + +### Use Case Launch Scripts + +**Launch vision encoder instances:** + +```shell +# launch cuda encoder +CUDA_VISIBLE_DEVICES=0 numactl --cpunodebind=0 --membind=0 python -m dynamo.sglang --multimodal-encode-worker --model-path "$MODEL_NAME" --chat-template "$CHAT_TEMPLATE" --embedding-transfer-mode nixl-read & + +# launch cpu encoders, DYN_ENCODER_CUDA_TO_CPU_RATIO is 12 in this case +for node in 0 1 2 3; do 12 + case "$node" in + 0) cpus="$(printf "%s\n%s\n" "$(seq 0 2 46)" "$(seq 96 2 142)" | paste -sd, -)" ;; + 1) cpus="$(printf "%s\n%s\n" "$(seq 48 2 94)" "$(seq 144 2 190)" | paste -sd, -)" ;; + 2) cpus="$(printf "%s\n%s\n" "$(seq 1 2 47)" "$(seq 97 2 143)" | paste -sd, -)" ;; + 3) cpus="$(printf "%s\n%s\n" "$(seq 49 2 95)" "$(seq 145 2 191)" | paste -sd, -)" ;; + esac + + CUDA_VISIBLE_DEVICES="" \ + SGLANG_USE_CPU_ENGINE=1 \ + SGLANG_CPU_OMP_THREADS_BIND="$cpus" \ + numactl --cpunodebind="$node" --membind="$node" \ + python -m dynamo.sglang \ + --multimodal-encode-worker \ + --model-path "$MODEL_NAME" \ + --chat-template "$CHAT_TEMPLATE" \ + --embedding-transfer-mode nixl-read & +done +``` + +**Launch the PD instance:** + +```shell +# launch PD instances +for gpu in 2 3 4 5; do + if [[ "$gpu" -lt 4 ]]; then + numa_node=0 + else + numa_node=1 + fi + + CUDA_VISIBLE_DEVICES="$gpu" \ + numactl --cpunodebind="$numa_node" --membind="$numa_node" \ + python3 -m dynamo.sglang \ + --multimodal-worker \ + --model-path "$MODEL_NAME" \ + --page-size 16 \ + --tp 1 \ + --prefill-max-requests 1 \ + --log-level debug \ + --trust-remote-code \ + --skip-tokenizer-init \ + --disable-radix-cache \ + --embedding-transfer-mode nixl-read \ + --disaggregation-transfer-backend nixl & +done +``` + +**Launch router:** + +```shell +DYN_ENCODER_CUDA_TO_CPU_RATIO=12 python3 -m dynamo.frontend --router-mode device-aware-weighted +``` + +## Benchmark + +### Benchmark Script + +```shell +python -m sglang.bench_serving.py --model Qwen/Qwen3-VL-8B-Instruct --num-prompts 32 --dataset-name image --random-input-len 128 --random-output-len 256 --image-count 8 --image-resolution 1080p --host localhost --port 8000 --backend sglang-oai-chat --request-rate $QPS +``` + +### Benchmark Results + +#### P99 TTFT + + + +#### P99 TPOT + + + +#### Request Throughput + + + +**Key Findings:** + +- Heterogeneous CPU + GPU EPD disaggregation brings consistent better performance over pure GPU EPD disaggregation across all metrics (TTFT, TPOT, request throughput) under load (QPS between 1 and 2). +- ~1.2x-1.3x P99 TTFT and request throughput improvement can be observed, indicating CPU helps offloading the vision encoder burden of GPU under load. +- Significant ~1.3x-30x P99 TPOT reduction is achieved by relieving vision encoding traffic and mitigating the 2+ token generation queueing time. + +Heterogeneous CPU + GPU EPD disaggregation achieves an extra higher return on investment (ROI) in addition to the ROI brought by the pure GPU EPD disaggregation $^{[1]}$ almost for free. This is achieved by the system-level optimization which includes the AMX powered CPU into the solution space with a whole system view. + +## Reference +1. [EPD Disaggregation: Elastic Encoder Scaling for Vision-Language Models in SGLang](https://www.lmsys.org/blog/2026-01-12-epd/) +2. [Intel(R) Xeon(R) 6747P CPU](https://www.intel.com/content/www/us/en/products/sku/241825/intel-xeon-6747p-processor-288m-cache-2-70-ghz/specifications.html) +3. [NVIDIA L40S](https://www.nvidia.com/en-us/data-center/l40s/) diff --git a/public/images/blog/hetero-epd/1.png b/public/images/blog/hetero-epd/1.png new file mode 100644 index 000000000..20dacabb6 Binary files /dev/null and b/public/images/blog/hetero-epd/1.png differ diff --git a/public/images/blog/hetero-epd/2.png b/public/images/blog/hetero-epd/2.png new file mode 100644 index 000000000..d66f780a4 Binary files /dev/null and b/public/images/blog/hetero-epd/2.png differ diff --git a/public/images/blog/hetero-epd/3.png b/public/images/blog/hetero-epd/3.png new file mode 100644 index 000000000..9b66ef752 Binary files /dev/null and b/public/images/blog/hetero-epd/3.png differ diff --git a/public/images/blog/hetero-epd/4.png b/public/images/blog/hetero-epd/4.png new file mode 100644 index 000000000..9dcbe6f44 Binary files /dev/null and b/public/images/blog/hetero-epd/4.png differ