You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: blog/2026-02-11-Qwen-latency.md
+32-62Lines changed: 32 additions & 62 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,12 +12,8 @@ Qwen is a series of large-scale, high-performance Large Language Models (LLMs) d
12
12
13
13
14
14
In recent months, the Qwen C-end Infrastructure Engineering Team and the AMD AI Framework Team have collaborated to implement extreme latency optimization solutions for Qwen3-235B and Qwen3-VL-235B on the AMD Instinct<sup>TM</sup> MI300X series GPU platform based on the SGLang framework. Remarkable breakthroughs have been achieved in terms of performance, precision, and stability.
15
-
16
-
17
-
- For Qwen3-235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.67×, and the Time Per Output Token (TPOT) has been improved by 2.12×.
18
-
19
-
20
-
- For Qwen3-VL-235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.62×, and the Time Per Output Token (TPOT) has been improved by 1.90×.
15
+
***For Qwen3-235B**: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.67×, and the Time Per Output Token (TPOT) has been improved by 2.12×.
16
+
***For Qwen3-VL-235B**: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.62×, and the Time Per Output Token (TPOT) has been improved by 1.90×.
21
17
22
18
23
19
The AMD Instinct<sup>TM</sup> MI300X series GPUs are built on the CDNA<sup>TM</sup> 3 architecture, featuring 192 GB of HBM3 memory per card—sufficient to support inference for models with over 70 billion parameters. Combined with a 5.3 TB/s memory bandwidth, 256 MB Infinity Cache, and native Matrix Core support for FP8 and PTPC quantization, the platform delivers exceptional performance and cost-efficiency, making it an ideal choice for large-scale LLM cluster deployment.
@@ -49,7 +45,7 @@ The inference computation flow of Qwen3-235B is illustrated in Figure 2. The fol
@@ -91,26 +87,21 @@ In low-concurrency, extreme-latency-critical scenarios, TP8 distributes model we
91
87
For the Attention module, we integrate highperformance MHA and PagedAttention operators from AMD’s AITER Library, which are customized for a specialized KV Cache layout. The layout is defined as:
This layout aligns memory access patterns with the AMD CDNA<sup>TM</sup> 3 architecture, drastically improving the memory efficiency of PagedAttention. During the decode phase, no additional device-to-device (D2D) copies are required for layout conversion, thus eliminating redundant overhead (Figure 5). Compared to the standard KV Cache layout [num_blocks, num_kv_heads, head_dim, block_size], this optimization improves decode throughput by 15%–20% while reducing inference latency.
<palign="center"style="color:gray; text-align: center;"><em>Figure 6. Qwen3-VL-235B deployment in SGLang</em></p>
156
141
157
142
158
143
Compared to Qwen3‑235B, Qwen3‑VL‑235B introduces several new critical inference stages:
159
-
160
-
- Multimodal data format adaptation, preprocessing, and cross‑modal alignment
161
-
162
-
- ViT encoder execution, visual patch embedding, and cross‑modal feature fusion
144
+
* Multimodal data format adaptation, preprocessing, and cross‑modal alignment
145
+
* ViT encoder execution, visual patch embedding, and cross‑modal feature fusion
163
146
164
147
These extensions lengthen the inference pipeline and involve complex cross‑modal data coordination and feature adaptation, significantly increasing per‑request latency. The full dataflow is shown in Figure 6. Relative to pure language LLMs, Qwen3‑VL’s major overhead comes from three sources:
165
-
166
-
- Host‑side multimodal preprocessing
167
-
168
-
- Multimodal data transfer
169
-
170
-
- GPU‑side ViT encoder computation
148
+
* Host‑side multimodal preprocessing
149
+
* Multimodal data transfer
150
+
* GPU‑side ViT encoder computation
171
151
172
152
173
153
We designed targeted latency optimizations for each bottleneck.
@@ -200,7 +180,7 @@ In SGLang, the Tokenizer and Scheduler typically run in separate processes. Prep
200
180
The ROCm<sup>TM</sup> backend supports **CUDA IPC**, enabling direct GPU to GPU data transfer without CPU intermediation. This eliminates redundant CPU GPU copies and drastically reduces multimodal transfer latency, as shown in Figure 9. Additionally, we offload image hashing (Figure 6) to the GPU, further compressing overhead.
<palign="center"style="color:gray; text-align: center;"><em>Figure 9. CUDA IPC on ROCm backend
206
186
</em></p>
@@ -210,13 +190,9 @@ The ROCm<sup>TM</sup> backend supports **CUDA IPC**, enabling direct GPU to GPU
210
190
211
191
212
192
The Vision Transformer (ViT) module performs visual feature encoding from images and videos. For high-resolution inputs, however, it becomes a severe compute-bound bottleneck due to patch-based tokenization:
213
-
214
-
215
-
- Inputs are split into fixed patches (e.g., 16×16)
216
-
217
-
- Sequence length grows quadratically with resolution
218
-
219
-
- Full selfattention has complexity O(N<sup>2</sup>)
193
+
* Inputs are split into fixed patches (e.g., 16×16)
194
+
* Sequence length grows quadratically with resolution
195
+
* Full selfattention has complexity O(N<sup>2</sup>)
220
196
221
197
A 1280×1280 image generates approximately 4,800 tokens (consistent with original correction: 960×1280 → 4,800 tokens), resulting in over 23 million attention interactions. In extreme scenarios involving large batches of high-resolution images or long videos, token counts can surpass 1M, pushing attention complexity to O(10<sup>12</sup>). This leads to explosive memory consumption, extreme latency, and low hardware utilization.
222
198
@@ -245,22 +221,16 @@ We use PTPC-FP8 quantization recipe and the corresponding model weights are avai
245
221
246
222
247
223
These optimizations target **low latency inference scenarios**, with evaluation settings as follows:
***Qwen3‑VL‑235B:** Single request, text ISL = 8000, 5 images (960×1280) per request, OSL = 500.
254
226
255
227
256
228
### 3.2 CUDA IPC Configuration
257
229
258
230
259
231
To enable **GPU direct IPC** for efficient multimodal data transfer, users can set the following environment variables. The variable value can be changed according to different scenarios.
260
-
261
-
- export SGLANG_USE_CUDA_IPC_TRANSPORT=1
262
-
263
-
- export SGLANG_VLM_CACHE_SIZE_MB=8192
232
+
* export SGLANG_USE_CUDA_IPC_TRANSPORT=1
233
+
* export SGLANG_VLM_CACHE_SIZE_MB=8192
264
234
265
235
Experimental results show that, for 5 images of 960×1280 resolution, enabling CUDA IPC yields a **significant reduction in data transfer latency**, with a peak reduction of up to 2 seconds compared with gloo:broadcast.
266
236
@@ -286,7 +256,7 @@ For Qwen3-VL-235B, the performance optimization results are shown in Figure 11.
0 commit comments