Skip to content

Commit 78f5f46

Browse files
authored
Update the image layout to improve the readability (#318)
1 parent a8608d5 commit 78f5f46

3 files changed

Lines changed: 32 additions & 62 deletions

File tree

blog/2026-02-11-Qwen-latency.md

Lines changed: 32 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,8 @@ Qwen is a series of large-scale, high-performance Large Language Models (LLMs) d
1212

1313

1414
In recent months, the Qwen C-end Infrastructure Engineering Team and the AMD AI Framework Team have collaborated to implement extreme latency optimization solutions for Qwen3-235B and Qwen3-VL-235B on the AMD Instinct<sup>TM</sup> MI300X series GPU platform based on the SGLang framework. Remarkable breakthroughs have been achieved in terms of performance, precision, and stability.
15-
16-
17-
- For Qwen3-235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.67×, and the Time Per Output Token (TPOT) has been improved by 2.12×.
18-
19-
20-
- For Qwen3-VL-235B: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.62×, and the Time Per Output Token (TPOT) has been improved by 1.90×.
15+
* **For Qwen3-235B**: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.67×, and the Time Per Output Token (TPOT) has been improved by 2.12×.
16+
* **For Qwen3-VL-235B**: Compared with the baseline, the Time to First Token (TTFT) has been improved by 1.62×, and the Time Per Output Token (TPOT) has been improved by 1.90×.
2117

2218

2319
The AMD Instinct<sup>TM</sup> MI300X series GPUs are built on the CDNA<sup>TM</sup> 3 architecture, featuring 192 GB of HBM3 memory per card—sufficient to support inference for models with over 70 billion parameters. Combined with a 5.3 TB/s memory bandwidth, 256 MB Infinity Cache, and native Matrix Core support for FP8 and PTPC quantization, the platform delivers exceptional performance and cost-efficiency, making it an ideal choice for large-scale LLM cluster deployment.
@@ -49,7 +45,7 @@ The inference computation flow of Qwen3-235B is illustrated in Figure 2. The fol
4945
#### 2.1.1 GEMM Quantization Strategy
5046

5147
<p align="center">
52-
<img src="/images/blog/qwen_amd_latency/PTPC.png" width="40%">
48+
<img src="/images/blog/qwen_amd_latency/PTPC.png" style="display: block; margin: 20px auto 0; width: 40%; max-width: 100%; height: auto;">
5349
</p>
5450
<p align="center" style="color:gray; text-align: center;"><em>Figure 3. PTPC-FP8: Per-Token-Activation, Per-Channel-Weight Quantization</em></p>
5551

@@ -91,26 +87,21 @@ In low-concurrency, extreme-latency-critical scenarios, TP8 distributes model we
9187
For the Attention module, we integrate highperformance MHA and PagedAttention operators from AMD’s AITER Library, which are customized for a specialized KV Cache layout. The layout is defined as:
9288

9389

94-
- k_cache: [num_blocks, num_kv_heads, head_dim // x, block_size, x]
95-
- v_cache: [num_blocks, num_kv_heads, block_size // X, head_dim, X]
90+
* k_cache: [num_blocks, num_kv_heads, head_dim // x, block_size, x]
91+
* v_cache: [num_blocks, num_kv_heads, block_size // X, head_dim, X]
9692

9793

9894
This layout aligns memory access patterns with the AMD CDNA<sup>TM</sup> 3 architecture, drastically improving the memory efficiency of PagedAttention. During the decode phase, no additional device-to-device (D2D) copies are required for layout conversion, thus eliminating redundant overhead (Figure 5). Compared to the standard KV Cache layout [num_blocks, num_kv_heads, head_dim, block_size], this optimization improves decode throughput by 15%–20% while reducing inference latency.
9995

10096
<p align="center">
101-
<img src="/images/blog/qwen_amd_latency/K_Cache_Layout.png" width="30%">
97+
<img src="/images/blog/qwen_amd_latency/K_Cache_Layout.png" style="display: block; margin: 20px auto 0; width: 40%; max-width: 100%; height: auto;">
10298
</p>
10399
<p align="center" style="color:gray; text-align: center;"><em>Figure 5. K Cache Layout Distribution</em></p>
104100

105101

106102
**(2) DataType Optimization**
107-
108-
109-
- In the **prefill** phase: per-tensor FP8 quantization is applied to query, key, and value activations for MHA.
110-
111-
112-
- In the **decode** phase: query uses BF16, while KV Cache remains stored in per-tensor FP8 (consistent with prefill).
113-
103+
* In the **prefill** phase: per-tensor FP8 quantization is applied to query, key, and value activations for MHA.
104+
* In the **decode** phase: query uses BF16, while KV Cache remains stored in per-tensor FP8 (consistent with prefill).
114105

115106
This mixed precision configuration reduces HBM usage while maintaining accuracy and performance.
116107

@@ -119,25 +110,19 @@ This mixed precision configuration reduces HBM usage while maintaining accuracy
119110

120111

121112
For low-concurrency workloads, we have deeply optimized MoE operators in AITER across four key dimensions:
122-
123-
- **Load Balancing**: Fine-grained task scheduling for Compute Units (CUs) during low-concurrency inference enables near-synchronized execution, eliminating idle cycles and maximizing hardware utilization.
124-
125-
- **Compute Efficiency**: Hardware-aware loop tuning on the K dimension eliminates redundant operations and significantly improves throughput.
126-
127-
- **Memory Efficiency**: Optimized atomic memory access patterns enhance L2 cache hit rates and alleviate memory bandwidth bottlenecks.
128-
129-
- **Auto-tuning**: Following manual optimizations, automated tuning tools search for optimal operator configurations to further maximize performance.
113+
* **Load Balancing**: Fine-grained task scheduling for Compute Units (CUs) during low-concurrency inference enables near-synchronized execution, eliminating idle cycles and maximizing hardware utilization.
114+
* **Compute Efficiency**: Hardware-aware loop tuning on the K dimension eliminates redundant operations and significantly improves throughput.
115+
* **Memory Efficiency**: Optimized atomic memory access patterns enhance L2 cache hit rates and alleviate memory bandwidth bottlenecks.
116+
* **Auto-tuning**: Following manual optimizations, automated tuning tools search for optimal operator configurations to further maximize performance.
130117

131118
Notably, load balancing and fine-grained scheduling yield particularly strong performance gains during LLM decoding, ultimately **improving MoE module performance by 2×**.
132119

133120

134121
#### 2.1.5 Kernel Fusion Optimization
135122

136123
We also fused several critical operators, including:
137-
138-
- Module 2: QKNorm + RoPE
139-
140-
- Modules 6 & 9: AllReduce + AddRMSNorm + per-token quant
124+
* Module 2: QKNorm + RoPE
125+
* Modules 6 & 9: AllReduce + AddRMSNorm + per-token quant
141126

142127
Operator fusion reduces frequent HBM access and further lowers endtoend inference latency.
143128

@@ -150,24 +135,19 @@ Operator fusion reduces frequent HBM access and further lowers endtoend inferenc
150135
### 2.2 Optimization for Qwen3-VL-235B
151136

152137
<p align="center">
153-
<img src="/images/blog/qwen_amd_latency/qwenvl_deployment.png" width="40%">
138+
<img src="/images/blog/qwen_amd_latency/qwenvl_deployment.png" style="display: block; margin: 20px auto 0; width: 60%; max-width: 100%; height: auto;">
154139
</p>
155140
<p align="center" style="color:gray; text-align: center;"><em>Figure 6. Qwen3-VL-235B deployment in SGLang</em></p>
156141

157142

158143
Compared to Qwen3‑235B, Qwen3‑VL‑235B introduces several new critical inference stages:
159-
160-
- Multimodal data format adaptation, preprocessing, and cross‑modal alignment
161-
162-
- ViT encoder execution, visual patch embedding, and cross‑modal feature fusion
144+
* Multimodal data format adaptation, preprocessing, and cross‑modal alignment
145+
* ViT encoder execution, visual patch embedding, and cross‑modal feature fusion
163146

164147
These extensions lengthen the inference pipeline and involve complex cross‑modal data coordination and feature adaptation, significantly increasing per‑request latency. The full dataflow is shown in Figure 6. Relative to pure language LLMs, Qwen3‑VL’s major overhead comes from three sources:
165-
166-
- Host‑side multimodal preprocessing
167-
168-
- Multimodal data transfer
169-
170-
- GPU‑side ViT encoder computation
148+
* Host‑side multimodal preprocessing
149+
* Multimodal data transfer
150+
* GPU‑side ViT encoder computation
171151

172152

173153
We designed targeted latency optimizations for each bottleneck.
@@ -200,7 +180,7 @@ In SGLang, the Tokenizer and Scheduler typically run in separate processes. Prep
200180
The ROCm<sup>TM</sup> backend supports **CUDA IPC**, enabling direct GPU to GPU data transfer without CPU intermediation. This eliminates redundant CPU GPU copies and drastically reduces multimodal transfer latency, as shown in Figure 9. Additionally, we offload image hashing (Figure 6) to the GPU, further compressing overhead.
201181

202182
<p align="center">
203-
<img src="/images/blog/qwen_amd_latency/cuda_ipc.png" width="40%">
183+
<img src="/images/blog/qwen_amd_latency/cuda_ipc.png" style="display: block; margin: 20px auto 0; width: 50%; max-width: 100%; height: auto;">
204184
</p>
205185
<p align="center" style="color:gray; text-align: center;"><em>Figure 9. CUDA IPC on ROCm backend
206186
</em></p>
@@ -210,13 +190,9 @@ The ROCm<sup>TM</sup> backend supports **CUDA IPC**, enabling direct GPU to GPU
210190

211191

212192
The Vision Transformer (ViT) module performs visual feature encoding from images and videos. For high-resolution inputs, however, it becomes a severe compute-bound bottleneck due to patch-based tokenization:
213-
214-
215-
- Inputs are split into fixed patches (e.g., 16×16)
216-
217-
- Sequence length grows quadratically with resolution
218-
219-
- Full selfattention has complexity O(N<sup>2</sup>)
193+
* Inputs are split into fixed patches (e.g., 16×16)
194+
* Sequence length grows quadratically with resolution
195+
* Full selfattention has complexity O(N<sup>2</sup>)
220196

221197
A 1280×1280 image generates approximately 4,800 tokens (consistent with original correction: 960×1280 → 4,800 tokens), resulting in over 23 million attention interactions. In extreme scenarios involving large batches of high-resolution images or long videos, token counts can surpass 1M, pushing attention complexity to O(10<sup>12</sup>). This leads to explosive memory consumption, extreme latency, and low hardware utilization.
222198

@@ -245,22 +221,16 @@ We use PTPC-FP8 quantization recipe and the corresponding model weights are avai
245221

246222

247223
These optimizations target **low latency inference scenarios**, with evaluation settings as follows:
248-
249-
250-
- **Qwen3‑235B:** Single request, Input Sequence Length (ISL) = 8000, Output Sequence Length (OSL) = 500.
251-
252-
253-
- **Qwen3‑VL‑235B:** Single request, text ISL = 8000, 5 images (960×1280) per request, OSL = 500.
224+
* **Qwen3‑235B:** Single request, Input Sequence Length (ISL) = 8000, Output Sequence Length (OSL) = 500.
225+
* **Qwen3‑VL‑235B:** Single request, text ISL = 8000, 5 images (960×1280) per request, OSL = 500.
254226

255227

256228
### 3.2 CUDA IPC Configuration
257229

258230

259231
To enable **GPU direct IPC** for efficient multimodal data transfer, users can set the following environment variables. The variable value can be changed according to different scenarios.
260-
261-
- export SGLANG_USE_CUDA_IPC_TRANSPORT=1
262-
263-
- export SGLANG_VLM_CACHE_SIZE_MB=8192
232+
* export SGLANG_USE_CUDA_IPC_TRANSPORT=1
233+
* export SGLANG_VLM_CACHE_SIZE_MB=8192
264234

265235
Experimental results show that, for 5 images of 960×1280 resolution, enabling CUDA IPC yields a **significant reduction in data transfer latency**, with a peak reduction of up to 2 seconds compared with gloo:broadcast.
266236

@@ -286,7 +256,7 @@ For Qwen3-VL-235B, the performance optimization results are shown in Figure 11.
286256
</em></p>
287257

288258
## 4. Reference
289-
- [Qwen3:Thinker Deeper, Act Faster](https://qwen.ai/blog?id=qwen3)
290-
- [AITER](https://github.com/ROCm/aiter)
291-
- [SGLang Document](https://docs.sglang.io/)
292-
- [rocJPEG](https://github.com/ROCm/rocJPEG)
259+
* [Qwen3:Thinker Deeper, Act Faster](https://qwen.ai/blog?id=qwen3)
260+
* [AITER](https://github.com/ROCm/aiter)
261+
* [SGLang Document](https://docs.sglang.io/)
262+
* [rocJPEG](https://github.com/ROCm/rocJPEG)
-274 KB
Loading
589 KB
Loading

0 commit comments

Comments
 (0)