Skip to content

Commit 8b6d465

Browse files
committed
Added Russian Roulette to RayTracingInVulkan for a more fair comparison
1 parent f682a59 commit 8b6d465

1 file changed

Lines changed: 83 additions & 35 deletions

File tree

RTIOW.html

Lines changed: 83 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -69,10 +69,9 @@
6969

7070
<article>
7171
<div class="collapsible">
72-
<h1>CUDA Ray Tracing 3.6x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
72+
<h1>CUDA Ray Tracing 2x Faster Than RTX: My CUDA Ray Tracing Journey</h1>
7373
<!-- <p><em>Note: This is a draft version. Final edits are still in progress. Feedback is welcome while final
7474
edits are underway.</em></p> -->
75-
7675
</div>
7776

7877
<img class="photo" src="images/RTIOW/2560x1440_50depth_3000samples_3400ms.png">
@@ -99,6 +98,23 @@ <h2>Introduction</h2>
9998
enthusiast, or just curious about real-world GPU optimization, I hope you'll find something useful
10099
here.
101100
</p>
101+
102+
<div class="gotcha-card pro-tip">
103+
<div class="gotcha-marker pro-tip-marker"></div>
104+
<div class="gotcha-content">
105+
<h4>Note</h4>
106+
<p>
107+
The original title claimed a 3.6x speedup, which was true at the time of writing —
108+
but after
109+
realizing
110+
I forgot to add Russian Roulette to RayTracingInVulkan, the performance difference shrunk to
111+
2x.
112+
Still very significant, and it's more fair now.
113+
</p>
114+
</div>
115+
</div>
116+
117+
102118
<div class="perf-table-container">
103119
<div class="perf-table-container">
104120
<table class="perf-table">
@@ -120,10 +136,20 @@ <h2>Introduction</h2>
120136
<td class="spec-value">Vulkan</td>
121137
<td class="spec-value">RTX acceleration</td>
122138
<td class="spec-value">Procedural sphere tracing + triangle modes</td>
123-
<td class="spec-value fps-highlight">~33 ms</td>
124-
<td class="spec-value fps-highlight">~30 FPS</td>
139+
<td class="spec-value fps-highlight"><s>~30 ms</s>
140+
<br>
141+
~20 ms
142+
143+
</td>
144+
145+
<td class="spec-value fps-highlight">
146+
<s>~30 FPS</s>
147+
<br>
148+
~50 FPS
149+
</td>
125150
<td class="spec-value">
126151
<ul>
152+
<li>Added russian roulette for a fair comparison</li>
127153
<li>No acceleration structure compaction</li>
128154
<li>Using procedural AABBs per sphere</li>
129155
<li>Using ray tracing pipeline (no inline ray tracing)</li>
@@ -162,20 +188,26 @@ <h2>Introduction</h2>
162188
</p>
163189
<blockquote class="quote">
164190
“My suspicion is that procedural spheres are relatively cheap to compute (both the ray
165-
intersection and shading), leaving the compute units mostly idling while the RT units are fully
166-
utilized doing BVH traversal. Thus the performance in this case is entirely limited by the RT
191+
intersection and shading), leaving the compute units mostly idling while the RT units are
192+
fully
193+
utilized doing BVH traversal. Thus the performance in this case is entirely limited by the
194+
RT
167195
units.
168196
<br>
169197
<br>
170-
Interestingly, this article (and the Radeon RX 6900 XT results in RayTracingInVulkan procedural
198+
Interestingly, this article (and the Radeon RX 6900 XT results in RayTracingInVulkan
199+
procedural
171200
benchmarks, a GPU where the BVH traversal is handled by the compute units rather than its RT
172-
units) tend to support the idea that doing the entire BVH traversal using only the compute units
173-
is faster than delegating to the RT units. At least on the GeForce 3000 series and the Radeon RX
201+
units) tend to support the idea that doing the entire BVH traversal using only the compute
202+
units
203+
is faster than delegating to the RT units. At least on the GeForce 3000 series and the
204+
Radeon RX
174205
6000 series, that is.
175206
<br>
176207
<br>
177208

178-
In practice, the test scene is an unlikely scenario in gaming. In a modern AAA game, the compute
209+
In practice, the test scene is an unlikely scenario in gaming. In a modern AAA game, the
210+
compute
179211
cores will be actively used for shading and rendering the game, leaving little room on those
180212
units for doing the BVH traversal, while most (all?) of the ray intersections will be done
181213
against triangles (a task at which RT units excel, especially on later generation GPUs).”
@@ -189,7 +221,8 @@ <h2>Introduction</h2>
189221

190222

191223
<p>
192-
Supporting this theory, <strong>RayTracingInVulkan</strong> consistently benchmarks better on
224+
Supporting this theory, <strong>RayTracingInVulkan</strong> consistently benchmarks better
225+
on
193226
AMD
194227
cards, such as the Radeon RX 6900 XT, which perform BVH traversal using compute units rather
195228
than
@@ -205,7 +238,8 @@ <h2>Introduction</h2>
205238
with triangle geometry—not procedural primitives like spheres or AABBs:
206239
</p>
207240
<blockquote class="quote">
208-
“Use triangles over AABBs. RTX GPUs excel in accelerating traversal of AS created from triangle
241+
“Use triangles over AABBs. RTX GPUs excel in accelerating traversal of AS created from
242+
triangle
209243
geometry.”
210244
<br>
211245
<span class="quote-author">
@@ -214,8 +248,10 @@ <h2>Introduction</h2>
214248
</span>
215249
</blockquote>
216250
<p>
217-
Of course, this is a synthetic scenario. In a typical AAA game, compute cores are heavily loaded
218-
with shading and post-processing tasks, and most ray intersections are against triangles—a case
251+
Of course, this is a synthetic scenario. In a typical AAA game, compute cores are heavily
252+
loaded
253+
with shading and post-processing tasks, and most ray intersections are against triangles—a
254+
case
219255
where RT cores excel, especially on newer generations of GPUs.
220256
</p>
221257

@@ -224,7 +260,8 @@ <h2>Introduction</h2>
224260
hardware
225261
RT pipeline often incurs more overhead than inline ray tracing (Ray query). It tends to make
226262
heavy use of VRAM bandwidth by moving payload
227-
data around between shader stages. On the other hand, inline ray tracing can keep most of the
263+
data around between shader stages. On the other hand, inline ray tracing can keep most of
264+
the
228265
data
229266
in registers, which is exactly what's happening in my implementation. So you can consider my
230267
approach as <strong>inline ray tracing</strong>
@@ -236,21 +273,25 @@ <h2>Introduction</h2>
236273
into
237274
sample rates, shader complexity, geometry types, and hardware, the numbers hold up. In this
238275
article,
239-
I'll peel back the layers of how I squeezed 3.6x performance out through CUDA-level
276+
I'll peel back the layers of how I squeezed 2x performance out through CUDA-level
240277
optimizations,
241-
giving you an exciting taste of what's possible when you really dig deep into cache behavior,
278+
giving you an exciting taste of what's possible when you really dig deep into cache
279+
behavior,
242280
register pressure, and GPU optimization.
243281
</p>
244282

245283
<h3> Why CUDA?</h3>
246284
<p>
247285
As a graphics programmer, I'm constantly pushing the limits of what the GPU can do. But I
248286
realized
249-
that knowing just high-level shading languages or APIs like Vulkan or DirectX wasn't enough—I
287+
that knowing just high-level shading languages or APIs like Vulkan or DirectX wasn't
288+
enough—I
250289
needed
251-
to understand the machine itself. CUDA gave me the lowest-level, most explicit way to explore
290+
to understand the machine itself. CUDA gave me the lowest-level, most explicit way to
291+
explore
252292
how
253-
GPUs schedule threads, manage memory, and hit (or miss) performance targets. And with the help
293+
GPUs schedule threads, manage memory, and hit (or miss) performance targets. And with the
294+
help
254295
of
255296
<strong>Nsight Compute</strong>, I wasn't just reading theory—I was hands-on, exploring real
256297
bottlenecks, discovering how latency hiding works, learning about warp scheduling, cache
@@ -262,22 +303,26 @@ <h3> Why CUDA?</h3>
262303

263304
<p>And I didn't want to "just learn a language." I wanted to <strong>learn CUDA as a suite of
264305
tools</strong>, to
265-
really get under the hood of how GPU code runs, stalls, and gets optimized. So I asked myself:
306+
really get under the hood of how GPU code runs, stalls, and gets optimized. So I asked
307+
myself:
266308
what's the best way to do that for a graphics programmer?
267309
</p>
268310

269-
<p><strong>Answer:</strong> write a ray tracer from scratch in CUDA… and then squeeze it until it
311+
<p><strong>Answer:</strong> write a ray tracer from scratch in CUDA… and then squeeze it until
312+
it
270313
screams.</p>
271314

272315
<p>This article walks you through how I implemented a naive CUDA port of <em>Ray Tracing in One
273316
Weekend</em>
274317
that
275318
ran at <strong>2.5 seconds per frame</strong>, and optimized it down to <strong>9
276-
milliseconds</strong>. Along the way, I hit every wall I could—scoreboard stalls, branching
319+
milliseconds</strong>. Along the way, I hit every wall I could—scoreboard stalls,
320+
branching
277321
hell,
278322
memory layout issues—and learned how to knock each one down.</p>
279323

280-
<p>This isn't a language learning blog. It's an <strong>optimization story</strong>. A journey into
324+
<p>This isn't a language learning blog. It's an <strong>optimization story</strong>. A journey
325+
into
281326
how
282327
GPUs
283328
really work, and what it takes to make them fly.</p>
@@ -294,9 +339,11 @@ <h3> Why CUDA?</h3>
294339

295340
<h3>Specifications:</h3>
296341
<p>
297-
To give proper context to the performance numbers and optimizations discussed in this article,
342+
To give proper context to the performance numbers and optimizations discussed in this
343+
article,
298344
it's
299-
important to understand the hardware I tested on. These specs shaped not only what was possible,
345+
important to understand the hardware I tested on. These specs shaped not only what was
346+
possible,
300347
but
301348
also where the real bottlenecks and wins emerged during tuning.
302349
</p>
@@ -588,14 +635,14 @@ <h3>Before vs After</h3>
588635
<td><code>.cu</code> per class</td>
589636
<td>Poor</td>
590637
<td>High</td>
591-
<td>Fast</td>
638+
<td>short</td>
592639
<td class="bad">Slow</td>
593640
</tr>
594641
<tr>
595642
<td><code>.cuh</code> header-only</td>
596643
<td>Excellent</td>
597644
<td>Minimal</td>
598-
<td>Longer</td>
645+
<td>Long</td>
599646
<td class="good">Fast</td>
600647
</tr>
601648
</tbody>
@@ -2696,13 +2743,14 @@ <h2 class="section-title">References</h2>
26962743
memory
26972744
coalescing.</code>.
26982745
</li>
2699-
<li></li>
2700-
<a href="https://developer.nvidia.com/blog/rtx-best-practices/" target="_blank">
2701-
RTX Best Practices — NVIDIA Developer Blog
2702-
</a>
2703-
<br>
2704-
NVIDIA's official guide to best practices for real-time ray tracing with RTX, including performance
2705-
tips and architectural insights.
2746+
<li>
2747+
<a href="https://developer.nvidia.com/blog/rtx-best-practices/" target="_blank">
2748+
RTX Best Practices — NVIDIA Developer Blog
2749+
</a>
2750+
<br>
2751+
NVIDIA's official guide to best practices for real-time ray tracing with RTX, including
2752+
performance
2753+
tips and architectural insights.
27062754
</li>
27072755
<li>
27082756
<a href="https://developer.nvidia.com/blog/accelerated-ray-tracing-cuda/"

0 commit comments

Comments
 (0)