Add example that profiles parallel sum by FattiMei · Pull Request #774 · inducer/pyopencl

FattiMei · 2024-07-15T20:25:13Z

Description

This program profiles the parallel sum kernel when summing two arrays of increasing size, interpreting the data as char, int, float... the results are plotted with matplotlib, it's an external dependency but it's a common one.

Rationale

When experimenting with GPU computing is useful to estimate an upper bound on performance, this PR offers an example of how you could use pyopencl events to profile a kernel (I suppose we are profiling only the execution time without the data transfers).
The results show a powerful idea in parallel computing: using shorter types improves throughput

Possible corners

I assumed that the profiling times were in nanoseconds
I wanted to work with fp16, but not all devices support it so I commented it out. Maybe we could query at runtime if fp16 arithmetic is supported

inducer

Thanks for contributing this. Some comments below.

inducer · 2024-07-16T01:13:02Z

examples/demo_flops.py

+            FLOPS = 1e9 * sums / (event.profile.end - event.profile.start)
+            GFLOPS = FLOPS / 1e6
+
+            data[row, col] = GFLOPS


Arguably, this workload will be bandwidth-bound, so GB/s will be the more appropriate measure.

This decision was made because it's common to evaluate gpu performance based on TFLOPS (and this number is computed with similar workloads) and especially highlights the fact that of course the flops go up when working with smaller types

examples/demo_flops.py

inducer · 2024-07-16T01:15:50Z

examples/demo_flops.py

+            header  = f'#define T {literal}\n'
+            kernel  = cl.Program(ctx, header + src).build().sum
+
+            event   = kernel(queue, (sums,), None, x, y, z)


It's generally good practice to do a few "warmup" rounds before timing, to better measure the steady-state rate.

There are problem with caches however. In cpu runs I get crazy GFLOPS for medium size arrays because they already live in the cache, gpu doesn't seem to suffer from this problem.
But with the new commits one could decide to do no warmup runs and only one hot run so it's ok

inducer · 2024-07-16T01:16:13Z

examples/demo_flops.py

Could you look over the CI failures?

Add example that profiles parallel sum

15d981a

inducer reviewed Jul 16, 2024

View reviewed changes

FattiMei and others added 4 commits July 16, 2024 07:12

refactor: comply with ruff requirements

dfef25a

Add warm-up runs and multiple measurements per run

df99f35

Add matplotlib dependency to examples ci

72b1ab4

Merge branch 'main' into main

766873e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example that profiles parallel sum#774

Add example that profiles parallel sum#774
FattiMei wants to merge 5 commits intoinducer:mainfrom
FattiMei:main

FattiMei commented Jul 15, 2024

Uh oh!

inducer left a comment

Uh oh!

inducer Jul 16, 2024

Uh oh!

FattiMei Jul 16, 2024

Uh oh!

Uh oh!

inducer Jul 16, 2024

Uh oh!

FattiMei Jul 16, 2024

Uh oh!

inducer Jul 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FattiMei commented Jul 15, 2024

Description

Rationale

Possible corners

Uh oh!

inducer left a comment

Choose a reason for hiding this comment

Uh oh!

inducer Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

FattiMei Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

inducer Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

FattiMei Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

inducer Jul 16, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants