Fix batch-and-skip benchmark exploit via per-call timing by nataliakokoromyti · Pull Request #104 · gpu-mode/reference-kernels

nataliakokoromyti · 2026-02-22T10:18:32Z

Summary

Fixes a benchmark exploit in eval_better_bench_grouped_gemm.py where a submission can batch all 15 custom_kernel() calls into a single GPU kernel launch and make 14/15 timed calls into no-ops (pure dict lookups returning cached results). This reports ~1/15th of the real per-call cost.

Why #102's fix is insufficient: The clone+shuffle approach in #102 breaks trivial id()-based caching, but a more sophisticated exploit uses a shape-matching fallback path that collects cloned data objects by problem shape and still batches them — the pointer-update path doesn't depend on stable id() values at all.

Changes

Clone data each timing iteration — prevents object-identity caching
Per-call CUDA events with GPU sync — each custom_kernel() call is individually timed with torch.cuda.synchronize() between calls, preventing work deferral across calls
Per-call correctness check in recheck mode — if a submission skips the kernel and returns uncomputed tensors, the correctness check fails immediately (fixes the indentation bug where only the last call was checked)
Local seed variable — avoids mutating test.args["seed"] across iterations

How the exploit works

The exploit:

Learning phase (first 15 calls): Records each data object's id(), tensors, and results
_build_superbatch(): Merges all 15 × 8 groups = 120 groups into a single kernel launch
Fast path: On subsequent iterations, only the first id() triggers the batched kernel; the other 14 return pre-cached results (zero GPU work)
Pointer-update fallback: When id() values change (e.g., after cloning), collects all 15 new objects by shape match, updates pointer tables, and still launches only once — defeating clone-based mitigations

Why this fix works

GPU sync between calls forces each call to either launch a kernel (measurable cost) or not (returns uncomputed results)
Per-call correctness check catches deferred computation — if a call returns without launching a kernel, its output tensors contain garbage and fail verification
The only viable strategy for a submission is to actually compute the result for each call independently — which is exactly what a legitimate kernel does

Test plan

Verify legitimate submissions produce same scores (per-call mean = batch mean for honest kernels)
Verify the known exploit kernel fails correctness in leaderboard mode
Check benchmark runtime overhead is acceptable (extra sync per call adds ~5μs × 15 = ~75μs per repeat)

Vectoradd

Vectorsum

Updates to Problems

* torch.no_grad(), might get some memory freed earlier * rename parameters to reflect the asymmetry re relative error * don't try to stringify list of wrong locations; those could be millions in the worst case

improvements

receive inputs directly in uint8 allow specifying contention in data distribution

histogram update

Change sizes on conv2d

model solutions

Update submission_cuda_inline.py

- Deadline: Feb 20, 2026 - Runners: B200 and NVIDIA

* Add nvfp4_group_gemm problem to nvidia.yaml - Deadline: Feb 20, 2026 - Runners: B200 and NVIDIA * Fix eval.py to handle list values in test cases Bypass text serialization and parse YAML directly to properly handle list values for m, n, k in group GEMM test cases.

Updated the source path for eval.py in task.yml.

* Fix eval.py to properly parse list values in test cases - Updated regex to use [^\]]* instead of [^\]]+ to handle edge cases - Added underscores to key pattern [a-zA-Z_]+ - Skip empty lines and empty parts when parsing - Use re.fullmatch directly instead of both re.match and re.fullmatch - Handle empty tuples/lists in value parsing * Fix eval.py to use text parsing instead of YAML Kernelbot passes a text file with format like: m: [96, 128]; n: [128, 256]; k: [128, 512]; g: 2; seed: 1111 Use get_test_cases() to parse this text format directly. Remove unused get_test_cases_from_yaml function.

* Add new problem nvfp4_gemm to nvidia.yaml * change k's value to a multiple of 256 for better perf (simplify some logic). * revert unnecessary change.

add MLIRError, UNSERIALIZABLE_EXCEPTIONS tuple

Added NVIDIA Blackwell NVFP4 competition to the competition list.

Changed from Feb 20 midnight to Feb 21 7:30 UTC.

…ess checks The current eval times all 15 custom_kernel() calls as a single batch and divides by 15. A malicious submission can exploit this by deferring all work to one call (batching 15 problems into a single kernel launch) and making the other 14 calls no-ops, reporting ~1/15th of the real per-call cost. Cloning data alone (as proposed in gpu-mode#102) does not fully prevent this -- a shape-matching fallback path can still collect new data objects and batch them. This fix: - Clones data each timing iteration (prevents object-identity caching) - Times each call individually with its own CUDA events and GPU sync (prevents amortization across calls) - Checks correctness after each individual call in recheck/leaderboard mode (catches deferred-computation exploits that return uncomputed tensors) - Uses a local seed variable instead of mutating test.args - Fixes the recheck indentation bug where only the last call was checked

G-structure · 2026-02-22T15:39:41Z

Hey @nataliakokoromyti — this is awesome, thanks for writing it up so clearly. The explanation of why #102’s clone+shuffle isn’t enough (shape-match + pointer-update path) is exactly right.

One thing I noticed when I ran 208fd03 against the same known-good kernel we’ve been comparing with: the per-call sync approach ends up being really expensive in leaderboard mode. I’m seeing a geomean around 266,298 ns (~266 µs) — about +1040% vs the lighter-hardened baseline (~23 µs). At that point we’re mostly benchmarking torch.cuda.synchronize() + event overhead rather than the kernel itself, and it risks turning the leaderboard into “who is least harmed by eval overhead” instead of “who’s closest to speed-of-light math throughput.”

One nuance on semantics: the per-call correctness checks guarantee “correct when checked,” but they don’t fully enforce call independence as a contract. A clever submission can still coordinate across calls (batching/deferral) as long as it lands the writes before the check. So it’s a strong hammer, but it’s costly and still not quite the clean independence guarantee we want.

A direction that seems both cheaper and more targeted is output fingerprint auditing (we’ve been experimenting with this and it’s been working well):

After a probed call returns, compute a lightweight fingerprint of the output buffers (sampled indices + random weights; computed on GPU so it stays in stream order).
After the full call window (after the usual sync), recompute the fingerprint on the same buffers.
If it changed → the output mutated after return → direct signal of deferred / cross-call writes (“return handle now, fill later”).

This hits the exploit mechanism directly (temporal integrity) instead of inferring cheating from timing skew. In our tests a fingerprint-based audit (f52ff4b style) caught the deferred-mutation exploit while keeping overhead much closer to baseline (~26.6 µs, ~+13.9%).

A couple notes / caveats on fingerprinting (worth us digging into together):

It assumes the call’s work lands on the current stream; if custom streams are allowed, we probably need to disallow them (common in these comps) or treat them as undefined for timing/integrity.
It’s probabilistic because we sample, but using 2 hashes + enough samples (e.g. 256–1024) makes collisions super unlikely.
We don’t need to fingerprint every call — probing 1–2 random positions per integrity repeat tends to be enough to catch “flush at end” patterns.

I also tried a couple follow-ups building on your approach:

20fb8c3: randomized window length/order + sparse probes (more heuristic)
f52ff4b: fingerprint audit that directly catches post-return output mutation

Diff is here (easy to cherry-pick pieces): https://gist.github.com/G-structure/f9de3df9b051f43c06422ffd7a21a8dd

Down to pair on integrating this in a way that keeps leaderboard scoring “real,” with the stronger checks happening only on integrity repeats.

ngc92 · 2026-02-22T15:50:33Z

Hi,
I've been working on a new benchmark+test implementation, unfortunately not in time for the competition:
https://github.com/ngc92/pygpubench

I think it does avoid most of the problems mentioned above, and it tries to minimize the overhead of the benchmarking framework by implementing the main loop in C++, calling the user function through nanobind.
This also avoids malicious users messing with benchmarking data using the inspect module in python.

Note that the checking kernel is started using PDL to minimize the attack window, and checks the entries in the result in randomized order.

nataliakokoromyti · 2026-02-23T03:23:24Z

thanks @G-structure and @ngc92 for your help and thoughtful responses. idk what the timeline for migrating to cpp is (great idea) but till then sth like this pr ^ could be beneficial.

Mark Saroufim and others added 30 commits February 22, 2025 14:34

convolution is bueno

168d345

Merge pull request gpu-mode#4 from gpu-mode/vectoradd

eec6afd

Vectoradd

prefixsum

7b8fb8c

remove useless comments

32e78bb

Make solution more robust to cheese solutions by adding scale and offset

2d6a1d6

Merge pull request gpu-mode#3 from gpu-mode/dev-siro

9db74b3

Vectorsum

fixed prefixsum

9fcb89f

Merge pull request gpu-mode#2 from gpu-mode/dev

a834338

Updates to Problems

improvements to allclose:

eb928f7

* torch.no_grad(), might get some memory freed earlier * rename parameters to reflect the asymmetry re relative error * don't try to stringify list of wrong locations; those could be millions in the worst case

simplified reference utility

519a8f0

reduce in fp64

1657e57

Merge pull request gpu-mode#6 from gpu-mode/ngc92/improvements

4df532f

improvements

histogram:

3479186

receive inputs directly in uint8 allow specifying contention in data distribution

Merge pull request gpu-mode#7 from gpu-mode/ngc92/histogram

d9f012e

histogram update

Change sizes on conv2d

99ec59f

Merge pull request gpu-mode#9 from gpu-mode/fix/smaller-conv2d

27f3179

Change sizes on conv2d

model solutions

f836aeb

use double for reference calculation

9467522

comment

090fbae

Merge pull request gpu-mode#10 from gpu-mode/ngc92/problems

dea77a1

model solutions

Feat: templates

efa2178

fixup

855df3d

Update README.md with Docs Links

4f42b29

Update submission_cuda_inline.py

3d8dcec

Merge pull request gpu-mode#14 from gpu-mode/msaroufim-patch-1

d0a7a1e

Update submission_cuda_inline.py

identity task for AMD competition

066d45a

rename

b514247

names aren't supported

294a335

fixups

9a9dba8

fp8 matmul

6dbc737

Mark Saroufim and others added 18 commits January 19, 2026 08:59

Add nvfp4_group_gemm problem to nvidia.yaml (gpu-mode#92)

250b004

- Deadline: Feb 20, 2026 - Runners: B200 and NVIDIA

Fix eval.py source path in task configuration

3f23047

Updated the source path for eval.py in task.yml.

patch utils.py to avoid [] as a valid submission

aeee2ba

Fix: better configuration for grouped gemm launch

66065c2

Fix: add better l2 cache clear

9ca8ea5

change fp4 init range (gpu-mode#96)

07f0321

change k's value to a multiple of 256 (gpu-mode#98)

db1c91e

* Add new problem nvfp4_gemm to nvidia.yaml * change k's value to a multiple of 256 for better perf (simplify some logic). * revert unnecessary change.

update pmppv2 dates

53801cc

add MLIRError, UNSERIALIZABLE_EXCEPTIONS tuple

62e4b61

Merge pull request gpu-mode#99 from djsaunde/unseriable-exceptions

6dac61f

add MLIRError, UNSERIALIZABLE_EXCEPTIONS tuple

Update competition list in README

efa5217

Added NVIDIA Blackwell NVFP4 competition to the competition list.

Update README.md

64e88da

remove unsued eval (gpu-mode#101)

2998db4

Update deadline for trimul problem

b98f123

Update nvfp4_group_gemm deadline to match Luma event end time

04c0b02

Changed from Feb 20 midnight to Feb 21 7:30 UTC.

nataliakokoromyti marked this pull request as draft February 22, 2026 10:20

G-structure force-pushed the fix/per-call-timing-anti-exploit branch from 208fd03 to 11fe446 Compare February 23, 2026 03:09

G-structure added 2 commits February 22, 2026 22:11

Improve grouped GEMM eval anti-cheat checks

3d389a2

Add fingerprint audit for deferred output mutation

340e48e

G-structure force-pushed the fix/per-call-timing-anti-exploit branch from 11fe446 to 340e48e Compare February 23, 2026 03:11

nataliakokoromyti marked this pull request as ready for review February 23, 2026 03:21

Ammaar-Alam mentioned this pull request Apr 8, 2026

Fix Princeton cross-entropy replay exploit via phase-specific inputs #142

Merged

msaroufim closed this Jun 15, 2026

msaroufim force-pushed the main branch from 248d962 to e224fc2 Compare June 15, 2026 04:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch-and-skip benchmark exploit via per-call timing#104

Fix batch-and-skip benchmark exploit via per-call timing#104
nataliakokoromyti wants to merge 264 commits into
gpu-mode:mainfrom
nataliakokoromyti:fix/per-call-timing-anti-exploit

nataliakokoromyti commented Feb 22, 2026

Uh oh!

G-structure commented Feb 22, 2026

Uh oh!

ngc92 commented Feb 22, 2026 •

edited

Loading

Uh oh!

nataliakokoromyti commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

nataliakokoromyti commented Feb 22, 2026

Summary

Changes

How the exploit works

Why this fix works

Test plan

Uh oh!

G-structure commented Feb 22, 2026

Uh oh!

ngc92 commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nataliakokoromyti commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

ngc92 commented Feb 22, 2026 •

edited

Loading