Fix batch-and-skip benchmark exploit via per-call timing#104
Fix batch-and-skip benchmark exploit via per-call timing#104nataliakokoromyti wants to merge 264 commits into
Conversation
Updates to Problems
* torch.no_grad(), might get some memory freed earlier * rename parameters to reflect the asymmetry re relative error * don't try to stringify list of wrong locations; those could be millions in the worst case
receive inputs directly in uint8 allow specifying contention in data distribution
histogram update
Change sizes on conv2d
model solutions
Update submission_cuda_inline.py
- Deadline: Feb 20, 2026 - Runners: B200 and NVIDIA
* Add nvfp4_group_gemm problem to nvidia.yaml - Deadline: Feb 20, 2026 - Runners: B200 and NVIDIA * Fix eval.py to handle list values in test cases Bypass text serialization and parse YAML directly to properly handle list values for m, n, k in group GEMM test cases.
Updated the source path for eval.py in task.yml.
* Fix eval.py to properly parse list values in test cases - Updated regex to use [^\]]* instead of [^\]]+ to handle edge cases - Added underscores to key pattern [a-zA-Z_]+ - Skip empty lines and empty parts when parsing - Use re.fullmatch directly instead of both re.match and re.fullmatch - Handle empty tuples/lists in value parsing * Fix eval.py to use text parsing instead of YAML Kernelbot passes a text file with format like: m: [96, 128]; n: [128, 256]; k: [128, 512]; g: 2; seed: 1111 Use get_test_cases() to parse this text format directly. Remove unused get_test_cases_from_yaml function.
* Add new problem nvfp4_gemm to nvidia.yaml * change k's value to a multiple of 256 for better perf (simplify some logic). * revert unnecessary change.
add MLIRError, UNSERIALIZABLE_EXCEPTIONS tuple
Added NVIDIA Blackwell NVFP4 competition to the competition list.
Changed from Feb 20 midnight to Feb 21 7:30 UTC.
…ess checks The current eval times all 15 custom_kernel() calls as a single batch and divides by 15. A malicious submission can exploit this by deferring all work to one call (batching 15 problems into a single kernel launch) and making the other 14 calls no-ops, reporting ~1/15th of the real per-call cost. Cloning data alone (as proposed in gpu-mode#102) does not fully prevent this -- a shape-matching fallback path can still collect new data objects and batch them. This fix: - Clones data each timing iteration (prevents object-identity caching) - Times each call individually with its own CUDA events and GPU sync (prevents amortization across calls) - Checks correctness after each individual call in recheck/leaderboard mode (catches deferred-computation exploits that return uncomputed tensors) - Uses a local seed variable instead of mutating test.args - Fixes the recheck indentation bug where only the last call was checked
|
Hey @nataliakokoromyti — this is awesome, thanks for writing it up so clearly. The explanation of why #102’s clone+shuffle isn’t enough (shape-match + pointer-update path) is exactly right. One thing I noticed when I ran One nuance on semantics: the per-call correctness checks guarantee “correct when checked,” but they don’t fully enforce call independence as a contract. A clever submission can still coordinate across calls (batching/deferral) as long as it lands the writes before the check. So it’s a strong hammer, but it’s costly and still not quite the clean independence guarantee we want. A direction that seems both cheaper and more targeted is output fingerprint auditing (we’ve been experimenting with this and it’s been working well):
This hits the exploit mechanism directly (temporal integrity) instead of inferring cheating from timing skew. In our tests a fingerprint-based audit ( A couple notes / caveats on fingerprinting (worth us digging into together):
I also tried a couple follow-ups building on your approach:
Diff is here (easy to cherry-pick pieces): https://gist.github.com/G-structure/f9de3df9b051f43c06422ffd7a21a8dd Down to pair on integrating this in a way that keeps leaderboard scoring “real,” with the stronger checks happening only on integrity repeats. |
|
Hi, I think it does avoid most of the problems mentioned above, and it tries to minimize the overhead of the benchmarking framework by implementing the main loop in C++, calling the user function through nanobind. Note that the checking kernel is started using PDL to minimize the attack window, and checks the entries in the result in randomized order. |
208fd03 to
11fe446
Compare
11fe446 to
340e48e
Compare
|
thanks @G-structure and @ngc92 for your help and thoughtful responses. idk what the timeline for migrating to cpp is (great idea) but till then sth like this pr ^ could be beneficial. |
Summary
Fixes a benchmark exploit in
eval_better_bench_grouped_gemm.pywhere a submission can batch all 15custom_kernel()calls into a single GPU kernel launch and make 14/15 timed calls into no-ops (pure dict lookups returning cached results). This reports ~1/15th of the real per-call cost.Why #102's fix is insufficient: The clone+shuffle approach in #102 breaks trivial
id()-based caching, but a more sophisticated exploit uses a shape-matching fallback path that collects cloned data objects by problem shape and still batches them — the pointer-update path doesn't depend on stableid()values at all.Changes
custom_kernel()call is individually timed withtorch.cuda.synchronize()between calls, preventing work deferral across callstest.args["seed"]across iterationsHow the exploit works
The exploit:
id(), tensors, and results_build_superbatch(): Merges all 15 × 8 groups = 120 groups into a single kernel launchid()triggers the batched kernel; the other 14 return pre-cached results (zero GPU work)id()values change (e.g., after cloning), collects all 15 new objects by shape match, updates pointer tables, and still launches only once — defeating clone-based mitigationsWhy this fix works
Test plan