██████╗ ███████╗████████╗███████╗██████╗ ██████╗██████╗
██╔══██╗██╔════╝╚══██╔══╝██╔════╝██╔══██╗██╔════╝╚════██╗
██████╔╝█████╗ ██║ █████╗ ██████╔╝██║ █████╔╝
██╔═══╝ ██╔══╝ ██║ ██╔══╝ ██╔══██╗██║ ╚═══██╗
██║ ███████╗ ██║ ███████╗██║ ██║╚██████╗██████╔╝
╚═╝ ╚══════╝ ╚═╝ ╚══════╝╚═╝ ╚═╝ ╚═════╝╚═════╝
Self-taught systems programmer working at the GPU / driver / ML-runtime boundary. From-scratch on-device NPU inference across two vendors' silicon — AMD (Radeon 890M iGPU + XDNA 2 NPU) and MediaTek (MDLA / APU 650). I build PyTorch backends, patch kernel drivers, drive vendor NPU compilers directly, and write the upstream bug reports for hardware the software stack hasn't caught up to yet — and I publish all of it.
Async / written-first collaborator. Comfortable in Rust and C++ down to the dispatcher, allocator, and SPIR-V level.
| Project | What it actually does |
|---|---|
torch-vulkan |
From-scratch PyTorch device backend (PrivateUse1 + Vulkan compute) for the Radeon 890M iGPU. 39 hand-written SPIR-V compute shaders, custom allocator, buffer pooling, Q4 matmul. C++17 / pybind11. Functional prototype. |
amdxdna-strix-fix |
Kernel driver patch — root-caused and fixed an SMU init-order bug in the in-tree amdxdna driver (Linux 6.14+) that left the Ryzen AI NPU dead on cold boot. Brought an otherwise-unusable NPU online. |
mdla-cnn-engine |
From-scratch INT8 inference engine for a MediaTek MDLA phone NPU (MT6886) — drives the on-device Neuron compiler directly, bypassing the gated NeuroPilot SDK. Four stock CNNs run end-to-end on the NPU through one pipeline, hardware-witnessed. Honestly scoped (CNN-only, stock models — the contribution is the path to the silicon). |
recursive-routing-racer |
Tri-processor dispatch runtime — routes ML workloads across CPU + iGPU (Vulkan) + NPU (XDNA) on Ryzen AI 300. REINFORCE-trained scheduler, SQLite-backed state, hardware monitoring. Working prototype, ~5,100 lines Python. |
kv-compressor |
KV-cache compression experiment with a documented negative result (FINDINGS.md) — measured where the approach stops paying off, written up honestly rather than buried. |
graphql-authz-fuzzer |
GraphQL mutation authorization tester — schema introspection, probe generation, auth-gap classification. Standard-library only, has tests. |
cube-memory |
Research code + the public VSA negative-results preprint (/paper — 12-page PDF, LaTeX source, 6 figures). See Research below. |
| Project | What it actually does |
|---|---|
recursive-routing-racer-rs |
LLM inference engine from scratch in Rust — GGUF loading, BPE tokenizer, KV cache, Vulkan GPU dispatch, speculative decoding. Runs Phi-4 Mini at ~5.5 tok/s. Learning project. |
pytorch-gfx1150 |
PyTorch built from source for the Radeon 890M (gfx1150) — build scripts, AOTriton workarounds, GCC 15 fixes, documented. |
miopen-gfx1150 |
MIOpen analysis for RDNA 3.5 — whitelist patch, CK VGPR analysis, solver-availability matrix; 3-bug writeup. |
unified-ml |
HIP + Vulkan unified-memory strategy benchmarks on AMD APUs, plus a GGUF parser (712 lines, 5 quant formats + F32/F16). |
Filed reproducible upstream issues against PyTorch, ROCm, and AMD driver projects documenting where the software stack breaks on this silicon. Several triaged by maintainers; one closed after direct collaboration with an AMD engineer. (These are reported & triaged issues, not merged fixes.)
- PyTorch #178934, #178839 — MIOpen Gemm solvers return
workspace_size=0on gfx1150 (triaged,has-workaround) - ROCm/rocm-libraries #6045, #6048 — gfx1150 missing from CK whitelist; CK VGPR mismatch (in triage)
- ROCm/composable_kernel #3724 — WMMA kernels fail on gfx1150
- amd/xdna-driver #1257 —
aie2_smu_initcold-boot precheck failure (closed after collaboration with AMD) - amd/Triton-XDNA #33 —
detect_npu_version()doesn't recognize RyzenAI-npu4
"Two Negative Results for Vector Symbolic Architectures" — single-author 12-page preprint showing VSAs fail at FFN replacement (a rank bottleneck: VSA retrieval is rank ≤ top-k while FFN effective rank exceeds 2048) and at compositional image generation, with cross-scale validation on Qwen3-4B / 8B / 27B. Preprint, targeting the NeurIPS Negative Results track — not peer-reviewed. Read it: github.com/Peterc3-dev/cube-memory/tree/master/paper
- torch-vulkan — expanding op coverage on the Vulkan/SPIR-V PyTorch backend
- amdxdna NPU — driver debugging and bring-up on XDNA 2 (Strix Point)
- Cross-arch ML enablement on RDNA 3.5 + XDNA 2 — building and reporting upstream as gaps surface
Rust · C++17 · Python · GLSL/SPIR-V · Vulkan (Kompute) · HIP/ROCm · Linux kernel (driver debugging) · CachyOS/Arch · Kotlin / Android SDK · GraphQL · Tailscale
