Skip to content
View Peterc3-dev's full-sized avatar

Block or report Peterc3-dev

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Peterc3-dev/README.md
 ██████╗ ███████╗████████╗███████╗██████╗  ██████╗██████╗
 ██╔══██╗██╔════╝╚══██╔══╝██╔════╝██╔══██╗██╔════╝╚════██╗
 ██████╔╝█████╗     ██║   █████╗  ██████╔╝██║      █████╔╝
 ██╔═══╝ ██╔══╝     ██║   ██╔══╝  ██╔══██╗██║      ╚═══██╗
 ██║     ███████╗   ██║   ███████╗██║  ██║╚██████╗██████╔╝
 ╚═╝     ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝ ╚═════╝╚═════╝

Self-taught systems programmer working at the GPU / driver / ML-runtime boundary. From-scratch on-device NPU inference across two vendors' silicon — AMD (Radeon 890M iGPU + XDNA 2 NPU) and MediaTek (MDLA / APU 650). I build PyTorch backends, patch kernel drivers, drive vendor NPU compilers directly, and write the upstream bug reports for hardware the software stack hasn't caught up to yet — and I publish all of it.

Async / written-first collaborator. Comfortable in Rust and C++ down to the dispatcher, allocator, and SPIR-V level.


Featured work

Project What it actually does
torch-vulkan From-scratch PyTorch device backend (PrivateUse1 + Vulkan compute) for the Radeon 890M iGPU. 39 hand-written SPIR-V compute shaders, custom allocator, buffer pooling, Q4 matmul. C++17 / pybind11. Functional prototype.
amdxdna-strix-fix Kernel driver patch — root-caused and fixed an SMU init-order bug in the in-tree amdxdna driver (Linux 6.14+) that left the Ryzen AI NPU dead on cold boot. Brought an otherwise-unusable NPU online.
mdla-cnn-engine From-scratch INT8 inference engine for a MediaTek MDLA phone NPU (MT6886) — drives the on-device Neuron compiler directly, bypassing the gated NeuroPilot SDK. Four stock CNNs run end-to-end on the NPU through one pipeline, hardware-witnessed. Honestly scoped (CNN-only, stock models — the contribution is the path to the silicon).
recursive-routing-racer Tri-processor dispatch runtime — routes ML workloads across CPU + iGPU (Vulkan) + NPU (XDNA) on Ryzen AI 300. REINFORCE-trained scheduler, SQLite-backed state, hardware monitoring. Working prototype, ~5,100 lines Python.
kv-compressor KV-cache compression experiment with a documented negative result (FINDINGS.md) — measured where the approach stops paying off, written up honestly rather than buried.
graphql-authz-fuzzer GraphQL mutation authorization tester — schema introspection, probe generation, auth-gap classification. Standard-library only, has tests.
cube-memory Research code + the public VSA negative-results preprint (/paper — 12-page PDF, LaTeX source, 6 figures). See Research below.

More AMD / ML systems work

Project What it actually does
recursive-routing-racer-rs LLM inference engine from scratch in Rust — GGUF loading, BPE tokenizer, KV cache, Vulkan GPU dispatch, speculative decoding. Runs Phi-4 Mini at ~5.5 tok/s. Learning project.
pytorch-gfx1150 PyTorch built from source for the Radeon 890M (gfx1150) — build scripts, AOTriton workarounds, GCC 15 fixes, documented.
miopen-gfx1150 MIOpen analysis for RDNA 3.5 — whitelist patch, CK VGPR analysis, solver-availability matrix; 3-bug writeup.
unified-ml HIP + Vulkan unified-memory strategy benchmarks on AMD APUs, plus a GGUF parser (712 lines, 5 quant formats + F32/F16).

Upstream bug reports (RDNA 3.5 / Strix ML-enablement gaps)

Filed reproducible upstream issues against PyTorch, ROCm, and AMD driver projects documenting where the software stack breaks on this silicon. Several triaged by maintainers; one closed after direct collaboration with an AMD engineer. (These are reported & triaged issues, not merged fixes.)

  • PyTorch #178934, #178839 — MIOpen Gemm solvers return workspace_size=0 on gfx1150 (triaged, has-workaround)
  • ROCm/rocm-libraries #6045, #6048 — gfx1150 missing from CK whitelist; CK VGPR mismatch (in triage)
  • ROCm/composable_kernel #3724 — WMMA kernels fail on gfx1150
  • amd/xdna-driver #1257aie2_smu_init cold-boot precheck failure (closed after collaboration with AMD)
  • amd/Triton-XDNA #33detect_npu_version() doesn't recognize RyzenAI-npu4

Research / writing

"Two Negative Results for Vector Symbolic Architectures" — single-author 12-page preprint showing VSAs fail at FFN replacement (a rank bottleneck: VSA retrieval is rank ≤ top-k while FFN effective rank exceeds 2048) and at compositional image generation, with cross-scale validation on Qwen3-4B / 8B / 27B. Preprint, targeting the NeurIPS Negative Results track — not peer-reviewed. Read it: github.com/Peterc3-dev/cube-memory/tree/master/paper


Currently exploring

  • torch-vulkan — expanding op coverage on the Vulkan/SPIR-V PyTorch backend
  • amdxdna NPU — driver debugging and bring-up on XDNA 2 (Strix Point)
  • Cross-arch ML enablement on RDNA 3.5 + XDNA 2 — building and reporting upstream as gaps surface

Tech

Rust · C++17 · Python · GLSL/SPIR-V · Vulkan (Kompute) · HIP/ROCm · Linux kernel (driver debugging) · CachyOS/Arch · Kotlin / Android SDK · GraphQL · Tailscale


Pinned Loading

  1. agentspyboo agentspyboo Public

    First Rust-based AI red team agent on AMD XDNA 2 NPU — portable, air-gapped, zero-cloud autonomous pentesting

    Rust 1

  2. d-board d-board Public

    Ottholinear Android keyboard - true grid layout for improved typing ergonomics.

    Kotlin

  3. retune432-android retune432-android Public

    Android app to batch-convert audio from A440 to A432 Hz with metadata preservation

    Kotlin

  4. torch-vulkan torch-vulkan Public

    Vulkan compute backend for PyTorch — runs on any GPU. PrivateUse1 dispatch, SPIR-V shaders, zero ROCm/CUDA dependency.

    C++

  5. omnivoice-gfx1150 omnivoice-gfx1150 Public

    Running k2-fsa/OmniVoice voice cloning TTS on AMD Radeon 890M (gfx1150, Strix Point) integrated GPU — report, benchmark, and reproduction guide

    Python 1

  6. recursive-routing-racer-rs recursive-routing-racer-rs Public

    From-scratch Rust + Vulkan LLM inference engine for AMD Radeon 890M — GGUF loading, BPE tokenizer, KV cache, speculative decoding, Q4 matmul. Runs Phi-4 Mini at ~5.5 tok/s. Learning project.

    Rust