-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Build an end-to-end interactive demo: load GPT-2 weights, type a prompt, get generated text — with the attention computation running on a WarpForth-compiled GPU kernel.
What the user experience looks like
$ python demo/gpt2_generate.py --prompt "The meaning of life is"
Loading GPT-2 (124M) weights...
Compiling attention kernel with warpforthc...
Generating...
The meaning of life is to find your gift. The purpose of life is to give it away.
Architecture
┌─────────────────────────────────────────────┐
│ Python orchestration (demo/gpt2_generate.py)│
│ │
│ 1. Load GPT-2 weights (HuggingFace) │
│ 2. Tokenize input (tiktoken / transformers) │
│ 3. For each generated token: │
│ For each of 12 layers: │
│ a. LayerNorm ← PyTorch │
│ b. QKV projection ← PyTorch │
│ c. Split into heads ← PyTorch │
│ d. Attention ← WarpForth PTX │
│ e. Concat heads ← PyTorch │
│ f. Output projection ← PyTorch │
│ g. Residual + LayerNorm← PyTorch │
│ h. MLP (2× matmul) ← PyTorch │
│ i. Residual ← PyTorch │
│ Final LayerNorm + logits ← PyTorch │
│ Sample next token │
│ 4. Decode and print │
└─────────────────────────────────────────────┘
GPT-2 (124M) specs
- 12 layers, 12 attention heads
- hidden_dim = 768, head_dim = 64
- vocab = 50257, max_seq_len = 1024
- Weights ~500MB (float32), ~1GB (float64)
Key design decisions
f64 vs f32
WarpForth currently operates on f64 (double precision). GPT-2 weights are typically float32. Options:
- Upcast to f64: Simple, just convert weights on load. 2× memory but GPT-2-small fits easily in GPU memory. This is the pragmatic choice for a demo.
- Add f32 support to WarpForth: Better long-term but a large feature.
Recommendation: upcast to f64 for the demo.
Multi-head attention
GPT-2 has 12 heads. Two options:
- Python loop: Split Q/K/V into 12 heads, launch the attention kernel 12 times, concatenate. Simple.
- Batched kernel: Launch with grid
(seq_len, n_heads, 1), handle head indexing in Forth. More efficient.
Recommendation: Start with the Python loop; it's simpler and the kernel stays clean.
Sequence length
For interactive generation, sequence length grows with each token. The attention kernel needs to handle variable seq_len (passed as a scalar param). Maximum 1024 for GPT-2.
Files to create
demo/gpt2_generate.py— Main scriptdemo/attention.forth— Attention kernel (from Naive attention kernel in Forth #44)demo/warpforth.py— PyCUDA integration (from Python/PyCUDA integration for launching WarpForth kernels #45)demo/README.md— Setup and usage instructions
Dependencies (must be completed first)
- Float math intrinsics: FEXP, FSQRT, FLOG, FABS, FNEG, FMAX, FMIN #42 — Math intrinsics (FEXP, FSQRT, FMAX, FMIN)
- Naive attention kernel in Forth #44 — Naive attention kernel
- Python/PyCUDA integration for launching WarpForth kernels #45 — Python/PyCUDA integration
Requirements
- NVIDIA GPU with CUDA
- PyTorch (for everything except attention)
- PyCUDA (for launching WarpForth kernels)
transformersortiktoken(for tokenizer)warpforthcbinary on PATH
Acceptance criteria
- Loads GPT-2-small (124M) weights from HuggingFace
- Generates coherent text from a prompt
- Attention runs on WarpForth-compiled PTX kernel
- Output matches pure-PyTorch GPT-2 inference (within f64 tolerance)
- Interactive: accepts prompt from command line or stdin
- Works on a single consumer GPU (e.g., RTX 3060, 8GB VRAM)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request