Skip to content

GPT-2 interactive inference demo with WarpForth attention #46

@tetsuo-cpp

Description

@tetsuo-cpp

Summary

Build an end-to-end interactive demo: load GPT-2 weights, type a prompt, get generated text — with the attention computation running on a WarpForth-compiled GPU kernel.

What the user experience looks like

$ python demo/gpt2_generate.py --prompt "The meaning of life is"
Loading GPT-2 (124M) weights...
Compiling attention kernel with warpforthc...
Generating...

The meaning of life is to find your gift. The purpose of life is to give it away.

Architecture

┌─────────────────────────────────────────────┐
│  Python orchestration (demo/gpt2_generate.py)│
│                                             │
│  1. Load GPT-2 weights (HuggingFace)        │
│  2. Tokenize input (tiktoken / transformers) │
│  3. For each generated token:               │
│     For each of 12 layers:                  │
│       a. LayerNorm           ← PyTorch      │
│       b. QKV projection     ← PyTorch       │
│       c. Split into heads    ← PyTorch      │
│       d. Attention           ← WarpForth PTX │
│       e. Concat heads        ← PyTorch      │
│       f. Output projection   ← PyTorch      │
│       g. Residual + LayerNorm← PyTorch      │
│       h. MLP (2× matmul)    ← PyTorch       │
│       i. Residual            ← PyTorch      │
│     Final LayerNorm + logits ← PyTorch      │
│     Sample next token                       │
│  4. Decode and print                        │
└─────────────────────────────────────────────┘

GPT-2 (124M) specs

  • 12 layers, 12 attention heads
  • hidden_dim = 768, head_dim = 64
  • vocab = 50257, max_seq_len = 1024
  • Weights ~500MB (float32), ~1GB (float64)

Key design decisions

f64 vs f32

WarpForth currently operates on f64 (double precision). GPT-2 weights are typically float32. Options:

  • Upcast to f64: Simple, just convert weights on load. 2× memory but GPT-2-small fits easily in GPU memory. This is the pragmatic choice for a demo.
  • Add f32 support to WarpForth: Better long-term but a large feature.

Recommendation: upcast to f64 for the demo.

Multi-head attention

GPT-2 has 12 heads. Two options:

  • Python loop: Split Q/K/V into 12 heads, launch the attention kernel 12 times, concatenate. Simple.
  • Batched kernel: Launch with grid (seq_len, n_heads, 1), handle head indexing in Forth. More efficient.

Recommendation: Start with the Python loop; it's simpler and the kernel stays clean.

Sequence length

For interactive generation, sequence length grows with each token. The attention kernel needs to handle variable seq_len (passed as a scalar param). Maximum 1024 for GPT-2.

Files to create

Dependencies (must be completed first)

Requirements

  • NVIDIA GPU with CUDA
  • PyTorch (for everything except attention)
  • PyCUDA (for launching WarpForth kernels)
  • transformers or tiktoken (for tokenizer)
  • warpforthc binary on PATH

Acceptance criteria

  • Loads GPT-2-small (124M) weights from HuggingFace
  • Generates coherent text from a prompt
  • Attention runs on WarpForth-compiled PTX kernel
  • Output matches pure-PyTorch GPT-2 inference (within f64 tolerance)
  • Interactive: accepts prompt from command line or stdin
  • Works on a single consumer GPU (e.g., RTX 3060, 8GB VRAM)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions