GPT-2 interactive inference demo with WarpForth attention

## Summary

Build an end-to-end interactive demo: load GPT-2 weights, type a prompt, get generated text — with the attention computation running on a WarpForth-compiled GPU kernel.

## What the user experience looks like

```
$ python demo/gpt2_generate.py --prompt "The meaning of life is"
Loading GPT-2 (124M) weights...
Compiling attention kernel with warpforthc...
Generating...

The meaning of life is to find your gift. The purpose of life is to give it away.
```

## Architecture

```
┌─────────────────────────────────────────────┐
│  Python orchestration (demo/gpt2_generate.py)│
│                                             │
│  1. Load GPT-2 weights (HuggingFace)        │
│  2. Tokenize input (tiktoken / transformers) │
│  3. For each generated token:               │
│     For each of 12 layers:                  │
│       a. LayerNorm           ← PyTorch      │
│       b. QKV projection     ← PyTorch       │
│       c. Split into heads    ← PyTorch      │
│       d. Attention           ← WarpForth PTX │
│       e. Concat heads        ← PyTorch      │
│       f. Output projection   ← PyTorch      │
│       g. Residual + LayerNorm← PyTorch      │
│       h. MLP (2× matmul)    ← PyTorch       │
│       i. Residual            ← PyTorch      │
│     Final LayerNorm + logits ← PyTorch      │
│     Sample next token                       │
│  4. Decode and print                        │
└─────────────────────────────────────────────┘
```

## GPT-2 (124M) specs

- 12 layers, 12 attention heads
- hidden_dim = 768, head_dim = 64
- vocab = 50257, max_seq_len = 1024
- Weights ~500MB (float32), ~1GB (float64)

## Key design decisions

### f64 vs f32

WarpForth currently operates on f64 (double precision). GPT-2 weights are typically float32. Options:
- **Upcast to f64**: Simple, just convert weights on load. 2× memory but GPT-2-small fits easily in GPU memory. This is the pragmatic choice for a demo.
- **Add f32 support to WarpForth**: Better long-term but a large feature.

Recommendation: upcast to f64 for the demo.

### Multi-head attention

GPT-2 has 12 heads. Two options:
- **Python loop**: Split Q/K/V into 12 heads, launch the attention kernel 12 times, concatenate. Simple.
- **Batched kernel**: Launch with grid `(seq_len, n_heads, 1)`, handle head indexing in Forth. More efficient.

Recommendation: Start with the Python loop; it's simpler and the kernel stays clean.

### Sequence length

For interactive generation, sequence length grows with each token. The attention kernel needs to handle variable `seq_len` (passed as a scalar param). Maximum 1024 for GPT-2.

## Files to create

- `demo/gpt2_generate.py` — Main script
- `demo/attention.forth` — Attention kernel (from #44)
- `demo/warpforth.py` — PyCUDA integration (from #45)
- `demo/README.md` — Setup and usage instructions

## Dependencies (must be completed first)

- #42 — Math intrinsics (FEXP, FSQRT, FMAX, FMIN)
- #44 — Naive attention kernel
- #45 — Python/PyCUDA integration

## Requirements

- NVIDIA GPU with CUDA
- PyTorch (for everything except attention)
- PyCUDA (for launching WarpForth kernels)
- `transformers` or `tiktoken` (for tokenizer)
- `warpforthc` binary on PATH

## Acceptance criteria

- [ ] Loads GPT-2-small (124M) weights from HuggingFace
- [ ] Generates coherent text from a prompt
- [ ] Attention runs on WarpForth-compiled PTX kernel
- [ ] Output matches pure-PyTorch GPT-2 inference (within f64 tolerance)
- [ ] Interactive: accepts prompt from command line or stdin
- [ ] Works on a single consumer GPU (e.g., RTX 3060, 8GB VRAM)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-2 interactive inference demo with WarpForth attention #46

Summary

What the user experience looks like

Architecture

GPT-2 (124M) specs

Key design decisions

f64 vs f32

Multi-head attention

Sequence length

Files to create

Dependencies (must be completed first)

Requirements

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

GPT-2 interactive inference demo with WarpForth attention #46

Description

Summary

What the user experience looks like

Architecture

GPT-2 (124M) specs

Key design decisions

f64 vs f32

Multi-head attention

Sequence length

Files to create

Dependencies (must be completed first)

Requirements

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions