Python/PyCUDA integration for launching WarpForth kernels

## Summary

Create a Python module that loads WarpForth-compiled PTX kernels and launches them with NumPy/PyTorch tensors as arguments. This replaces `warpforth-runner` for real workloads.

## Motivation

The existing `warpforth-runner` is a standalone C++ tool designed for testing — it takes CSV values on the command line. For real ML workloads, we need to pass large tensors (millions of elements) directly from Python without serialization overhead.

## Design

### Core API

```python
from warpforth import WarpForthKernel

# Compile and load
kernel = WarpForthKernel("attention.forth")

# Launch with PyTorch tensors (zero-copy via .data_ptr())
kernel.launch(
    Q_gpu, K_gpu, V_gpu, O_gpu,   # GPU tensors
    seq_len, head_dim,             # scalar params
    grid=(seq_len, 1, 1),
    block=(64, 1, 1),
)
```

### Implementation

- Use PyCUDA's `cuda.module_from_buffer()` to load PTX
- Accept both NumPy arrays (copy to GPU) and PyTorch CUDA tensors (zero-copy via `data_ptr()`)
- Subprocess call to `warpforthc` for compilation, or accept pre-compiled PTX
- Parse `\!` header directives from Forth source to determine parameter types and order
- Map f64 arrays to `float64` device pointers, i64 arrays to `int64` device pointers
- Handle scalar params (pass by value, not pointer)

### Parameter mapping

| Forth declaration | Python input | CUDA argument |
|------------------|-------------|---------------|
| `\! param X f64[N]` | `torch.Tensor` (float64, CUDA) | Device pointer |
| `\! param X i64[N]` | `torch.Tensor` (int64, CUDA) | Device pointer |
| `\! param X f64` | `float` | Value (bitcast to i64) |
| `\! param X i64` | `int` | Value |

## Files to create

- `demo/warpforth.py` — The integration module
- `demo/requirements.txt` or `pyproject.toml` — Dependencies (pycuda, numpy, torch)

## Acceptance criteria

- [ ] Can load a WarpForth-compiled PTX kernel
- [ ] Can launch with PyTorch CUDA tensors (zero-copy)
- [ ] Can launch with NumPy arrays (auto-copy to GPU)
- [ ] Correctly handles both array and scalar parameters
- [ ] Works with the attention kernel from #44

## Dependencies

- #44 — Naive attention kernel (first consumer of this integration)

Forth declaration	Python input	CUDA argument
`\! param X f64[N]`	`torch.Tensor` (float64, CUDA)	Device pointer
`\! param X i64[N]`	`torch.Tensor` (int64, CUDA)	Device pointer
`\! param X f64`	`float`	Value (bitcast to i64)
`\! param X i64`	`int`	Value

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python/PyCUDA integration for launching WarpForth kernels #45

Summary

Motivation

Design

Core API

Implementation

Parameter mapping

Files to create

Acceptance criteria

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Python/PyCUDA integration for launching WarpForth kernels #45

Description

Summary

Motivation

Design

Core API

Implementation

Parameter mapping

Files to create

Acceptance criteria

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions