A neural network that recognizes handwritten digits, written entirely in x86-64 NASM assembly using raw Linux syscalls, SSE2/SSE4.1 and floating-point instructions.
Achieves 96.5% test accuracy on MNIST (9,652/10,000 correct), comparable to equivalent Python/NumPy implementations.
Includes a live drawing app: draw digits on a canvas and watch the NASM neural network classify them in real time (~0.7ms per inference).
Input (784) → Hidden (32, sigmoid) → Output (10, softmax)
- Training: Mini-batch SGD, batch size 32, learning rate 0.1, 30 epochs
- Loss: Cross-entropy with softmax output
- Weight init: Xavier scaling (hidden: ±0.0815, output: ±0.378)
- Data: MNIST IDX binary files parsed at runtime (no preprocessing tools)
- Precision: IEEE 754 double-precision (64-bit) throughout
- Linux x86-64 (tested on WSL2)
- NASM assembler (
sudo apt install nasm) - GNU ld linker (usually included with
binutils) - GCC (for the live app server)
- Python 3 (optional, for downloading MNIST data)
./scripts/download_mnist.shThis downloads the four MNIST files (~55MB) into data/.
make clean && make
./build/mnistlearnTraining takes ~1-2 minutes (30 epochs). Output:
=== NASM Learn Complex — MNIST ===
Training 784-32-10 network...
Epoch 1/30 | Loss: 0.6105 | Accuracy: 85.26%
...
Epoch 30/30 | Loss: 0.0788 | Accuracy: 97.78%
Saving weights... done.
--- Test Results ---
Overall accuracy: 96.52% (9652/10000)
Per-digit accuracy:
0: 98.98%
1: 98.77%
2: 95.93%
...
Trained weights are saved to data/weights.bin (203,600 bytes).
cd app
make clean && make
make runOpen http://localhost:8080 in your browser. Draw a digit on the canvas and watch predictions update live as you draw.
The server links directly with the NASM neural network object files. The same assembly code that trained the network runs inference with zero overhead.
nasm-learn-complex/
├── include/
│ └── constants.inc # Network dims, syscall numbers, training params
├── src/
│ ├── main.asm # Entry point, training loop, weight save, test eval
│ ├── data/
│ │ └── mnist.asm # MNIST IDX file parser, pixel normalization
│ ├── math/
│ │ ├── dot.asm # Dot product (leaf function)
│ │ ├── exp.asm # e^x via range reduction + Horner polynomial
│ │ ├── ln.asm # ln(x) via IEEE 754 decomposition + Pade series
│ │ ├── sigmoid.asm # Sigmoid and derivative
│ │ └── softmax.asm # Numerically stable softmax
│ ├── nn/
│ │ ├── forward.asm # Dense layer forward pass (sigmoid + linear)
│ │ ├── backward.asm # Backpropagation (output softmax + hidden sigmoid)
│ │ └── update.asm # SGD weight update, zero/scale buffer utilities
│ ├── io/
│ │ ├── print.asm # Print routines (string, double, uint64, percent)
│ │ └── string.asm # Double-to-ASCII conversion
│ └── timing/
│ └── clock.asm # Wall-clock timing via clock_gettime
├── app/
│ ├── server.c # HTTP inference server (links NASM .o files)
│ ├── index.html # Drawing canvas + live prediction UI
│ └── Makefile # Builds server with NASM objects
├── scripts/
│ └── download_mnist.sh # Downloads MNIST data files
├── data/ # MNIST files + trained weights (gitignored)
├── docs/plans/ # Design documents
└── Makefile # Builds the training program
Total: 2,680 lines of NASM assembly + 170 lines of C server + 300 lines of HTML/JS frontend.
-
Data loading:
mnist.asmopens MNIST IDX binary files withsys_open, reads them withsys_readin a loop, parses big-endian headers withbswap, normalizes pixel bytes to[0, 1]doubles, and one-hot encodes labels. -
Forward pass: For each sample,
forward.asmcomputesoutput = sigmoid(W·x + b)for the hidden layer and raw linear output for the softmax layer.softmax.asmconverts raw outputs to probabilities (numerically stable: subtracts max before exp). -
Loss: Cross-entropy loss
−ln(softmax[true_class])computed vialn.asm, which decomposes IEEE 754 doubles into exponent + mantissa for a Pade approximation. -
Backward pass:
backward.asmcomputes gradients. The softmax+cross-entropy gradient simplifies todelta[i] = output[i] - target[i]. Hidden layer gradients use the sigmoid derivatives(1−s)and chain rule. -
Update: After each mini-batch (32 samples), gradients are averaged and weights are updated:
w -= lr * grad. -
Shuffling: Fisher-Yates shuffle randomizes training order each epoch, using an LCG PRNG seeded from
rdtsc.
The inference server (app/server.c) links directly with the same NASM .o files used in training. It:
- Loads trained weights from
data/weights.bin - Serves the HTML/JS drawing interface
- Accepts POST requests with 784 raw pixel bytes
- Normalizes pixels, calls
forward_sigmoid→forward_linear→softmax - Returns JSON probabilities
The frontend draws on a 280x280 canvas, downsamples to 28x28 by averaging 10x10 blocks, centers the digit using center-of-mass (matching MNIST preprocessing), and sends predictions on every stroke.
All functions follow the SysV AMD64 ABI:
- Arguments:
rdi,rsi,rdx,rcx,r8,r9(then stack) - Callee-saved:
rbx,rbp,r12–r15 - Return:
rax(integer),xmm0(float) - All XMM registers are caller-saved
- Stack aligned to 16 bytes before every
call
data/weights.bin contains raw IEEE 754 doubles in this order:
| Buffer | Count | Bytes |
|---|---|---|
| w_hidden | 25,088 | 200,704 |
| b_hidden | 32 | 256 |
| w_output | 320 | 2,560 |
| b_output | 10 | 80 |
| Total | 25,450 | 203,600 |
| Metric | Value |
|---|---|
| Training time | ~60–135s (30 epochs, hardware dependent) |
| Train accuracy | 97.8% |
| Test accuracy | 96.5% |
| Inference latency | ~0.7ms round-trip (including HTTP) |
| Weight file size | 199 KB |
| Binary size (training) | ~27 KB |
| Binary size (server) | ~25 KB |
Licensed under the Apache License, Version 2.0. See NOTICE for details.