A compact, Arduino-compatible RISC-V RV32I softcore for the Sipeed Tang Nano 9k FPGA. Fits in 1779 LUT4s and scores 6.08 CoreMark/s at 18 MHz.
The core uses a multicycle design with three sequential phases per instruction: Fetch, Decode, and Execute. This deliberately trades throughput for a minimal LUT footprint — no pipeline registers, no forwarding logic, no branch predictor.
Cycle 1 — FETCH
┌──────────────────────────────────────────────────────┐
│ Address ◄── PC │
│ Input_Data ──► fetch register │
│ PC ◄── PC + 4 │
└──────────────────────────────────────────────────────┘
Cycle 2 — DECODE
┌──────────────────────────────────────────────────────┐
│ Instruction fields latched (rs1, rs2, rd, funct3…) │
│ Immediate reconstructed and sign-extended │
│ Shift direction/type captured (if shift op) │
└──────────────────────────────────────────────────────┘
Cycle 3+ — EXECUTE
┌──────────────────────────────────────────────────────┐
│ ALU result computed combinationally │
│ Branch condition evaluated, PC updated if taken │
│ Memory load/store address driven │
│ Register file written │
│ │
│ Stalls: │
│ Shift ops — 1 extra cycle per bit shifted │
│ Unaligned — 1 extra cycle for cross-word access │
└──────────────────────────────────────────────────────┘
| Instruction class | CPI |
|---|---|
| ALU (ADD, AND, OR, XOR, SLT…) | 3 |
| Load / store (aligned) | 3 |
| Branch / jump | 3 |
| Load / store (unaligned) | 4 |
| Shift by N bits (SLL, SRL, SRA) | 3 + N |
Shifts are implemented as an iterative one-bit-per-cycle shifter rather than a barrel shifter. A 31-bit shift takes 34 cycles; a 1-bit shift takes 4. This is the primary LUT-saving tradeoff in the design — a barrel shifter would cost significantly more fabric.
Single shared port — the same bus carries instruction fetches during Fetch phase and data during Execute phase. No cache. Block RAM timing assumed: address presented this cycle, data valid next cycle (1-cycle latency).
Unaligned accesses (e.g. a 32-bit load from a non-word-aligned address) are handled in hardware by issuing two sequential word reads and reassembling the result — no alignment exception is raised.
Full RV32I base integer set. No CSRs, no interrupts, no memory protection.
FENCE, ECALL, and EBREAK are not implemented.
The CPU is wrapped in a minimal SoC targeting the Tang Nano 9k:
| Peripheral | Base Address | Length | Notes |
|---|---|---|---|
| UART | 0x20010 | 16B | Program upload + serial console |
| GPIO | 0x20000 | 12B | Exposed on Arduino-compatible headers |
| Boot ROM | 0x00000 | 2KB | UART bootloader, generated from bare_metal/boot |
| SRAM | 0x08000 | 32KB | On-chip block RAM |
| Metric | Value |
|---|---|
| CoreMark score | 6.08 CoreMark/s |
| CoreMark/MHz | 0.338 CoreMark/s/MHz |
| Clock frequency | 18 MHz |
| LUT4 usage | 1779 on Sipeed Tang Nano 9k |
| Compiler | GCC 15.1.0 -Os |
The low LUT count is a direct consequence of the multicycle architecture and iterative shifter — no pipeline registers, no forwarding paths, no barrel shifter.
Full CoreMark output
CoreMark Size : 666
Total ticks : 18083
Total time (secs): 18
Iterations/Sec : 6
Iterations : 110
Compiler version : GCC15.1.0
Compiler flags : -Os
Memory location : STACK
seedcrc : 0xE9F5
[0]crclist : 0xE714
[0]crcmatrix : 0x1FD7
[0]crcstate : 0x8E3A
[0]crcfinal : 0x134
Correct operation validated.
- Complete RV32I base integer instruction set
- Hardware unaligned memory access (no alignment trap)
- Arduino framework support — familiar APIs on a CPU you built
- UART bootloader — no JTAG probe required
- Verilator simulation — test on your PC before touching hardware
- ISA compliance tests in
tests/
rv32i/
├── hdl/
│ ├── cpu/ # Core — FSM, ALU, shifter, register file, load/store logic
│ └── soc/ # SoC wrapper, memory map, peripheral glue
├── bare_metal/
│ ├── boot/ # UART bootloader
│ └── hal/ # Hardware abstraction layer and debug utilities
├── arduino/ # Arduino core and example sketches
├── gowin/ # Gowin IDE project files
└── tests/ # RV32I ISA compliance tests and testbenches
Note: This is an educational project — no CSRs, interrupts, or memory protection. Tested on Sipeed Tang Nano 9k; other FPGAs should work with minor SoC and pin mapping changes.
| Tool | Purpose |
|---|---|
| Gowin IDE | Synthesis and place-and-route |
| openFPGALoader | FPGA programming |
riscv64-unknown-elf-gcc |
RISC-V cross compiler |
| Verilator | RTL simulation |
| GHDL | VHDL to Verilog conversion |
| Python 3 + pyserial | Bootloader ROM generation and serial upload |
Skip this step if you want to use the pre-built bootloader.
Edit bare_metal/boot/Makefile to point at your toolchain:
GCC_PATH ?= /path/to/riscv/bin
GCC_PREFIX ?= riscv64-unknown-elf-Optionally adjust F_CPU for a different clock than 18 MHz, then build:
make -C bare_metal/bootGenerate the ROM init file:
cd hdl/soc
python3 gen.py ../../bare_metal/boot/build/bootloader.hexOpen gowin/soc.gprj in Gowin IDE and run Synthesis → Place & Route.
cd gowin/impl/pnr
openFPGALoader -b tangnano9k -f soc.fsThe board enumerates as two serial ports. If the UART stops responding after flashing,
connect to /dev/ttyUSB0 first, then switch to /dev/ttyUSB1.
ln -s ./arduino <ARDUINO_SKETCHBOOK>/hardware/rv32i/rv32iOpen Arduino IDE, select the RV32I board, set clock to 18 MHz, and start writing sketches.
This started as my first processor design, written in high school in April 2021. The
original version was bare-bones — base RV32I only, no wait states, no .bss zeroing
at boot (which caused mysterious crashes for longer than I'd like to admit).
The original goal was to out-perform PicoRV32 — the core that inspired the Fetch-Decode-Execute structure. I beat it. Against PicoRV32's slowest configuration, but a win is a win.
Coming back to it later with more experience was a useful exercise in seeing how much had changed. The startup code now works correctly and the core has been through a pass of much-needed polish.
MIT — see LICENSE.