Skip to content

EPSILON0-dev/RV32I

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RV32I CPU Softcore

A compact, Arduino-compatible RISC-V RV32I softcore for the Sipeed Tang Nano 9k FPGA. Fits in 1779 LUT4s and scores 6.08 CoreMark/s at 18 MHz.


Architecture

The core uses a multicycle design with three sequential phases per instruction: Fetch, Decode, and Execute. This deliberately trades throughput for a minimal LUT footprint — no pipeline registers, no forwarding logic, no branch predictor.

  Cycle 1 — FETCH
  ┌──────────────────────────────────────────────────────┐
  │  Address ◄── PC                                      │
  │  Input_Data ──► fetch register                       │
  │  PC ◄── PC + 4                                       │
  └──────────────────────────────────────────────────────┘

  Cycle 2 — DECODE
  ┌──────────────────────────────────────────────────────┐
  │  Instruction fields latched (rs1, rs2, rd, funct3…)  │
  │  Immediate reconstructed and sign-extended           │
  │  Shift direction/type captured (if shift op)         │
  └──────────────────────────────────────────────────────┘

  Cycle 3+ — EXECUTE
  ┌──────────────────────────────────────────────────────┐
  │  ALU result computed combinationally                 │
  │  Branch condition evaluated, PC updated if taken     │
  │  Memory load/store address driven                    │
  │  Register file written                               │
  │                                                      │
  │  Stalls:                                             │
  │    Shift ops  — 1 extra cycle per bit shifted        │
  │    Unaligned  — 1 extra cycle for cross-word access  │
  └──────────────────────────────────────────────────────┘

Cycles per instruction

Instruction class CPI
ALU (ADD, AND, OR, XOR, SLT…) 3
Load / store (aligned) 3
Branch / jump 3
Load / store (unaligned) 4
Shift by N bits (SLL, SRL, SRA) 3 + N

Shifts are implemented as an iterative one-bit-per-cycle shifter rather than a barrel shifter. A 31-bit shift takes 34 cycles; a 1-bit shift takes 4. This is the primary LUT-saving tradeoff in the design — a barrel shifter would cost significantly more fabric.

Memory interface

Single shared port — the same bus carries instruction fetches during Fetch phase and data during Execute phase. No cache. Block RAM timing assumed: address presented this cycle, data valid next cycle (1-cycle latency).

Unaligned accesses (e.g. a 32-bit load from a non-word-aligned address) are handled in hardware by issuing two sequential word reads and reassembling the result — no alignment exception is raised.

ISA coverage

Full RV32I base integer set. No CSRs, no interrupts, no memory protection. FENCE, ECALL, and EBREAK are not implemented.


SoC

The CPU is wrapped in a minimal SoC targeting the Tang Nano 9k:

Peripheral Base Address Length Notes
UART 0x20010 16B Program upload + serial console
GPIO 0x20000 12B Exposed on Arduino-compatible headers
Boot ROM 0x00000 2KB UART bootloader, generated from bare_metal/boot
SRAM 0x08000 32KB On-chip block RAM

Performance

Metric Value
CoreMark score 6.08 CoreMark/s
CoreMark/MHz 0.338 CoreMark/s/MHz
Clock frequency 18 MHz
LUT4 usage 1779 on Sipeed Tang Nano 9k
Compiler GCC 15.1.0 -Os

The low LUT count is a direct consequence of the multicycle architecture and iterative shifter — no pipeline registers, no forwarding paths, no barrel shifter.

Full CoreMark output
CoreMark Size    : 666
Total ticks      : 18083
Total time (secs): 18
Iterations/Sec   : 6
Iterations       : 110
Compiler version : GCC15.1.0
Compiler flags   : -Os
Memory location  : STACK
seedcrc          : 0xE9F5
[0]crclist       : 0xE714
[0]crcmatrix     : 0x1FD7
[0]crcstate      : 0x8E3A
[0]crcfinal      : 0x134
Correct operation validated.

Features

  • Complete RV32I base integer instruction set
  • Hardware unaligned memory access (no alignment trap)
  • Arduino framework support — familiar APIs on a CPU you built
  • UART bootloader — no JTAG probe required
  • Verilator simulation — test on your PC before touching hardware
  • ISA compliance tests in tests/

Project structure

rv32i/
├── hdl/
│   ├── cpu/        # Core — FSM, ALU, shifter, register file, load/store logic
│   └── soc/        # SoC wrapper, memory map, peripheral glue
├── bare_metal/
│   ├── boot/       # UART bootloader
│   └── hal/        # Hardware abstraction layer and debug utilities
├── arduino/        # Arduino core and example sketches
├── gowin/          # Gowin IDE project files
└── tests/          # RV32I ISA compliance tests and testbenches

Getting started

Note: This is an educational project — no CSRs, interrupts, or memory protection. Tested on Sipeed Tang Nano 9k; other FPGAs should work with minor SoC and pin mapping changes.

Dependencies

Tool Purpose
Gowin IDE Synthesis and place-and-route
openFPGALoader FPGA programming
riscv64-unknown-elf-gcc RISC-V cross compiler
Verilator RTL simulation
GHDL VHDL to Verilog conversion
Python 3 + pyserial Bootloader ROM generation and serial upload

1. Build the bootloader

Skip this step if you want to use the pre-built bootloader.

Edit bare_metal/boot/Makefile to point at your toolchain:

GCC_PATH   ?= /path/to/riscv/bin
GCC_PREFIX ?= riscv64-unknown-elf-

Optionally adjust F_CPU for a different clock than 18 MHz, then build:

make -C bare_metal/boot

Generate the ROM init file:

cd hdl/soc
python3 gen.py ../../bare_metal/boot/build/bootloader.hex

2. Synthesize

Open gowin/soc.gprj in Gowin IDE and run Synthesis → Place & Route.

3. Flash

cd gowin/impl/pnr
openFPGALoader -b tangnano9k -f soc.fs

4. Connect

The board enumerates as two serial ports. If the UART stops responding after flashing, connect to /dev/ttyUSB0 first, then switch to /dev/ttyUSB1.

5. Arduino IDE setup

ln -s ./arduino <ARDUINO_SKETCHBOOK>/hardware/rv32i/rv32i

Open Arduino IDE, select the RV32I board, set clock to 18 MHz, and start writing sketches.


Background

This started as my first processor design, written in high school in April 2021. The original version was bare-bones — base RV32I only, no wait states, no .bss zeroing at boot (which caused mysterious crashes for longer than I'd like to admit).

The original goal was to out-perform PicoRV32 — the core that inspired the Fetch-Decode-Execute structure. I beat it. Against PicoRV32's slowest configuration, but a win is a win.

Coming back to it later with more experience was a useful exercise in seeing how much had changed. The startup code now works correctly and the core has been through a pass of much-needed polish.


License

MIT — see LICENSE.

About

Tiny but mighty! RISC-V RV32I CPU softcore is designed for the Sipeed Tang Nano 9k FPGA. It’s got a small footprint, is Arduino compatible, and runs almost as fast as an Intel 486DX!

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors