Skip to content

Releases: vortexgpgpu/vortex

Release v2.3

25 Apr 03:15

Choose a tag to compare

This release includes the following major changes and fixes since v2.2:

Microarchitecture & RTL

  • New SimX dispatcher, GPR unit, dedicated operand-collector unit (opc_unit), and split operands stage.
  • Added configurable SIMD_WIDTH to reduce operand fan-out at large thread counts.
  • Added support for multiple operand collectors.
  • PC_BITS refactoring across the pipeline to support configurable program-counter widths.
  • New cache replacement-policy module (VX_cache_repl.sv).
  • Migrated FPU dependency from fpnew to cvfpu.

Tensor Core extension (TCU)

  • New tensor unit and TCU kernel API (vx_tensor.h).
  • Added Berkeley HardFloat (BHF) tensor cores supporting fp8/fp16/bf16/tf32/fp32 multiply-add.
  • Added VHF (Vortex HardFloat) tensor regression tests covering all TCU formats.
  • Fixed FPGA TCU DSP path and int4 corrections.

Debug support

  • Added RISC-V Debug Module, JTAG DTM, and OpenOCD remote-bitbang TCP server in SimX for GDB debugging.
  • Extended Debug Module to support 64-bit (XLEN=64) register and memory access.
  • Fixed MISA CSR read so ELF binaries can be loaded via GDB.
  • Added comprehensive GDB + OpenOCD debug-mode guide (docs/debug_mode.md).

SST integration

  • SimX-SST integration (no-memory variant) merged with cleanup of memory-related changes (PR #298).
  • SST CI integration: added sst_install.sh, SST regression tests (hello, vecadd, fibonacci, conform), and CI plumbing.

OpenMPI support

  • Initial OpenMPI support for SimX, with mpi_vecadd regression test (PR #282).
  • Added MPI benchmarks suite (mpi_blocked_sgemm, mpi_conv3, mpi_diverge, mpi_dotproduct, mpi_neighbor_a2a_conv3, mpi_put_dotproduct, mpi_sgemm, mpi_vecadd) covering point-to-point, collective, RMA, and Cannon's algorithm.

OpenCL & runtime

  • Added OpenCL clEnqueueCopyBuffer support with regression test (PR #310).
  • Fixed vx_spawn_threads group-offset bug when groups don't divide evenly across cores (PR #321).
  • Renamed __assert macro to __vortex_assert to fix debug builds against modern glibc/GCC 15 (PR #326).
  • Fixed BSS data race across cores by moving zeroing from _start into the host-side kernel uploader (PR #338).

FPGA

  • Reorganized FPGA build trees: collapsed Altera quartus/ and Xilinx test/ directories into unified dut/ trees with consistent per-block Makefiles.
  • New consolidated FPGA setup documentation (docs/fpga_setup.md) replacing older Altera/Xilinx guides.
  • Fixed AXI burst mode (Fixed → Incr) for Vivado SmartConnect compatibility (PR #297).
  • Fixed Xilinx U50 platform configuration (PLATFORM_MERGED_MEMORY_INTERFACE) so demo/sgemm/vecadd pass on hardware (PR #330).
  • Fixed spurious endif in Xilinx XRT Makefile breaking Rocky Linux 9.2 / XRT 2023.2 builds (PR #272).

SimX fixes & infrastructure

  • SimX source tree restructured into submodule-friendly subdirectories (sst/, tcu/, vpu/, dtm/).
  • SimObject infrastructure: added immediate-event support to the SimX scheduler.
  • Fixed CSR read bug in SimX.
  • Fixed CTA dispatch bug and added a dedicated cta regression test.
  • Fixed SRAI instruction decode bug in SimX, added arith regression test (PR #320).
  • Fixed local-memory address aliasing in SimX where capacities >2KB silently overwrote lower addresses (PR #327).
  • Fixed RAM::copy memmove direction and added device-match check for copy_dev_to_dev.
  • Fixed io_addr regression-test memory-access violation around vx_perf_dump permissions (PR #267).

CI, toolchain, & containers

  • Added Ubuntu 24.04 to the CI test matrix.
  • Updated Verilator install to 5.046 (Ubuntu Focal toolchain) (PR #328).
  • Apptainer container updates: added Boost libraries, environment modules, and updated documentation (PR #276).
  • Initial Apptainer-based CI pipeline (PR #289).

Tests & documentation

  • New regression tests: dotproduct, dropout, matmul, relu, sgemm2, sgemv, vecadd, madmax, plus an OpenCL BFS test.
  • Added cache-size configuration documentation and examples to README (PR #324).

Release v2.2

06 Aug 21:27

Choose a tag to compare

This release includes the following major changes and fixes:

  • New vx_spawn_threads kernel launch API supporting 3D task-partitioning.
  • Using the ../configure script without parameters to update the build repository during development.
  • Support for the ZICOND RISC-V extension for branchless conditionals.
  • OpenCL compiler migration from warp-level to thread-level scheduling.
  • Support for OpenCL's just-in-time compilation.
  • Support for OpenCL's 64-bit kernel.
  • Support for Vortex runtime dynamic loading for driver-specific implementations simplifies linking for Vortex applications.
  • Updated README instructions.
  • New Xilinx FPGA setup documentation.
  • Enabled Full logic synthesis test using Yosys.
  • Added cache support for hierarchical flush.
  • Added cache support for write-back mode with configurable dirty bytes.
  • RTL scoreboard and operand speed optimization.
  • Support for Ramulator 2.0 with HBM memory configuration.
  • Migration to Verilator 5.0.
  • Migration to LLVM 18.0.
  • New Stencil3D regression test.
  • Fixed Xilinx FPGA synthesis for cores with more than 256 threads.
  • Updated Centos 7.9 toolchain
  • Migration from Travis CI to GitHub CI workflow.

Release v2.1

14 May 07:02

Choose a tag to compare

This release includes the following major changes and fixes:

  • new build configuration script to isolate the sources from the build directory
  • added spawn_taskgroups kernel API for running kernels that use local memory and barriers (see tests/regression/sgemm2x)
  • new runtime extension for relocatable kernel binary and arguments.
  • new runtime memory API additions: vx_mem_reserve, vx_mem_access, vx_mem_address
  • new runtime vx_check_occupancy API
  • added GPU driver option to test OpenCL tests on local GPU (e.g. blackbox.sh --driver=gpu --app=sgemm)
  • added OpenCL tests that use with local memory (psum, sgemm2, sgemm3)
  • added vortex custom libc and librt libraries with control divergence instrumentation
  • added memory coalescing support
  • reduced CSR instructions pipeline stalls
  • optimized split/join h/w area overhead with new split_n, pred_n inverted predicate instructions.

Release v2.x

10 Nov 11:04

Choose a tag to compare

Merge branch 'develop'

Release v1.x

21 Oct 03:31

Choose a tag to compare

minor update

Release v0.2.3

28 Jul 03:40

Choose a tag to compare

External Interface Refactoring for Third Party Integration

This new release includes major changes to Vortex’s external interface that will simplify integration with third party designs. These changes include; (1) memory mapped CSRs, (2) _ebreak _signal removal. To support memory mapped CSRs, we had to first added support for non-cacheable memory such that CSR write requests from the kernel will bypass the cache subsystem to go directly to memory. Details about individual features are described below.

New Features

  • Non-Cacheable Memory

A new module VX_nc_bypass was added to the cache top module to detect requests to I/O memory regions (defined in the configuration file VX_config.vh) and redirect those requests to memory, bypassing its normal caching operation. This was implemented by extending the cache request tag interface with a I/O bypass flag that is computed inside the Load/Store Unit based on the address range. _VX_nc_bypass _manages core request to memory bypassing as well as memory response to core bypassing for I/O addresses.

  • Memory Mapped CSRs

The original Vortex’s external interface had CSR request/response ports to allow the host processor to read the content of the CSR registers. This interface was mainly used for gathering performance counters. This feature removed that external interface from Vortex and instead implemented the performance counters support via memory mapped I/O. More specifically, we reserved a memory space for storing the performance counters and then added a new stage into the application exit routine to dump the performance counters to memory. Now, the host application reads the performance from a dedicated memory region instead of using a dedicated I/O bus.

  • Multi-Bank Memory Support

Original Vortex implementation was using a single memory bank to handle all the memory transactions. This feature extends the command processor (AFU) module to expose the memory banks to the Vortex processor. Our current FPGA devices include Intel Arria 10 and Stratix 10 that support 2 memory channels and 8 memory channels respectively.

  • OpenCL Debug Printf

This feature takes advantage of the new no-cacheable memory feature to support debug printf interface for OpenCL applications. Most of the changes related to this feature were implemented in our POCL codebase (https://github.com/vortexgpgpu/pocl).

  • Memory Fence Support

This feature is about adding support for the RISC-V data fence extension. This work was completed last semester in our private repository and finally ported into the public repository.

Changes & Improvements

  • Documentation

    • The public repository now includes a doc folder where we have the current documentation for the processor.
    • ebreak external Interface cleanup
    • The Vortex public interface used to have an ebreak signal that was used in simulation to trap the returned exitcode of RISC-V unit tests. This change removes the signal from the external interface and instead uses an internal debug interface to retrieve the exitcode.
  • New regression tests

    • Io_addr: non-cacheable memory test
    • Diverge: branch divergence test
    • Fence: fence feature test
    • mtress: memory stress
    • printf: opencl printf test
    • sort: parallel sort benchmark
  • Tests folders reorganization

    • We reorganized all Vortex tests into one test location which includes OpenCL benchmark, driver tests, runtime tests.
  • Regression Tests Migration to travis.org

    • Vortex was using travis.com for the continuous integration tests but the service was discontinued last month. This task is about migrating our regression tests to the new service travis.com.

Bug Fixes

  • Shared Memory Bug
    • This was a synchronization bug in the dcache/shared memory arbiter.