Releases: vortexgpgpu/vortex
Release v2.3
This release includes the following major changes and fixes since v2.2:
Microarchitecture & RTL
- New SimX dispatcher, GPR unit, dedicated operand-collector unit (
opc_unit), and split operands stage. - Added configurable SIMD_WIDTH to reduce operand fan-out at large thread counts.
- Added support for multiple operand collectors.
- PC_BITS refactoring across the pipeline to support configurable program-counter widths.
- New cache replacement-policy module (
VX_cache_repl.sv). - Migrated FPU dependency from
fpnewtocvfpu.
Tensor Core extension (TCU)
- New tensor unit and TCU kernel API (
vx_tensor.h). - Added Berkeley HardFloat (BHF) tensor cores supporting fp8/fp16/bf16/tf32/fp32 multiply-add.
- Added VHF (Vortex HardFloat) tensor regression tests covering all TCU formats.
- Fixed FPGA TCU DSP path and int4 corrections.
Debug support
- Added RISC-V Debug Module, JTAG DTM, and OpenOCD remote-bitbang TCP server in SimX for GDB debugging.
- Extended Debug Module to support 64-bit (XLEN=64) register and memory access.
- Fixed MISA CSR read so ELF binaries can be loaded via GDB.
- Added comprehensive GDB + OpenOCD debug-mode guide (
docs/debug_mode.md).
SST integration
- SimX-SST integration (no-memory variant) merged with cleanup of memory-related changes (PR #298).
- SST CI integration: added
sst_install.sh, SST regression tests (hello, vecadd, fibonacci, conform), and CI plumbing.
OpenMPI support
- Initial OpenMPI support for SimX, with
mpi_vecaddregression test (PR #282). - Added MPI benchmarks suite (
mpi_blocked_sgemm,mpi_conv3,mpi_diverge,mpi_dotproduct,mpi_neighbor_a2a_conv3,mpi_put_dotproduct,mpi_sgemm,mpi_vecadd) covering point-to-point, collective, RMA, and Cannon's algorithm.
OpenCL & runtime
- Added OpenCL
clEnqueueCopyBuffersupport with regression test (PR #310). - Fixed
vx_spawn_threadsgroup-offset bug when groups don't divide evenly across cores (PR #321). - Renamed
__assertmacro to__vortex_assertto fix debug builds against modern glibc/GCC 15 (PR #326). - Fixed BSS data race across cores by moving zeroing from
_startinto the host-side kernel uploader (PR #338).
FPGA
- Reorganized FPGA build trees: collapsed Altera
quartus/and Xilinxtest/directories into unifieddut/trees with consistent per-block Makefiles. - New consolidated FPGA setup documentation (
docs/fpga_setup.md) replacing older Altera/Xilinx guides. - Fixed AXI burst mode (Fixed → Incr) for Vivado SmartConnect compatibility (PR #297).
- Fixed Xilinx U50 platform configuration (
PLATFORM_MERGED_MEMORY_INTERFACE) so demo/sgemm/vecadd pass on hardware (PR #330). - Fixed spurious
endifin Xilinx XRT Makefile breaking Rocky Linux 9.2 / XRT 2023.2 builds (PR #272).
SimX fixes & infrastructure
- SimX source tree restructured into submodule-friendly subdirectories (
sst/,tcu/,vpu/,dtm/). - SimObject infrastructure: added immediate-event support to the SimX scheduler.
- Fixed CSR read bug in SimX.
- Fixed CTA dispatch bug and added a dedicated
ctaregression test. - Fixed SRAI instruction decode bug in SimX, added
arithregression test (PR #320). - Fixed local-memory address aliasing in SimX where capacities >2KB silently overwrote lower addresses (PR #327).
- Fixed
RAM::copymemmove direction and added device-match check forcopy_dev_to_dev. - Fixed
io_addrregression-test memory-access violation aroundvx_perf_dumppermissions (PR #267).
CI, toolchain, & containers
- Added Ubuntu 24.04 to the CI test matrix.
- Updated Verilator install to 5.046 (Ubuntu Focal toolchain) (PR #328).
- Apptainer container updates: added Boost libraries, environment modules, and updated documentation (PR #276).
- Initial Apptainer-based CI pipeline (PR #289).
Tests & documentation
- New regression tests:
dotproduct,dropout,matmul,relu,sgemm2,sgemv,vecadd,madmax, plus an OpenCL BFS test. - Added cache-size configuration documentation and examples to README (PR #324).
Release v2.2
This release includes the following major changes and fixes:
- New vx_spawn_threads kernel launch API supporting 3D task-partitioning.
- Using the ../configure script without parameters to update the build repository during development.
- Support for the ZICOND RISC-V extension for branchless conditionals.
- OpenCL compiler migration from warp-level to thread-level scheduling.
- Support for OpenCL's just-in-time compilation.
- Support for OpenCL's 64-bit kernel.
- Support for Vortex runtime dynamic loading for driver-specific implementations simplifies linking for Vortex applications.
- Updated README instructions.
- New Xilinx FPGA setup documentation.
- Enabled Full logic synthesis test using Yosys.
- Added cache support for hierarchical flush.
- Added cache support for write-back mode with configurable dirty bytes.
- RTL scoreboard and operand speed optimization.
- Support for Ramulator 2.0 with HBM memory configuration.
- Migration to Verilator 5.0.
- Migration to LLVM 18.0.
- New Stencil3D regression test.
- Fixed Xilinx FPGA synthesis for cores with more than 256 threads.
- Updated Centos 7.9 toolchain
- Migration from Travis CI to GitHub CI workflow.
Release v2.1
This release includes the following major changes and fixes:
- new build configuration script to isolate the sources from the build directory
- added spawn_taskgroups kernel API for running kernels that use local memory and barriers (see tests/regression/sgemm2x)
- new runtime extension for relocatable kernel binary and arguments.
- new runtime memory API additions: vx_mem_reserve, vx_mem_access, vx_mem_address
- new runtime vx_check_occupancy API
- added GPU driver option to test OpenCL tests on local GPU (e.g. blackbox.sh --driver=gpu --app=sgemm)
- added OpenCL tests that use with local memory (psum, sgemm2, sgemm3)
- added vortex custom libc and librt libraries with control divergence instrumentation
- added memory coalescing support
- reduced CSR instructions pipeline stalls
- optimized split/join h/w area overhead with new split_n, pred_n inverted predicate instructions.
Release v2.x
Merge branch 'develop'
Release v1.x
minor update
Release v0.2.3
External Interface Refactoring for Third Party Integration
This new release includes major changes to Vortex’s external interface that will simplify integration with third party designs. These changes include; (1) memory mapped CSRs, (2) _ebreak _signal removal. To support memory mapped CSRs, we had to first added support for non-cacheable memory such that CSR write requests from the kernel will bypass the cache subsystem to go directly to memory. Details about individual features are described below.
New Features
- Non-Cacheable Memory
A new module VX_nc_bypass was added to the cache top module to detect requests to I/O memory regions (defined in the configuration file VX_config.vh) and redirect those requests to memory, bypassing its normal caching operation. This was implemented by extending the cache request tag interface with a I/O bypass flag that is computed inside the Load/Store Unit based on the address range. _VX_nc_bypass _manages core request to memory bypassing as well as memory response to core bypassing for I/O addresses.
- Memory Mapped CSRs
The original Vortex’s external interface had CSR request/response ports to allow the host processor to read the content of the CSR registers. This interface was mainly used for gathering performance counters. This feature removed that external interface from Vortex and instead implemented the performance counters support via memory mapped I/O. More specifically, we reserved a memory space for storing the performance counters and then added a new stage into the application exit routine to dump the performance counters to memory. Now, the host application reads the performance from a dedicated memory region instead of using a dedicated I/O bus.
- Multi-Bank Memory Support
Original Vortex implementation was using a single memory bank to handle all the memory transactions. This feature extends the command processor (AFU) module to expose the memory banks to the Vortex processor. Our current FPGA devices include Intel Arria 10 and Stratix 10 that support 2 memory channels and 8 memory channels respectively.
- OpenCL Debug Printf
This feature takes advantage of the new no-cacheable memory feature to support debug printf interface for OpenCL applications. Most of the changes related to this feature were implemented in our POCL codebase (https://github.com/vortexgpgpu/pocl).
- Memory Fence Support
This feature is about adding support for the RISC-V data fence extension. This work was completed last semester in our private repository and finally ported into the public repository.
Changes & Improvements
-
Documentation
- The public repository now includes a doc folder where we have the current documentation for the processor.
- ebreak external Interface cleanup
- The Vortex public interface used to have an ebreak signal that was used in simulation to trap the returned exitcode of RISC-V unit tests. This change removes the signal from the external interface and instead uses an internal debug interface to retrieve the exitcode.
-
New regression tests
- Io_addr: non-cacheable memory test
- Diverge: branch divergence test
- Fence: fence feature test
- mtress: memory stress
- printf: opencl printf test
- sort: parallel sort benchmark
-
Tests folders reorganization
- We reorganized all Vortex tests into one test location which includes OpenCL benchmark, driver tests, runtime tests.
-
Regression Tests Migration to travis.org
- Vortex was using travis.com for the continuous integration tests but the service was discontinued last month. This task is about migrating our regression tests to the new service travis.com.
Bug Fixes
- Shared Memory Bug
- This was a synchronization bug in the dcache/shared memory arbiter.