perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more by aepod · Pull Request #294 · ruvnet/RuVector

aepod · 2026-03-24T14:56:49Z

Performance Optimization Sweep

This is an ongoing PR that will accumulate multiple performance optimizations identified by a 4-agent audit of ruvector's HNSW implementation. Each optimization is investigated individually before implementation to avoid hidden regressions in this complex codebase.

Context

After running real benchmarks against hnswlib (C++), we identified several areas where ruvector-core is leaving significant performance on the table — not because the HNSW algorithm is wrong, but because of wrapper overhead, I/O in hot paths, and unused optimizations.

Current baseline (from PR #293):

Scale	ruvector QPS	hnswlib QPS	Gap	ruvector Build	hnswlib Build
10K	443	1,153	2.6x	44.0s	7.5s
100K	86	250	2.9x	855.6s	395.3s

Root Causes Identified

A team of 4 specialized performance agents audited the codebase in parallel:

Search path expert — Found that VectorDB::search() triggers a full REDB read transaction per result (10 results = 10 DB roundtrips). hnswlib does zero I/O during search.
Build path expert — Found that the benchmark uses individual db.insert() calls (1 REDB transaction per vector = 10K fsyncs at 10K scale). Also found that parallel_insert_slice() exists in hnsw_rs but isn't used.
Memory expert — Found triple vector storage (hnsw_rs + DashMap + REDB), Arc-heavy graph structure (~40 bytes per neighbor link vs ~8 needed), and 16-layer pre-allocation per point.
SIMD/distance expert — Found that hand-written NEON SIMD intrinsics exist in simd_intrinsics.rs but are NOT used in the HNSW search hot loop. Instead, every distance call crosses the FFI boundary to SimSIMD's C library, preventing inlining.

Optimization Plan (10 items, 3 phases)

Each item goes through: Investigate → Document → Implement → Test → Commit

Phase A — Quick Wins (this PR, in progress)

ID	Fix	Expected Impact	Status
OPT-4	Batch REDB transactions in benchmark	Build 1.5-3x faster (10K)	DONE ✅
OPT-1	Remove `storage.get()` from search path	QPS +30-50%	Investigating
OPT-2	Use native Rust NEON intrinsics instead of SimSIMD FFI	QPS +15-30%	Investigating
OPT-8	Add prefetching to `search_layer`	QPS +10-20%	Not started
OPT-9	Set REDB cache size limit	Memory -20-40MB	Not started

Phase B — Medium Effort (future commits)

ID	Fix	Expected Impact	Status
OPT-5	Remove/restructure outer RwLock on HnswInner	Enables OPT-3	Not started
OPT-3	Use `parallel_insert_slice()` for batch build	Build 2-4x faster	Not started
OPT-7	Replace HashMap visited set with bitset in hnsw_rs	QPS +10-15%	Not started

Phase C — Structural Changes (future PR)

ID	Fix	Expected Impact	Status
OPT-6	Refactor HnswInner.vectors DashMap	Memory -65MB	Not started
OPT-10	Replace Arc<PointWithOrder> with flat arrays	Memory -150MB, QPS +20-30%	Not started

Commits So Far

OPT-4: Batch REDB transactions in benchmark ✅

What changed: crates/ruvector-core/tests/bench_hnsw.rs

Before: The benchmark called db.insert(entry) in a loop for each vector. Each call opened a separate REDB write transaction → serialized the vector → wrote to B-tree → committed with fsync. At 10K vectors = 10,000 separate transactions.

// OLD — 10K individual transactions
for (i, vec) in data.iter().enumerate() {
    let entry = VectorEntry { id: Some(format!("v{}", i)), vector: vec.clone(), metadata: None };
    db.insert(entry).expect("Insert failed");
}

After: Uses db.insert_batch() which wraps all inserts in a single REDB write transaction with one commit.

// NEW — 1 transaction for all vectors
let entries: Vec<VectorEntry> = data.iter().enumerate()
    .map(|(i, vec)| VectorEntry { id: Some(format!("v{}", i)), vector: vec.clone(), metadata: None })
    .collect();
db.insert_batch(entries).expect("Insert batch failed");

Why this matters: The old benchmark was measuring REDB transaction overhead, not HNSW build performance. Production code (ruvector-server, ruvector-cli, ruvector-node) already uses insert_batch(). This change aligns the benchmark with real-world usage patterns.

No library code changed — this is a benchmark-only fix.

Investigation notes: We verified that storage.insert_batch() (at storage.rs:168-207) uses a single begin_write() / commit() pair. VectorDB::insert_batch() correctly delegates to it. The rvlite SQL path uses individual inserts by design (one INSERT = one vector), which is correct for that use case.

Target Metrics After Full Optimization

Metric	Current	Phase A Target	Full Target
QPS (10K)	443	~800	~1,000+
QPS (100K)	86	~150	~200+
Build (10K)	44s	~15s	~5-8s
Build (100K)	856s	~500s	~150-250s
Memory (100K)	523MB	~460MB	~250MB

How to Verify

# Run the benchmark (requires Rust toolchain)
cargo test -p ruvector-core --test bench_hnsw --release -- --nocapture

Compare build time before and after. QPS and recall should be unchanged since this only affects the insert path.

Test plan

cargo test -p ruvector-core --test bench_hnsw --release passes
Build time improves (expected 1.5-3x at 10K)
QPS unchanged (search path not modified)
Recall unchanged (HNSW parameters unchanged)
All existing tests still pass: cargo test -p ruvector-core

🤖 Generated with claude-flow

The execute_match() function previously collapsed all match results into a single ExecutionContext via context.bind(), which overwrote previous bindings. MATCH (n:Person) on 3 Person nodes returned only 1 row. This commit refactors the executor to use a ResultSet pipeline: - type ResultSet = Vec<ExecutionContext> - Each clause transforms ResultSet → ResultSet - execute_match() expands the set (one context per match) - execute_return() projects one row per context - execute_set/delete() apply to all contexts - Cross-product semantics for multiple patterns in one MATCH Also adds comprehensive tests: - test_match_returns_multiple_rows (the Issue ruvnet#269 regression) - test_match_return_properties (verify correct values per row) - test_match_where_filter (WHERE correctly filters multi-row) - test_match_single_result (1 match → 1 row, no regression) - test_match_no_results (0 matches → 0 rows) - test_match_many_nodes (100 nodes → 100 rows, stress test) Co-Authored-By: claude-flow <ruv@ruv.net>

RETURN n.name now produces column "n.name" instead of "?column?". Property expressions (Expression::Property) are formatted as "object.property" for column naming, matching standard Cypher behavior. Co-Authored-By: claude-flow <ruv@ruv.net>

Built from commit b2347ce Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

Built from commit 2adb949 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

Phase 2 of the ruvector remediation plan. Replaces simulated benchmarks with real measurements: - Python harness: hnswlib (C++) and numpy brute-force on same datasets - Rust test: ruvector-core HNSW with ground-truth recall measurement - Datasets: random-10K and random-100K, 128 dimensions - Metrics: QPS (p50/p95), recall@10 vs ground truth, memory, build time Key findings: - ruvector recall@10 is good: 98.3% (10K), 86.75% (100K) - ruvector QPS is 2.6-2.9x slower than hnswlib - ruvector build time is 2.2-5.9x slower than hnswlib - ruvector uses ~523MB for 100K vectors (10x raw data size) - All numbers are REAL — no hardcoded values, no simulation Co-Authored-By: claude-flow <ruv@ruv.net>

- Add independent benchmark report comparing ruvector-core vs hnswlib (C++) vs numpy brute-force - 10K vectors: 443 QPS / 98.3% recall (vs hnswlib 1153 QPS / 98.95% recall) - 100K vectors: 86 QPS / 86.75% recall (vs hnswlib 250 QPS / 74.27% recall) - Fix README "100% recall" claim — actual recall is 86.75-98.3% depending on scale - Fix "simulated Python baseline" — now compared against real hnswlib competitor - Include raw JSON data and full methodology documentation Co-Authored-By: claude-flow <ruv@ruv.net>

…ions The benchmark was calling db.insert() in a loop for each vector individually. Each insert() call opens a separate REDB write transaction, serializes the vector with bincode, writes to the B-tree table, and commits with fsync. For 10K vectors, this meant 10,000 separate write transactions with 10,000 fsync operations. The benchmark was measuring REDB transaction commit overhead, not HNSW index build performance. The fix: use db.insert_batch() which wraps all inserts in a single REDB write transaction with one commit at the end. This is how production code already works — ruvector-server, ruvector-cli, and ruvector-node all use insert_batch() for bulk operations. This is OPT-4 from the ruvector performance optimization plan. Expected impact: 1.5-3x faster build time at 10K scale, where transaction overhead dominated. Smaller improvement at 100K where HNSW graph search cost dominates. No library code changed — benchmark only. Co-Authored-By: claude-flow <ruv@ruv.net>

…PT-1) The search() method was calling storage.get() for every result to populate vector data and metadata. Each call opens a REDB read transaction, does a B-tree lookup, and deserializes with bincode. For k=10 results, that's 10 separate database roundtrips per search query. hnswlib returns IDs + scores with zero I/O. Added SearchQuery.enrich field (Option<bool>) to control this behavior: - None (default): auto — only enrich when a metadata filter is present, since the filter needs metadata to evaluate. No filter = skip enrichment. - Some(true): always enrich (for callers that need vector data/metadata) - Some(false): never enrich (pure ID + score results, fastest path) This is backward compatible: - Existing code with metadata filters: auto-enriches (filter present → true) - agenticdb callers: use filters, so auto-enrichment kicks in - Benchmark: explicitly sets enrich: Some(false) for fair HNSW comparison - All 27 files with SearchQuery literals updated with ..Default::default() Added Default derive to SearchQuery so existing struct literals work with the new field via ..Default::default(). Expected impact: +30-50% QPS improvement for search-only workloads (no filter, no vector/metadata needed), which is the common case for similarity search and the benchmark comparison against hnswlib. IMPORTANT: Setting enrich=Some(false) with a metadata filter will cause the filter to match nothing (metadata is None). The auto behavior (None) prevents this footgun. Co-Authored-By: claude-flow <ruv@ruv.net>

…I (OPT-2) Replaced SimSIMD FFI calls in the HNSW distance hot path with native Rust SIMD intrinsics from simd_intrinsics.rs. The key difference: native Rust functions can be fully inlined by the compiler into the monomorphized hnsw_rs search loop, while SimSIMD's C library crossed the FFI boundary on every distance call (hundreds of times per search query). Changes to distance.rs: - euclidean_distance: SimSIMD sqeuclidean().sqrt() -> simd_intrinsics::euclidean_distance_simd() Both return sqrt(sum_of_squares), semantically identical. - cosine_distance: SimSIMD cosine() -> 1.0 - simd_intrinsics::cosine_similarity_simd() Added is_finite() guard for zero-norm vectors (SimSIMD handled this internally, simd_intrinsics does not check for division by zero). - dot_product_distance: SimSIMD dot() -> -simd_intrinsics::dot_product_simd() Negation for distance semantics preserved. Changes to index/hnsw.rs (DistanceFn::eval): - Added #[inline(always)] to enable inlining into hnsw_rs search loop - Replaced distance(a, b, metric).unwrap_or(f32::MAX) with direct match on metric enum calling individual distance functions. This eliminates: 1. The dimension length check (guaranteed equal at insert time) 2. The Result<f32> construction and unwrap_or per call 3. One level of function call indirection WASM path is unaffected — still uses scalar fallback, gated by #[cfg(any(not(feature = "simd"), target_arch = "wasm32"))]. SimSIMD remains as a dependency but is no longer called from the HNSW hot path. The existing distance tests verify equivalence. Expected impact: +15-30% QPS from elimination of FFI overhead and enabling full inlining of distance computation into the search loop. Co-Authored-By: claude-flow <ruv@ruv.net>

aepod · 2026-03-24T16:25:37Z

OPT-2: Native Rust SIMD intrinsics instead of SimSIMD FFI ✅

Commit: 9d111d4

Replaced all 3 SimSIMD FFI calls in the distance hot path with native Rust SIMD intrinsics that were already in the codebase (simd_intrinsics.rs) but weren't being used for HNSW:

Function	Before (SimSIMD FFI)	After (Native Rust SIMD)
euclidean	`simsimd::sqeuclidean().sqrt()`	`simd_intrinsics::euclidean_distance_simd()`
cosine	`simsimd::cosine()`	`1.0 - simd_intrinsics::cosine_similarity_simd()`
dot product	`simsimd::dot()`	`-simd_intrinsics::dot_product_simd()`

Also optimized DistanceFn::eval (called hundreds of times per search):

Added #[inline(always)] for compiler inlining into hnsw_rs search loop
Removed distance().unwrap_or(f32::MAX) wrapper — now calls distance functions directly
Eliminates: dimension check, Result construction, unwrap_or, one call indirection

Investigation notes:

cosine_similarity_simd returns similarity, not distance — added 1.0 - similarity conversion
Added is_finite() guard for zero-norm vectors (SimSIMD handled this internally, NEON impl does not)
WASM path unchanged — still uses scalar fallback
Existing distance tests verify semantic equivalence

…al (OPT-8) Added software prefetch hints in the hnsw_rs search_layer inner loop to overlap memory fetches with computation. When processing neighbor N, we prefetch neighbor N+1's vector data into L1 cache. The distance computation (dist_f.eval) reads the full vector data (~512 bytes for 128-dimensional f32 vectors) from heap memory. Without prefetching, each neighbor access triggers an L2/L3 cache miss (~50-100ns on aarch64). With prefetching, the memory subsystem begins fetching the next vector's data while the CPU is computing the current distance. Implementation: - Converted `for e in neighbours` to indexed loop for lookahead access - aarch64: inline asm `prfm pldl1keep` (stable Rust, no nightly needed) - x86_64: `_mm_prefetch` with `_MM_HINT_T0` (stable with SSE) - Other architectures: no-op (graceful degradation) - Only prefetches the first cache line (64 bytes = 16 floats) — the hardware prefetcher will handle sequential access from there Prefetch is a hint — worst case it's ignored by the CPU. No correctness impact. The change is localized to search_layer in patched hnsw_rs. Expected impact: +10-20% QPS for datasets where vectors don't fit in L2 cache (roughly >5K vectors at 128 dimensions). Co-Authored-By: claude-flow <ruv@ruv.net>

aepod · 2026-03-24T16:32:14Z

OPT-8: Software prefetching in search_layer ✅

Commit: 4cc6eaa

Added prefetch hints in the hnsw_rs search_layer inner loop. When processing neighbor N's distance computation, we prefetch neighbor N+1's vector data into L1 cache.

Why this helps: Each vector is 512 bytes (128d × 4 bytes) on the heap. Without prefetching, accessing a neighbor's vector data triggers an L2/L3 cache miss (~50-100ns). By issuing a prefetch hint one iteration ahead, the memory subsystem starts fetching while the CPU computes the current distance.

Implementation:

aarch64: inline asm `prfm pldl1keep` (works on stable Rust, no nightly)
x86_64: `_mm_prefetch` with `_MM_HINT_T0` (stable with SSE)
Other architectures: no-op (graceful degradation)
Converted `for e in neighbours` to indexed loop for lookahead access
Only prefetches first cache line (64 bytes) — hardware prefetcher handles the rest

Investigation notes:

`std::arch::aarch64::_prefetch` is unstable (requires nightly) — used inline asm instead
Prefetch is a hint, worst case ignored by CPU — zero correctness risk
Vector data accessed via `point_ref.data.get_v()` returns `&[T]` (heap pointer or mmap slice)
Multiple indirections (Arc -> Point -> PointData -> Vec) but vector data is the main miss

…1GB (OPT-9) redb's default cache size is 1GB (set in Builder::new()). This is excessive for most vector workloads — at 100K vectors with 128 dimensions, the working set is ~50MB of vector data plus graph overhead. Added DbOptions.cache_size_bytes (Option<usize>) to expose REDB cache configuration: - None (default): uses 64MB cache — sufficient for B-tree index pages and recently accessed vectors - Some(bytes): explicit override for specific workloads Changes: - Added cache_size_bytes field to DbOptions with #[serde(default)] - Added VectorStorage::with_cache() constructor that accepts cache config - VectorDB::new() passes cache config through to storage - Database::builder().set_cache_size(bytes).create() replaces Database::create() - All 20 files with DbOptions struct literals updated with ..Default::default() redb splits cache 90% read / 10% write internally. With OPT-1 removing storage reads from the search hot path, the read cache is primarily used for filtered searches and startup index rebuild. Expected impact: -20-40MB RSS reduction (from not allocating 1GB of virtual address space for the page cache). Actual savings depend on OS memory management behavior. Co-Authored-By: claude-flow <ruv@ruv.net>

…mport SIMD floating point can produce cosine similarity slightly > 1.0 for near-identical vectors, causing 1.0 - similarity to be negative. This violates hnsw_rs assertions that expect non-negative distances. Added .max(0.0) clamp to cosine_distance when using native SIMD intrinsics. Also removed unused `distance` import from index/hnsw.rs and fixed double ..Default::default() in agenticdb.rs. Co-Authored-By: claude-flow <ruv@ruv.net>

aepod and others added 9 commits March 24, 2026 12:34

chore: Update NAPI-RS binaries for all platforms

c504a29

Built from commit b2347ce Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

chore: Update NAPI-RS binaries for all platforms

5156ceb

Built from commit 2adb949 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions

aepod and others added 2 commits March 24, 2026 16:49

aepod closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294

perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294
aepod wants to merge 12 commits intoruvnet:mainfrom
weave-logic-ai:perf/optimization-sweep

aepod commented Mar 24, 2026

Uh oh!

aepod commented Mar 24, 2026

Uh oh!

aepod commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aepod commented Mar 24, 2026

Performance Optimization Sweep

Context

Root Causes Identified

Optimization Plan (10 items, 3 phases)

Phase A — Quick Wins (this PR, in progress)

Phase B — Medium Effort (future commits)

Phase C — Structural Changes (future PR)

Commits So Far

OPT-4: Batch REDB transactions in benchmark ✅

Target Metrics After Full Optimization

How to Verify

Test plan

Uh oh!

aepod commented Mar 24, 2026

OPT-2: Native Rust SIMD intrinsics instead of SimSIMD FFI ✅

Uh oh!

aepod commented Mar 24, 2026

OPT-8: Software prefetching in search_layer ✅

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant