Skip to content

perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294

Closed
aepod wants to merge 12 commits intoruvnet:mainfrom
weave-logic-ai:perf/optimization-sweep
Closed

perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294
aepod wants to merge 12 commits intoruvnet:mainfrom
weave-logic-ai:perf/optimization-sweep

Conversation

@aepod
Copy link

@aepod aepod commented Mar 24, 2026

Performance Optimization Sweep

This is an ongoing PR that will accumulate multiple performance optimizations identified by a 4-agent audit of ruvector's HNSW implementation. Each optimization is investigated individually before implementation to avoid hidden regressions in this complex codebase.

Context

After running real benchmarks against hnswlib (C++), we identified several areas where ruvector-core is leaving significant performance on the table — not because the HNSW algorithm is wrong, but because of wrapper overhead, I/O in hot paths, and unused optimizations.

Current baseline (from PR #293):

Scale ruvector QPS hnswlib QPS Gap ruvector Build hnswlib Build
10K 443 1,153 2.6x 44.0s 7.5s
100K 86 250 2.9x 855.6s 395.3s

Root Causes Identified

A team of 4 specialized performance agents audited the codebase in parallel:

  1. Search path expert — Found that VectorDB::search() triggers a full REDB read transaction per result (10 results = 10 DB roundtrips). hnswlib does zero I/O during search.
  2. Build path expert — Found that the benchmark uses individual db.insert() calls (1 REDB transaction per vector = 10K fsyncs at 10K scale). Also found that parallel_insert_slice() exists in hnsw_rs but isn't used.
  3. Memory expert — Found triple vector storage (hnsw_rs + DashMap + REDB), Arc-heavy graph structure (~40 bytes per neighbor link vs ~8 needed), and 16-layer pre-allocation per point.
  4. SIMD/distance expert — Found that hand-written NEON SIMD intrinsics exist in simd_intrinsics.rs but are NOT used in the HNSW search hot loop. Instead, every distance call crosses the FFI boundary to SimSIMD's C library, preventing inlining.

Optimization Plan (10 items, 3 phases)

Each item goes through: Investigate → Document → Implement → Test → Commit

Phase A — Quick Wins (this PR, in progress)

ID Fix Expected Impact Status
OPT-4 Batch REDB transactions in benchmark Build 1.5-3x faster (10K) DONE
OPT-1 Remove storage.get() from search path QPS +30-50% Investigating
OPT-2 Use native Rust NEON intrinsics instead of SimSIMD FFI QPS +15-30% Investigating
OPT-8 Add prefetching to search_layer QPS +10-20% Not started
OPT-9 Set REDB cache size limit Memory -20-40MB Not started

Phase B — Medium Effort (future commits)

ID Fix Expected Impact Status
OPT-5 Remove/restructure outer RwLock on HnswInner Enables OPT-3 Not started
OPT-3 Use parallel_insert_slice() for batch build Build 2-4x faster Not started
OPT-7 Replace HashMap visited set with bitset in hnsw_rs QPS +10-15% Not started

Phase C — Structural Changes (future PR)

ID Fix Expected Impact Status
OPT-6 Refactor HnswInner.vectors DashMap Memory -65MB Not started
OPT-10 Replace Arc<PointWithOrder> with flat arrays Memory -150MB, QPS +20-30% Not started

Commits So Far

OPT-4: Batch REDB transactions in benchmark ✅

What changed: crates/ruvector-core/tests/bench_hnsw.rs

Before: The benchmark called db.insert(entry) in a loop for each vector. Each call opened a separate REDB write transaction → serialized the vector → wrote to B-tree → committed with fsync. At 10K vectors = 10,000 separate transactions.

// OLD — 10K individual transactions
for (i, vec) in data.iter().enumerate() {
    let entry = VectorEntry { id: Some(format!("v{}", i)), vector: vec.clone(), metadata: None };
    db.insert(entry).expect("Insert failed");
}

After: Uses db.insert_batch() which wraps all inserts in a single REDB write transaction with one commit.

// NEW — 1 transaction for all vectors
let entries: Vec<VectorEntry> = data.iter().enumerate()
    .map(|(i, vec)| VectorEntry { id: Some(format!("v{}", i)), vector: vec.clone(), metadata: None })
    .collect();
db.insert_batch(entries).expect("Insert batch failed");

Why this matters: The old benchmark was measuring REDB transaction overhead, not HNSW build performance. Production code (ruvector-server, ruvector-cli, ruvector-node) already uses insert_batch(). This change aligns the benchmark with real-world usage patterns.

No library code changed — this is a benchmark-only fix.

Investigation notes: We verified that storage.insert_batch() (at storage.rs:168-207) uses a single begin_write() / commit() pair. VectorDB::insert_batch() correctly delegates to it. The rvlite SQL path uses individual inserts by design (one INSERT = one vector), which is correct for that use case.

Target Metrics After Full Optimization

Metric Current Phase A Target Full Target
QPS (10K) 443 ~800 ~1,000+
QPS (100K) 86 ~150 ~200+
Build (10K) 44s ~15s ~5-8s
Build (100K) 856s ~500s ~150-250s
Memory (100K) 523MB ~460MB ~250MB

How to Verify

# Run the benchmark (requires Rust toolchain)
cargo test -p ruvector-core --test bench_hnsw --release -- --nocapture

Compare build time before and after. QPS and recall should be unchanged since this only affects the insert path.

Test plan

  • cargo test -p ruvector-core --test bench_hnsw --release passes
  • Build time improves (expected 1.5-3x at 10K)
  • QPS unchanged (search path not modified)
  • Recall unchanged (HNSW parameters unchanged)
  • All existing tests still pass: cargo test -p ruvector-core

🤖 Generated with claude-flow

aepod and others added 9 commits March 24, 2026 12:34
The execute_match() function previously collapsed all match results into
a single ExecutionContext via context.bind(), which overwrote previous
bindings. MATCH (n:Person) on 3 Person nodes returned only 1 row.

This commit refactors the executor to use a ResultSet pipeline:
- type ResultSet = Vec<ExecutionContext>
- Each clause transforms ResultSet → ResultSet
- execute_match() expands the set (one context per match)
- execute_return() projects one row per context
- execute_set/delete() apply to all contexts
- Cross-product semantics for multiple patterns in one MATCH

Also adds comprehensive tests:
- test_match_returns_multiple_rows (the Issue ruvnet#269 regression)
- test_match_return_properties (verify correct values per row)
- test_match_where_filter (WHERE correctly filters multi-row)
- test_match_single_result (1 match → 1 row, no regression)
- test_match_no_results (0 matches → 0 rows)
- test_match_many_nodes (100 nodes → 100 rows, stress test)

Co-Authored-By: claude-flow <ruv@ruv.net>
RETURN n.name now produces column "n.name" instead of "?column?".
Property expressions (Expression::Property) are formatted as
"object.property" for column naming, matching standard Cypher behavior.

Co-Authored-By: claude-flow <ruv@ruv.net>
  Built from commit b2347ce

  Platforms updated:
  - linux-x64-gnu
  - linux-arm64-gnu
  - darwin-x64
  - darwin-arm64
  - win32-x64-msvc

  🤖 Generated by GitHub Actions
  Built from commit 2adb949

  Platforms updated:
  - linux-x64-gnu
  - linux-arm64-gnu
  - darwin-x64
  - darwin-arm64
  - win32-x64-msvc

  🤖 Generated by GitHub Actions
Phase 2 of the ruvector remediation plan. Replaces simulated benchmarks
with real measurements:

- Python harness: hnswlib (C++) and numpy brute-force on same datasets
- Rust test: ruvector-core HNSW with ground-truth recall measurement
- Datasets: random-10K and random-100K, 128 dimensions
- Metrics: QPS (p50/p95), recall@10 vs ground truth, memory, build time

Key findings:
- ruvector recall@10 is good: 98.3% (10K), 86.75% (100K)
- ruvector QPS is 2.6-2.9x slower than hnswlib
- ruvector build time is 2.2-5.9x slower than hnswlib
- ruvector uses ~523MB for 100K vectors (10x raw data size)
- All numbers are REAL — no hardcoded values, no simulation

Co-Authored-By: claude-flow <ruv@ruv.net>
- Add independent benchmark report comparing ruvector-core vs hnswlib (C++) vs numpy brute-force
- 10K vectors: 443 QPS / 98.3% recall (vs hnswlib 1153 QPS / 98.95% recall)
- 100K vectors: 86 QPS / 86.75% recall (vs hnswlib 250 QPS / 74.27% recall)
- Fix README "100% recall" claim — actual recall is 86.75-98.3% depending on scale
- Fix "simulated Python baseline" — now compared against real hnswlib competitor
- Include raw JSON data and full methodology documentation

Co-Authored-By: claude-flow <ruv@ruv.net>
…ions

The benchmark was calling db.insert() in a loop for each vector individually.
Each insert() call opens a separate REDB write transaction, serializes the
vector with bincode, writes to the B-tree table, and commits with fsync.

For 10K vectors, this meant 10,000 separate write transactions with 10,000
fsync operations. The benchmark was measuring REDB transaction commit overhead,
not HNSW index build performance.

The fix: use db.insert_batch() which wraps all inserts in a single REDB
write transaction with one commit at the end. This is how production code
already works — ruvector-server, ruvector-cli, and ruvector-node all use
insert_batch() for bulk operations.

This is OPT-4 from the ruvector performance optimization plan. Expected
impact: 1.5-3x faster build time at 10K scale, where transaction overhead
dominated. Smaller improvement at 100K where HNSW graph search cost dominates.

No library code changed — benchmark only.

Co-Authored-By: claude-flow <ruv@ruv.net>
…PT-1)

The search() method was calling storage.get() for every result to populate
vector data and metadata. Each call opens a REDB read transaction, does a
B-tree lookup, and deserializes with bincode. For k=10 results, that's 10
separate database roundtrips per search query. hnswlib returns IDs + scores
with zero I/O.

Added SearchQuery.enrich field (Option<bool>) to control this behavior:
- None (default): auto — only enrich when a metadata filter is present,
  since the filter needs metadata to evaluate. No filter = skip enrichment.
- Some(true): always enrich (for callers that need vector data/metadata)
- Some(false): never enrich (pure ID + score results, fastest path)

This is backward compatible:
- Existing code with metadata filters: auto-enriches (filter present → true)
- agenticdb callers: use filters, so auto-enrichment kicks in
- Benchmark: explicitly sets enrich: Some(false) for fair HNSW comparison
- All 27 files with SearchQuery literals updated with ..Default::default()

Added Default derive to SearchQuery so existing struct literals work with
the new field via ..Default::default().

Expected impact: +30-50% QPS improvement for search-only workloads (no
filter, no vector/metadata needed), which is the common case for
similarity search and the benchmark comparison against hnswlib.

IMPORTANT: Setting enrich=Some(false) with a metadata filter will cause
the filter to match nothing (metadata is None). The auto behavior (None)
prevents this footgun.

Co-Authored-By: claude-flow <ruv@ruv.net>
…I (OPT-2)

Replaced SimSIMD FFI calls in the HNSW distance hot path with native Rust
SIMD intrinsics from simd_intrinsics.rs. The key difference: native Rust
functions can be fully inlined by the compiler into the monomorphized
hnsw_rs search loop, while SimSIMD's C library crossed the FFI boundary
on every distance call (hundreds of times per search query).

Changes to distance.rs:
- euclidean_distance: SimSIMD sqeuclidean().sqrt() -> simd_intrinsics::euclidean_distance_simd()
  Both return sqrt(sum_of_squares), semantically identical.
- cosine_distance: SimSIMD cosine() -> 1.0 - simd_intrinsics::cosine_similarity_simd()
  Added is_finite() guard for zero-norm vectors (SimSIMD handled this
  internally, simd_intrinsics does not check for division by zero).
- dot_product_distance: SimSIMD dot() -> -simd_intrinsics::dot_product_simd()
  Negation for distance semantics preserved.

Changes to index/hnsw.rs (DistanceFn::eval):
- Added #[inline(always)] to enable inlining into hnsw_rs search loop
- Replaced distance(a, b, metric).unwrap_or(f32::MAX) with direct match
  on metric enum calling individual distance functions. This eliminates:
  1. The dimension length check (guaranteed equal at insert time)
  2. The Result<f32> construction and unwrap_or per call
  3. One level of function call indirection

WASM path is unaffected — still uses scalar fallback, gated by
#[cfg(any(not(feature = "simd"), target_arch = "wasm32"))].

SimSIMD remains as a dependency but is no longer called from the HNSW
hot path. The existing distance tests verify equivalence.

Expected impact: +15-30% QPS from elimination of FFI overhead and
enabling full inlining of distance computation into the search loop.

Co-Authored-By: claude-flow <ruv@ruv.net>
@aepod
Copy link
Author

aepod commented Mar 24, 2026

OPT-2: Native Rust SIMD intrinsics instead of SimSIMD FFI ✅

Commit: 9d111d4

Replaced all 3 SimSIMD FFI calls in the distance hot path with native Rust SIMD intrinsics that were already in the codebase (simd_intrinsics.rs) but weren't being used for HNSW:

Function Before (SimSIMD FFI) After (Native Rust SIMD)
euclidean simsimd::sqeuclidean().sqrt() simd_intrinsics::euclidean_distance_simd()
cosine simsimd::cosine() 1.0 - simd_intrinsics::cosine_similarity_simd()
dot product simsimd::dot() -simd_intrinsics::dot_product_simd()

Also optimized DistanceFn::eval (called hundreds of times per search):

  • Added #[inline(always)] for compiler inlining into hnsw_rs search loop
  • Removed distance().unwrap_or(f32::MAX) wrapper — now calls distance functions directly
  • Eliminates: dimension check, Result construction, unwrap_or, one call indirection

Investigation notes:

  • cosine_similarity_simd returns similarity, not distance — added 1.0 - similarity conversion
  • Added is_finite() guard for zero-norm vectors (SimSIMD handled this internally, NEON impl does not)
  • WASM path unchanged — still uses scalar fallback
  • Existing distance tests verify semantic equivalence

…al (OPT-8)

Added software prefetch hints in the hnsw_rs search_layer inner loop to
overlap memory fetches with computation. When processing neighbor N, we
prefetch neighbor N+1's vector data into L1 cache.

The distance computation (dist_f.eval) reads the full vector data (~512
bytes for 128-dimensional f32 vectors) from heap memory. Without prefetching,
each neighbor access triggers an L2/L3 cache miss (~50-100ns on aarch64).
With prefetching, the memory subsystem begins fetching the next vector's
data while the CPU is computing the current distance.

Implementation:
- Converted `for e in neighbours` to indexed loop for lookahead access
- aarch64: inline asm `prfm pldl1keep` (stable Rust, no nightly needed)
- x86_64: `_mm_prefetch` with `_MM_HINT_T0` (stable with SSE)
- Other architectures: no-op (graceful degradation)
- Only prefetches the first cache line (64 bytes = 16 floats) — the
  hardware prefetcher will handle sequential access from there

Prefetch is a hint — worst case it's ignored by the CPU. No correctness
impact. The change is localized to search_layer in patched hnsw_rs.

Expected impact: +10-20% QPS for datasets where vectors don't fit in L2
cache (roughly >5K vectors at 128 dimensions).

Co-Authored-By: claude-flow <ruv@ruv.net>
@aepod
Copy link
Author

aepod commented Mar 24, 2026

OPT-8: Software prefetching in search_layer ✅

Commit: 4cc6eaa

Added prefetch hints in the hnsw_rs search_layer inner loop. When processing neighbor N's distance computation, we prefetch neighbor N+1's vector data into L1 cache.

Why this helps: Each vector is 512 bytes (128d × 4 bytes) on the heap. Without prefetching, accessing a neighbor's vector data triggers an L2/L3 cache miss (~50-100ns). By issuing a prefetch hint one iteration ahead, the memory subsystem starts fetching while the CPU computes the current distance.

Implementation:

  • aarch64: inline asm `prfm pldl1keep` (works on stable Rust, no nightly)
  • x86_64: `_mm_prefetch` with `_MM_HINT_T0` (stable with SSE)
  • Other architectures: no-op (graceful degradation)
  • Converted `for e in neighbours` to indexed loop for lookahead access
  • Only prefetches first cache line (64 bytes) — hardware prefetcher handles the rest

Investigation notes:

  • `std::arch::aarch64::_prefetch` is unstable (requires nightly) — used inline asm instead
  • Prefetch is a hint, worst case ignored by CPU — zero correctness risk
  • Vector data accessed via `point_ref.data.get_v()` returns `&[T]` (heap pointer or mmap slice)
  • Multiple indirections (Arc -> Point -> PointData -> Vec) but vector data is the main miss

aepod and others added 2 commits March 24, 2026 16:49
…1GB (OPT-9)

redb's default cache size is 1GB (set in Builder::new()). This is excessive
for most vector workloads — at 100K vectors with 128 dimensions, the working
set is ~50MB of vector data plus graph overhead.

Added DbOptions.cache_size_bytes (Option<usize>) to expose REDB cache
configuration:
- None (default): uses 64MB cache — sufficient for B-tree index pages
  and recently accessed vectors
- Some(bytes): explicit override for specific workloads

Changes:
- Added cache_size_bytes field to DbOptions with #[serde(default)]
- Added VectorStorage::with_cache() constructor that accepts cache config
- VectorDB::new() passes cache config through to storage
- Database::builder().set_cache_size(bytes).create() replaces Database::create()
- All 20 files with DbOptions struct literals updated with ..Default::default()

redb splits cache 90% read / 10% write internally. With OPT-1 removing
storage reads from the search hot path, the read cache is primarily used
for filtered searches and startup index rebuild.

Expected impact: -20-40MB RSS reduction (from not allocating 1GB of virtual
address space for the page cache). Actual savings depend on OS memory
management behavior.

Co-Authored-By: claude-flow <ruv@ruv.net>
…mport

SIMD floating point can produce cosine similarity slightly > 1.0 for
near-identical vectors, causing 1.0 - similarity to be negative. This
violates hnsw_rs assertions that expect non-negative distances.

Added .max(0.0) clamp to cosine_distance when using native SIMD intrinsics.
Also removed unused `distance` import from index/hnsw.rs and fixed double
..Default::default() in agenticdb.rs.

Co-Authored-By: claude-flow <ruv@ruv.net>
@aepod aepod closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant