perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294
perf: HNSW optimization sweep — batch transactions, search I/O, SIMD, and more#294aepod wants to merge 12 commits intoruvnet:mainfrom
Conversation
The execute_match() function previously collapsed all match results into a single ExecutionContext via context.bind(), which overwrote previous bindings. MATCH (n:Person) on 3 Person nodes returned only 1 row. This commit refactors the executor to use a ResultSet pipeline: - type ResultSet = Vec<ExecutionContext> - Each clause transforms ResultSet → ResultSet - execute_match() expands the set (one context per match) - execute_return() projects one row per context - execute_set/delete() apply to all contexts - Cross-product semantics for multiple patterns in one MATCH Also adds comprehensive tests: - test_match_returns_multiple_rows (the Issue ruvnet#269 regression) - test_match_return_properties (verify correct values per row) - test_match_where_filter (WHERE correctly filters multi-row) - test_match_single_result (1 match → 1 row, no regression) - test_match_no_results (0 matches → 0 rows) - test_match_many_nodes (100 nodes → 100 rows, stress test) Co-Authored-By: claude-flow <ruv@ruv.net>
RETURN n.name now produces column "n.name" instead of "?column?". Property expressions (Expression::Property) are formatted as "object.property" for column naming, matching standard Cypher behavior. Co-Authored-By: claude-flow <ruv@ruv.net>
Built from commit b2347ce Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions
Built from commit 2adb949 Platforms updated: - linux-x64-gnu - linux-arm64-gnu - darwin-x64 - darwin-arm64 - win32-x64-msvc 🤖 Generated by GitHub Actions
Phase 2 of the ruvector remediation plan. Replaces simulated benchmarks with real measurements: - Python harness: hnswlib (C++) and numpy brute-force on same datasets - Rust test: ruvector-core HNSW with ground-truth recall measurement - Datasets: random-10K and random-100K, 128 dimensions - Metrics: QPS (p50/p95), recall@10 vs ground truth, memory, build time Key findings: - ruvector recall@10 is good: 98.3% (10K), 86.75% (100K) - ruvector QPS is 2.6-2.9x slower than hnswlib - ruvector build time is 2.2-5.9x slower than hnswlib - ruvector uses ~523MB for 100K vectors (10x raw data size) - All numbers are REAL — no hardcoded values, no simulation Co-Authored-By: claude-flow <ruv@ruv.net>
- Add independent benchmark report comparing ruvector-core vs hnswlib (C++) vs numpy brute-force - 10K vectors: 443 QPS / 98.3% recall (vs hnswlib 1153 QPS / 98.95% recall) - 100K vectors: 86 QPS / 86.75% recall (vs hnswlib 250 QPS / 74.27% recall) - Fix README "100% recall" claim — actual recall is 86.75-98.3% depending on scale - Fix "simulated Python baseline" — now compared against real hnswlib competitor - Include raw JSON data and full methodology documentation Co-Authored-By: claude-flow <ruv@ruv.net>
…ions The benchmark was calling db.insert() in a loop for each vector individually. Each insert() call opens a separate REDB write transaction, serializes the vector with bincode, writes to the B-tree table, and commits with fsync. For 10K vectors, this meant 10,000 separate write transactions with 10,000 fsync operations. The benchmark was measuring REDB transaction commit overhead, not HNSW index build performance. The fix: use db.insert_batch() which wraps all inserts in a single REDB write transaction with one commit at the end. This is how production code already works — ruvector-server, ruvector-cli, and ruvector-node all use insert_batch() for bulk operations. This is OPT-4 from the ruvector performance optimization plan. Expected impact: 1.5-3x faster build time at 10K scale, where transaction overhead dominated. Smaller improvement at 100K where HNSW graph search cost dominates. No library code changed — benchmark only. Co-Authored-By: claude-flow <ruv@ruv.net>
…PT-1) The search() method was calling storage.get() for every result to populate vector data and metadata. Each call opens a REDB read transaction, does a B-tree lookup, and deserializes with bincode. For k=10 results, that's 10 separate database roundtrips per search query. hnswlib returns IDs + scores with zero I/O. Added SearchQuery.enrich field (Option<bool>) to control this behavior: - None (default): auto — only enrich when a metadata filter is present, since the filter needs metadata to evaluate. No filter = skip enrichment. - Some(true): always enrich (for callers that need vector data/metadata) - Some(false): never enrich (pure ID + score results, fastest path) This is backward compatible: - Existing code with metadata filters: auto-enriches (filter present → true) - agenticdb callers: use filters, so auto-enrichment kicks in - Benchmark: explicitly sets enrich: Some(false) for fair HNSW comparison - All 27 files with SearchQuery literals updated with ..Default::default() Added Default derive to SearchQuery so existing struct literals work with the new field via ..Default::default(). Expected impact: +30-50% QPS improvement for search-only workloads (no filter, no vector/metadata needed), which is the common case for similarity search and the benchmark comparison against hnswlib. IMPORTANT: Setting enrich=Some(false) with a metadata filter will cause the filter to match nothing (metadata is None). The auto behavior (None) prevents this footgun. Co-Authored-By: claude-flow <ruv@ruv.net>
…I (OPT-2) Replaced SimSIMD FFI calls in the HNSW distance hot path with native Rust SIMD intrinsics from simd_intrinsics.rs. The key difference: native Rust functions can be fully inlined by the compiler into the monomorphized hnsw_rs search loop, while SimSIMD's C library crossed the FFI boundary on every distance call (hundreds of times per search query). Changes to distance.rs: - euclidean_distance: SimSIMD sqeuclidean().sqrt() -> simd_intrinsics::euclidean_distance_simd() Both return sqrt(sum_of_squares), semantically identical. - cosine_distance: SimSIMD cosine() -> 1.0 - simd_intrinsics::cosine_similarity_simd() Added is_finite() guard for zero-norm vectors (SimSIMD handled this internally, simd_intrinsics does not check for division by zero). - dot_product_distance: SimSIMD dot() -> -simd_intrinsics::dot_product_simd() Negation for distance semantics preserved. Changes to index/hnsw.rs (DistanceFn::eval): - Added #[inline(always)] to enable inlining into hnsw_rs search loop - Replaced distance(a, b, metric).unwrap_or(f32::MAX) with direct match on metric enum calling individual distance functions. This eliminates: 1. The dimension length check (guaranteed equal at insert time) 2. The Result<f32> construction and unwrap_or per call 3. One level of function call indirection WASM path is unaffected — still uses scalar fallback, gated by #[cfg(any(not(feature = "simd"), target_arch = "wasm32"))]. SimSIMD remains as a dependency but is no longer called from the HNSW hot path. The existing distance tests verify equivalence. Expected impact: +15-30% QPS from elimination of FFI overhead and enabling full inlining of distance computation into the search loop. Co-Authored-By: claude-flow <ruv@ruv.net>
OPT-2: Native Rust SIMD intrinsics instead of SimSIMD FFI ✅Commit: 9d111d4 Replaced all 3 SimSIMD FFI calls in the distance hot path with native Rust SIMD intrinsics that were already in the codebase (simd_intrinsics.rs) but weren't being used for HNSW:
Also optimized
Investigation notes:
|
…al (OPT-8) Added software prefetch hints in the hnsw_rs search_layer inner loop to overlap memory fetches with computation. When processing neighbor N, we prefetch neighbor N+1's vector data into L1 cache. The distance computation (dist_f.eval) reads the full vector data (~512 bytes for 128-dimensional f32 vectors) from heap memory. Without prefetching, each neighbor access triggers an L2/L3 cache miss (~50-100ns on aarch64). With prefetching, the memory subsystem begins fetching the next vector's data while the CPU is computing the current distance. Implementation: - Converted `for e in neighbours` to indexed loop for lookahead access - aarch64: inline asm `prfm pldl1keep` (stable Rust, no nightly needed) - x86_64: `_mm_prefetch` with `_MM_HINT_T0` (stable with SSE) - Other architectures: no-op (graceful degradation) - Only prefetches the first cache line (64 bytes = 16 floats) — the hardware prefetcher will handle sequential access from there Prefetch is a hint — worst case it's ignored by the CPU. No correctness impact. The change is localized to search_layer in patched hnsw_rs. Expected impact: +10-20% QPS for datasets where vectors don't fit in L2 cache (roughly >5K vectors at 128 dimensions). Co-Authored-By: claude-flow <ruv@ruv.net>
OPT-8: Software prefetching in search_layer ✅Commit: 4cc6eaa Added prefetch hints in the hnsw_rs search_layer inner loop. When processing neighbor N's distance computation, we prefetch neighbor N+1's vector data into L1 cache. Why this helps: Each vector is 512 bytes (128d × 4 bytes) on the heap. Without prefetching, accessing a neighbor's vector data triggers an L2/L3 cache miss (~50-100ns). By issuing a prefetch hint one iteration ahead, the memory subsystem starts fetching while the CPU computes the current distance. Implementation:
Investigation notes:
|
…1GB (OPT-9) redb's default cache size is 1GB (set in Builder::new()). This is excessive for most vector workloads — at 100K vectors with 128 dimensions, the working set is ~50MB of vector data plus graph overhead. Added DbOptions.cache_size_bytes (Option<usize>) to expose REDB cache configuration: - None (default): uses 64MB cache — sufficient for B-tree index pages and recently accessed vectors - Some(bytes): explicit override for specific workloads Changes: - Added cache_size_bytes field to DbOptions with #[serde(default)] - Added VectorStorage::with_cache() constructor that accepts cache config - VectorDB::new() passes cache config through to storage - Database::builder().set_cache_size(bytes).create() replaces Database::create() - All 20 files with DbOptions struct literals updated with ..Default::default() redb splits cache 90% read / 10% write internally. With OPT-1 removing storage reads from the search hot path, the read cache is primarily used for filtered searches and startup index rebuild. Expected impact: -20-40MB RSS reduction (from not allocating 1GB of virtual address space for the page cache). Actual savings depend on OS memory management behavior. Co-Authored-By: claude-flow <ruv@ruv.net>
…mport SIMD floating point can produce cosine similarity slightly > 1.0 for near-identical vectors, causing 1.0 - similarity to be negative. This violates hnsw_rs assertions that expect non-negative distances. Added .max(0.0) clamp to cosine_distance when using native SIMD intrinsics. Also removed unused `distance` import from index/hnsw.rs and fixed double ..Default::default() in agenticdb.rs. Co-Authored-By: claude-flow <ruv@ruv.net>
Performance Optimization Sweep
This is an ongoing PR that will accumulate multiple performance optimizations identified by a 4-agent audit of ruvector's HNSW implementation. Each optimization is investigated individually before implementation to avoid hidden regressions in this complex codebase.
Context
After running real benchmarks against hnswlib (C++), we identified several areas where ruvector-core is leaving significant performance on the table — not because the HNSW algorithm is wrong, but because of wrapper overhead, I/O in hot paths, and unused optimizations.
Current baseline (from PR #293):
Root Causes Identified
A team of 4 specialized performance agents audited the codebase in parallel:
VectorDB::search()triggers a full REDB read transaction per result (10 results = 10 DB roundtrips). hnswlib does zero I/O during search.db.insert()calls (1 REDB transaction per vector = 10K fsyncs at 10K scale). Also found thatparallel_insert_slice()exists in hnsw_rs but isn't used.simd_intrinsics.rsbut are NOT used in the HNSW search hot loop. Instead, every distance call crosses the FFI boundary to SimSIMD's C library, preventing inlining.Optimization Plan (10 items, 3 phases)
Each item goes through: Investigate → Document → Implement → Test → Commit
Phase A — Quick Wins (this PR, in progress)
storage.get()from search pathsearch_layerPhase B — Medium Effort (future commits)
parallel_insert_slice()for batch buildPhase C — Structural Changes (future PR)
Commits So Far
OPT-4: Batch REDB transactions in benchmark ✅
What changed:
crates/ruvector-core/tests/bench_hnsw.rsBefore: The benchmark called
db.insert(entry)in a loop for each vector. Each call opened a separate REDB write transaction → serialized the vector → wrote to B-tree → committed with fsync. At 10K vectors = 10,000 separate transactions.After: Uses
db.insert_batch()which wraps all inserts in a single REDB write transaction with one commit.Why this matters: The old benchmark was measuring REDB transaction overhead, not HNSW build performance. Production code (
ruvector-server,ruvector-cli,ruvector-node) already usesinsert_batch(). This change aligns the benchmark with real-world usage patterns.No library code changed — this is a benchmark-only fix.
Investigation notes: We verified that
storage.insert_batch()(atstorage.rs:168-207) uses a singlebegin_write()/commit()pair.VectorDB::insert_batch()correctly delegates to it. ThervliteSQL path uses individual inserts by design (one INSERT = one vector), which is correct for that use case.Target Metrics After Full Optimization
How to Verify
Compare build time before and after. QPS and recall should be unchanged since this only affects the insert path.
Test plan
cargo test -p ruvector-core --test bench_hnsw --releasepassescargo test -p ruvector-core🤖 Generated with claude-flow