datafusion-vector-search-ext

A DataFusion extension that integrates USearch HNSW approximate nearest-neighbour (ANN) vector search as a first-class SQL operator.

Queries matching the ORDER BY distance_fn(col, query) LIMIT k pattern are transparently rewritten by an optimizer rule into a native USearch index call — no query rewrite needed from the caller. WHERE clause filters are handled adaptively: high-selectivity filters use USearch's in-graph predicate API; low-selectivity filters bypass HNSW entirely and scan the data directly.

DataFusion: 52.2 USearch: 2.24

Usage

Add the dependency

[dependencies]
datafusion-vector-search-ext = { git = "https://github.com/hotdata-dev/datafusion-vector-search-ext" }
usearch = "2.24.0"

Build or load a USearch index

USearchIndexConfig is the single source of truth for index parameters. Persist it alongside your .index file — USearch does not embed metadata inside the saved file.

use datafusion_vector_search_ext::USearchIndexConfig;
use usearch::MetricKind;

let cfg = USearchIndexConfig::new(768, MetricKind::L2sq);

// Build a new index:
let index = cfg.build_index()?;
index.reserve(n_rows)?;
for (key, vec) in rows {
    index.add(key, &vec)?;  // key must be a unique u64
}
index.save("my_table.index")?;

// Reload from disk:
let index = cfg.load_index("my_table.index")?;

Key mapping: USearch stores u64 keys only. Each key must map 1:1 to a row in your providers. A simple monotonic row-ID (0, 1, 2, ...) works well.

Register providers and set up the SessionContext

Registration requires two providers:

scan_provider (Arc<dyn TableProvider>) — used for WHERE evaluation during the pre-scan phase (scalar columns only).
lookup_provider (Arc<dyn PointLookupProvider>) — used for O(k) key-based row fetch after HNSW search. Does not need the vector column.

PointLookupProvider extends DataFusion's TableProvider with a single method:

#[async_trait]
impl PointLookupProvider for MyTable {
    async fn fetch_by_keys(
        &self,
        keys: &[u64],
        key_col: &str,
        projection: Option<&[usize]>,
    ) -> Result<Vec<RecordBatch>> {
        // Look up each key via your primary-key index.
        // Missing keys are silently omitted.
        todo!()
    }
}

For development and tests, use the bundled HashKeyProvider:

use datafusion_vector_search_ext::HashKeyProvider;
let provider = Arc::new(HashKeyProvider::try_new(schema, batches, "id")?);

Wire everything together:

use std::sync::Arc;
use datafusion::execution::context::SessionStateBuilder;
use datafusion::prelude::SessionContext;
use datafusion_vector_search_ext::{
    USearchRegistry, USearchQueryPlanner, register_all,
};
use usearch::{MetricKind, ScalarKind};

let registry = USearchRegistry::new();
registry.add(
    "my_table::vector",     // "<table>::<vector_col>"
    Arc::new(index),
    scan_provider.clone(),  // TableProvider (e.g. Parquet)
    lookup_provider.clone(),// PointLookupProvider (e.g. SQLite)
    "id",                   // key column name
    MetricKind::L2sq,
    ScalarKind::F32,
)?;
let registry = registry.into_arc();

let state = SessionStateBuilder::new()
    .with_default_features()
    .with_query_planner(Arc::new(USearchQueryPlanner::new(registry.clone())))
    .build();
let ctx = SessionContext::new_with_state(state);
register_all(&ctx, registry)?;

ctx.register_table("my_table", scan_provider)?;

Write queries

No special syntax — the extension transparently rewrites standard SQL:

SELECT id, title, l2_distance(vector, ARRAY[0.1, 0.2, ...]) AS dist
FROM my_table
ORDER BY dist ASC
LIMIT 10

Recognised patterns:

Distance function	Index metric	Order
`l2_distance`	`MetricKind::L2sq`	`ASC`
`cosine_distance`	`MetricKind::Cos`	`ASC`
`negative_dot_product`	`MetricKind::IP`	`ASC`

All three are lower-is-closer. ORDER BY ... DESC falls back to exact computation. Metric mismatch (e.g. cosine UDF on an L2 index) also falls back silently.

WHERE clause filtering

Scalar WHERE conditions are absorbed and handled adaptively:

SELECT id, l2_distance(vector, ARRAY[...]) AS dist
FROM my_table
WHERE category = 'nlp'
ORDER BY dist ASC
LIMIT 10

See Adaptive filtering for details on how the execution path is chosen.

UDTF path

vector_search_vector provides an explicit table function for ANN search, returning all table columns plus _distance:

SELECT id, title, _distance
FROM vector_search_vector('conn.schema.table', 'column', ARRAY[0.1, 0.2, ...], 10)
ORDER BY _distance ASC

Argument	Type	Description
table	string literal	Dot-separated table reference: `'conn.schema.table'`
column	string literal	Vector column with a registered index
query	`ARRAY[...]` literal	Query vector
k	integer	Number of nearest neighbors to return

The UDTF calls resolve() (sync, cache-only) on the registry — the index must already be loaded before the query is planned. It always calls index.search() directly — no filter absorption. Apply WHERE on the outer query to post-filter.

-- With filtering, aggregation, etc.
SELECT category, COUNT(*) AS cnt, AVG(_distance) AS avg_dist
FROM vector_search_vector('conn.schema.table', 'embedding', ARRAY[...], 50)
WHERE category = 'nlp'
GROUP BY category
ORDER BY avg_dist

Tuning

Pass USearchTableConfig to add_with_config for per-table tuning:

use datafusion_vector_search_ext::USearchTableConfig;

registry.add_with_config(
    "my_table::vector",
    Arc::new(index),
    scan_provider.clone(),
    lookup_provider.clone(),
    "id",
    MetricKind::L2sq,
    ScalarKind::F32,
    USearchTableConfig {
        expansion_search: 128,                    // ef_search (default: 64)
        brute_force_selectivity_threshold: 0.03,  // low-sel cutoff (default: 0.05)
    },
)?;

Parameter	Default	Notes
`expansion_search`	64	HNSW beam width at query time. Higher = better recall, slower. Set once at registration.
`brute_force_selectivity_threshold`	0.05	Below this fraction, bypass HNSW and scan directly.

USearchIndexConfig build parameters:

Parameter	Default	Notes
`connectivity` (M)	16	Graph degree. Higher = better recall, more memory.
`expansion_add` (ef_construction)	128	Must be >= 2xM. Higher = better graph, slower build.
`quantization`	`ScalarKind::F32`	`F16` halves memory with ~1% recall loss at high dims.

Architecture

Module structure

src/
  lib.rs       — public API, register_all()
  registry.rs  — USearchRegistry, RegisteredTable, USearchIndexConfig, USearchTableConfig
  node.rs      — USearchNode: custom logical plan leaf
  rule.rs      — USearchRule: optimizer rewrite rule
  planner.rs   — USearchExecPlanner, USearchExec: physical execution
  udf.rs       — l2_distance, cosine_distance, negative_dot_product scalar UDFs
  udtf.rs      — vector_search_vector table function
  lookup.rs    — PointLookupProvider trait + HashKeyProvider
  keys.rs      — DatasetLayout, pack_key/unpack_key key encoding

tests/
  optimizer_rule.rs  — rewrite rule matching/rejection tests
  execution.rs       — end-to-end execution tests (HNSW + index-get paths)

Optimizer rewrite

The rule (rule.rs) matches two logical plan shapes:

Sort(fetch=k, ORDER BY dist ASC)
  Projection([..., distance_fn(col, lit) AS dist, ...])
    TableScan(name)

Sort(fetch=k, ORDER BY dist ASC)
  Projection([..., distance_fn(col, lit) AS dist, ...])
    Filter(predicate)
      TableScan(name)

Preconditions: sort is ASC, distance UDF matches index metric, table is registered, query vector is a literal. When the rule fires, it replaces the inner nodes with a USearchNode leaf carrying: table name, vector column, query vector, k, distance type, and absorbed filter predicates. The Sort node is preserved above for final ordering.

Physical planning (planner.rs) translates USearchNode into USearchExec, a physical plan node that executes the actual search.

Adaptive filtering

When WHERE filters are present, the execution follows three possible paths:

Query arrives
  |
  +-- No WHERE clause
  |     -> USearch HNSW search -> lookup_provider fetch(k) -> result
  |
  +-- Has WHERE clause
        |
        +-- Pre-scan: scan_provider (_key + filter cols only, predicate pushdown)
        |     -> collect valid_keys, compute selectivity
        |
        +-- Low selectivity (<= threshold, default 5%)
        |     -> index.get(key) for each valid_key -> compute distances -> top-k
        |     -> lookup_provider fetch(k) -> result
        |
        +-- High selectivity (> threshold)
              -> HNSW filtered_search(valid_keys predicate)
              -> lookup_provider fetch(k) -> result

Pre-scan phase: Projects only _key and columns referenced by the WHERE clause (excludes all other columns for efficiency). Filter expressions are pushed down to the scan provider for Parquet-level pruning (row group statistics, bloom filters, page indexes). Collects the set of valid keys and computes selectivity = valid_keys.len() / index.size().

Low-selectivity path (index-get): When few rows pass the filter, HNSW graph traversal becomes expensive (it must explore ~k/selectivity nodes to find k passing candidates). Instead, vectors are retrieved directly from the USearch index via index.get(key) for each valid key, exact distances are computed, and a top-k heap selects the closest matches. Result rows are fetched from the lookup provider.

High-selectivity path (HNSW filtered): Passes valid keys as a predicate to index.filtered_search() — HNSW skips non-passing nodes during traversal. Result keys are fetched from the lookup provider.

Why 5%: The crossover is approximately sqrt(k * M / n). For k=10, M=16, n=240k this is ~2.6%. The 5% default provides a safety margin.

Distance metrics

All three distance functions are lower-is-closer:

SQL function	Index metric	Kernel
`l2_distance(a, b)`	`L2sq`	`sum((a_i - b_i)^2)`
`cosine_distance(a, b)`	`Cos`	`1 - dot(a,b) / (norm(a) * norm(b))`
`negative_dot_product(a, b)`	`IP`	`-(a . b)`

l2_distance returns squared L2 (no sqrt), matching USearch's MetricKind::L2sq. This ensures numeric consistency between the UDF, the rewritten index path, and the brute-force path.

Running tests

cargo test

Tests cover optimizer rule matching/rejection, end-to-end execution through both HNSW and index-get paths, registration validation, and provider error handling.

Limitations

Limitation	Notes
Stacked `Filter` nodes	Only one `Filter -> TableScan` layer is absorbed. `Filter -> Filter -> TableScan` falls back to exact execution. DataFusion typically combines multiple WHERE conditions into a single Filter, so this rarely occurs.
Runtime query vectors	The query vector must be a compile-time literal (`ARRAY[0.1, ...]`). Column references or subquery results are not rewritten. Use `vector_search_vector` for explicit ANN queries.
`ef_search` per-query	`expansion_search` is global to the index instance. Per-query adjustment is not supported.
No DELETE / compaction	USearch soft-deletes entries but requires a full rebuild to reclaim space.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datafusion-vector-search-ext

Usage

Add the dependency

Build or load a USearch index

Register providers and set up the SessionContext

Write queries

WHERE clause filtering

UDTF path

Tuning

Architecture

Module structure

Optimizer rewrite

Adaptive filtering

Distance metrics

Running tests

Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datafusion-vector-search-ext

Usage

Add the dependency

Build or load a USearch index

Register providers and set up the SessionContext

Write queries

WHERE clause filtering

UDTF path

Tuning

Architecture

Module structure

Optimizer rewrite

Adaptive filtering

Distance metrics

Running tests

Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages