fix(rule): index ORDER BY-only distance with WHERE (trimmed projection shape)#26
Merged
Merged
Conversation
A k-NN query whose distance appears only in ORDER BY (not the SELECT
list) silently fell back to brute-force whenever a WHERE clause was
present. With a Filter, DataFusion materializes the raw vector column
in an intermediate projection to feed the Sort, then trims it with an
outer projection:
Projection: id <- real output (no vector)
Sort: l2_distance(vector, lit)
Projection: id, vector <- vector only feeds the Sort
Filter: ...
TableScan
The Sort-anchored match judged producibility on the inner projection,
saw the vector column the node cannot produce, and declined. Without a
WHERE clause projection pushdown eats the intermediate projection, so
the passthrough arm fired and the gap went unnoticed.
Extend the Projection-anchored arm to recognize the trimmed shape and
judge producibility on the OUTER projection (the query's real output).
SELECT * / SELECT id, vector still fall back (#508 behavior preserved,
re-tested); the aliased-distance shape still rewrites via the Sort arm
(the new arm finds no distance in the outer exprs and declines).
New tests use a ducklake-style rowid Int64 addressing key, covering
the key-column-agnostic path alongside the parquet-style _key fixtures.
test_bare_orderby_with_where_rewrites fails on the unfixed rule.
There was a problem hiding this comment.
Verified the trimmed-shape arm by inspection:
- Distance is extracted directly from
sort.expr, so the new arm's producibility judgment correctly uses the outer projection (real output) while table/filter extraction usesinner.input. - Aliased-distance,
SELECT *, explicit-vector, and deeper-nesting shapes all land in a decline path (None) — no valid-but-wrong rewrite is reachable, preserving the failure-mode asymmetry. - New tests cover the regression and the preserved #508 fallbacks meaningfully.
LGTM. (Note: could not run cargo test in the review environment; relying on the PR's reported green suite.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A k-NN query whose distance appears only in ORDER BY (not the SELECT list) silently falls back to brute-force whenever a WHERE clause is present — despite the README contract ("Queries matching the
ORDER BY distance_fn(col, query) LIMIT kpattern are transparently rewritten") and despite the output not containing the vector column.Root cause
With a Filter present, DataFusion materializes the raw vector column in an intermediate projection to feed the Sort, then trims it with an outer projection:
The Sort-anchored match judges producibility on the inner projection, sees the vector column (which the node can never produce —
meta.schemaderives from the lookup provider, which stores no vectors), and declines viaProjection::try_new(...).ok()?. Mechanically the same fallback as the intended #508 behavior forSELECT *— but over-conservative here, since the outer projection discards the vector.Without a WHERE clause, projection pushdown eliminates the intermediate projection, the passthrough arm fires, and the gap goes unnoticed — which is why
test_bare_select_inline_distance_still_rewrites(no WHERE) never caught it.Fix
Extend the Projection-anchored arm to recognize the trimmed shape
Projection → Sort → Projection(inner) → [Filter →] TableScanand judge producibility on the outer projection (the query's real output). Deliberately narrow:SELECT */SELECT id, vectorwith WHERE: still fall back (#508 preserved, re-tested)SELECT …, dist AS d … ORDER BY d): the new arm finds no distance in the outer exprs, declines, and the Sort arm handles it exactly as beforeinner.input ∈ {TableScan, Filter}guard → fallback as beforeExtensioninputs → no double-fire across optimizer passesThe new arm's firing condition is precisely the previously-dead zone — there is no existing rewrite behavior to displace. Failure-mode asymmetry is unchanged: an unenumerated plan shape lands in the decline path (correct, unindexed), never in silent wrong results.
Tests
New
tests/orderby_distance_trimmed.rs(5 tests), using a ducklake-stylerowid: Int64addressing key (vs the parquet-style_keyin the existing fixtures), so the key-column-agnostic path is covered too:test_bare_orderby_with_where_rewrites— fails on the unfixed rule, the regression guardSELECT *+ WHERE fallback, explicit-vector + WHERE fallback, aliased-distance + WHERE still rewritesFull suite green (79 existing + 5 new). Also verified end-to-end against a runtimedb local server on the ducklake storage backend: the production shape now plans
USearchExec filtered=truefor both an existing-column index and an embedding-backed index, with correct results;SELECT *fallback intact.