Skip to content

fix(rule): index ORDER BY-only distance with WHERE (trimmed projection shape)#26

Merged
anoop-narang merged 1 commit into
mainfrom
fix/orderby-distance-not-projected
Jun 4, 2026
Merged

fix(rule): index ORDER BY-only distance with WHERE (trimmed projection shape)#26
anoop-narang merged 1 commit into
mainfrom
fix/orderby-distance-not-projected

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Problem

A k-NN query whose distance appears only in ORDER BY (not the SELECT list) silently falls back to brute-force whenever a WHERE clause is present — despite the README contract ("Queries matching the ORDER BY distance_fn(col, query) LIMIT k pattern are transparently rewritten") and despite the output not containing the vector column.

SELECT id FROM items WHERE label = 'x'
ORDER BY l2_distance(embedding, ARRAY[...]) ASC LIMIT 2   -- brute-forced before this PR

Root cause

With a Filter present, DataFusion materializes the raw vector column in an intermediate projection to feed the Sort, then trims it with an outer projection:

Projection: id                      ← real output (no vector)
  Sort: l2_distance(vector, lit)
    Projection: id, vector          ← vector exists only to feed the Sort
      Filter: label = 'x'
        TableScan

The Sort-anchored match judges producibility on the inner projection, sees the vector column (which the node can never produce — meta.schema derives from the lookup provider, which stores no vectors), and declines via Projection::try_new(...).ok()?. Mechanically the same fallback as the intended #508 behavior for SELECT * — but over-conservative here, since the outer projection discards the vector.

Without a WHERE clause, projection pushdown eliminates the intermediate projection, the passthrough arm fires, and the gap goes unnoticed — which is why test_bare_select_inline_distance_still_rewrites (no WHERE) never caught it.

Fix

Extend the Projection-anchored arm to recognize the trimmed shape Projection → Sort → Projection(inner) → [Filter →] TableScan and judge producibility on the outer projection (the query's real output). Deliberately narrow:

  • SELECT * / SELECT id, vector with WHERE: still fall back (#508 preserved, re-tested)
  • Aliased-distance shape (SELECT …, dist AS d … ORDER BY d): the new arm finds no distance in the outer exprs, declines, and the Sort arm handles it exactly as before
  • Deeper nesting (aggregates, joins, stacked projections): blocked by the inner.input ∈ {TableScan, Filter} guard → fallback as before
  • Already-rewritten plans: all arms decline on Extension inputs → no double-fire across optimizer passes

The new arm's firing condition is precisely the previously-dead zone — there is no existing rewrite behavior to displace. Failure-mode asymmetry is unchanged: an unenumerated plan shape lands in the decline path (correct, unindexed), never in silent wrong results.

Tests

New tests/orderby_distance_trimmed.rs (5 tests), using a ducklake-style rowid: Int64 addressing key (vs the parquet-style _key in the existing fixtures), so the key-column-agnostic path is covered too:

  • test_bare_orderby_with_where_rewritesfails on the unfixed rule, the regression guard
  • multi-column output variant, SELECT * + WHERE fallback, explicit-vector + WHERE fallback, aliased-distance + WHERE still rewrites

Full suite green (79 existing + 5 new). Also verified end-to-end against a runtimedb local server on the ducklake storage backend: the production shape now plans USearchExec filtered=true for both an existing-column index and an embedding-backed index, with correct results; SELECT * fallback intact.

A k-NN query whose distance appears only in ORDER BY (not the SELECT
list) silently fell back to brute-force whenever a WHERE clause was
present. With a Filter, DataFusion materializes the raw vector column
in an intermediate projection to feed the Sort, then trims it with an
outer projection:

  Projection: id                      <- real output (no vector)
    Sort: l2_distance(vector, lit)
      Projection: id, vector          <- vector only feeds the Sort
        Filter: ...
          TableScan

The Sort-anchored match judged producibility on the inner projection,
saw the vector column the node cannot produce, and declined. Without a
WHERE clause projection pushdown eats the intermediate projection, so
the passthrough arm fired and the gap went unnoticed.

Extend the Projection-anchored arm to recognize the trimmed shape and
judge producibility on the OUTER projection (the query's real output).
SELECT * / SELECT id, vector still fall back (#508 behavior preserved,
re-tested); the aliased-distance shape still rewrites via the Sort arm
(the new arm finds no distance in the outer exprs and declines).

New tests use a ducklake-style rowid Int64 addressing key, covering
the key-column-agnostic path alongside the parquet-style _key fixtures.
test_bare_orderby_with_where_rewrites fails on the unfixed rule.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the trimmed-shape arm by inspection:

  • Distance is extracted directly from sort.expr, so the new arm's producibility judgment correctly uses the outer projection (real output) while table/filter extraction uses inner.input.
  • Aliased-distance, SELECT *, explicit-vector, and deeper-nesting shapes all land in a decline path (None) — no valid-but-wrong rewrite is reachable, preserving the failure-mode asymmetry.
  • New tests cover the regression and the preserved #508 fallbacks meaningfully.

LGTM. (Note: could not run cargo test in the review environment; relying on the PR's reported green suite.)

@anoop-narang anoop-narang merged commit 3d8c7a2 into main Jun 4, 2026
6 checks passed
@anoop-narang anoop-narang deleted the fix/orderby-distance-not-projected branch June 4, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant