Skip to content

Limit pushdown with filters + sort-aware reverse output#7

Draft
ch-sc wants to merge 2 commits into
0.65.0-atlasfrom
limit-pushdown
Draft

Limit pushdown with filters + sort-aware reverse output#7
ch-sc wants to merge 2 commits into
0.65.0-atlasfrom
limit-pushdown

Conversation

@ch-sc
Copy link
Copy Markdown
Collaborator

@ch-sc ch-sc commented May 19, 2026

Summary

Two related perf gaps surfaced while investigating reverse order scans:

  1. LIMIT is silently dropped when a filter is set. ScanBuilder bailed outright on filter + limit, and VortexOpener worked around it by gating with_limit on filter.is_none(). The result was that SELECT … WHERE x > k LIMIT n always scanned every file end-to-end.
  2. ORDER BY ts DESC against a table declared WITH ORDER (ts ASC) doesn't exploit reverse scanning. VortexSource::with_reversed already existed and ScanBuilder already supported reverse intra-file scans, but no FileSource::try_reverse_output hook was wired up. So DataFusion always inserted a full SortExec instead of letting the source absorb the reversal.

Together they meant that the canonical "latest N matching rows" query (e.g. SELECT * FROM t WHERE x > k ORDER BY ts DESC LIMIT 100 over time-partitioned files) read every file and every row. After the changes in this PR it typically reads only the last file, short-circuiting once the limit is satisfied.

SELECT * FROM t WHERE x > k ORDER BY ts DESC LIMIT 100 over time-partitioned
files now executes as:

  1. Planner sees ORDER BY ts DESC, table's output ordering is ts ASC. Calls VortexSource::try_pushdown_sort.
  2. We return Exact with reversed=true. DataFusion reverses file_groups (so the latest file comes first) and removes the SortExec.
  3. Planner pushes filter x > k via try_pushdown_filters (already in place).
  4. Planner pushes LIMIT 100 via with_fetch on DataSourceExec (already in place; passes through to base_config.limit).
  5. Opener pushes the limit unconditionally now (remove the filter.is_none() guard).
  6. Scan starts processing files in reverse order. First split's filter+projection produces, say, 60 rows. Atomic reserves 60, remaining = 40.
  7. Second split sees remaining = 40, reserves 40, exits early after enough rows. Done.
  8. Third+ splits / later files never start.

reversed is now surfaced in fmt_extra so plan rendering shows
reversed: true, which makes the pushdown observable in tests.

Edge cases.

  • Empty requested ordering or no declared orderings → Unsupported.
  • Multiple declared orderings → accept if any one matches in reverse.
  • Partial prefix matches → is_reverse is strict; we return Unsupported, the planner falls back to a full SortExec. Correct, just not optimal.
  • Currently-reversed source asked to reverse again → flip the flag back (reversed = !reversed).
  • Filter pushdown interactions — none. Sort pushdown is a separate optimizer stage; we just clone with a different flag.

Risks & limitations

Risk Mitigation
CAS livelock under heavy contention compare_exchange_weak allows spurious failures; bounded by physical concurrency. 32-thread stress test (reserve_up_to_concurrent_never_oversubscribes) exercises this.
Soft early-stop over-fetches on a slow split Outer LimitExec truncates. We pay extra IO, never wrong rows.
ORDER BY ts DESC, id ASC (multi-key with mixed direction) is_reverse is strict; we return Unsupported, planner falls back to SortExec. Correct, just not optimal.
DataFusion 52 leaves a SortExec even after reverse pushdown We return Inexact (Parquet does the same on 52). The TopK still benefits from fetch pushdown and from the source emitting locally-sorted streams.

API Changes

  • VortexSource::with_reversed(bool) -> Self
  • VortexSource::try_reverse_output(&self, &[PhysicalSortExpr], &EquivalenceProperties) -> Result<SortOrderPushdownResult<Arc<dyn FileSource>>>
  • ScanBuilder::with_reversed(bool) -> Self and ScanBuilder::reversed(&self) -> bool (already existed in the type, now stamped into the public-api lock).

public-api.lock files regenerated via cargo xtask public-api.

Testing

vortex-scan/src/tasks.rs (CAS unit tests):

  • reserve_up_to_drains_counter — sequential reservations drain to zero.
  • reserve_up_to_concurrent_never_oversubscribes — 32-thread stress proves the
    CAS loop never lets the counter underflow or hand out more than the initial
    budget.

vortex-file/src/tests.rs (filter + limit at the file level):

  • limit_with_filter_truncates_globally — multi-chunk scan with a filter and
    limit=7 returns exactly 7 rows. (Previously this combination would have
    bailed in ScanBuilder::prepare.)
  • limit_zero_with_filter_yields_no_rowslimit=0 short-circuits even when
    a filter is present.

vortex-datafusion/src/persistent/mod.rs (sort pushdown):

  • reverse_pushdown_swaps_file_orderWITH ORDER (ts ASC) + ORDER BY ts DESC LIMIT k produces the largest k values, the plan shows
    SortExec(TopK) and the source is marked reversed: true.
  • reverse_pushdown_rejected_for_unrelated_orderingWITH ORDER (a ASC),
    query asks ORDER BY b DESC. Plan keeps the Sort and source is not
    reversed.
  • reverse_pushdown_no_ordering_declared — no WITH ORDER → no reverse
    pushdown.
  • reverse_pushdown_then_filter — filter pushdown still works alongside
    reverse sort pushdown.
  • order_by_desc_limit_returns_largest_rows — end-to-end synergy: 5 files of
    100 rows each, filter + ORDER BY ts DESC LIMIT 10 returns ts [490..500]
    and the source is reversed in the plan.

ch-sc added 2 commits May 19, 2026 17:37
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
Signed-off-by: Christoph Schulze <christoph.schulze@polygon.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant