Skip to content

pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail#13446

Closed
matthargett wants to merge 23 commits into
bytecodealliance:mainfrom
rebeckerspecialties:claude/pulley-fusion-xband-brif-upstream
Closed

pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail#13446
matthargett wants to merge 23 commits into
bytecodealliance:mainfrom
rebeckerspecialties:claude/pulley-fusion-xband-brif-upstream

Conversation

@matthargett
Copy link
Copy Markdown
Contributor

@matthargett matthargett commented May 22, 2026

TL;DR

Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the call_indirect lazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus a call_indirect{1,2,3,4} family mirroring direct-call call{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.

Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on xmrsplayer on Apple Watch SE2 — the closest cross-device result in our matrix.

Dependency

Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against main includes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated on is_eagerly_initialized_funcref_table (the predicate added in #13445), so it only fires when the table-mutability proof holds.

Stack

13 commits on top of #13445:

  • Phases 1–3: collapse band + brif + 2 xloads at the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).
  • Phase 4: call_indirect{1,2,3,4} opcodes mirror direct-call call{1,2,3,4}. Inst::IndirectCall bundles first 4 integer ABI args into the call opcode instead of synthesising xmovs via regalloc reg_fixed_use.
  • Correctness: handlers trap on null (a slow-path-aliasing review concern: sink_pure_inst of the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).

Wallclock medians, N=10, phase-4 vs table-mutability-tracking baseline

BENCH_TARGET_MS=2000; .utility QoS on iOS / taskpolicy -b on M4.

workload iPhone 12 (A14) iPhone XS (A12) Watch SE2 (S8)
call_indirect −2.48 % +0.18 % −0.50 %
vtable_bi −7.45 % +4.84 % −4.26 %
vtable_poly4 −8.61 % −2.96 % −4.76 %
vtable_poly6 −5.32 % +5.41 % −4.64 %
xmrsplayer +0.26 % −6.04 % +0.25 %
graphql (AS) −0.13 % −3.63 % −0.74 %
graphql (Porffor) +2.13 % +0.21 % +0.58 %

PMU buckets (single 12 s xctrace CPU Counters per workload)

  • A14 Icestorm vtable suite: 20–38 % Processing, 4–18 % Discarded.
  • A14 Icestorm graphql + call_indirect: 9–13 % Processing, 17–41 % Discarded interpreter-loop mispredict pressure but its kinda squirrely and I couldn't pin it down..
  • M4 Sawtooth, every workload: 33–47 % Processing — back-end load-use latency on the dispatch tail dominates Sawtooth's wider issue width. The wide spread here is probably due to running the measurement and dev stack on the device itself.

Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.

Verification

  • 2237 / 2237 cranelift filetests
  • 13 / 13 craneliftpulley_call_* integration tests
  • 21 min cargo fuzz run differential --no-default-features with ALLOWED_ENGINES=pulley,wasmtime — 0 crashes / 0 divergences

Extra credit

Cross-device measurement harness I built for this bake-off across WASM runtimes: rebeckerspecialties/wasm-benchmark#1

I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.

@matthargett matthargett requested review from a team as code owners May 22, 2026 02:26
@matthargett matthargett requested review from cfallin and fitzgen and removed request for a team May 22, 2026 02:26
Add `ModuleTranslation::tables_mutated`, a `SecondaryMap<TableIndex, bool>`
populated during `ModuleEnvironment::translate` recording whether any
function in the module mutates a given table at runtime via `table.set`
/ `table.fill` / `table.copy` (as dest) / `table.grow` / `table.init`.
Imported tables are conservatively marked mutated. Active `elem`
segments at instantiation time are part of initial state, not mutations.

O(total opcodes) extra pass over each function body. Groundwork for
follow-on call_indirect optimizations gated on the predicate; nothing
consumes the bit in this commit.
When `call_indirect` resolves to a constant index into a provably
immutable funcref table whose contents are statically known from
`elem` segments, rewrite the call to a direct `call F` at lowering
time. Skips all per-dispatch checks (bounds, null, sig) and replaces
the indirect jump with a direct branch.

Gated on `is_immutable_funcref_table(table_idx)` (= predicate from
the previous commit + statically-known table contents).
…ables

When a funcref table is provably immutable AND every entry in its
elem segments has the same function signature as the call_indirect's
type annotation, the runtime signature check is statically redundant
and is elided in `translate_call_indirect`.
When a funcref table is provably immutable AND none of its precomputed
elem-segment entries are null, the runtime null check after the
funcref load is statically redundant and is elided. Distinct from the
sig-check elision: this targets tables that mix sigs but never contain
null.
…wering

Add `ModuleTranslation::precomputed_funcref_table_contents`,
populated during finalization from active elem segments applied at
table-init time. Cranelift's lowering uses this to resolve constant
call_indirect indices and to check the "no null / uniform sig"
predicates the prior elisions depend on.
For provably non-growable funcref tables (`!tables_mutated` excludes
`table.grow`), the table size is fixed at instantiation and the
per-call_indirect bounds-check load can be replaced with a constant
fold using `precomputed_funcref_table_contents.len()`.
`crates/environ/tests/table_mutability.rs`: 12 cases covering the
mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/
`init`, imported tables, multi-table modules, and active-elem-segment
behavior.
Three soundness corrections to the call_indirect elision chain:

1. `is_immutable_funcref_table` previously returned true when the
   table had no per-function `table.set` etc. uses but had a passive
   elem segment whose `elem.init` could land at runtime. Track the
   passive-segment dest tables and treat them as potentially mutated.
2. The constant-index direct-call rewrite assumed the resolved
   funcref's vmctx matched the caller's; correct it to load the
   callee's `vmctx` from the precomputed `VMFuncRef`.
3. Null-check elision must NOT fire when the precomputed table
   contains the tagged-null pattern (slot value `1`); add that case.

Disas filetests cover each scenario.
When `is_eagerly_initialized_funcref_table(table_idx)` (= immutable,
fully precomputed, no null, no tagged-null) holds:

- `Instance::initialize_tables` eagerly resolves and stores the full
  `VMFuncRef *` for every slot at instantiation, instead of leaving
  the tagged lazy-init bit set.
- Cranelift's `call_indirect` lowering tests the masked funcref
  (`band v, -2`) for null instead of testing the raw slot value; the
  brif's null branch is provably unreachable at runtime.

This is the predicate the Pulley fusion stack downstream is gated on.
The c1-8 attempt at fully eliding the lazy-init brif (egraph-folds
it to `trapz`) reshaped the Pulley dispatch sequence in a way that
*increased* Discarded-bucket pressure on iPhone 12 Icestorm by ~14 %
without a wallclock improvement at N=3. The c1-7 form (brif retained,
mask + tagged-pointer test) is the floor we keep.

Disas snapshot rewritten to the c1-7 form.
…ions

Active elem segments whose memory range extends past the destination
table's current size at instantiation behave as a runtime mutation:
the trailing entries get dropped, but only after they've been
considered for table-resize semantics. Treat the source table as
mutated when the module contains such a segment so the
call_indirect elisions don't over-fire.

Adds an integration test (`tests/all/leftover_elem_segment_soundness.rs`)
+ two disas filetests covering the leftover-segment shape.
Add `xband{32,64}_s8_br_if_{x32,x64,not_x32,not_x64}` ops: each one
computes `dst = src & sign_extend(mask)` unconditionally, then
conditionally branches by `offset` on the original `src` (or, for the
`_not` variants, on its zero/non-zero inverse).

Emitted by Cranelift at call_indirect lazy-init brif sites where the
funcref's init-bit mask and the brif's null-check both read the same
loaded value. Saves one match_loop dispatch per call_indirect site.
Add `Lower::sink_pure_inst(ir::Inst)`: mark a side-effect-free CLIF
instruction as absorbed so its standalone lowering is skipped and a
later MachInst (e.g. a fused band+brif) can claim its result vreg
directly.

The reverse-iteration order in lower-block guarantees the terminator
that absorbs the inst lowers first, so the absorbed inst is still
present when the terminator looks it up.
Pattern-match `cond = band v, -2; brif cond, taken, not_taken` and
lower it to `MInst::BandBrIf` (forward + inverted variants), which
the emit side encodes as the `xband{32,64}_s8_br_if_*` pulley ops.

The `band -2` is sunk via `sink_pure_inst` so the fused MachInst
defs its result vreg. Gated on the band's mask being exactly `-2`
in the appropriate width — never fires on user-wasm `band v, -2`
shapes because the IR rewrite under `is_eagerly_initialized_funcref_table`
is the only producer of this exact pattern at a brif site.

Adds `tests/disas/pulley-call-indirect-band-brif-fusion.wat`.
Add `xfuncref_dispatch_{x64,not_x64,x32,not_x32}` ops: each one
takes a pre-masked funcref pointer, loads `wasm_call` and
callee `vmctx` from offsets `offset_code`/`offset_vmctx`, and
conditionally branches by `offset` on whether the pointer is null.

The branch direction (`x64` = branch on non-null, `not_x64` = branch
on null) is chosen by MachBuffer's fall-through optimization at emit
time.

Consumed by phase-2 Cranelift fusion (next commit).
Add `LowerBackend::pre_lower(ctx)`, called once before
`lower_clif_block` iteration begins. Backends override it to scan the
whole function and mark pure loads (or any `is_pure_for_egraph`-
satisfying inst) as absorbed via `sink_pure_inst`.

Required for the phase-2 funcref-dispatch fusion which absorbs the
two `VMFuncRef` field loads from the continuation block into the
brif's MachInst — a cross-block sink that can't be expressed in the
per-block reverse-iteration lowering order.
Phase-2 fusion. Matches the canonical call_indirect lazy-init shape:

    band v, -2 -> cond
    brif cond, continuation([cond]), null_block([])
    continuation(funcref_ptr):
      code  = load funcref_ptr + offset_code
      vmctx = load funcref_ptr + offset_vmctx

`pre_lower` sinks the two continuation-block loads; `lower_branch`
on the brif emits `MInst::FuncrefDispatch` (encoded as
`xfuncref_dispatch_*`). The band stays as a separate `xband_s8` op
because its result feeds the brif test and the continuation block
param.

Dispatch tail at the call_indirect lazy-init site shrinks from
5 Pulley dispatches to 3 (band, fused dispatch, call_indirect).
The phase-2 pattern matcher checked `Imm64(-2)` directly without
canonicalising to the band's value type. On pulley32, `iconst -2 :
i32` is stored as `Imm64(0xFFFFFFFE)` (the i64 representation of the
i32 value), so the literal `-2` compare failed and phase 2 never
fired on `arm64_32-apple-watchos`.

Replace the check with width-aware `is_minus_two_for(imm, ty)` that
matches both `Imm64(-2)` (i64 result) and `Imm64(0xFFFFFFFE)`
(i32 result). Adds `tests/disas/pulley-fusion-fires-32bit.wat`.
Adds 7 `tests/disas/pulley-fusion-*.wat` filetests:

- Gating (must-not-fire): user-wasm `band v, -2 + br_if`,
  mutable-table, table.set/fill/copy/grow, runtime sig check.
- Firing: pulley32 target, multi-site (two call_indirects in one
  function), return_call_indirect.

Test shapes drawn from known fusion-soundness bug classes in V8,
WAMR, wasm3, WasmEdge, Hermes, ChakraCore, Luau; citations in each
test's docstring.
Add `xband_funcref_dispatch_{x64,not_x64,x32,not_x32}`: same shape
as `xfuncref_dispatch_*` but consumes the UNMASKED funcref pointer
and writes the masked value to `dst_masked` so the brif's
block-call-arg copy to the continuation block param still has a
defined producer.

Cranelift's `try_fuse_funcref_dispatch` prefers phase 3 (also
absorb the standalone `xband_s8`) when the band has no other uses,
falling back to phase 2 (band stays standalone) otherwise.

Dispatch tail at the call_indirect lazy-init site shrinks to
2 Pulley dispatches (fused op + call_indirect).
Mirror of the direct-call `call{1,2,3,4}` family: each new op
combines `xmov xN, argN` ABI fixups with the indirect call. Reads
arg values before writing the ABI registers so the sequence is safe
when an argN aliases the corresponding ABI register.

`call_indirect1 dst, arg1`:
    x0 = state[arg1]
    lr = pc
    pc = state[dst]

Saves up to N Pulley dispatches per call_indirect site (one per
moved arg). In practice at least one — the callee vmctx ABI fixup.

Cranelift wiring in the next commit.
Extend `Inst::IndirectCall`'s `info.dest` from `XReg` to
`PulleyCallIndirect { target, args: SmallVec<[XReg; 4]> }`, parallel
to `PulleyCall`. `gen_call_ind_info` pulls the first 0–4 integer
args from `uses` (where they were going through regalloc's
`reg_fixed_use`, synthesising an `xmov` each) into `args`, where
they flow as free reg uses and the emitted `call_indirect{1,2,3,4}`
opcode moves them at call time.

The emit side picks the narrowest op after the same "drop args
already in their ABI register" loop used by direct calls. Phase-3's
`xband_funcref_dispatch_*` writing `dst_vmctx` into a free register
+ `call_indirect1 dst_code, dst_vmctx` is the headline shrink (one
fewer Pulley dispatch per call_indirect on the eager-table fast
path).

Filetest snapshots updated for the new `dest` shape.
@matthargett matthargett force-pushed the claude/pulley-fusion-xband-brif-upstream branch from b21631f to 68d3e70 Compare May 22, 2026 03:26
@matthargett matthargett force-pushed the claude/pulley-fusion-xband-brif-upstream branch 2 times, most recently from 80b51f7 to 8a9e7cc Compare May 22, 2026 03:40
Codex review on the rebeckerspecialties wasmtime fork PR pointed out
that phase-2/3's continuation-block load absorption breaks the
lazy-init slow path's correctness: the slow path's libcall rejoins
`continuation_block` via a block param, and after absorption the
loads are gone — `call_indirect` would see uninitialized
`dst_code`/`dst_vmctx` if the slow path is ever reached.

Fusion is gated on `is_eagerly_initialized_funcref_table` so the
slow path is unreachable at runtime, but the previous handler's
`ControlFlow::Continue(())` on null was advertised as defence-in-
depth and was itself broken. Replace it with `done_trap` in the 8
affected handlers (4 forward + 4 `_not` variants across x64/x32 ×
xfuncref_dispatch/xband_funcref_dispatch). `offset` on the `_not`
variants becomes vestigial; kept for encoding-shape parity.
@matthargett matthargett force-pushed the claude/pulley-fusion-xband-brif-upstream branch from 8a9e7cc to dce728e Compare May 22, 2026 04:08
@matthargett matthargett deleted the claude/pulley-fusion-xband-brif-upstream branch May 22, 2026 04:11
@matthargett
Copy link
Copy Markdown
Contributor Author

Reopened as #13447 (renamed branch). Same commits, same code; just a branch-name cleanup. CI is running there now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant