pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail#13446
Closed
matthargett wants to merge 23 commits into
Closed
pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail#13446matthargett wants to merge 23 commits into
matthargett wants to merge 23 commits into
Conversation
Add `ModuleTranslation::tables_mutated`, a `SecondaryMap<TableIndex, bool>` populated during `ModuleEnvironment::translate` recording whether any function in the module mutates a given table at runtime via `table.set` / `table.fill` / `table.copy` (as dest) / `table.grow` / `table.init`. Imported tables are conservatively marked mutated. Active `elem` segments at instantiation time are part of initial state, not mutations. O(total opcodes) extra pass over each function body. Groundwork for follow-on call_indirect optimizations gated on the predicate; nothing consumes the bit in this commit.
When `call_indirect` resolves to a constant index into a provably immutable funcref table whose contents are statically known from `elem` segments, rewrite the call to a direct `call F` at lowering time. Skips all per-dispatch checks (bounds, null, sig) and replaces the indirect jump with a direct branch. Gated on `is_immutable_funcref_table(table_idx)` (= predicate from the previous commit + statically-known table contents).
…ables When a funcref table is provably immutable AND every entry in its elem segments has the same function signature as the call_indirect's type annotation, the runtime signature check is statically redundant and is elided in `translate_call_indirect`.
When a funcref table is provably immutable AND none of its precomputed elem-segment entries are null, the runtime null check after the funcref load is statically redundant and is elided. Distinct from the sig-check elision: this targets tables that mix sigs but never contain null.
…wering Add `ModuleTranslation::precomputed_funcref_table_contents`, populated during finalization from active elem segments applied at table-init time. Cranelift's lowering uses this to resolve constant call_indirect indices and to check the "no null / uniform sig" predicates the prior elisions depend on.
For provably non-growable funcref tables (`!tables_mutated` excludes `table.grow`), the table size is fixed at instantiation and the per-call_indirect bounds-check load can be replaced with a constant fold using `precomputed_funcref_table_contents.len()`.
`crates/environ/tests/table_mutability.rs`: 12 cases covering the mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/ `init`, imported tables, multi-table modules, and active-elem-segment behavior.
Three soundness corrections to the call_indirect elision chain: 1. `is_immutable_funcref_table` previously returned true when the table had no per-function `table.set` etc. uses but had a passive elem segment whose `elem.init` could land at runtime. Track the passive-segment dest tables and treat them as potentially mutated. 2. The constant-index direct-call rewrite assumed the resolved funcref's vmctx matched the caller's; correct it to load the callee's `vmctx` from the precomputed `VMFuncRef`. 3. Null-check elision must NOT fire when the precomputed table contains the tagged-null pattern (slot value `1`); add that case. Disas filetests cover each scenario.
When `is_eagerly_initialized_funcref_table(table_idx)` (= immutable, fully precomputed, no null, no tagged-null) holds: - `Instance::initialize_tables` eagerly resolves and stores the full `VMFuncRef *` for every slot at instantiation, instead of leaving the tagged lazy-init bit set. - Cranelift's `call_indirect` lowering tests the masked funcref (`band v, -2`) for null instead of testing the raw slot value; the brif's null branch is provably unreachable at runtime. This is the predicate the Pulley fusion stack downstream is gated on.
The c1-8 attempt at fully eliding the lazy-init brif (egraph-folds it to `trapz`) reshaped the Pulley dispatch sequence in a way that *increased* Discarded-bucket pressure on iPhone 12 Icestorm by ~14 % without a wallclock improvement at N=3. The c1-7 form (brif retained, mask + tagged-pointer test) is the floor we keep. Disas snapshot rewritten to the c1-7 form.
…ions Active elem segments whose memory range extends past the destination table's current size at instantiation behave as a runtime mutation: the trailing entries get dropped, but only after they've been considered for table-resize semantics. Treat the source table as mutated when the module contains such a segment so the call_indirect elisions don't over-fire. Adds an integration test (`tests/all/leftover_elem_segment_soundness.rs`) + two disas filetests covering the leftover-segment shape.
Add `xband{32,64}_s8_br_if_{x32,x64,not_x32,not_x64}` ops: each one
computes `dst = src & sign_extend(mask)` unconditionally, then
conditionally branches by `offset` on the original `src` (or, for the
`_not` variants, on its zero/non-zero inverse).
Emitted by Cranelift at call_indirect lazy-init brif sites where the
funcref's init-bit mask and the brif's null-check both read the same
loaded value. Saves one match_loop dispatch per call_indirect site.
Add `Lower::sink_pure_inst(ir::Inst)`: mark a side-effect-free CLIF instruction as absorbed so its standalone lowering is skipped and a later MachInst (e.g. a fused band+brif) can claim its result vreg directly. The reverse-iteration order in lower-block guarantees the terminator that absorbs the inst lowers first, so the absorbed inst is still present when the terminator looks it up.
Pattern-match `cond = band v, -2; brif cond, taken, not_taken` and
lower it to `MInst::BandBrIf` (forward + inverted variants), which
the emit side encodes as the `xband{32,64}_s8_br_if_*` pulley ops.
The `band -2` is sunk via `sink_pure_inst` so the fused MachInst
defs its result vreg. Gated on the band's mask being exactly `-2`
in the appropriate width — never fires on user-wasm `band v, -2`
shapes because the IR rewrite under `is_eagerly_initialized_funcref_table`
is the only producer of this exact pattern at a brif site.
Adds `tests/disas/pulley-call-indirect-band-brif-fusion.wat`.
Add `xfuncref_dispatch_{x64,not_x64,x32,not_x32}` ops: each one
takes a pre-masked funcref pointer, loads `wasm_call` and
callee `vmctx` from offsets `offset_code`/`offset_vmctx`, and
conditionally branches by `offset` on whether the pointer is null.
The branch direction (`x64` = branch on non-null, `not_x64` = branch
on null) is chosen by MachBuffer's fall-through optimization at emit
time.
Consumed by phase-2 Cranelift fusion (next commit).
Add `LowerBackend::pre_lower(ctx)`, called once before `lower_clif_block` iteration begins. Backends override it to scan the whole function and mark pure loads (or any `is_pure_for_egraph`- satisfying inst) as absorbed via `sink_pure_inst`. Required for the phase-2 funcref-dispatch fusion which absorbs the two `VMFuncRef` field loads from the continuation block into the brif's MachInst — a cross-block sink that can't be expressed in the per-block reverse-iteration lowering order.
Phase-2 fusion. Matches the canonical call_indirect lazy-init shape:
band v, -2 -> cond
brif cond, continuation([cond]), null_block([])
continuation(funcref_ptr):
code = load funcref_ptr + offset_code
vmctx = load funcref_ptr + offset_vmctx
`pre_lower` sinks the two continuation-block loads; `lower_branch`
on the brif emits `MInst::FuncrefDispatch` (encoded as
`xfuncref_dispatch_*`). The band stays as a separate `xband_s8` op
because its result feeds the brif test and the continuation block
param.
Dispatch tail at the call_indirect lazy-init site shrinks from
5 Pulley dispatches to 3 (band, fused dispatch, call_indirect).
The phase-2 pattern matcher checked `Imm64(-2)` directly without canonicalising to the band's value type. On pulley32, `iconst -2 : i32` is stored as `Imm64(0xFFFFFFFE)` (the i64 representation of the i32 value), so the literal `-2` compare failed and phase 2 never fired on `arm64_32-apple-watchos`. Replace the check with width-aware `is_minus_two_for(imm, ty)` that matches both `Imm64(-2)` (i64 result) and `Imm64(0xFFFFFFFE)` (i32 result). Adds `tests/disas/pulley-fusion-fires-32bit.wat`.
Adds 7 `tests/disas/pulley-fusion-*.wat` filetests: - Gating (must-not-fire): user-wasm `band v, -2 + br_if`, mutable-table, table.set/fill/copy/grow, runtime sig check. - Firing: pulley32 target, multi-site (two call_indirects in one function), return_call_indirect. Test shapes drawn from known fusion-soundness bug classes in V8, WAMR, wasm3, WasmEdge, Hermes, ChakraCore, Luau; citations in each test's docstring.
Add `xband_funcref_dispatch_{x64,not_x64,x32,not_x32}`: same shape
as `xfuncref_dispatch_*` but consumes the UNMASKED funcref pointer
and writes the masked value to `dst_masked` so the brif's
block-call-arg copy to the continuation block param still has a
defined producer.
Cranelift's `try_fuse_funcref_dispatch` prefers phase 3 (also
absorb the standalone `xband_s8`) when the band has no other uses,
falling back to phase 2 (band stays standalone) otherwise.
Dispatch tail at the call_indirect lazy-init site shrinks to
2 Pulley dispatches (fused op + call_indirect).
Mirror of the direct-call `call{1,2,3,4}` family: each new op
combines `xmov xN, argN` ABI fixups with the indirect call. Reads
arg values before writing the ABI registers so the sequence is safe
when an argN aliases the corresponding ABI register.
`call_indirect1 dst, arg1`:
x0 = state[arg1]
lr = pc
pc = state[dst]
Saves up to N Pulley dispatches per call_indirect site (one per
moved arg). In practice at least one — the callee vmctx ABI fixup.
Cranelift wiring in the next commit.
Extend `Inst::IndirectCall`'s `info.dest` from `XReg` to
`PulleyCallIndirect { target, args: SmallVec<[XReg; 4]> }`, parallel
to `PulleyCall`. `gen_call_ind_info` pulls the first 0–4 integer
args from `uses` (where they were going through regalloc's
`reg_fixed_use`, synthesising an `xmov` each) into `args`, where
they flow as free reg uses and the emitted `call_indirect{1,2,3,4}`
opcode moves them at call time.
The emit side picks the narrowest op after the same "drop args
already in their ABI register" loop used by direct calls. Phase-3's
`xband_funcref_dispatch_*` writing `dst_vmctx` into a free register
+ `call_indirect1 dst_code, dst_vmctx` is the headline shrink (one
fewer Pulley dispatch per call_indirect on the eager-table fast
path).
Filetest snapshots updated for the new `dest` shape.
b21631f to
68d3e70
Compare
80b51f7 to
8a9e7cc
Compare
Codex review on the rebeckerspecialties wasmtime fork PR pointed out that phase-2/3's continuation-block load absorption breaks the lazy-init slow path's correctness: the slow path's libcall rejoins `continuation_block` via a block param, and after absorption the loads are gone — `call_indirect` would see uninitialized `dst_code`/`dst_vmctx` if the slow path is ever reached. Fusion is gated on `is_eagerly_initialized_funcref_table` so the slow path is unreachable at runtime, but the previous handler's `ControlFlow::Continue(())` on null was advertised as defence-in- depth and was itself broken. Replace it with `done_trap` in the 8 affected handlers (4 forward + 4 `_not` variants across x64/x32 × xfuncref_dispatch/xband_funcref_dispatch). `offset` on the `_not` variants becomes vestigial; kept for encoding-shape parity.
8a9e7cc to
dce728e
Compare
Contributor
Author
|
Reopened as #13447 (renamed branch). Same commits, same code; just a branch-name cleanup. CI is running there now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Stacks on #13445 (per-table mutability tracking). Adds a fused dispatch op family at the
call_indirectlazy-init brif site (similar in shape to WAMR's preprocessed-bytecode register IR) plus acall_indirect{1,2,3,4}family mirroring direct-callcall{1,2,3,4}. Consistent 4–8 % wallclock wins on polymorphic vtable workloads across iPhone 12 E-core, Apple Watch SE2 , and M4 E-core. iPhone XS e-core is mixed and the reason, is visible in hand-rolled aarch64 assembly microbenchmarks: branch-prediction pressure is the cross-microarch variable.Closes ~10 % of the Pulley/WAMR wallclock gap on the iPhone 12 vtable suite (vtable_poly4 1.73× → 1.58×) and pushes WAMR within 1.16× on
xmrsplayeron Apple Watch SE2 — the closest cross-device result in our matrix.Dependency
Depends on #13445 landing first. Both PRs are stacked from the same fork branch; this PR's diff against
mainincludes #13445's 11 commits at the bottom + 13 fusion commits on top. The fusion is gated onis_eagerly_initialized_funcref_table(the predicate added in #13445), so it only fires when the table-mutability proof holds.Stack
13 commits on top of #13445:
band + brif + 2 xloadsat the call_indirect lazy-init tail (5 Pulley dispatches → 2 per call_indirect site).call_indirect{1,2,3,4}opcodes mirror direct-callcall{1,2,3,4}.Inst::IndirectCallbundles first 4 integer ABI args into the call opcode instead of synthesisingxmovs via regallocreg_fixed_use.sink_pure_instof the continuation-block loads broke the lazy-init slow path's rejoin; trapping fails closed under the predicate).Wallclock medians, N=10, phase-4 vs
table-mutability-trackingbaselineBENCH_TARGET_MS=2000;.utilityQoS on iOS /taskpolicy -bon M4.PMU buckets (single 12 s xctrace
CPU Countersper workload)Common bottleneck across A14 + M4 is Processing (back-end-bound). I can't measure these advacned CPU counters on iPhone XS (A12) or Apple Watch, I can only get wall clock time and power draw.
Verification
craneliftpulley_call_*integration testscargo fuzz run differential --no-default-featureswithALLOWED_ENGINES=pulley,wasmtime— 0 crashes / 0 divergencesExtra credit
Cross-device measurement harness I built for this bake-off across WASM runtimes: rebeckerspecialties/wasm-benchmark#1
I submitted two PRs to WAMR to support exceptions and loose SIMD, so that it can run more of the benchmarks and generally have functional partity without losing its performance edge.