cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables#13445
Conversation
Add `ModuleTranslation::tables_mutated`, a `SecondaryMap<TableIndex, bool>` populated during `ModuleEnvironment::translate` recording whether any function in the module mutates a given table at runtime via `table.set` / `table.fill` / `table.copy` (as dest) / `table.grow` / `table.init`. Imported tables are conservatively marked mutated. Active `elem` segments at instantiation time are part of initial state, not mutations. O(total opcodes) extra pass over each function body. Groundwork for follow-on call_indirect optimizations gated on the predicate; nothing consumes the bit in this commit.
When `call_indirect` resolves to a constant index into a provably immutable funcref table whose contents are statically known from `elem` segments, rewrite the call to a direct `call F` at lowering time. Skips all per-dispatch checks (bounds, null, sig) and replaces the indirect jump with a direct branch. Gated on `is_immutable_funcref_table(table_idx)` (= predicate from the previous commit + statically-known table contents).
…ables When a funcref table is provably immutable AND every entry in its elem segments has the same function signature as the call_indirect's type annotation, the runtime signature check is statically redundant and is elided in `translate_call_indirect`.
When a funcref table is provably immutable AND none of its precomputed elem-segment entries are null, the runtime null check after the funcref load is statically redundant and is elided. Distinct from the sig-check elision: this targets tables that mix sigs but never contain null.
…wering Add `ModuleTranslation::precomputed_funcref_table_contents`, populated during finalization from active elem segments applied at table-init time. Cranelift's lowering uses this to resolve constant call_indirect indices and to check the "no null / uniform sig" predicates the prior elisions depend on.
For provably non-growable funcref tables (`!tables_mutated` excludes `table.grow`), the table size is fixed at instantiation and the per-call_indirect bounds-check load can be replaced with a constant fold using `precomputed_funcref_table_contents.len()`.
`crates/environ/tests/table_mutability.rs`: 12 cases covering the mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/ `init`, imported tables, multi-table modules, and active-elem-segment behavior.
Three soundness corrections to the call_indirect elision chain: 1. `is_immutable_funcref_table` previously returned true when the table had no per-function `table.set` etc. uses but had a passive elem segment whose `elem.init` could land at runtime. Track the passive-segment dest tables and treat them as potentially mutated. 2. The constant-index direct-call rewrite assumed the resolved funcref's vmctx matched the caller's; correct it to load the callee's `vmctx` from the precomputed `VMFuncRef`. 3. Null-check elision must NOT fire when the precomputed table contains the tagged-null pattern (slot value `1`); add that case. Disas filetests cover each scenario.
When `is_eagerly_initialized_funcref_table(table_idx)` (= immutable, fully precomputed, no null, no tagged-null) holds: - `Instance::initialize_tables` eagerly resolves and stores the full `VMFuncRef *` for every slot at instantiation, instead of leaving the tagged lazy-init bit set. - Cranelift's `call_indirect` lowering tests the masked funcref (`band v, -2`) for null instead of testing the raw slot value; the brif's null branch is provably unreachable at runtime. This is the predicate the Pulley fusion stack downstream is gated on.
The c1-8 attempt at fully eliding the lazy-init brif (egraph-folds it to `trapz`) reshaped the Pulley dispatch sequence in a way that *increased* Discarded-bucket pressure on iPhone 12 Icestorm by ~14 % without a wallclock improvement at N=3. The c1-7 form (brif retained, mask + tagged-pointer test) is the floor we keep. Disas snapshot rewritten to the c1-7 form.
213baca to
a40c9b3
Compare
|
@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:
These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR. (And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?) |
I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.
sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice. This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates. The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.
the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.
the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).
sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.
I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house. I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure. |
|
I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now. |
…ions Active elem segments whose memory range extends past the destination table's current size at instantiation behave as a runtime mutation: the trailing entries get dropped, but only after they've been considered for table-resize semantics. Treat the source table as mutated when the module contains such a segment so the call_indirect elisions don't over-fire. Adds an integration test (`tests/all/leftover_elem_segment_soundness.rs`) + two disas filetests covering the leftover-segment shape.
a40c9b3 to
b2ab608
Compare
TL;DR
Adds a
ModuleTranslation::tables_mutatedbit (set during translation bytable.set/table.fill/table.copy-dest /table.grow/table.initopcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks oncall_indirectagainst provably-immutable funcref tables:call FVMFuncRef *directly at instantiation)Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.
Why
Each elision saves a small per-
call_indirectcost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.Soundness
Three corners had to be tightened during development:
elem.initagainst a slot the predicate said was immutable would slip through).vmctxfrom the precomputedVMFuncRef, not the caller's.1, produced bytable.fill(null)on a tagged table; excluded by the immutability half of the predicate).tests/all/leftover_elem_segment_soundness.rs+ 4 disas filetests cover the soundness shapes.crates/environ/tests/table_mutability.rshas 12 cases for the predicate itself.Learnings / Caveats
An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to
trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commitdisable the c1-8 brif elision based on PMU evidencedocuments this.Tests
crates/environ/tests/table_mutability.rsintegrationtests/all/leftover_elem_segment_soundness.rsStacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.