cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables by matthargett · Pull Request #13445 · bytecodealliance/wasmtime

matthargett · 2026-05-22T02:25:56Z

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

constant-index dispatches lower to direct call F
sig check is elided when every elem in the table shares the call_indirect's sig
null check is elided when no precomputed slot is null
bounds check + per-dispatch bound load are elided on non-growable tables
the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).
The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.
Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

2237 / 2237 cranelift filetests
16 / 16 crates/environ/tests/table_mutability.rs integration
new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

Add `ModuleTranslation::tables_mutated`, a `SecondaryMap<TableIndex, bool>` populated during `ModuleEnvironment::translate` recording whether any function in the module mutates a given table at runtime via `table.set` / `table.fill` / `table.copy` (as dest) / `table.grow` / `table.init`. Imported tables are conservatively marked mutated. Active `elem` segments at instantiation time are part of initial state, not mutations. O(total opcodes) extra pass over each function body. Groundwork for follow-on call_indirect optimizations gated on the predicate; nothing consumes the bit in this commit.

When `call_indirect` resolves to a constant index into a provably immutable funcref table whose contents are statically known from `elem` segments, rewrite the call to a direct `call F` at lowering time. Skips all per-dispatch checks (bounds, null, sig) and replaces the indirect jump with a direct branch. Gated on `is_immutable_funcref_table(table_idx)` (= predicate from the previous commit + statically-known table contents).

…ables When a funcref table is provably immutable AND every entry in its elem segments has the same function signature as the call_indirect's type annotation, the runtime signature check is statically redundant and is elided in `translate_call_indirect`.

When a funcref table is provably immutable AND none of its precomputed elem-segment entries are null, the runtime null check after the funcref load is statically redundant and is elided. Distinct from the sig-check elision: this targets tables that mix sigs but never contain null.

…wering Add `ModuleTranslation::precomputed_funcref_table_contents`, populated during finalization from active elem segments applied at table-init time. Cranelift's lowering uses this to resolve constant call_indirect indices and to check the "no null / uniform sig" predicates the prior elisions depend on.

For provably non-growable funcref tables (`!tables_mutated` excludes `table.grow`), the table size is fixed at instantiation and the per-call_indirect bounds-check load can be replaced with a constant fold using `precomputed_funcref_table_contents.len()`.

`crates/environ/tests/table_mutability.rs`: 12 cases covering the mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/ `init`, imported tables, multi-table modules, and active-elem-segment behavior.

Three soundness corrections to the call_indirect elision chain: 1. `is_immutable_funcref_table` previously returned true when the table had no per-function `table.set` etc. uses but had a passive elem segment whose `elem.init` could land at runtime. Track the passive-segment dest tables and treat them as potentially mutated. 2. The constant-index direct-call rewrite assumed the resolved funcref's vmctx matched the caller's; correct it to load the callee's `vmctx` from the precomputed `VMFuncRef`. 3. Null-check elision must NOT fire when the precomputed table contains the tagged-null pattern (slot value `1`); add that case. Disas filetests cover each scenario.

When `is_eagerly_initialized_funcref_table(table_idx)` (= immutable, fully precomputed, no null, no tagged-null) holds: - `Instance::initialize_tables` eagerly resolves and stores the full `VMFuncRef *` for every slot at instantiation, instead of leaving the tagged lazy-init bit set. - Cranelift's `call_indirect` lowering tests the masked funcref (`band v, -2`) for null instead of testing the raw slot value; the brif's null branch is provably unreachable at runtime. This is the predicate the Pulley fusion stack downstream is gated on.

The c1-8 attempt at fully eliding the lazy-init brif (egraph-folds it to `trapz`) reshaped the Pulley dispatch sequence in a way that *increased* Discarded-bucket pressure on iPhone 12 Icestorm by ~14 % without a wallclock improvement at N=3. The c1-7 form (brif retained, mask + tagged-pointer test) is the floor we keep. Disas snapshot rewritten to the c1-7 form.

cfallin · 2026-05-22T04:36:22Z

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

"This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-call_indirect cost" -- so it sounds like this is not just computing predicates but using them?
"c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.
"...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

matthargett · 2026-05-22T06:47:50Z

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.

* "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-`call_indirect` cost" -- so it sounds like this is not just computing predicates but using them?

sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice.

This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates.

The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.

* "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.

c here means commit, again sorry for the sourceforge-era shorthand. I tried to keep the commit stack clean so that each change and why it was necessary was discrete and easy reason about one after the other (and not duplicate this information in the PR description). if it was one big commit, then some of the complexity around making the changes resilient to the fuzzer might seem unrelated or unnecessary.

the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.

* "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house.

I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure.

matthargett · 2026-05-22T07:02:41Z

I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now.

…ions Active elem segments whose memory range extends past the destination table's current size at instantiation behave as a runtime mutation: the trailing entries get dropped, but only after they've been considered for table-resize semantics. Treat the source table as mutated when the module contains such a segment so the call_indirect elisions don't over-fire. Adds an integration test (`tests/all/leftover_elem_segment_soundness.rs`) + two disas filetests covering the leftover-segment shape.

matthargett requested review from a team as code owners May 22, 2026 02:25

matthargett requested review from cfallin and removed request for a team May 22, 2026 02:25

matthargett mentioned this pull request May 22, 2026

pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail #13446

Closed

matthargett added 10 commits May 21, 2026 19:52

environ: integration tests for analyze_table_mutability

5fdc275

`crates/environ/tests/table_mutability.rs`: 12 cases covering the mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/ `init`, imported tables, multi-table modules, and active-elem-segment behavior.

matthargett force-pushed the table-mutability-tracking-upstream branch from 213baca to a40c9b3 Compare May 22, 2026 03:26

matthargett mentioned this pull request May 22, 2026

pulley/cranelift: opcode fusion at call_indirect lazy-init dispatch tail #13447

Open

matthargett force-pushed the table-mutability-tracking-upstream branch from a40c9b3 to b2ab608 Compare May 22, 2026 07:11

github-actions Bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables#13445

cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables#13445
matthargett wants to merge 11 commits into
bytecodealliance:mainfrom
rebeckerspecialties:table-mutability-tracking-upstream

matthargett commented May 22, 2026 •

edited

Loading

Uh oh!

cfallin commented May 22, 2026

Uh oh!

matthargett commented May 22, 2026

Uh oh!

matthargett commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

matthargett commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Why

Soundness

Learnings / Caveats

Tests

Uh oh!

cfallin commented May 22, 2026

Uh oh!

matthargett commented May 22, 2026

Uh oh!

matthargett commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matthargett commented May 22, 2026 •

edited

Loading