Skip to content

cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables#13445

Open
matthargett wants to merge 11 commits into
bytecodealliance:mainfrom
rebeckerspecialties:table-mutability-tracking-upstream
Open

cranelift: per-table mutability tracking + call_indirect elisions on immutable funcref tables#13445
matthargett wants to merge 11 commits into
bytecodealliance:mainfrom
rebeckerspecialties:table-mutability-tracking-upstream

Conversation

@matthargett
Copy link
Copy Markdown
Contributor

@matthargett matthargett commented May 22, 2026

TL;DR

Adds a ModuleTranslation::tables_mutated bit (set during translation by table.set / table.fill / table.copy-dest / table.grow / table.init opcodes, or any passive elem segment that could land at runtime, or any leftover-segment shape that crosses a runtime resize) and uses it to elide six redundant runtime checks on call_indirect against provably-immutable funcref tables:

  • constant-index dispatches lower to direct call F
  • sig check is elided when every elem in the table shares the call_indirect's sig
  • null check is elided when no precomputed slot is null
  • bounds check + per-dispatch bound load are elided on non-growable tables
  • the lazy-init brif tests the masked funcref instead of the raw slot value (eager-init slots store the resolved VMFuncRef * directly at instantiation)

Each elision is gated on a stronger predicate than the previous; passive-segment + leftover-segment soundness corners are covered by integration tests.

Why

Each elision saves a small per-call_indirect cost, but the combined predicate (is_eagerly_initialized_funcref_table) is what lets the downstream Pulley opcode-fusion stack (a follow-up PR) collapse the dispatch tail. Real-world graphql-js validation pipelines compile to ~98 call_indirect sites all dispatching through a single immutable funcref table — every site qualifies.

Soundness

Three corners had to be tightened during development:

  • Passive elem segments with dest tables are conservatively counted as mutations (else elem.init against a slot the predicate said was immutable would slip through).
  • The constant-index direct-call rewrite loads the callee's vmctx from the precomputed VMFuncRef, not the caller's.
  • Null-check elision skips the tagged-null pattern (slot value 1, produced by table.fill(null) on a tagged table; excluded by the immutability half of the predicate).

tests/all/leftover_elem_segment_soundness.rs + 4 disas filetests cover the soundness shapes. crates/environ/tests/table_mutability.rs has 12 cases for the predicate itself.

Learnings / Caveats

An attempt at fully eliding the lazy-init brif (c1-8: egraph-folds it to trapz) showed ~14 % branch mis-prediction increase on iPhone 12 E-core profiler across 3+ runs without a wallclock improvement. I kept the form from commits 1-7 (brif retained, mask + tagged-pointer test); commit disable the c1-8 brif elision based on PMU evidence documents this.

Tests

  • 2237 / 2237 cranelift filetests
  • 16 / 16 crates/environ/tests/table_mutability.rs integration
  • new tests/all/leftover_elem_segment_soundness.rs

Stacks under a follow-up PR for Pulley opcode fusion at the call_indirect lazy-init site.

@matthargett matthargett requested review from a team as code owners May 22, 2026 02:25
@matthargett matthargett requested review from cfallin and removed request for a team May 22, 2026 02:25
Add `ModuleTranslation::tables_mutated`, a `SecondaryMap<TableIndex, bool>`
populated during `ModuleEnvironment::translate` recording whether any
function in the module mutates a given table at runtime via `table.set`
/ `table.fill` / `table.copy` (as dest) / `table.grow` / `table.init`.
Imported tables are conservatively marked mutated. Active `elem`
segments at instantiation time are part of initial state, not mutations.

O(total opcodes) extra pass over each function body. Groundwork for
follow-on call_indirect optimizations gated on the predicate; nothing
consumes the bit in this commit.
When `call_indirect` resolves to a constant index into a provably
immutable funcref table whose contents are statically known from
`elem` segments, rewrite the call to a direct `call F` at lowering
time. Skips all per-dispatch checks (bounds, null, sig) and replaces
the indirect jump with a direct branch.

Gated on `is_immutable_funcref_table(table_idx)` (= predicate from
the previous commit + statically-known table contents).
…ables

When a funcref table is provably immutable AND every entry in its
elem segments has the same function signature as the call_indirect's
type annotation, the runtime signature check is statically redundant
and is elided in `translate_call_indirect`.
When a funcref table is provably immutable AND none of its precomputed
elem-segment entries are null, the runtime null check after the
funcref load is statically redundant and is elided. Distinct from the
sig-check elision: this targets tables that mix sigs but never contain
null.
…wering

Add `ModuleTranslation::precomputed_funcref_table_contents`,
populated during finalization from active elem segments applied at
table-init time. Cranelift's lowering uses this to resolve constant
call_indirect indices and to check the "no null / uniform sig"
predicates the prior elisions depend on.
For provably non-growable funcref tables (`!tables_mutated` excludes
`table.grow`), the table size is fixed at instantiation and the
per-call_indirect bounds-check load can be replaced with a constant
fold using `precomputed_funcref_table_contents.len()`.
`crates/environ/tests/table_mutability.rs`: 12 cases covering the
mutation-tracking predicate across `table.set`/`fill`/`copy`/`grow`/
`init`, imported tables, multi-table modules, and active-elem-segment
behavior.
Three soundness corrections to the call_indirect elision chain:

1. `is_immutable_funcref_table` previously returned true when the
   table had no per-function `table.set` etc. uses but had a passive
   elem segment whose `elem.init` could land at runtime. Track the
   passive-segment dest tables and treat them as potentially mutated.
2. The constant-index direct-call rewrite assumed the resolved
   funcref's vmctx matched the caller's; correct it to load the
   callee's `vmctx` from the precomputed `VMFuncRef`.
3. Null-check elision must NOT fire when the precomputed table
   contains the tagged-null pattern (slot value `1`); add that case.

Disas filetests cover each scenario.
When `is_eagerly_initialized_funcref_table(table_idx)` (= immutable,
fully precomputed, no null, no tagged-null) holds:

- `Instance::initialize_tables` eagerly resolves and stores the full
  `VMFuncRef *` for every slot at instantiation, instead of leaving
  the tagged lazy-init bit set.
- Cranelift's `call_indirect` lowering tests the masked funcref
  (`band v, -2`) for null instead of testing the raw slot value; the
  brif's null branch is provably unreachable at runtime.

This is the predicate the Pulley fusion stack downstream is gated on.
The c1-8 attempt at fully eliding the lazy-init brif (egraph-folds
it to `trapz`) reshaped the Pulley dispatch sequence in a way that
*increased* Discarded-bucket pressure on iPhone 12 Icestorm by ~14 %
without a wallclock improvement at N=3. The c1-7 form (brif retained,
mask + tagged-pointer test) is the floor we keep.

Disas snapshot rewritten to the c1-7 form.
@cfallin
Copy link
Copy Markdown
Member

cfallin commented May 22, 2026

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

  • "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-call_indirect cost" -- so it sounds like this is not just computing predicates but using them?
  • "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.
  • "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

@matthargett
Copy link
Copy Markdown
Contributor Author

@matthargett could I request that (per our AI policy) you rewrite the PR description here? In particular, there are a bunch of phrases here that seem to be a locally-evolved jargon and are not very comprehensible:

I did edit the PR description, by my own typing hands, and it folds in the feedback given about both the verbosity and the AI policy in the chat.

* "This is the predicate-factory PR." What is a predicate factory in this context? What does that mean? Are you just saying that this PR computes certain properties (predicates)? But then you say "Each elision saves a small per-`call_indirect` cost" -- so it sounds like this is not just computing predicates but using them?

sorry, this is a term we used in the code for my product called BugScan back in 2003. in that and this context, analyzing code to derive predicates that can be chained together for solving/elision purposes. here it's bits being set and not structs/objects, so "factory" wasn't a good metaphor choice.

This PR does the full chain: it does the analysis, sets the bits, and then the cranelift modifications uses those bits to do some elision of opcodes based on the proof of analysis that the predicates.

The reason there's a split between the PRs is that while there's some low-level CPU counters and profiler stats that improve just with this PR, but wallclock/e2e results on my devices didn't show an above-noise uplift.

* "c1-7 vs c1-8 elision floor" -- what does this mean? What is c1-7? What is c1-8? (If I had to guess: these are Claude-shaped plan phase names?) And then what is an "elision floor" (other than, perhaps, a fancy new kind of home-construction product)? Please try to make sure the description doesn't have incomprehensible jargon -- define terms before you use them, unless they are standard (in our subfield) terms.

c here means commit, again sorry for the sourceforge-era shorthand. I tried to keep the commit stack clean so that each change and why it was necessary was discrete and easy reason about one after the other (and not duplicate this information in the PR description). if it was one big commit, then some of the complexity around making the changes resilient to the fuzzer might seem unrelated or unnecessary.

the elision floor is when I wasn't getting any movement, even on CPU counter improvements, from trying to elide even more instructions. I'm not trying to use confusing metaphors on purpose, and I'm sorry my phrasing wasn't clearer.

* "...showed ~14 % Discarded-bucket increase" -- this reads like some sort of retro-encabulator description. What is the discarded bucket? Is it a bucket we have chosen to discard? Is it a bucket for things that are discarded? Why are they discarded? Who discarded them? Why is it bad that the discarded-bucket (... or the number of things within it) increases? There must be a whole experiment+measurement story here that's elided; unfortunately without that story it's hard to get any meaning out of this.

the "discarded" metric comes from the XCode profiling tools, specificallt xctrace's template for CPU bottlenecks. it's how many branch prediction mis-predicts happen, and its relevant in a bunch of interpreter performance work I've done over the years (starting with Tcl on the DEC Alpha and PowerPC).

These are just examples; in general I'd like to see a description of the work that is aimed to actually communicate to a human, not dump a bunch of out-of-context details, especially before diving in to a 2k-line PR.

sorry, I really did try to make it more straightforward and not repeat things the diff already communicates.

(And, to double-check per our AI policy: have you reviewed this whole PR by hand before posting it?)

I re-reviewed diffs inbetween each on-device benchmark pass, which took 30-40 minutes across the cross-section of physical devices I have at my house.

I know I'm not the world's best programmer or communicator, even after a few decades of practice. If you feel like a video/audio chat would help, I'm up for it. If it's just too much overhead for you all, that's okay: I can keep a clean patch stack in my fork and we can revisit (or not) at your leisure.

@matthargett
Copy link
Copy Markdown
Contributor Author

I did just notice that when cleaning up my fork's branch and splitting the PR that I lost some cleanup commits from my local clone. I'm fixing that now.

…ions

Active elem segments whose memory range extends past the destination
table's current size at instantiation behave as a runtime mutation:
the trailing entries get dropped, but only after they've been
considered for table-resize semantics. Treat the source table as
mutated when the module contains such a segment so the
call_indirect elisions don't over-fire.

Adds an integration test (`tests/all/leftover_elem_segment_soundness.rs`)
+ two disas filetests covering the leftover-segment shape.
@matthargett matthargett force-pushed the table-mutability-tracking-upstream branch from a40c9b3 to b2ab608 Compare May 22, 2026 07:11
@github-actions github-actions Bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants