cranelift: inline small constant-length array.copy#13460
Open
gfx wants to merge 2 commits into
Open
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.
Changes:
- Inline-expand constant-length
array.copyfor small copies (<= 8 elements) into explicit loads+stores withmemmovesemantics. - Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
- Add a disassembly-based “shape” test to ensure the inline expansion remains stable.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/misc_testsuite/gc/array-copy-inline.wast | New runtime tests validating correctness across element sizes, threshold behavior, and overlap (memmove) semantics. |
| tests/disas/array-copy-inline.wat | New disassembly test to lock in the expected inline load-then-store codegen sequence. |
| crates/cranelift/src/func_environ.rs | Implements the inline expansion for small constant-length array.copy and a helper to detect constant lengths. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
When a scalar-element `array.copy` has a compile-time-constant length of at most 8, expand it inline as loads-then-stores instead of calling the `memory_copy` libcall. The libcall's fixed per-call cost (a wasm/host transition and an indirect call) dominates for tiny copies, and is especially visible in unoptimized builds where the libcall body itself is not optimized. Every element is loaded before any is stored so overlapping ranges keep memmove semantics. Dynamic or larger lengths, and tables, still use the libcall, whose `memmove` amortizes the overhead. The bound of 8 trades a little perf (the inline-vs-libcall crossover is ~16 elements) for bounded code size, capturing the largest wins, which cluster at <= 8 elements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reword `emit_inline_array_copy`'s doc to describe a bitwise copy by element width (it also handles `v128` and copies `f32`/`f64` via integer types), and replace the `elem_size`/`n` parameter shadowing with `stride`/`count`. No codegen change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Small, statically-sized
array.copys of scalar elements currently go through thememory_copylibcall, just like every other size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actualmemmove, soarray.copyis far slower than it should be for small fixed-size copies (enough that a hand-writtenarray.get/array.setloop can beat it).This PR expands such copies inline, as a sequence of loads followed by stores, when the element type is scalar and the length is a compile-time constant of at most 8 elements.
Speed
array.copyof Ni32elements, ns per copy, release build, lower is better:The inline path also beats the manual element-by-element loop that was previously the faster workaround. The gap is larger still in unoptimized builds, where the libcall body itself is not optimized (~24 ns/call drops to a couple ns).
Methodology
ns per whole-array copy = (wall time at N iterations − wall time at 0 iterations) / N, taking the min of several repetitions, default config, x86_64. The 0-iteration baseline cancels out module compilation and array allocation. The "before" column is measured with a dynamic length (forcing the libcall) and the "after" column with a constant length (inline), at the same N.
Correctness & scope
array.copy'smemmovesemantics.memory_copylibcall — there itsmemmoveamortizes the call overhead.Tests
tests/disas/array-copy-inline.watlocks the codegen (the libcall is gone; loads-then-stores remain).tests/misc_testsuite/gc/array-copy-inline.wastchecks correctness for every element size (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (8 inline vs 9 libcall), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.🤖 Generated with Claude Code