cranelift: inline small constant-length `array.copy` by gfx · Pull Request #13460 · bytecodealliance/wasmtime

gfx · 2026-05-23T01:01:49Z

Small, statically-sized array.copys of scalar elements currently go through the memory_copy libcall, just like every other size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so array.copy is far slower than it should be for small fixed-size copies (enough that a hand-written array.get/array.set loop can beat it).

This PR expands such copies inline, as a sequence of loads followed by stores, when the element type is scalar and the length is a compile-time constant of at most 8 elements.

Speed

array.copy of N i32 elements, ns per copy, release build, lower is better:

N	before (libcall)	after (inline)	speedup
1	4.60	1.27	3.6×
2	3.36	1.27	2.6×
4	4.17	1.43	2.9×
8	4.88	2.00	2.4×

The inline path also beats the manual element-by-element loop that was previously the faster workaround. The gap is larger still in unoptimized builds, where the libcall body itself is not optimized (~24 ns/call drops to a couple ns).

Methodology

ns per whole-array copy = (wall time at N iterations − wall time at 0 iterations) / N, taking the min of several repetitions, default config, x86_64. The 0-iteration baseline cancels out module compilation and array allocation. The "before" column is measured with a dynamic length (forcing the libcall) and the "after" column with a constant length (inline), at the same N.

Correctness & scope

Every element is loaded before any is stored, so overlapping source/destination ranges keep array.copy's memmove semantics.
Dynamic or larger lengths, and tables, still use the memory_copy libcall — there its memmove amortizes the call overhead.
The bound of 8 elements comes from measurement: the inline-vs-libcall crossover is around 16 elements, but the largest wins cluster at ≤ 8 while inline code size grows with the element count.

Tests

tests/disas/array-copy-inline.wat locks the codegen (the libcall is gone; loads-then-stores remain).
tests/misc_testsuite/gc/array-copy-inline.wast checks correctness for every element size (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (8 inline vs 9 libcall), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.

🤖 Generated with Claude Code

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.

Changes:

Inline-expand constant-length array.copy for small copies (<= 8 elements) into explicit loads+stores with memmove semantics.
Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
Add a disassembly-based “shape” test to ensure the inline expansion remains stable.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
tests/misc_testsuite/gc/array-copy-inline.wast	New runtime tests validating correctness across element sizes, threshold behavior, and overlap (`memmove`) semantics.
tests/disas/array-copy-inline.wat	New disassembly test to lock in the expected inline load-then-store codegen sequence.
crates/cranelift/src/func_environ.rs	Implements the inline expansion for small constant-length `array.copy` and a helper to detect constant lengths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

When a scalar-element `array.copy` has a compile-time-constant length of at most 8, expand it inline as loads-then-stores instead of calling the `memory_copy` libcall. The libcall's fixed per-call cost (a wasm/host transition and an indirect call) dominates for tiny copies, and is especially visible in unoptimized builds where the libcall body itself is not optimized. Every element is loaded before any is stored so overlapping ranges keep memmove semantics. Dynamic or larger lengths, and tables, still use the libcall, whose `memmove` amortizes the overhead. The bound of 8 trades a little perf (the inline-vs-libcall crossover is ~16 elements) for bounded code size, capturing the largest wins, which cluster at <= 8 elements. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reword `emit_inline_array_copy`'s doc to describe a bitwise copy by element width (it also handles `v128` and copies `f32`/`f64` via integer types), and replace the `elem_size`/`n` parameter shadowing with `stride`/`count`. No codegen change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gfx requested a review from a team as a code owner May 23, 2026 01:01

Copilot AI review requested due to automatic review settings May 23, 2026 01:01

gfx requested a review from a team as a code owner May 23, 2026 01:01

gfx requested review from cfallin and removed request for a team May 23, 2026 01:01

Copilot AI reviewed May 23, 2026

View reviewed changes

Comment thread crates/cranelift/src/func_environ.rs Outdated

Comment thread crates/cranelift/src/func_environ.rs Outdated

gfx force-pushed the array-copy-inline branch from 7569784 to 1c48b7d Compare May 23, 2026 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cranelift: inline small constant-length `array.copy`#13460

cranelift: inline small constant-length `array.copy`#13460
gfx wants to merge 2 commits into
bytecodealliance:mainfrom
wado-lang:array-copy-inline

gfx commented May 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gfx commented May 23, 2026

Speed

Correctness & scope

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants