Skip to content

cranelift: inline small constant-length array.copy#13460

Open
gfx wants to merge 2 commits into
bytecodealliance:mainfrom
wado-lang:array-copy-inline
Open

cranelift: inline small constant-length array.copy#13460
gfx wants to merge 2 commits into
bytecodealliance:mainfrom
wado-lang:array-copy-inline

Conversation

@gfx
Copy link
Copy Markdown

@gfx gfx commented May 23, 2026

Small, statically-sized array.copys of scalar elements currently go through the memory_copy libcall, just like every other size. For tiny copies the libcall's fixed per-call cost — a wasm↔host transition plus an indirect call — dominates the actual memmove, so array.copy is far slower than it should be for small fixed-size copies (enough that a hand-written array.get/array.set loop can beat it).

This PR expands such copies inline, as a sequence of loads followed by stores, when the element type is scalar and the length is a compile-time constant of at most 8 elements.

Speed

array.copy of N i32 elements, ns per copy, release build, lower is better:

N before (libcall) after (inline) speedup
1 4.60 1.27 3.6×
2 3.36 1.27 2.6×
4 4.17 1.43 2.9×
8 4.88 2.00 2.4×

The inline path also beats the manual element-by-element loop that was previously the faster workaround. The gap is larger still in unoptimized builds, where the libcall body itself is not optimized (~24 ns/call drops to a couple ns).

Methodology

ns per whole-array copy = (wall time at N iterations − wall time at 0 iterations) / N, taking the min of several repetitions, default config, x86_64. The 0-iteration baseline cancels out module compilation and array allocation. The "before" column is measured with a dynamic length (forcing the libcall) and the "after" column with a constant length (inline), at the same N.

Correctness & scope

  • Every element is loaded before any is stored, so overlapping source/destination ranges keep array.copy's memmove semantics.
  • Dynamic or larger lengths, and tables, still use the memory_copy libcall — there its memmove amortizes the call overhead.
  • The bound of 8 elements comes from measurement: the inline-vs-libcall crossover is around 16 elements, but the largest wins cluster at ≤ 8 while inline code size grows with the element count.

Tests

  • tests/disas/array-copy-inline.wat locks the codegen (the libcall is gone; loads-then-stores remain).
  • tests/misc_testsuite/gc/array-copy-inline.wast checks correctness for every element size (i8/i16/i32/i64/f32/f64/v128), the threshold boundary (8 inline vs 9 libcall), and overlapping copies; the wast harness runs it across the DRC, null, and copying collectors.

🤖 Generated with Claude Code

@gfx gfx requested a review from a team as a code owner May 23, 2026 01:01
Copilot AI review requested due to automatic review settings May 23, 2026 01:01
@gfx gfx requested a review from a team as a code owner May 23, 2026 01:01
@gfx gfx requested review from cfallin and removed request for a team May 23, 2026 01:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds an optimization to inline small, statically-sized array.copy operations in Cranelift to avoid the fixed overhead of the memory_copy libcall, and introduces tests to lock down both codegen shape and runtime semantics.

Changes:

  • Inline-expand constant-length array.copy for small copies (<= 8 elements) into explicit loads+stores with memmove semantics.
  • Add runtime GC tests covering element sizes/types, overlap semantics, and the 8/9-element threshold.
  • Add a disassembly-based “shape” test to ensure the inline expansion remains stable.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/misc_testsuite/gc/array-copy-inline.wast New runtime tests validating correctness across element sizes, threshold behavior, and overlap (memmove) semantics.
tests/disas/array-copy-inline.wat New disassembly test to lock in the expected inline load-then-store codegen sequence.
crates/cranelift/src/func_environ.rs Implements the inline expansion for small constant-length array.copy and a helper to detect constant lengths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/cranelift/src/func_environ.rs Outdated
Comment thread crates/cranelift/src/func_environ.rs Outdated
When a scalar-element `array.copy` has a compile-time-constant length of at
most 8, expand it inline as loads-then-stores instead of calling the
`memory_copy` libcall. The libcall's fixed per-call cost (a wasm/host
transition and an indirect call) dominates for tiny copies, and is
especially visible in unoptimized builds where the libcall body itself is
not optimized.

Every element is loaded before any is stored so overlapping ranges keep
memmove semantics. Dynamic or larger lengths, and tables, still use the
libcall, whose `memmove` amortizes the overhead. The bound of 8 trades a
little perf (the inline-vs-libcall crossover is ~16 elements) for bounded
code size, capturing the largest wins, which cluster at <= 8 elements.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gfx gfx force-pushed the array-copy-inline branch from 7569784 to 1c48b7d Compare May 23, 2026 01:06
Reword `emit_inline_array_copy`'s doc to describe a bitwise copy by element
width (it also handles `v128` and copies `f32`/`f64` via integer types), and
replace the `elem_size`/`n` parameter shadowing with `stride`/`count`. No
codegen change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants