Skip to content

Rewrite Model with bitmask wave and AC-4 propagator (~5x speedup)#10

Open
pusewicz wants to merge 32 commits into
mainfrom
optimize/bitmask-wfc
Open

Rewrite Model with bitmask wave and AC-4 propagator (~5x speedup)#10
pusewicz wants to merge 32 commits into
mainfrom
optimize/bitmask-wfc

Conversation

@pusewicz
Copy link
Copy Markdown
Owner

Summary

  • Replace the per-cell Tile-array design with a flat bitmask wave, a precomputed propagator[dir][tile] table, and an AC-4 compatible counter stored as a byte buffer (String#setbyte/getbyte). Propagation is iterative on an explicit stack; entropy is updated incrementally with a noise tiebreak; the chosen tile is cached per cell so grid lookups are O(1). Contradictions restart.
  • The original Model and Cell are preserved as LegacyModel for apples-to-apples benchmarking. Tile now stores raw edge signatures so both implementations share it. The Window calls model.grid once per draw instead of receiving a grid from every iterate.
  • bin/benchmark extended to compare both models across grid sizes; the CI JSON contract (10 runs of 20×20, avg/min/max/P90) is preserved.

Results (Ruby 4.0.1 + YJIT, arm64-darwin)

Grid Legacy median New median Speedup
10×10 0.293s 0.093s 3.2×
20×20 1.700s 0.374s 4.5×
30×30 4.057s 0.855s 4.7×
50×50 2.420s
100×100 10.295s
150×150 ~26.9s

The new model holds ~800–950 observations/sec across grid sizes (legacy is ~200 obs/sec and does not practically reach 50×50+).

Test plan

  • bundle exec rake test — 2 runs, 16 assertions, 0 failures
  • CI=1 bundle exec bin/benchmark — produces the JSON the existing GitHub workflow consumes
  • RUBY_YJIT_ENABLE=1 bundle exec bin/benchmark — sweeps Legacy vs New across sizes
  • Open the Gosu window (bundle exec ruby bin/run) and verify rendering, P pause, R restart, A append row

Replace the per-cell Tile-array design with a flat bitmask wave, a
precomputed `propagator[dir][tile]` table, and an AC-4 compatible counter
stored as a byte buffer. Propagation is iterative on an explicit stack;
entropy is updated incrementally with a noise tiebreak; the chosen tile is
cached per cell so `grid` lookups are O(1). Contradictions restart.

The original `Model` and `Cell` are preserved verbatim as `LegacyModel`
for direct benchmark comparison; `Tile` now stores raw edge signatures so
both implementations can share it. The Window calls `model.grid` once per
draw rather than receiving a grid from every `iterate`.

Speedup (Ruby 4.0.1 + YJIT, arm64-darwin):
  20×20:  1.70s -> 0.37s (4.5x)
  30×30:  4.06s -> 0.86s (4.7x)
  New scales to 100×100 in ~10s and 150×150 in ~27s where the legacy
  implementation is impractical.

`bin/benchmark` now compares both models across grid sizes; CI JSON
contract preserved.
Copilot AI review requested due to automatic review settings May 26, 2026 08:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR rewrites the core Wave Function Collapse Model to use a flat bitmask-based wave representation and an AC-4-style propagation counter for significantly faster constraint propagation, while preserving the original implementation as LegacyModel for benchmarking comparisons.

Changes:

  • Replaced the per-cell tile arrays with a bitmask wave, precomputed propagator tables, and a byte-buffer compatible-counter propagation loop.
  • Preserved the old implementation as LegacyModel/LegacyCell for apples-to-apples benchmarking and adjusted consumers (Window, tests) to use the new Model API (grid, entropy_at).
  • Expanded bin/benchmark to compare Legacy vs New across sizes while keeping the CI JSON contract; added bin/profile for RubyProf profiling.

Reviewed changes

Copilot reviewed 7 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
lib/wave_function_collapse/model.rb New bitmask-wave + AC-4-style propagation implementation; exposes grid/entropy_at; supports row prepend.
lib/wave_function_collapse/legacy_model.rb Introduces preserved legacy implementation for benchmarking comparisons.
lib/wave_function_collapse/tile.rb Changes edge representation to store raw (frozen) edge signatures shared by both models.
lib/wave_function_collapse/window.rb Updates rendering loop to call model.grid and model.entropy_at instead of relying on iterate returning a grid.
test/test_model.rb Updates tests for the new Model API and float percent semantics.
lib/wave_function_collapse/cell.rb Removes the old Cell implementation (moved into LegacyCell).
lib/wave_function_collapse.rb Updates autoloads to include LegacyModel and drop Cell.
bin/benchmark Reworks benchmark driver: CI JSON mode preserved; adds legacy vs new sweeps and summary output.
bin/profile Adds RubyProf-based profiling script for the new model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/wave_function_collapse/model.rb Outdated
Comment on lines +496 to +501
idx = (nc * t_max + tp) * 4 + opp_d
count = compatible.getbyte(idx) - 1
compatible.setbyte(idx, count)
if count == 0 && (wave[nc] & (1 << tp)) != 0
ban(nc, tp)
return if @contradiction
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4738796 — went with hoisting the existing wave[nc] & bit check above the byte read instead of a saturating decrement: verified empirically that String#setbyte wraps on negative values rather than raising RangeError, so the visible failure mode you described doesn't occur, but the silent underflow you flagged is real and the guard avoids it (and the wasted work).

Comment thread lib/wave_function_collapse/model.rb Outdated
Comment on lines +194 to +201
block = ::String.new(::String.new.b, capacity: t_max * 4)
block.force_encoding(::Encoding::BINARY)
t = 0
while t < t_max
block << propagator_lists[0][t].length.chr
block << propagator_lists[1][t].length.chr
block << propagator_lists[2][t].length.chr
block << propagator_lists[3][t].length.chr
Comment thread lib/wave_function_collapse/model.rb Outdated
Comment on lines +166 to +178
# Weights
weights = ::Array.new(t_max)
log_weights = ::Array.new(t_max)
weights_log_weights = ::Array.new(t_max)
sum_w = 0.0
sum_w_log_w = 0.0
t = 0
while t < t_max
w = tiles[t].probability.to_f
weights[t] = w
lw = ::Math.log(w)
log_weights[t] = lw
weights_log_weights[t] = w * lw
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 5cf00c0.

Comment thread test/test_model.rb Outdated
Comment on lines 17 to 19
assert_in_delta(100.0 / (320 * 240), model.percent, 0.01)
refute model.complete?
assert model.solve
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 10ee256.

Comment thread lib/wave_function_collapse/model.rb Outdated
Comment on lines +225 to +226
@chosen_tile = ::Array.new(n, -1)
@uncollapsed_count = n
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in f216999.

pusewicz added 4 commits May 26, 2026 12:07
The propagator inner loop decremented `compatible[nc][tp][d]` for every
neighbour tile `tp` even when `tp` had already been banned at `nc`.
`String#setbyte` wraps on underflow rather than raising, so this was
silent — but the count drifted from reality and the work was wasted,
since the `wave[nc] & bit` check below it would suppress the ban call
anyway.  Hoist that bit check above the byte read so banned tiles are
skipped entirely.
`log_weights` was allocated and populated alongside `weights_log_weights`
but never read after assignment — the per-iteration `lw` local was used
both for the array slot and the `w * lw` product, then thrown away.
Fold the multiplication into a single local and remove the array.
The old assertion used `assert_in_delta(100.0 / (320 * 240), …, 0.01)`,
which only passed because the delta was wider than the value it was
comparing against — masking the fact that the new Model reports 0.0%
progress until the first observation. Pin the expected value directly.
`@uncollapsed_count` was set to `n` unconditionally, so when `tiles.size`
was 1 — every cell already trivially collapsed — `complete?` stayed
false and `iterate until model.complete?` looped forever. Detect the
single-tile case during state setup, pre-fill `@chosen_tile` with the
sole tile index, and start with zero uncollapsed cells so `grid` returns
a fully populated result on the first call.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 9 changed files in this pull request and generated 2 comments.

Comment thread lib/wave_function_collapse/model.rb Outdated
Comment on lines +194 to +199
t = 0
while t < t_max
block << propagator_lists[0][t].length.chr
block << propagator_lists[1][t].length.chr
block << propagator_lists[2][t].length.chr
block << propagator_lists[3][t].length.chr
Comment on lines 34 to 48
def update
@labels = []
@map = @model.solve if @map.nil?
if @map.nil?
@model.solve
@map = @model.grid
end

return if @paused

unless @model.complete?
time_start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
@map = @model.iterate
@model.iterate
@times << Process.clock_gettime(Process::CLOCK_MONOTONIC) - time_start
@map = @model.grid
end
pusewicz added 19 commits May 26, 2026 12:58
`1 << t` was being recomputed on every propagation inner iteration,
every observation, and every ban — and for tile indices ≥ 62 those
left-shifts allocated a fresh Bignum each time. Materialise the masks
once at construction (`@bit[t]`) and read them by index from then on.
`c % w`, `c / w`, the `cx + DX[d]` / `cy + DY[d]` recomputation, the
four bounds compares, and `ny * w + nx` all run on every direction
iteration of the propagation, border-patch, and rebuild loops — and
none of it depends on the wave state. Cache the resolved neighbour
index per `(cell, direction)` at construction in a flat
`@neighbours[c * 4 + d]` table (-1 for off-grid) and let the hot loops
do a single Array lookup instead.
When a cell's `remaining` count hits 1, the surviving tile was found by
walking the wave mask one bit at a time — average ~T/2 iterations,
which is ~94 for the current tileset and fires once per uncollapsed
cell. The wave is a single-bit value at that point, so
`mask.bit_length - 1` gives the index directly via a C-implemented
Integer primitive.
`build_initial_compatible` ran on every contradiction restart, rebuilding
the full N×T×4 byte buffer from the interior block and re-applying the
border sentinels with `setbyte`. The result is fully determined by the
tileset and grid dimensions, so build it once at construction
(`@initial_compatible`) and `dup` it into `@compatible` on each
restart — a flat memcpy beats the per-byte border pass, and the win
grows with the grid.
Replaces 4*t_max one-byte `.chr` String allocations in build_propagator
with a single pre-sized zero-filled buffer written via setbyte. One-time
setup cost only, but removes setup-phase allocation noise.
Tilesets share edge signatures across many tiles. A class-level cache
keyed by packed wang IDs collapses 4-per-tile array allocations down to
one per unique signature. Array#hash is value-based so build_propagator's
edge_id dedup keeps working unchanged.
Pre-allocate the seven per-cell state arrays once in build_initial_state
and reset them in place via Array#fill (and a tight while loop for the
noise refresh) in setup_wave_state. Contradiction restarts no longer
discard and reallocate ~7*n slots' worth of Array objects.
Allocate @compatible once in build_initial_state and reset it via
String#replace on every restart instead of dup'ing @initial_compatible
into a fresh n*t_max*4-byte String. Same idea applies to
rebuild_compatible_from_wave, which now clears via a frozen 0xFF
sentinel (built once) and writes directly into @compatible.
Replace each `arr[w, n-w] + Array.new(w, val)` (three array allocations
per per-cell array, seven arrays total) with an in-place shift_uniform!
helper. Copies low-to-high so each source index is read before its
destination is overwritten. Adds a 3x3 prepend test to guard the shift
direction.
Trigger GC.compact at the tail of initialize so the frozen long-lived
data (propagator, neighbours, bit table, compatible template + fill
sentinel, weights) settles into old gen and doesn't fragment the heap
as solves allocate and free young-gen objects. Skipped on tiny grids
where compaction cost outweighs the win.
When GC_STATS=1 is set, bin/benchmark wraps each run_once with a
GC.start + GC.stat snapshot and reports per-run total_allocated_objects,
malloc_increase_bytes, minor_gc_count, and major_gc_count alongside the
existing timing columns. The CI JSON contract is preserved exactly —
the ENV["CI"] branch is unchanged.
Dedicated allocation-only measurement: N cold solves (Model.new + solve
to completion) and a streaming sub-bench (prepend_empty_row + iterate
on a single model, repeated). Reports median/p95/min/max for
total_allocated_objects, malloc_increase_bytes, minor/major GC counts,
and new shape transitions per measure block. Optional JSON dump via
JSON=path/to/out.json.
`ban` was called from the inner propagation loop on every supporter
that hit zero — millions of times per large solve. Each call paid for
method dispatch and re-read several instance variables that were
already lifted to locals in `propagate` (wave, bit table, etc.), plus
the `wave[c] & bit == 0` guard that the caller had just verified.

Inline the fast-path body directly at the call site. `ban` itself
stays callable for `orphan_ban_pass` and `observe`, where the guard
is still load-bearing.

Measured (Ruby 4.0.1 + YJIT, arm64-darwin, 3 runs each, medians):

  20×20:   0.359s -> 0.209s  (~42% faster)
  30×30:   0.806s -> 0.467s
  50×50:   2.263s -> 1.303s
  75×75:   5.344s -> 3.146s
  100×100: 9.715s -> 5.927s

obs/sec lifted from ~850 to ~1500 across the table.
`weights_log_weights[t] = w * Math.log(w)` is NaN when w is 0, because
Math.log(0) is -Infinity and 0 * -Infinity is NaN under IEEE 754. That
NaN flows into `@initial_sum_w_log_w`, every cell's entropy, and the
comparison inside `find_lowest_entropy_cell` (NaN < x is false for any
x), so the solver silently picks nothing and `model.iterate until
model.complete?` spins forever.

The mathematical limit of `w * log(w)` as `w → 0` is 0, so use that
explicitly. Add a regression test that exercises a zero-probability
tile end-to-end.
`@compatible` packs supporter counts into one byte per `(cell, tile,
direction)` entry. If any propagator list contains more than 255
entries, `String#setbyte` writes `length & 0xff` at build time —
silently wrapping. The downstream `orphan_ban_pass` then sees a fresh
cell whose counts are already zero and bans every tile, which can
drive `@uncollapsed_count` to 0 and `@contradiction` to true at the
same time. `complete?` returned `true` while the wave was actually
broken — a public-API invariant violation.

Validate the propagator lists at construction and raise
`WaveFunctionCollapse::Error` with a clear message when a list exceeds
255. Also strengthen `complete?` to require `!@contradiction`, so any
future hole in the wrapping argument can't make the public signal lie.
`prepend_empty_row` calls `rebuild_compatible_from_wave`,
`orphan_ban_pass`, and `propagate`, any of which can set
`@contradiction = true`. The method never inspected the flag, so it
returned `true` even on failure and left the wave in a half-mutated
state. The next `iterate` would then enter `observe_and_propagate`,
see `@contradiction == true`, call `setup_wave_state`, and silently
wipe every previously-streamed row — the exact thing the streaming
contract was supposed to prevent.

Clear the flag and the propagation stacks on entry so a leftover state
from a prior failure can't poison the new pass, then check the flag
after `propagate`. On contradiction, restore the model to a clean
blank state via `setup_wave_state` and return `false` so callers know
the prepend failed; on success, return `true` as before.
`prepend_empty_row` filled the new row's `@chosen_tile` slots with the
generic `-1` sentinel, but `setup_wave_state` has a special branch
that pre-fills `0` when `t_max == 1` (every cell is born collapsed).
Without the matching branch in the prepend path, `complete?` reported
true while `grid` returned `nil` for every new cell — the inserted row
rendered as a blank strip.
`(a << 16) | (b << 8) | c` only encodes the triple losslessly when each
component fits in 8 bits, so two distinct edges — for example
`[1, 0, 256]` and `[1, 1, 0]` — collided to the same cache key. The
hash then handed out one frozen Array for both, the propagator's
edge-ID dedup treated incompatible edges as equal, and the algorithm
silently allowed illegal adjacencies for any Wang ID ≥ 256.

Use the triple itself as the key. Array#hash and #eql? are
content-based, so the cache still dedupes correctly and the lookup is
collision-free for any integer components.
pusewicz added 8 commits May 27, 2026 02:00
A tileset that orphan_ban_pass can't fully reject — i.e. one where
setup_wave_state leaves some cells uncollapsed but the wave is
unsolvable — would push observe_and_propagate into an unbounded
observe → contradiction → setup_wave_state loop. The caller's
`iterate until complete?` then hangs with no progress and no signal.

Cap the loop at MAX_RESTARTS=100 consecutive contradictions and raise
`WaveFunctionCollapse::Error` instead of looping. Solvable tilesets
finish well under the cap; broken inputs now fail visibly.
Renderer hot path was the bottleneck for bin/run:
- @model.grid allocated a fresh 2-D Array every update
- draw_map iterated 64x36 cells and called draw_text_rel per
  uncollapsed cell, each frame
- @times.sort ran on an unbounded array every frame for P90/P99
- iterate was capped at one call per update tick (~60Hz)

Cache the grid into a Gosu macro via record(); rebuild only when
Model#generation advances. Run iterate in a per-update time budget
(ITERATE_BUDGET) and drop update_interval so the framework no longer
sleeps between ticks. Throttle map redraws to ~30Hz via needs_redraw?
to keep the macro rebuild from contending with iterate.

Add Model#generation (bumped from observe_and_propagate,
setup_wave_state, and prepend_empty_row success paths) and
Model#tile_id_at so the window can read state without going through
the allocating #grid accessor. Pre-build the 256 entropy overlay
colors. Cap @times to a 240-entry rolling buffer. Add E key to toggle
the entropy overlay.
- Accept "WxH" in SIZES/LEGACY_SIZES/NEW_SIZES, not just square N
- Add a unified SIZES env var; each model also falls back to the
  other's *_SIZES when only one is set
- Print the model/grid/cells prefix before runs so it's visible which
  size is being measured rather than appearing only after it finishes
- Right-align header columns to match the %9.3fs data columns
The hot path was bottlenecked on Bignum arithmetic: with 188 tiles, the
wave mask was a 188-bit Bignum, so every `wave[c] & bit` and `wave[c] ^
bit` in propagate/ban/observe allocated a fresh Bignum. Profiling showed
Integer#& at 31% of total wall time and ~11% in GC.

Store the wave as @chunk_count parallel Fixnum arrays of 62-bit chunks
(WAVE_CHUNK_BITS = 62 keeps each chunk in tagged-integer land on MRI).
Precompute @propagator_chunks[d][t][ch] in the same layout so the inner
loop in propagate becomes a Fixnum AND of the wave chunk with the
propagator mask, iterated via `m & -m` / `m ^= lowest`. This walks
exactly the tiles that still need a supporter-count decrement and skips
already-banned tiles for free — no per-tile bit test, no Bignum
allocations.

Observe, ban, orphan_ban_pass, and prepend_empty_row updated to the same
chunked iteration. rebuild_compatible_from_wave sums popcounts per chunk
instead of operating on a full-width Bignum.

At 64x36 with YJIT: 1.193s -> 0.438s median (~2.7x), 0 minor GCs per
run, ~19k allocations per run.
Noise (random jitter for tie-breaking among equal entropies) is set once
at setup and never changes. The find_lowest_entropy_cell scan was
reading both arrays and adding them every iteration — O(n) cells per
observe, ~3.7M cell scans total on a 64x36 solve.

Store `entropy + noise` directly in @entropy_noise and update it
whenever the entropy itself changes (ban + inlined ban in propagate).
@noise stays around so the addition can be recomputed when the entropy
is updated, but find_lowest_entropy_cell now reads one array per cell
instead of two.

At 64x36 over 30 runs: median 0.426s, best 0.399s.
The disable was a workaround for the original solver's heavy Bignum
allocation churn during propagation — pausing GC traded memory growth
for smoother frames. With the chunked-Fixnum wave the hot path
allocates almost nothing (a 64x36 solve runs with zero minor GCs), so
GC has nothing to do during a solve and leaving it disabled just lets
the process bloat over long sessions.
Three independent wins, primarily helping the non-YJIT interpreted path:

- Compatible-count storage was a single interleaved String of bytes
  indexed `(c * t_max + t) * 4 + d`. The inner loop's setbyte/getbyte
  calls are ~30% slower than Array `[]`/`[]=` in interpreted Ruby (per
  microbench). Split into four parallel Arrays of Fixnums, one per
  direction, indexed by the flat `c * t_max + t`. The inner loop no
  longer needs the `<<2` and `+ opp_d` arithmetic.

- Pre-permute by `OPP` into `@compatible_per_opp` so the propagation
  loop reads `compatible_per_opp[d]` directly instead of
  `compatible_per_dir[OPP[d]]` — one Array#[] instead of two.

- Replace the `m ^= lowest` advance with `m -= lowest`. Both clear the
  lowest set bit (we just extracted it), but the subtract form is
  consistently faster in the interpreter on this branch.

The trade-off: ~14 MB of compatible storage instead of 1.7 MB (Fixnum
slot is 8 bytes vs 1 byte) — still well within L3 on modern hardware.

64x36 timings:
- Non-YJIT: 1.326s → ~1.11s median (16% faster)
- YJIT: 0.426s → ~0.418s median (within noise)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants