Rewrite Model with bitmask wave and AC-4 propagator (~5x speedup)#10
Rewrite Model with bitmask wave and AC-4 propagator (~5x speedup)#10pusewicz wants to merge 32 commits into
Conversation
Replace the per-cell Tile-array design with a flat bitmask wave, a precomputed `propagator[dir][tile]` table, and an AC-4 compatible counter stored as a byte buffer. Propagation is iterative on an explicit stack; entropy is updated incrementally with a noise tiebreak; the chosen tile is cached per cell so `grid` lookups are O(1). Contradictions restart. The original `Model` and `Cell` are preserved verbatim as `LegacyModel` for direct benchmark comparison; `Tile` now stores raw edge signatures so both implementations can share it. The Window calls `model.grid` once per draw rather than receiving a grid from every `iterate`. Speedup (Ruby 4.0.1 + YJIT, arm64-darwin): 20×20: 1.70s -> 0.37s (4.5x) 30×30: 4.06s -> 0.86s (4.7x) New scales to 100×100 in ~10s and 150×150 in ~27s where the legacy implementation is impractical. `bin/benchmark` now compares both models across grid sizes; CI JSON contract preserved.
There was a problem hiding this comment.
Pull request overview
This PR rewrites the core Wave Function Collapse Model to use a flat bitmask-based wave representation and an AC-4-style propagation counter for significantly faster constraint propagation, while preserving the original implementation as LegacyModel for benchmarking comparisons.
Changes:
- Replaced the per-cell tile arrays with a bitmask wave, precomputed propagator tables, and a byte-buffer compatible-counter propagation loop.
- Preserved the old implementation as
LegacyModel/LegacyCellfor apples-to-apples benchmarking and adjusted consumers (Window, tests) to use the newModelAPI (grid,entropy_at). - Expanded
bin/benchmarkto compare Legacy vs New across sizes while keeping the CI JSON contract; addedbin/profilefor RubyProf profiling.
Reviewed changes
Copilot reviewed 7 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
lib/wave_function_collapse/model.rb |
New bitmask-wave + AC-4-style propagation implementation; exposes grid/entropy_at; supports row prepend. |
lib/wave_function_collapse/legacy_model.rb |
Introduces preserved legacy implementation for benchmarking comparisons. |
lib/wave_function_collapse/tile.rb |
Changes edge representation to store raw (frozen) edge signatures shared by both models. |
lib/wave_function_collapse/window.rb |
Updates rendering loop to call model.grid and model.entropy_at instead of relying on iterate returning a grid. |
test/test_model.rb |
Updates tests for the new Model API and float percent semantics. |
lib/wave_function_collapse/cell.rb |
Removes the old Cell implementation (moved into LegacyCell). |
lib/wave_function_collapse.rb |
Updates autoloads to include LegacyModel and drop Cell. |
bin/benchmark |
Reworks benchmark driver: CI JSON mode preserved; adds legacy vs new sweeps and summary output. |
bin/profile |
Adds RubyProf-based profiling script for the new model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| idx = (nc * t_max + tp) * 4 + opp_d | ||
| count = compatible.getbyte(idx) - 1 | ||
| compatible.setbyte(idx, count) | ||
| if count == 0 && (wave[nc] & (1 << tp)) != 0 | ||
| ban(nc, tp) | ||
| return if @contradiction |
There was a problem hiding this comment.
Addressed in 4738796 — went with hoisting the existing wave[nc] & bit check above the byte read instead of a saturating decrement: verified empirically that String#setbyte wraps on negative values rather than raising RangeError, so the visible failure mode you described doesn't occur, but the silent underflow you flagged is real and the guard avoids it (and the wasted work).
| block = ::String.new(::String.new.b, capacity: t_max * 4) | ||
| block.force_encoding(::Encoding::BINARY) | ||
| t = 0 | ||
| while t < t_max | ||
| block << propagator_lists[0][t].length.chr | ||
| block << propagator_lists[1][t].length.chr | ||
| block << propagator_lists[2][t].length.chr | ||
| block << propagator_lists[3][t].length.chr |
| # Weights | ||
| weights = ::Array.new(t_max) | ||
| log_weights = ::Array.new(t_max) | ||
| weights_log_weights = ::Array.new(t_max) | ||
| sum_w = 0.0 | ||
| sum_w_log_w = 0.0 | ||
| t = 0 | ||
| while t < t_max | ||
| w = tiles[t].probability.to_f | ||
| weights[t] = w | ||
| lw = ::Math.log(w) | ||
| log_weights[t] = lw | ||
| weights_log_weights[t] = w * lw |
| assert_in_delta(100.0 / (320 * 240), model.percent, 0.01) | ||
| refute model.complete? | ||
| assert model.solve |
| @chosen_tile = ::Array.new(n, -1) | ||
| @uncollapsed_count = n |
The propagator inner loop decremented `compatible[nc][tp][d]` for every neighbour tile `tp` even when `tp` had already been banned at `nc`. `String#setbyte` wraps on underflow rather than raising, so this was silent — but the count drifted from reality and the work was wasted, since the `wave[nc] & bit` check below it would suppress the ban call anyway. Hoist that bit check above the byte read so banned tiles are skipped entirely.
`log_weights` was allocated and populated alongside `weights_log_weights` but never read after assignment — the per-iteration `lw` local was used both for the array slot and the `w * lw` product, then thrown away. Fold the multiplication into a single local and remove the array.
The old assertion used `assert_in_delta(100.0 / (320 * 240), …, 0.01)`, which only passed because the delta was wider than the value it was comparing against — masking the fact that the new Model reports 0.0% progress until the first observation. Pin the expected value directly.
`@uncollapsed_count` was set to `n` unconditionally, so when `tiles.size` was 1 — every cell already trivially collapsed — `complete?` stayed false and `iterate until model.complete?` looped forever. Detect the single-tile case during state setup, pre-fill `@chosen_tile` with the sole tile index, and start with zero uncollapsed cells so `grid` returns a fully populated result on the first call.
| t = 0 | ||
| while t < t_max | ||
| block << propagator_lists[0][t].length.chr | ||
| block << propagator_lists[1][t].length.chr | ||
| block << propagator_lists[2][t].length.chr | ||
| block << propagator_lists[3][t].length.chr |
| def update | ||
| @labels = [] | ||
| @map = @model.solve if @map.nil? | ||
| if @map.nil? | ||
| @model.solve | ||
| @map = @model.grid | ||
| end | ||
|
|
||
| return if @paused | ||
|
|
||
| unless @model.complete? | ||
| time_start = Process.clock_gettime(Process::CLOCK_MONOTONIC) | ||
| @map = @model.iterate | ||
| @model.iterate | ||
| @times << Process.clock_gettime(Process::CLOCK_MONOTONIC) - time_start | ||
| @map = @model.grid | ||
| end |
`1 << t` was being recomputed on every propagation inner iteration, every observation, and every ban — and for tile indices ≥ 62 those left-shifts allocated a fresh Bignum each time. Materialise the masks once at construction (`@bit[t]`) and read them by index from then on.
`c % w`, `c / w`, the `cx + DX[d]` / `cy + DY[d]` recomputation, the four bounds compares, and `ny * w + nx` all run on every direction iteration of the propagation, border-patch, and rebuild loops — and none of it depends on the wave state. Cache the resolved neighbour index per `(cell, direction)` at construction in a flat `@neighbours[c * 4 + d]` table (-1 for off-grid) and let the hot loops do a single Array lookup instead.
When a cell's `remaining` count hits 1, the surviving tile was found by walking the wave mask one bit at a time — average ~T/2 iterations, which is ~94 for the current tileset and fires once per uncollapsed cell. The wave is a single-bit value at that point, so `mask.bit_length - 1` gives the index directly via a C-implemented Integer primitive.
`build_initial_compatible` ran on every contradiction restart, rebuilding the full N×T×4 byte buffer from the interior block and re-applying the border sentinels with `setbyte`. The result is fully determined by the tileset and grid dimensions, so build it once at construction (`@initial_compatible`) and `dup` it into `@compatible` on each restart — a flat memcpy beats the per-byte border pass, and the win grows with the grid.
Replaces 4*t_max one-byte `.chr` String allocations in build_propagator with a single pre-sized zero-filled buffer written via setbyte. One-time setup cost only, but removes setup-phase allocation noise.
Tilesets share edge signatures across many tiles. A class-level cache keyed by packed wang IDs collapses 4-per-tile array allocations down to one per unique signature. Array#hash is value-based so build_propagator's edge_id dedup keeps working unchanged.
Pre-allocate the seven per-cell state arrays once in build_initial_state and reset them in place via Array#fill (and a tight while loop for the noise refresh) in setup_wave_state. Contradiction restarts no longer discard and reallocate ~7*n slots' worth of Array objects.
Allocate @compatible once in build_initial_state and reset it via String#replace on every restart instead of dup'ing @initial_compatible into a fresh n*t_max*4-byte String. Same idea applies to rebuild_compatible_from_wave, which now clears via a frozen 0xFF sentinel (built once) and writes directly into @compatible.
Replace each `arr[w, n-w] + Array.new(w, val)` (three array allocations per per-cell array, seven arrays total) with an in-place shift_uniform! helper. Copies low-to-high so each source index is read before its destination is overwritten. Adds a 3x3 prepend test to guard the shift direction.
Trigger GC.compact at the tail of initialize so the frozen long-lived data (propagator, neighbours, bit table, compatible template + fill sentinel, weights) settles into old gen and doesn't fragment the heap as solves allocate and free young-gen objects. Skipped on tiny grids where compaction cost outweighs the win.
When GC_STATS=1 is set, bin/benchmark wraps each run_once with a GC.start + GC.stat snapshot and reports per-run total_allocated_objects, malloc_increase_bytes, minor_gc_count, and major_gc_count alongside the existing timing columns. The CI JSON contract is preserved exactly — the ENV["CI"] branch is unchanged.
Dedicated allocation-only measurement: N cold solves (Model.new + solve to completion) and a streaming sub-bench (prepend_empty_row + iterate on a single model, repeated). Reports median/p95/min/max for total_allocated_objects, malloc_increase_bytes, minor/major GC counts, and new shape transitions per measure block. Optional JSON dump via JSON=path/to/out.json.
`ban` was called from the inner propagation loop on every supporter that hit zero — millions of times per large solve. Each call paid for method dispatch and re-read several instance variables that were already lifted to locals in `propagate` (wave, bit table, etc.), plus the `wave[c] & bit == 0` guard that the caller had just verified. Inline the fast-path body directly at the call site. `ban` itself stays callable for `orphan_ban_pass` and `observe`, where the guard is still load-bearing. Measured (Ruby 4.0.1 + YJIT, arm64-darwin, 3 runs each, medians): 20×20: 0.359s -> 0.209s (~42% faster) 30×30: 0.806s -> 0.467s 50×50: 2.263s -> 1.303s 75×75: 5.344s -> 3.146s 100×100: 9.715s -> 5.927s obs/sec lifted from ~850 to ~1500 across the table.
`weights_log_weights[t] = w * Math.log(w)` is NaN when w is 0, because Math.log(0) is -Infinity and 0 * -Infinity is NaN under IEEE 754. That NaN flows into `@initial_sum_w_log_w`, every cell's entropy, and the comparison inside `find_lowest_entropy_cell` (NaN < x is false for any x), so the solver silently picks nothing and `model.iterate until model.complete?` spins forever. The mathematical limit of `w * log(w)` as `w → 0` is 0, so use that explicitly. Add a regression test that exercises a zero-probability tile end-to-end.
`@compatible` packs supporter counts into one byte per `(cell, tile, direction)` entry. If any propagator list contains more than 255 entries, `String#setbyte` writes `length & 0xff` at build time — silently wrapping. The downstream `orphan_ban_pass` then sees a fresh cell whose counts are already zero and bans every tile, which can drive `@uncollapsed_count` to 0 and `@contradiction` to true at the same time. `complete?` returned `true` while the wave was actually broken — a public-API invariant violation. Validate the propagator lists at construction and raise `WaveFunctionCollapse::Error` with a clear message when a list exceeds 255. Also strengthen `complete?` to require `!@contradiction`, so any future hole in the wrapping argument can't make the public signal lie.
`prepend_empty_row` calls `rebuild_compatible_from_wave`, `orphan_ban_pass`, and `propagate`, any of which can set `@contradiction = true`. The method never inspected the flag, so it returned `true` even on failure and left the wave in a half-mutated state. The next `iterate` would then enter `observe_and_propagate`, see `@contradiction == true`, call `setup_wave_state`, and silently wipe every previously-streamed row — the exact thing the streaming contract was supposed to prevent. Clear the flag and the propagation stacks on entry so a leftover state from a prior failure can't poison the new pass, then check the flag after `propagate`. On contradiction, restore the model to a clean blank state via `setup_wave_state` and return `false` so callers know the prepend failed; on success, return `true` as before.
`prepend_empty_row` filled the new row's `@chosen_tile` slots with the generic `-1` sentinel, but `setup_wave_state` has a special branch that pre-fills `0` when `t_max == 1` (every cell is born collapsed). Without the matching branch in the prepend path, `complete?` reported true while `grid` returned `nil` for every new cell — the inserted row rendered as a blank strip.
`(a << 16) | (b << 8) | c` only encodes the triple losslessly when each component fits in 8 bits, so two distinct edges — for example `[1, 0, 256]` and `[1, 1, 0]` — collided to the same cache key. The hash then handed out one frozen Array for both, the propagator's edge-ID dedup treated incompatible edges as equal, and the algorithm silently allowed illegal adjacencies for any Wang ID ≥ 256. Use the triple itself as the key. Array#hash and #eql? are content-based, so the cache still dedupes correctly and the lookup is collision-free for any integer components.
A tileset that orphan_ban_pass can't fully reject — i.e. one where setup_wave_state leaves some cells uncollapsed but the wave is unsolvable — would push observe_and_propagate into an unbounded observe → contradiction → setup_wave_state loop. The caller's `iterate until complete?` then hangs with no progress and no signal. Cap the loop at MAX_RESTARTS=100 consecutive contradictions and raise `WaveFunctionCollapse::Error` instead of looping. Solvable tilesets finish well under the cap; broken inputs now fail visibly.
Renderer hot path was the bottleneck for bin/run: - @model.grid allocated a fresh 2-D Array every update - draw_map iterated 64x36 cells and called draw_text_rel per uncollapsed cell, each frame - @times.sort ran on an unbounded array every frame for P90/P99 - iterate was capped at one call per update tick (~60Hz) Cache the grid into a Gosu macro via record(); rebuild only when Model#generation advances. Run iterate in a per-update time budget (ITERATE_BUDGET) and drop update_interval so the framework no longer sleeps between ticks. Throttle map redraws to ~30Hz via needs_redraw? to keep the macro rebuild from contending with iterate. Add Model#generation (bumped from observe_and_propagate, setup_wave_state, and prepend_empty_row success paths) and Model#tile_id_at so the window can read state without going through the allocating #grid accessor. Pre-build the 256 entropy overlay colors. Cap @times to a 240-entry rolling buffer. Add E key to toggle the entropy overlay.
- Accept "WxH" in SIZES/LEGACY_SIZES/NEW_SIZES, not just square N - Add a unified SIZES env var; each model also falls back to the other's *_SIZES when only one is set - Print the model/grid/cells prefix before runs so it's visible which size is being measured rather than appearing only after it finishes - Right-align header columns to match the %9.3fs data columns
The hot path was bottlenecked on Bignum arithmetic: with 188 tiles, the wave mask was a 188-bit Bignum, so every `wave[c] & bit` and `wave[c] ^ bit` in propagate/ban/observe allocated a fresh Bignum. Profiling showed Integer#& at 31% of total wall time and ~11% in GC. Store the wave as @chunk_count parallel Fixnum arrays of 62-bit chunks (WAVE_CHUNK_BITS = 62 keeps each chunk in tagged-integer land on MRI). Precompute @propagator_chunks[d][t][ch] in the same layout so the inner loop in propagate becomes a Fixnum AND of the wave chunk with the propagator mask, iterated via `m & -m` / `m ^= lowest`. This walks exactly the tiles that still need a supporter-count decrement and skips already-banned tiles for free — no per-tile bit test, no Bignum allocations. Observe, ban, orphan_ban_pass, and prepend_empty_row updated to the same chunked iteration. rebuild_compatible_from_wave sums popcounts per chunk instead of operating on a full-width Bignum. At 64x36 with YJIT: 1.193s -> 0.438s median (~2.7x), 0 minor GCs per run, ~19k allocations per run.
Noise (random jitter for tie-breaking among equal entropies) is set once at setup and never changes. The find_lowest_entropy_cell scan was reading both arrays and adding them every iteration — O(n) cells per observe, ~3.7M cell scans total on a 64x36 solve. Store `entropy + noise` directly in @entropy_noise and update it whenever the entropy itself changes (ban + inlined ban in propagate). @noise stays around so the addition can be recomputed when the entropy is updated, but find_lowest_entropy_cell now reads one array per cell instead of two. At 64x36 over 30 runs: median 0.426s, best 0.399s.
The disable was a workaround for the original solver's heavy Bignum allocation churn during propagation — pausing GC traded memory growth for smoother frames. With the chunked-Fixnum wave the hot path allocates almost nothing (a 64x36 solve runs with zero minor GCs), so GC has nothing to do during a solve and leaving it disabled just lets the process bloat over long sessions.
Three independent wins, primarily helping the non-YJIT interpreted path: - Compatible-count storage was a single interleaved String of bytes indexed `(c * t_max + t) * 4 + d`. The inner loop's setbyte/getbyte calls are ~30% slower than Array `[]`/`[]=` in interpreted Ruby (per microbench). Split into four parallel Arrays of Fixnums, one per direction, indexed by the flat `c * t_max + t`. The inner loop no longer needs the `<<2` and `+ opp_d` arithmetic. - Pre-permute by `OPP` into `@compatible_per_opp` so the propagation loop reads `compatible_per_opp[d]` directly instead of `compatible_per_dir[OPP[d]]` — one Array#[] instead of two. - Replace the `m ^= lowest` advance with `m -= lowest`. Both clear the lowest set bit (we just extracted it), but the subtract form is consistently faster in the interpreter on this branch. The trade-off: ~14 MB of compatible storage instead of 1.7 MB (Fixnum slot is 8 bytes vs 1 byte) — still well within L3 on modern hardware. 64x36 timings: - Non-YJIT: 1.326s → ~1.11s median (16% faster) - YJIT: 0.426s → ~0.418s median (within noise)
Summary
Tile-array design with a flat bitmask wave, a precomputedpropagator[dir][tile]table, and an AC-4 compatible counter stored as a byte buffer (String#setbyte/getbyte). Propagation is iterative on an explicit stack; entropy is updated incrementally with a noise tiebreak; the chosen tile is cached per cell sogridlookups are O(1). Contradictions restart.ModelandCellare preserved asLegacyModelfor apples-to-apples benchmarking.Tilenow stores raw edge signatures so both implementations share it. The Window callsmodel.gridonce per draw instead of receiving a grid from everyiterate.bin/benchmarkextended to compare both models across grid sizes; the CI JSON contract (10 runs of 20×20,avg/min/max/P90) is preserved.Results (Ruby 4.0.1 + YJIT, arm64-darwin)
The new model holds ~800–950 observations/sec across grid sizes (legacy is ~200 obs/sec and does not practically reach 50×50+).
Test plan
bundle exec rake test— 2 runs, 16 assertions, 0 failuresCI=1 bundle exec bin/benchmark— produces the JSON the existing GitHub workflow consumesRUBY_YJIT_ENABLE=1 bundle exec bin/benchmark— sweeps Legacy vs New across sizesbundle exec ruby bin/run) and verify rendering,Ppause,Rrestart,Aappend row