CLU-95: investigation notes + repro harness by antiguru · Pull Request #36844 · MaterializeInc/materialize

antiguru · 2026-06-01T09:55:14Z

What

Investigation/understanding phase for CLU-95 — the recurring bootstrap panic:

thread 'coordinator' panicked at src/compute-client/src/as_of_selection.rs:
failed to apply hard as-of constraint (id=u732, ..., reason="storage export u732 write frontier")

No product code changes. This PR adds:

CLU-95-CONTINUATION.md — full root-cause write-up: decoded panic, the runtime invariant input.since <= step_back(mv.upper) and where it's enforced, the read-only/0dt leased read-hold theory (lease expiry + the deliberately-disabled update_since in expire_leased_reader, database-issues#6885), the tension that makes naive repros fail, open questions, a prioritized repro plan, candidate fixes with risk notes, and a code map with file:line references.
test/clu-95-repro/mzcompose.py — a repro hunt (not yet a proven reproducer) with zdt-soak (leader + read-only follower, short persist_reader_lease_duration, MV chain + REFRESH MV, reboot follower under load, scan logs) and restart-soak (single-env ungraceful kill+restart, mirroring workload-replay's sanity_restart).

TL;DR theory

u732 is a real, durably-written user MV; the panic means a storage input's read frontier (since) advanced ~41s past the MV's durable write frontier (upper). In a single read-write env the input is held by a persist critical handle (never expires), so this shouldn't happen. In read-only/0dt the follower holds inputs with persist leased handles, and on lease expiry update_since is disabled (#6885), so the leader's next compare_and_downgrade_since can jump the input since forward past a dependent MV's upper — and persist since never regresses, so the bad state is durable and every later bootstrap panics.

How to run

bin/mzcompose --find clu-95-repro run zdt-soak --iterations 40 --lease-seconds 5
bin/mzcompose --find clu-95-repro run zdt-soak --drop-recreate
bin/mzcompose --find clu-95-repro run restart-soak --iterations 60

Notes / caveats

The harness is a hunt, not a guaranteed reproducer — see the "tension" section in the doc for why and which knobs to turn.
The new composition is intentionally not wired into any CI pipeline (a non-deterministic hunt shouldn't gate CI), so check-mzcompose-files.sh will flag it as unused.
Next highest-value step: confirm from the failing release-qualification logs whether the panicking process was in read-only/0dt mode or a plain restart — that decides which theory/fix to pursue.

https://claude.ai/code/session_01G3SvtMjZaSAzqW1dGropWn

Generated by Claude Code

Add a continuation write-up (CLU-95-CONTINUATION.md) capturing the root-cause theory for the 'failed to apply hard as-of constraint' bootstrap panic, and an mzcompose repro harness (test/clu-95-repro) that stresses the read-only/0dt leased-read-hold lifecycle and single-env ungraceful restarts. No product code changes; this is the understanding/repro phase.

Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping per-replica read hold without equivalent global read hold" WARN added in PR #35937) into a pure helper `find_unprotected_replica_holds`, and add four unit tests that exercise the hold-asymmetry condition tracked under incidents-and-escalations#39. The tests are the first deterministic specification of the bug-class shape and pin down the regression contract for the eventual fix. Also extend the CLU-95 repro harness with two new workflows targeting the build 1248 manifestation more directly: * `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep) on an unmanaged cluster pinned to a standalone Clusterd, cancelled mid-render, then clusterd force-killed to provoke reconnect. * `replica-removal-under-load` — writer cluster MV + concurrent dataflow churn on a separate compute cluster, then DROP CLUSTER REPLICA on the compute side to drive `Instance::remove_replica` under load. Both workflows accumulate perturbations under one long-lived envd and then do a single ungraceful restart, mirroring the workload-replay sanity_restart sequence from build 1248. Neither reproduces the bootstrap panic over 30/40 iterations, but the harness now scans for the diagnostic WARN as a secondary signal. CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log findings, rule out the leased-expiry framing for that build, and lay out the three-pronged fix direction: upstream hold-accounting fix (#39), bootstrap report-don't-panic safety net (the CLU-95-specific recovery), and render-time report-don't-panic (the moral successor to the now-canceled CLU-34). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude and others added 3 commits June 1, 2026 09:26

test/clu-95-repro: loop over all workflows in default

d23f760

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLU-95: investigation notes + repro harness#36844

CLU-95: investigation notes + repro harness#36844
antiguru wants to merge 3 commits into
mainfrom
claude/admiring-mendel-FNS5D

antiguru commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

antiguru commented Jun 1, 2026

What

TL;DR theory

How to run

Notes / caveats

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants