Skip to content

CLU-95: investigation notes + repro harness#36844

Draft
antiguru wants to merge 3 commits into
mainfrom
claude/admiring-mendel-FNS5D
Draft

CLU-95: investigation notes + repro harness#36844
antiguru wants to merge 3 commits into
mainfrom
claude/admiring-mendel-FNS5D

Conversation

@antiguru
Copy link
Copy Markdown
Member

@antiguru antiguru commented Jun 1, 2026

What

Investigation/understanding phase for CLU-95 — the recurring bootstrap panic:

thread 'coordinator' panicked at src/compute-client/src/as_of_selection.rs:
failed to apply hard as-of constraint (id=u732, ..., reason="storage export u732 write frontier")

No product code changes. This PR adds:

  • CLU-95-CONTINUATION.md — full root-cause write-up: decoded panic, the runtime invariant input.since <= step_back(mv.upper) and where it's enforced, the read-only/0dt leased read-hold theory (lease expiry + the deliberately-disabled update_since in expire_leased_reader, database-issues#6885), the tension that makes naive repros fail, open questions, a prioritized repro plan, candidate fixes with risk notes, and a code map with file:line references.
  • test/clu-95-repro/mzcompose.py — a repro hunt (not yet a proven reproducer) with zdt-soak (leader + read-only follower, short persist_reader_lease_duration, MV chain + REFRESH MV, reboot follower under load, scan logs) and restart-soak (single-env ungraceful kill+restart, mirroring workload-replay's sanity_restart).

TL;DR theory

u732 is a real, durably-written user MV; the panic means a storage input's read frontier (since) advanced ~41s past the MV's durable write frontier (upper). In a single read-write env the input is held by a persist critical handle (never expires), so this shouldn't happen. In read-only/0dt the follower holds inputs with persist leased handles, and on lease expiry update_since is disabled (#6885), so the leader's next compare_and_downgrade_since can jump the input since forward past a dependent MV's upper — and persist since never regresses, so the bad state is durable and every later bootstrap panics.

How to run

bin/mzcompose --find clu-95-repro run zdt-soak --iterations 40 --lease-seconds 5
bin/mzcompose --find clu-95-repro run zdt-soak --drop-recreate
bin/mzcompose --find clu-95-repro run restart-soak --iterations 60

Notes / caveats

  • The harness is a hunt, not a guaranteed reproducer — see the "tension" section in the doc for why and which knobs to turn.
  • The new composition is intentionally not wired into any CI pipeline (a non-deterministic hunt shouldn't gate CI), so check-mzcompose-files.sh will flag it as unused.
  • Next highest-value step: confirm from the failing release-qualification logs whether the panicking process was in read-only/0dt mode or a plain restart — that decides which theory/fix to pursue.

https://claude.ai/code/session_01G3SvtMjZaSAzqW1dGropWn


Generated by Claude Code

claude and others added 3 commits June 1, 2026 09:26
Add a continuation write-up (CLU-95-CONTINUATION.md) capturing the root-cause
theory for the 'failed to apply hard as-of constraint' bootstrap panic, and an
mzcompose repro harness (test/clu-95-repro) that stresses the read-only/0dt
leased-read-hold lifecycle and single-env ungraceful restarts.

No product code changes; this is the understanding/repro phase.
Refactor `Instance::remove_replica`'s diagnostic loop (the "dropping
per-replica read hold without equivalent global read hold" WARN added in
PR #35937) into a pure helper `find_unprotected_replica_holds`, and add
four unit tests that exercise the hold-asymmetry condition tracked under
incidents-and-escalations#39. The tests are the first deterministic
specification of the bug-class shape and pin down the regression contract
for the eventual fix.

Also extend the CLU-95 repro harness with two new workflows targeting
the build 1248 manifestation more directly:

* `cancelled-peek-reconnect` — slow-path SELECT (via mz_unsafe.mz_sleep)
  on an unmanaged cluster pinned to a standalone Clusterd, cancelled
  mid-render, then clusterd force-killed to provoke reconnect.
* `replica-removal-under-load` — writer cluster MV + concurrent
  dataflow churn on a separate compute cluster, then DROP CLUSTER
  REPLICA on the compute side to drive `Instance::remove_replica`
  under load.

Both workflows accumulate perturbations under one long-lived envd and
then do a single ungraceful restart, mirroring the workload-replay
sanity_restart sequence from build 1248. Neither reproduces the
bootstrap panic over 30/40 iterations, but the harness now scans for
the diagnostic WARN as a secondary signal.

CLU-95-CONTINUATION.md is rewritten to reflect the build 1248 services.log
findings, rule out the leased-expiry framing for that build, and lay out
the three-pronged fix direction: upstream hold-accounting fix (#39),
bootstrap report-don't-panic safety net (the CLU-95-specific recovery),
and render-time report-don't-panic (the moral successor to the now-canceled
CLU-34).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants