docs: chaos pod-kill vs engine-internal-crash FSFO asymmetry#129
docs: chaos pod-kill vs engine-internal-crash FSFO asymmetry#129weicao wants to merge 9 commits into
Conversation
Add engine-neutral methodology guide for chaos test matrix design, documenting why pod-kill and internal-process-kill exercise different failover code paths and must both be covered. Add Oracle 19c case appendix consolidating round-1 (pod-kill, no FSFO, standby SUSPEND) vs round-6 (SMON SIGKILL, FSFO fires + auto-reinstate) side-by-side evidence on the same o19p15v9 cluster.
Chaos round 9 (DMON SIGKILL on primary) discovered a 3rd failure class that the 2-axis taxonomy missed: - B-instance (e.g. SMON / LGWR): PMON terminates the instance, watchdog restarts ctr, FSFO fires, role change + reinstate. - B-broker (e.g. DMON / INSV / NSV*): engine internally spawns a new process, instance unaffected, no ctr restart, FSFO must NOT fire. The B-broker class exposes a distinct design risk: if a failover decision-maker conflates "broker config query failed" with "primary unreachable", it will mis-trigger failover. The new test matrix row (#8) covers this explicitly. Also extended the Oracle case appendix with rounds 7 (LGWR), 8 (observer+SMON race) and 9 (DMON) summaries + file map, and noted F38 (observer setup script has no timeout) as a related latent risk.
Round 10 killed INSV (broker support daemon) on the primary, paired with round 9's DMON (broker coordinator) kill, to verify the B-broker class generalizes beyond the master daemon. Result: INSV kill recovers in ~4 seconds with two alert log lines. No broker cleanup, no support-process cascade, no ERROR window. This contrasts with DMON kill's ~27s full broker re-init. Both share the defining property of the B-broker class: instance is not terminated and FSFO must not fire. So the 3-class taxonomy (A / B-instance / B-broker) stands. A finer master-vs-worker sub-classification is noted but deferred until operationally needed.
…dimension Round 11 killed SMON on the active FSFO target standby (oracle-1). Outcome: FSFO did NOT fire (primary stayed healthy), but observer auto-shifted the active target ORCLCDB_1 -> ORCLCDB_2 after the 30s threshold elapsed. Same target-shift behavior as round 3 (pod-kill on FSFO target), only via the B-instance recovery path, which finishes ~3x faster (84s vs 4m25s). Methodology guide now treats FSFO fire, role change, target shift, and broker config-status trajectory as four independent observable dimensions, so testers do not conflate "target shifted" with "failover fired".
…trix Round 12 completes the B-instance position-dependency matrix. Killing SMON on the non-target standby produces the minimum-impact B-instance outcome: broker ERROR, single ctr restart, ~110s to broker SUCCESS. No target shift, no role change, no FSFO. Primary writes uninterrupted throughout. Combined with rounds 6/7 (primary) and round 11 (active FSFO target), the same SMON kill produces three qualitatively different cluster-level outcomes depending on the target role. Methodology guide now documents this as the B-instance position dependency rule: each of the three positions must be tested separately.
|
Repo-level docs review: currently blocked. Blockers:
Commit bodies look clean for AI attribution. After these are fixed I can re-run file-level checks. |
Round 13 = 3 consecutive SMON kills on primary, no manual reinstate between cycles. Outcomes split 1/3 self-recovery / 2/3 FSFO — direct observation of FSFO bimodal distribution. Cluster self-stabilized through 3 cycles, reinstate time stable (~120-150s, no degradation). Methodology guide gains: - "B-instance primary FSFO probabilism" section: race between self-recovery total time and FSFO threshold, with watchdog tick phase as the dominant tie-breaker (engine-neutral) - Engine adaptation checklist item 8: tune watchdog tick vs FSFO threshold relationship deliberately to pick one regime - Burn-in methodology section: >=3 cycles, decoupled poll point, topology rotation, bimodal ratio observation - Anti-pattern 8 (single-shot != verified) and 9 (do not poll from the member being killed — observed stale broker state for ~75s) Oracle case appendix gains: - Round 13 burn-in narrative + per-cycle table + bimodal observation - Methodology self-disclosure on the polling-point bug found in cycle 3 - File map and round summary table updated Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round 14 introduces B-listener as a fourth chaos sub-class beyond A / B-instance / B-broker. Killing the TNS listener leaves engine and broker daemon intact, so all existing TCP sessions (broker peer-to-peer, redo transport, observer-to-instance) keep working and the broker reports SUCCESS indefinitely — even six minutes after the listener is gone. The alert log writes nothing. Only new connections fail with ORA-12541. This is a true silent failure. For the Oracle Addon specifically, runOracle.sh's watchdog only pgreps ora_pmon; tnslsnr is never checked, so a listener crash never triggers ctr restart and never produces operator-visible signal. Filed as F39, to be fixed in a separate engine PR rather than mixed with this docs PR. Methodology guide gains: - A fourth axis (B-listener) with explanation of why existing TCP sessions mask the failure - Matrix row #11 and a stronger contract: must test #1 + #6 + #8 + #11 - Engine adaptation checklist item 9: listener / port-listener watchdog coverage - Anti-pattern 10: "broker SUCCESS = cluster healthy" is wrong; broker SUCCESS only reflects already-established TCP sessions Oracle case appendix gains: - Round 14 narrative, F39 fix sketch, file map, summary table row Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round 15 lifts round 14's B-listener kill from a non-target standby to the primary. The silent failure pattern reproduces in full: broker SUCCESS for the entire 133s observation window, no role change, no target shift, no FSFO. The new evidence at this position is that the observer reports "Last Ping to Primary: 1 second ago" while any new dgmgrl /@ORCLCDB_0 from another pod returns ORA-12541 immediately — observer and broker keep talking over already-open sessions while every new client connection is rejected. The blast radius is qualitatively different. On a non-target standby (round 14) listener death has minimal cluster-level impact. On the primary (round 15) it is a production-grade application outage with no broker alarm and no auto-recovery. Manual `lsnrctl start` in the oracle ctr restored the cluster in seconds without bouncing anything. That promotes F39 (runOracle.sh watchdog does not pgrep tnslsnr) to a Sev-1 candidate. The fix is cheap — pgrep -x tnslsnr in the same watchdog loop, try lsnrctl start once, exit for ctr restart on failure — and will be filed as a separate engine PR. Case appendix updated with round 15 narrative and file map. Summary table now covers rounds 1-15. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ckers - Round 16 收口 B-listener position-axis matrix (non-target / primary / active FSFO target 三个位置均 silent failure, F39 position-independent) - 案例附录 intro 补 'Affected by version skew' 字段 + 案例 scope 行 (Oracle 19c / KB 1.0.3-beta.x / 单 cluster / round 1-16 only) - 方法论 doc 把 version skew 从 '不受影响' 改为 'yes — engine 版本 / HA 模式 / observer config / watchdog 实现 / KB sidecar 都改变观察结果' - 把案例 evidence 绝对路径 /Users/wei/... 改为 repo-relative 引用 - 悬空链 (oracle-f35-sync-membership-grace-period-case.md / addon-multi-container-shared-file-symlink-write-guide.md) 改为 '待对应文档 land 后回填链接' 描述, 避免悬空 - SKILL-INDEX.md 新增两篇 (chaos pod-kill vs engine internal crash methodology + Oracle chaos case) 到 section 2 + 文档全列表 + Oracle 案例区 - 添加 round 16 到方法论 doc 案例附录摘要列表
|
Thanks. All 6 blockers fixed in
Also pushed Round 16 (B-listener kill on active FSFO target, completes the position-axis matrix and confirms F39 is position-independent) in the same commit. The PR scope grew from 13 → 16 rounds; titles and summary text updated to match. Ready for re-check. |
|
Second pass: the content blockers from my earlier review are largely fixed. I verified the updated PR body is clean, both new docs are indexed, the previous dangling links are gone, case scope/version-skew wording is present, absolute One remaining blocking issue before merge:
Suggested self-check after rewriting: git log origin/main..HEAD --format='%H%n%B%n---END---' \
| grep -Ei 'Generated with|Anthropic|Claude|Codex|OpenAI|AI agent|🤖|noreply@anthropic|noreply@openai|noreply@codex|^Co-authored-by:.*(Anthropic|Claude|Codex|OpenAI|AI agent|🤖|noreply|bot)'Expected result: no output. After that rewrite, I can do a final file-level pass and merge if the diff is unchanged. |
Summary
addon-chaos-pod-kill-vs-engine-internal-crash-guide.md— defines the chaos axes (K8s-layer pod kill vs engine-internal process kill, B class split into instance / broker / listener subtypes), explains why each axis exercises different failover code paths, gives a position-axis matrix template (primary / active FSFO target / non-target standby), burn-in methodology for probabilistic failover behavior, and 11-row chaos test matrix.cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md— rounds 1–16 on the sameo19p15v9cluster, covering A pod-kill / B-instance SMON+LGWR / B-broker DMON+INSV / B-listener tnslsnr; concurrent A+B race (round 8); burn-in (round 13); position-axis matrices for B-instance and B-listener; full alert log + observer log + broker poll trajectories.Why this matters
While running the Oracle chaos matrix on
o19p15v9we found three things that any other addon team will likely face:runOracle.shwatchdog is blind to tnslsnr loss. Rounds 14 / 15 / 16 (tnslsnr SIGKILL on non-target standby / primary / active target) all show the same silent failure pattern: broker SUCCESS, observer happy on existing TCP, alert log silent, new client connections all ORA-12541, no auto-recovery. Position-independent. Fix submitted as PR apecloud/apecloud-addons#1320.Review-blocker fixes (addressing earlier review comment)
docs/SKILL-INDEX.md(section 2 / 文档全列表 / Oracle 案例区).Affected by version skewfield to the case appendix intro and an explicit case-scope line (Oracle 19c / KB 1.0.3-beta.x / single cluster / rounds 1–16 only).yes — engine version, HA mode, observer/broker config, container watchdog, K8s restart path, KB sidecar behavior all change observed outcomes./Users/wei/...workstation paths in the case appendix with repo-relative archive references.Test plan
evidence-skyworth-oracle-19c/).docs/SKILL-INDEX.md.