Skip to content

docs: chaos pod-kill vs engine-internal-crash FSFO asymmetry#129

Open
weicao wants to merge 9 commits into
mainfrom
feat/chaos-pod-kill-vs-internal-crash-fsfo-asymmetry
Open

docs: chaos pod-kill vs engine-internal-crash FSFO asymmetry#129
weicao wants to merge 9 commits into
mainfrom
feat/chaos-pod-kill-vs-internal-crash-fsfo-asymmetry

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 15, 2026

Summary

  • Add engine-neutral methodology guide addon-chaos-pod-kill-vs-engine-internal-crash-guide.md — defines the chaos axes (K8s-layer pod kill vs engine-internal process kill, B class split into instance / broker / listener subtypes), explains why each axis exercises different failover code paths, gives a position-axis matrix template (primary / active FSFO target / non-target standby), burn-in methodology for probabilistic failover behavior, and 11-row chaos test matrix.
  • Add Oracle 19c case appendix cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md — rounds 1–16 on the same o19p15v9 cluster, covering A pod-kill / B-instance SMON+LGWR / B-broker DMON+INSV / B-listener tnslsnr; concurrent A+B race (round 8); burn-in (round 13); position-axis matrices for B-instance and B-listener; full alert log + observer log + broker poll trajectories.

Why this matters

While running the Oracle chaos matrix on o19p15v9 we found three things that any other addon team will likely face:

  1. Pod-kill vs internal-process-kill behave completely differently in MaxPerformance + ASYNC mode. Pod-kill of primary → standby SUSPEND (data-loss safety guard) → FSFO suppressed → primary self-recovers. SMON SIGKILL of primary → standby stays healthy → FSFO fires → role failover + auto-reinstate. A chaos matrix that only does pod-kill never exercises the FSFO-fires code path.
  2. FSFO trigger in this configuration is bimodal probabilism. Burn-in (round 13, three back-to-back SMON kills on primary) shows 1/3 cycle self-recovers, 2/3 cycles FSFO fires. Watchdog tick phase vs FSFO threshold decides which path wins. Single-shot SMON kill verification cannot conclude "SMON kill always triggers FSFO".
  3. F39: runOracle.sh watchdog is blind to tnslsnr loss. Rounds 14 / 15 / 16 (tnslsnr SIGKILL on non-target standby / primary / active target) all show the same silent failure pattern: broker SUCCESS, observer happy on existing TCP, alert log silent, new client connections all ORA-12541, no auto-recovery. Position-independent. Fix submitted as PR apecloud/apecloud-addons#1320.

Review-blocker fixes (addressing earlier review comment)

  • Removed AI attribution from PR body.
  • Indexed both new files in docs/SKILL-INDEX.md (section 2 / 文档全列表 / Oracle 案例区).
  • Replaced two dangling local links in the case appendix with deferred-link descriptions.
  • Added Affected by version skew field to the case appendix intro and an explicit case-scope line (Oracle 19c / KB 1.0.3-beta.x / single cluster / rounds 1–16 only).
  • Softened the methodology doc's version-skew claim to yes — engine version, HA mode, observer/broker config, container watchdog, K8s restart path, KB sidecar behavior all change observed outcomes.
  • Replaced absolute /Users/wei/... workstation paths in the case appendix with repo-relative archive references.

Test plan

  • Methodology doc written engine-neutral (Oracle only in case appendix).
  • Case appendix grounded in real evidence (alert log + observer log + broker polls archived in evidence-skyworth-oracle-19c/).
  • Cross-references between methodology and case file correct.
  • One-topic-per-doc principle respected — chaos axis design separate from F37 / multi-ctr symlink / reconfigure topics.
  • Both files indexed in docs/SKILL-INDEX.md.
  • No dangling local links.

Ava added 5 commits May 15, 2026 12:23
Add engine-neutral methodology guide for chaos test matrix design,
documenting why pod-kill and internal-process-kill exercise different
failover code paths and must both be covered.

Add Oracle 19c case appendix consolidating round-1 (pod-kill, no FSFO,
standby SUSPEND) vs round-6 (SMON SIGKILL, FSFO fires + auto-reinstate)
side-by-side evidence on the same o19p15v9 cluster.
Chaos round 9 (DMON SIGKILL on primary) discovered a 3rd failure
class that the 2-axis taxonomy missed:

- B-instance (e.g. SMON / LGWR): PMON terminates the instance,
  watchdog restarts ctr, FSFO fires, role change + reinstate.
- B-broker (e.g. DMON / INSV / NSV*): engine internally spawns a new
  process, instance unaffected, no ctr restart, FSFO must NOT fire.

The B-broker class exposes a distinct design risk: if a failover
decision-maker conflates "broker config query failed" with "primary
unreachable", it will mis-trigger failover. The new test matrix row
(#8) covers this explicitly.

Also extended the Oracle case appendix with rounds 7 (LGWR), 8
(observer+SMON race) and 9 (DMON) summaries + file map, and noted
F38 (observer setup script has no timeout) as a related latent risk.
Round 10 killed INSV (broker support daemon) on the primary, paired
with round 9's DMON (broker coordinator) kill, to verify the B-broker
class generalizes beyond the master daemon.

Result: INSV kill recovers in ~4 seconds with two alert log lines.
No broker cleanup, no support-process cascade, no ERROR window. This
contrasts with DMON kill's ~27s full broker re-init.

Both share the defining property of the B-broker class: instance is
not terminated and FSFO must not fire. So the 3-class taxonomy
(A / B-instance / B-broker) stands. A finer master-vs-worker
sub-classification is noted but deferred until operationally needed.
…dimension

Round 11 killed SMON on the active FSFO target standby (oracle-1).
Outcome: FSFO did NOT fire (primary stayed healthy), but observer
auto-shifted the active target ORCLCDB_1 -> ORCLCDB_2 after the
30s threshold elapsed. Same target-shift behavior as round 3
(pod-kill on FSFO target), only via the B-instance recovery path,
which finishes ~3x faster (84s vs 4m25s).

Methodology guide now treats FSFO fire, role change, target shift,
and broker config-status trajectory as four independent observable
dimensions, so testers do not conflate "target shifted" with
"failover fired".
…trix

Round 12 completes the B-instance position-dependency matrix.
Killing SMON on the non-target standby produces the minimum-impact
B-instance outcome: broker ERROR, single ctr restart, ~110s to
broker SUCCESS. No target shift, no role change, no FSFO. Primary
writes uninterrupted throughout.

Combined with rounds 6/7 (primary) and round 11 (active FSFO
target), the same SMON kill produces three qualitatively different
cluster-level outcomes depending on the target role. Methodology
guide now documents this as the B-instance position dependency
rule: each of the three positions must be tested separately.
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 15, 2026

Repo-level docs review: currently blocked.

Blockers:

  1. PR body has the generated-by footer: 🤖 Generated with [Claude Code].... Public PR bodies must not contain AI/tool/model attribution.

  2. New files are not indexed in docs/SKILL-INDEX.md:

    • docs/addon-chaos-pod-kill-vs-engine-internal-crash-guide.md
    • docs/cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md
  3. Two local links are broken in the case appendix:

    • oracle-f35-sync-membership-grace-period-case.md
    • ../../addon-multi-container-shared-file-symlink-write-guide.md
      Either land/link the target docs in the same PR, or remove/defer those links.
  4. The case appendix intro is missing the standard fifth field, Affected by version skew. Also add an explicit case-scope line or equivalent wording: Oracle 19c / KB 1.0.3-beta.x / one cluster / observed rounds only.

  5. The methodology doc says Affected by version skew: 不受 KB 版本影响. Please soften to yes or mostly method-stable, evidence version-sensitive: engine version, HA mode, observer/broker config, container watchdog, K8s restart path, and KB sidecar behavior all affect the observed timing and failover outcome.

  6. The case file includes an absolute local path under /Users/wei/.... Prefer an archive/reference identifier or a repo-relative evidence pointer; avoid personal workstation paths in public docs.

Commit bodies look clean for AI attribution. After these are fixed I can re-run file-level checks.

Ava and others added 4 commits May 15, 2026 13:26
Round 13 = 3 consecutive SMON kills on primary, no manual reinstate
between cycles. Outcomes split 1/3 self-recovery / 2/3 FSFO — direct
observation of FSFO bimodal distribution. Cluster self-stabilized
through 3 cycles, reinstate time stable (~120-150s, no degradation).

Methodology guide gains:
- "B-instance primary FSFO probabilism" section: race between
  self-recovery total time and FSFO threshold, with watchdog tick
  phase as the dominant tie-breaker (engine-neutral)
- Engine adaptation checklist item 8: tune watchdog tick vs FSFO
  threshold relationship deliberately to pick one regime
- Burn-in methodology section: >=3 cycles, decoupled poll point,
  topology rotation, bimodal ratio observation
- Anti-pattern 8 (single-shot != verified) and 9 (do not poll from
  the member being killed — observed stale broker state for ~75s)

Oracle case appendix gains:
- Round 13 burn-in narrative + per-cycle table + bimodal observation
- Methodology self-disclosure on the polling-point bug found in cycle 3
- File map and round summary table updated

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round 14 introduces B-listener as a fourth chaos sub-class beyond
A / B-instance / B-broker. Killing the TNS listener leaves engine
and broker daemon intact, so all existing TCP sessions (broker
peer-to-peer, redo transport, observer-to-instance) keep working
and the broker reports SUCCESS indefinitely — even six minutes after
the listener is gone. The alert log writes nothing. Only new
connections fail with ORA-12541. This is a true silent failure.

For the Oracle Addon specifically, runOracle.sh's watchdog only
pgreps ora_pmon; tnslsnr is never checked, so a listener crash
never triggers ctr restart and never produces operator-visible
signal. Filed as F39, to be fixed in a separate engine PR rather
than mixed with this docs PR.

Methodology guide gains:
- A fourth axis (B-listener) with explanation of why existing
  TCP sessions mask the failure
- Matrix row #11 and a stronger contract: must test #1 + #6 + #8 + #11
- Engine adaptation checklist item 9: listener / port-listener
  watchdog coverage
- Anti-pattern 10: "broker SUCCESS = cluster healthy" is wrong;
  broker SUCCESS only reflects already-established TCP sessions

Oracle case appendix gains:
- Round 14 narrative, F39 fix sketch, file map, summary table row

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round 15 lifts round 14's B-listener kill from a non-target standby
to the primary. The silent failure pattern reproduces in full:
broker SUCCESS for the entire 133s observation window, no role
change, no target shift, no FSFO. The new evidence at this position
is that the observer reports "Last Ping to Primary: 1 second ago"
while any new dgmgrl /@ORCLCDB_0 from another pod returns ORA-12541
immediately — observer and broker keep talking over already-open
sessions while every new client connection is rejected.

The blast radius is qualitatively different. On a non-target standby
(round 14) listener death has minimal cluster-level impact. On the
primary (round 15) it is a production-grade application outage with
no broker alarm and no auto-recovery. Manual `lsnrctl start` in the
oracle ctr restored the cluster in seconds without bouncing
anything.

That promotes F39 (runOracle.sh watchdog does not pgrep tnslsnr) to
a Sev-1 candidate. The fix is cheap — pgrep -x tnslsnr in the same
watchdog loop, try lsnrctl start once, exit for ctr restart on
failure — and will be filed as a separate engine PR.

Case appendix updated with round 15 narrative and file map. Summary
table now covers rounds 1-15.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ckers

- Round 16 收口 B-listener position-axis matrix (non-target / primary /
  active FSFO target 三个位置均 silent failure, F39 position-independent)
- 案例附录 intro 补 'Affected by version skew' 字段 + 案例 scope 行
  (Oracle 19c / KB 1.0.3-beta.x / 单 cluster / round 1-16 only)
- 方法论 doc 把 version skew 从 '不受影响' 改为 'yes — engine 版本 / HA 模式 /
  observer config / watchdog 实现 / KB sidecar 都改变观察结果'
- 把案例 evidence 绝对路径 /Users/wei/... 改为 repo-relative 引用
- 悬空链 (oracle-f35-sync-membership-grace-period-case.md /
  addon-multi-container-shared-file-symlink-write-guide.md) 改为
  '待对应文档 land 后回填链接' 描述, 避免悬空
- SKILL-INDEX.md 新增两篇 (chaos pod-kill vs engine internal crash
  methodology + Oracle chaos case) 到 section 2 + 文档全列表 + Oracle 案例区
- 添加 round 16 到方法论 doc 案例附录摘要列表
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 15, 2026

Thanks. All 6 blockers fixed in 0ca57f9. Summary of what changed:

  1. PR body AI attribution — stripped. PR body now starts directly with ## Summary and ends with the test-plan checklist.
  2. SKILL-INDEX.md indexing — both files added:
    • Section 2 (写新 smoke / chaos 测试) gets the methodology doc.
    • 文档全列表 gets a detailed entry for the methodology doc.
    • 案例材料 / Oracle gets a detailed entry for the case file.
  3. Broken local links in case appendix — replaced with a "deferred-link" description that explains where the related findings live in local evidence archives and that links will be back-filled when the target docs land. No more dangling links.
  4. Case appendix intro — added Affected by version skew: yes — ... field (lists engine version, HA mode, observer config, KB sidecar, watchdog as factors that change outcomes) + explicit case-scope line: "Oracle 19c / KB 1.0.3-beta.x / one cluster (o19p15v9 on idc2) / rounds 1–16 only".
  5. Methodology doc version-skew claim — softened from 不受 KB 版本影响 to yes — 方法学跨版本稳定,但具体观测结论与时序受多层版本影响:engine 版本、HA 模式、observer / broker config、container watchdog 实现、K8s 重启路径、KB sidecar 行为 都会改变 FSFO 触发概率与恢复时长。引用具体案例时必须把这些 envelope 字段一并落定.
  6. Absolute local paths/Users/wei/... replaced with repo-relative archive references (evidence-skyworth-oracle-19c/...) and a maintainer-contact note for evidence access.

Also pushed Round 16 (B-listener kill on active FSFO target, completes the position-axis matrix and confirms F39 is position-independent) in the same commit. The PR scope grew from 13 → 16 rounds; titles and summary text updated to match.

Ready for re-check.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 15, 2026

Second pass: the content blockers from my earlier review are largely fixed. I verified the updated PR body is clean, both new docs are indexed, the previous dangling links are gone, case scope/version-skew wording is present, absolute /Users/wei/... evidence paths were replaced, git diff --check is clean, and repo local-link check only reports the existing kb-api-reference regex examples.

One remaining blocking issue before merge:

  1. Commit history still contains AI/tool co-author trailers.

    These three commits still include Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> in their commit bodies:

    • 60cc2873 (docs: add round 13 burn-in + FSFO probabilism methodology)
    • 4f6dd16d (docs: add round 14 (B-listener silent failure) + F39 watchdog gap)
    • d65528f1 (docs: add round 15 (B-listener kill on primary) — F39 to Sev-1)

    Please rewrite/squash the branch so every commit in origin/main..HEAD is clean for AI/tool attribution. The PR body is already clean; this is only about commit bodies.

Suggested self-check after rewriting:

git log origin/main..HEAD --format='%H%n%B%n---END---' \
  | grep -Ei 'Generated with|Anthropic|Claude|Codex|OpenAI|AI agent|🤖|noreply@anthropic|noreply@openai|noreply@codex|^Co-authored-by:.*(Anthropic|Claude|Codex|OpenAI|AI agent|🤖|noreply|bot)'

Expected result: no output.

After that rewrite, I can do a final file-level pass and merge if the diff is unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant