Skip to content

tests: stabilize TestUpdateMemberWhenRecovery retry path#10285

Open
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/flaky-9994-update-member-recovery
Open

tests: stabilize TestUpdateMemberWhenRecovery retry path#10285
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/flaky-9994-update-member-recovery

Conversation

@okJiang
Copy link
Member

@okJiang okJiang commented Mar 3, 2026

What problem does this PR solve?

Issue Number: ref #9994

TestUpdateMemberWhenRecovery is flaky. CI failures show transient errors right after TSO node restart:

  • ErrKeyspaceGroupModRevisionStale from FindGroupByKeyspaceID
  • followed by [PD:client:ErrClientCreateTSOStream]create TSO stream failed, retry timeout

This means the restarted TSO node is already serving, but keyspace-group watch state can still be stale for a short window.

Root-cause evidence chain

  • Issue CI run (job 57522640945) fails in this test with:
    • ErrKeyspaceGroupModRevisionStale (response mod revision 0 vs current 28)
    • then ErrClientCreateTSOStream ... retry timeout at tso_keyspace_group_test.go:737
  • Current test logic requires immediate success of the in-flight GetTS right after restart:
    • restart node -> WaitForPrimaryServing -> one-shot re.NoError(result.err)
  • WaitForPrimaryServing only guarantees server liveness/primary serving, not full keyspace-group metadata catch-up, so the one-shot assertion is timing-sensitive.

Historical analog

Pattern: flaky_stabilization + test_harness_alignment from flaky fix playbook (explicitly tolerate async readiness windows).

Closest corpus analog: #10203 (test: fix flaky test TestForwardTestSuite in next-gen) where test assertions were aligned with real readiness timing rather than assuming immediate steady state.

What is changed and how does it work?

  • Keep existing async in-flight GetTS scenario.
  • If the first in-flight result returns an error after node restart, retry GetTS with testutil.Eventually (bounded by 60s, 500ms tick), each retry using a short 10s context.
  • This preserves test intent (service should recover without legacy fallback), while removing the brittle “must succeed immediately after restart” assumption.

Risk and impact

  • Low risk, test-only change.
  • No production logic changes.
  • Slightly longer retry window only when transient recovery errors occur.

Verification commands and results

  • cd tests/integrations && make gotest GOTEST_ARGS='-tags without_dashboard ./mcs/keyspace -run TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery -count=5 -v'
    • PASS (ok github.com/tikv/pd/tests/integrations/mcs/keyspace 67.244s)

Check List

Tests

  • Integration test

Release note

None.

Summary by CodeRabbit

  • Tests
    • Improved TSO keyspace group test resilience by implementing retry logic with timeout for transient failures during node restart scenarios.

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 3, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 3, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5c3e2dcd-8057-4618-a723-12aba4b3e18a

📥 Commits

Reviewing files that changed from the base of the PR and between 78b9e3d and aa84038.

📒 Files selected for processing (1)
  • tests/integrations/mcs/keyspace/tso_keyspace_group_test.go

📝 Walkthrough

Walkthrough

A test file for TSO keyspace group functionality is modified to handle transient failures gracefully. Instead of immediately failing when a GetTS call errors after node restart, the test now retries with timeout and waits for eventual success, accounting for temporary stale metadata issues.

Changes

Cohort / File(s) Summary
Test Resilience Enhancement
tests/integrations/mcs/keyspace/tso_keyspace_group_test.go
Replaces direct assertion on GetTS call with retry-and-wait pattern; adds timeout logic and explanatory comments about transient metadata staleness during TSO node recovery.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

lgtm, approved

Suggested reviewers

  • rleungx
  • bufferflies
  • lhy1024

Poem

🐰 Hops through transient troubles with grace,
Retries gently in time and in place,
Stale metadata fades like morning dew,
The test waits patient, as TSO nodes renew! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the test being stabilized and the specific change (retry path), accurately reflecting the main purpose of this test-only fix.
Description check ✅ Passed The description covers all required sections: problem statement with issue reference, root-cause analysis with CI evidence, detailed explanation of the change, risk assessment, verification commands, and appropriate test checklist and release notes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 3, 2026
@okJiang
Copy link
Member Author

okJiang commented Mar 4, 2026

/retest

@okJiang okJiang marked this pull request as ready for review March 4, 2026 02:31
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 4, 2026
@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.81%. Comparing base (78b9e3d) to head (aa84038).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10285      +/-   ##
==========================================
+ Coverage   78.77%   78.81%   +0.03%     
==========================================
  Files         525      525              
  Lines       70824    70824              
==========================================
+ Hits        55795    55819      +24     
+ Misses      11020    10997      -23     
+ Partials     4009     4008       -1     
Flag Coverage Δ
unittests 78.81% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JmPotato

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Mar 6, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 6, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-06 03:48:19.39880684 +0000 UTC m=+502743.976886035: ☑️ agreed by JmPotato.

@ti-chi-bot ti-chi-bot bot added the approved label Mar 6, 2026
@okJiang
Copy link
Member Author

okJiang commented Mar 6, 2026

/ok-to-retest

@okJiang
Copy link
Member Author

okJiang commented Mar 6, 2026

/retest

2 similar comments
@okJiang
Copy link
Member Author

okJiang commented Mar 6, 2026

/retest

@okJiang
Copy link
Member Author

okJiang commented Mar 9, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants