tests: stabilize TestUpdateMemberWhenRecovery retry path#10285
tests: stabilize TestUpdateMemberWhenRecovery retry path#10285okJiang wants to merge 1 commit intotikv:masterfrom
Conversation
Signed-off-by: okjiang <819421878@qq.com>
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA test file for TSO keyspace group functionality is modified to handle transient failures gracefully. Instead of immediately failing when a GetTS call errors after node restart, the test now retries with timeout and waits for eventual success, accounting for temporary stale metadata issues. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #10285 +/- ##
==========================================
+ Coverage 78.77% 78.81% +0.03%
==========================================
Files 525 525
Lines 70824 70824
==========================================
+ Hits 55795 55819 +24
+ Misses 11020 10997 -23
+ Partials 4009 4008 -1
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JmPotato The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/ok-to-retest |
|
/retest |
2 similar comments
|
/retest |
|
/retest |
What problem does this PR solve?
Issue Number: ref #9994
TestUpdateMemberWhenRecoveryis flaky. CI failures show transient errors right after TSO node restart:ErrKeyspaceGroupModRevisionStalefromFindGroupByKeyspaceID[PD:client:ErrClientCreateTSOStream]create TSO stream failed, retry timeoutThis means the restarted TSO node is already serving, but keyspace-group watch state can still be stale for a short window.
Root-cause evidence chain
57522640945) fails in this test with:ErrKeyspaceGroupModRevisionStale(response mod revision0vs current28)ErrClientCreateTSOStream ... retry timeoutattso_keyspace_group_test.go:737GetTSright after restart:WaitForPrimaryServing-> one-shotre.NoError(result.err)WaitForPrimaryServingonly guarantees server liveness/primary serving, not full keyspace-group metadata catch-up, so the one-shot assertion is timing-sensitive.Historical analog
Pattern:
flaky_stabilization+test_harness_alignmentfrom flaky fix playbook (explicitly tolerate async readiness windows).Closest corpus analog: #10203 (
test: fix flaky test TestForwardTestSuite in next-gen) where test assertions were aligned with real readiness timing rather than assuming immediate steady state.What is changed and how does it work?
GetTSscenario.GetTSwithtestutil.Eventually(bounded by 60s, 500ms tick), each retry using a short 10s context.Risk and impact
Verification commands and results
cd tests/integrations && make gotest GOTEST_ARGS='-tags without_dashboard ./mcs/keyspace -run TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery -count=5 -v'ok github.com/tikv/pd/tests/integrations/mcs/keyspace 67.244s)Check List
Tests
Release note
Summary by CodeRabbit