-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-10306. Speed up TestSnapshotBackgroundServices #9721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hevinhsu for the patch.
...n-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR speeds up the TestSnapshotBackgroundServices integration test by reusing a single shared MiniOzoneHACluster across test cases and adding a helper API to restart OMs with updated configuration, avoiding full cluster re-creation.
Changes:
- Refactored
TestSnapshotBackgroundServicesto use a per-class shared HA cluster and per-test recovery/setup. - Added
MiniOzoneHAClusterImpl#getInactiveOM()and a helper to restart OMs with a config customizer. - Adjusted OM restart behavior to ensure an inactive OM is re-activated before restart.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
hadoop-ozone/mini-cluster/src/main/java/org/apache/hadoop/ozone/MiniOzoneHAClusterImpl.java |
Adds inactive OM access and a restart-with-config helper to support faster test reconfiguration. |
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java |
Reworks the test lifecycle to reuse the HA cluster and selectively restart/recover OMs between test cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (OzoneManager om : toRestart) { | ||
| if (!om.stop()) { | ||
| continue; | ||
| } | ||
| om.join(); | ||
| om.restart(); | ||
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); | ||
| } |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
restartOzoneManagersWithConfigCustomizer silently skips restarting an OM when om.stop() returns false (even though the OM was in the toRestart list). This can mask a failed stop and leave the cluster partially restarted with a mix of old/new config. Consider failing fast (throw) or at least logging and still attempting a restart / waiting for a consistent state.
| @BeforeEach | ||
| public void setupTest() throws IOException, InterruptedException, TimeoutException { | ||
| recoverCluster(); | ||
| stopFollowerOM(cluster.getOMLeader()); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setupTest passes cluster.getOMLeader() into stopFollowerOM(...), but getOMLeader() can legally return null transiently (eg during leader transitions) and would cause the helper to stop an arbitrary OM. Prefer using the result of cluster.waitForLeaderOM() (or assert non-null) to avoid test flakiness.
| stopFollowerOM(cluster.getOMLeader()); | |
| OzoneManager leader = cluster.waitForLeaderOM(); | |
| stopFollowerOM(leader); |
...n-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java
Show resolved
Hide resolved
| assertNotNull(files); | ||
| int numberOfSstFiles = files.length; | ||
|
|
||
| assertEquals(cluster.getOMLeader(), newLeaderOM); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertEquals(cluster.getOMLeader(), newLeaderOM) can be flaky because getOMLeader() explicitly returns null when no leader is ready or when multiple leaders appear ready. If the intent is to verify leadership, use cluster.waitForLeaderOM() (and compare), or assert newLeaderOM.isLeaderReady().
| assertEquals(cluster.getOMLeader(), newLeaderOM); | |
| assertEquals(cluster.waitForLeaderOM(), newLeaderOM); |
| if (configCustomizer != null) { | ||
| configCustomizer.accept(configuration); | ||
| } | ||
| om.setConfiguration(configuration); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access of element annotated with VisibleForTesting found in production code.
| configCustomizer.accept(configuration); | ||
| } | ||
| om.setConfiguration(configuration); | ||
| if (om.isRunning()) { |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access of element annotated with VisibleForTesting found in production code.
| } | ||
| om.join(); | ||
| om.restart(); | ||
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access of element annotated with VisibleForTesting found in production code.
| GenericTestUtils.waitFor(om::isRunning, 1000, 30000); | |
| long startTime = System.currentTimeMillis(); | |
| while (!om.isRunning()) { | |
| if (System.currentTimeMillis() - startTime > 30000) { | |
| throw new TimeoutException("Timed out waiting for OzoneManager to start."); | |
| } | |
| Thread.sleep(1000); | |
| } |
What changes were proposed in this pull request?
Speed up
TestSnapshotBackgroundServicesby reusing a sharedMiniOzoneHAClusterinstead of creating a new cluster for each test case.Please describe your PR in detail:
TestSnapshotBackgroundServicesto use a sharedMiniOzoneHAClusterto reduce repeated cluster initialization cost.What is the link to the Apache JIRA?
https://issues.apache.org/jira/browse/HDDS-10306
How was this patch tested?
https://github.com/hevinhsu/ozone/actions/runs/21741639312
Additionally, the following test was executed locally to verify the improvement:
before:
after: