Skip to content

Conversation

@hevinhsu
Copy link
Contributor

@hevinhsu hevinhsu commented Feb 6, 2026

What changes were proposed in this pull request?

Speed up TestSnapshotBackgroundServices by reusing a shared MiniOzoneHACluster instead of creating a new cluster for each test case.

Please describe your PR in detail:

  • Refactor TestSnapshotBackgroundServices to use a shared MiniOzoneHACluster to reduce repeated cluster initialization cost.
  • Introduce a small test-only API to apply updated configurations by restarting OM, allowing configuration changes without recreating the entire cluster.

What is the link to the Apache JIRA?

https://issues.apache.org/jira/browse/HDDS-10306

How was this patch tested?

https://github.com/hevinhsu/ozone/actions/runs/21741639312

Additionally, the following test was executed locally to verify the improvement:

mvn clean test -Dtest=TestSnapshotBackgroundServices -Dsurefire.skipAfterFailureCount=1 -pl hadoop-ozone/integration-test -am 

before:

[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 240.2 s -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotBackgroundServices

after:

[INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 155.6 s -- in org.apache.hadoop.ozone.om.snapshot.TestSnapshotBackgroundServices

@adoroszlai adoroszlai requested a review from jojochuang February 6, 2026 13:12
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hevinhsu for the patch.

@adoroszlai adoroszlai added the test label Feb 7, 2026
@smengcl smengcl requested review from Copilot and removed request for jojochuang and smengcl February 11, 2026 07:30
@smengcl smengcl self-requested a review February 11, 2026 07:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR speeds up the TestSnapshotBackgroundServices integration test by reusing a single shared MiniOzoneHACluster across test cases and adding a helper API to restart OMs with updated configuration, avoiding full cluster re-creation.

Changes:

  • Refactored TestSnapshotBackgroundServices to use a per-class shared HA cluster and per-test recovery/setup.
  • Added MiniOzoneHAClusterImpl#getInactiveOM() and a helper to restart OMs with a config customizer.
  • Adjusted OM restart behavior to ensure an inactive OM is re-activated before restart.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
hadoop-ozone/mini-cluster/src/main/java/org/apache/hadoop/ozone/MiniOzoneHAClusterImpl.java Adds inactive OM access and a restart-with-config helper to support faster test reconfiguration.
hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/snapshot/TestSnapshotBackgroundServices.java Reworks the test lifecycle to reuse the HA cluster and selectively restart/recover OMs between test cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +162 to +169
for (OzoneManager om : toRestart) {
if (!om.stop()) {
continue;
}
om.join();
om.restart();
GenericTestUtils.waitFor(om::isRunning, 1000, 30000);
}
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restartOzoneManagersWithConfigCustomizer silently skips restarting an OM when om.stop() returns false (even though the OM was in the toRestart list). This can mask a failed stop and leave the cluster partially restarted with a mix of old/new config. Consider failing fast (throw) or at least logging and still attempting a restart / waiting for a consistent state.

Copilot uses AI. Check for mistakes.
@BeforeEach
public void setupTest() throws IOException, InterruptedException, TimeoutException {
recoverCluster();
stopFollowerOM(cluster.getOMLeader());
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupTest passes cluster.getOMLeader() into stopFollowerOM(...), but getOMLeader() can legally return null transiently (eg during leader transitions) and would cause the helper to stop an arbitrary OM. Prefer using the result of cluster.waitForLeaderOM() (or assert non-null) to avoid test flakiness.

Suggested change
stopFollowerOM(cluster.getOMLeader());
OzoneManager leader = cluster.waitForLeaderOM();
stopFollowerOM(leader);

Copilot uses AI. Check for mistakes.
assertNotNull(files);
int numberOfSstFiles = files.length;

assertEquals(cluster.getOMLeader(), newLeaderOM);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertEquals(cluster.getOMLeader(), newLeaderOM) can be flaky because getOMLeader() explicitly returns null when no leader is ready or when multiple leaders appear ready. If the intent is to verify leadership, use cluster.waitForLeaderOM() (and compare), or assert newLeaderOM.isLeaderReady().

Suggested change
assertEquals(cluster.getOMLeader(), newLeaderOM);
assertEquals(cluster.waitForLeaderOM(), newLeaderOM);

Copilot uses AI. Check for mistakes.
if (configCustomizer != null) {
configCustomizer.accept(configuration);
}
om.setConfiguration(configuration);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access of element annotated with VisibleForTesting found in production code.

Copilot uses AI. Check for mistakes.
configCustomizer.accept(configuration);
}
om.setConfiguration(configuration);
if (om.isRunning()) {
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access of element annotated with VisibleForTesting found in production code.

Copilot uses AI. Check for mistakes.
}
om.join();
om.restart();
GenericTestUtils.waitFor(om::isRunning, 1000, 30000);
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access of element annotated with VisibleForTesting found in production code.

Suggested change
GenericTestUtils.waitFor(om::isRunning, 1000, 30000);
long startTime = System.currentTimeMillis();
while (!om.isRunning()) {
if (System.currentTimeMillis() - startTime > 30000) {
throw new TimeoutException("Timed out waiting for OzoneManager to start.");
}
Thread.sleep(1000);
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants