[SPARK-56480][TESTS] Fix flaky SparkLauncherSuite due to race conditions in cleanup #55344
Closed
dbtsai wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-56480][TESTS] Fix flaky SparkLauncherSuite due to race conditions in cleanup #55344dbtsai wants to merge 1 commit intoapache:masterfrom
dbtsai wants to merge 1 commit intoapache:masterfrom
Conversation
…ons in cleanup ### What changes were proposed in this pull request? Three interrelated race conditions caused non-deterministic failures in `SparkLauncherSuite`: 1. **Assertion logic bug**: `BaseSuite.postChecks()` was calling `assertNull(server)` on a local variable captured before `server.close()`. Since `close()` sets the static `serverInstance` to null (not the local reference), this assertion always fails when the server was alive at cleanup time. Fix: re-fetch via `LauncherServer.getServer()` after closing. 2. **Missing `waitFor()` calls**: `testInProcessLauncherGetError` and `testSparkLauncherGetError` called `handle.kill()` in their `finally` blocks but did not wait for the handle to be fully disposed. This left a window where `postChecks()` could run while `AbstractAppHandle.dispose()` was still executing and `serverInstance` was still non-null. Fix: add `waitFor(handle)` after each `kill()`. Since `kill()` synchronously calls `dispose()`, the subsequent `waitFor` check returns almost immediately in practice. 3. **Insufficient timeout**: `waitForSparkContextShutdown()` used a 5-second timeout with 10 ms polling, which proved too tight under heavy CI load. Fix: extend to 30 seconds with 100 ms polling. ### Why are the changes needed? The tests were non-deterministically failing in CI due to the races above. ### Does this PR introduce _any_ user-facing change? No. Only test infrastructure is affected. ### How was this patch tested? Existing tests in `SparkLauncherSuite` and `InProcessLauncherSuite`.
viirya
approved these changes
Apr 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The flakiness is caused by a combination of three issues:
Fix Applied
Why are the changes needed?
SparkLauncherSuitefails non-deterministically. All three issues are timing-dependentraces rather than logic bugs, making them hard to reproduce locally but common under CI load.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing tests in
SparkLauncherSuite. No new tests are needed — the fixes eliminateraces in the test infrastructure itself rather than changing production behavior.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6