Skip to content

[SPARK-56480][TESTS] Fix flaky SparkLauncherSuite due to race conditions in cleanup #55344

Closed
dbtsai wants to merge 1 commit intoapache:masterfrom
dbtsai:flakydix
Closed

[SPARK-56480][TESTS] Fix flaky SparkLauncherSuite due to race conditions in cleanup #55344
dbtsai wants to merge 1 commit intoapache:masterfrom
dbtsai:flakydix

Conversation

@dbtsai
Copy link
Copy Markdown
Member

@dbtsai dbtsai commented Apr 15, 2026

What changes were proposed in this pull request?

The flakiness is caused by a combination of three issues:

  1. Bug in BaseSuite.postChecks() assertion (primary cause): The assertion assertNull(server) checks a local variable that was assigned from LauncherServer.getServer(). After server.close() sets the static serverInstance to null, the local variable still holds the old non-null reference, causing the assertion to always fail if the server was alive at cleanup time.
  2. Missing handle disposal wait (race trigger): testInProcessLauncherGetError() and testSparkLauncherGetError() only call handle.kill() but do not wait for the handle to be fully disposed via waitFor(). This leaves a race window where the LauncherServer is still alive when postChecks() runs, triggering the buggy assertion.
  3. Insufficient SparkContext shutdown timeout (contributing factor): waitForSparkContextShutdown() had only a 5-second timeout with 10ms polling. Under CI load, SparkContext.stop() can take longer, causing additional flaky failures.

Fix Applied

  1. Changed assertNull(server) to assertNull(LauncherServer.getServer()) in BaseSuite.postChecks() to re-fetch the server reference after close.
  2. Added waitFor(handle) calls in testInProcessLauncherGetError() and testSparkLauncherGetError() finally blocks to ensure full handle disposal before postChecks().
  3. Increased waitForSparkContextShutdown() timeout from 5s to 30s and polling interval from 10ms to 100ms.

Why are the changes needed?

SparkLauncherSuite fails non-deterministically. All three issues are timing-dependent
races rather than logic bugs, making them hard to reproduce locally but common under CI load.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests in SparkLauncherSuite. No new tests are needed — the fixes eliminate
races in the test infrastructure itself rather than changing production behavior.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

…ons in cleanup

### What changes were proposed in this pull request?

Three interrelated race conditions caused non-deterministic failures in
`SparkLauncherSuite`:

1. **Assertion logic bug**: `BaseSuite.postChecks()` was calling
   `assertNull(server)` on a local variable captured before `server.close()`.
   Since `close()` sets the static `serverInstance` to null (not the local
   reference), this assertion always fails when the server was alive at
   cleanup time. Fix: re-fetch via `LauncherServer.getServer()` after closing.

2. **Missing `waitFor()` calls**: `testInProcessLauncherGetError` and
   `testSparkLauncherGetError` called `handle.kill()` in their `finally` blocks
   but did not wait for the handle to be fully disposed. This left a window where
   `postChecks()` could run while `AbstractAppHandle.dispose()` was still
   executing and `serverInstance` was still non-null. Fix: add `waitFor(handle)`
   after each `kill()`. Since `kill()` synchronously calls `dispose()`, the
   subsequent `waitFor` check returns almost immediately in practice.

3. **Insufficient timeout**: `waitForSparkContextShutdown()` used a 5-second
   timeout with 10 ms polling, which proved too tight under heavy CI load.
   Fix: extend to 30 seconds with 100 ms polling.

### Why are the changes needed?

The tests were non-deterministically failing in CI due to the races above.

### Does this PR introduce _any_ user-facing change?

No. Only test infrastructure is affected.

### How was this patch tested?

Existing tests in `SparkLauncherSuite` and `InProcessLauncherSuite`.
@dbtsai dbtsai closed this in aa2e1af Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants