Skip to content

Fix IdleConnectionEvictor thread leak in long-running services#1271

Open
samikshya-db wants to merge 2 commits intomainfrom
fix/1221-evictor-thread-leak
Open

Fix IdleConnectionEvictor thread leak in long-running services#1271
samikshya-db wants to merge 2 commits intomainfrom
fix/1221-evictor-thread-leak

Conversation

@samikshya-db
Copy link
Collaborator

Summary

Fixes #1221IdleConnectionEvictor threads accumulate over time in long-running services even when all JDBC connections are properly closed.

Root cause (two related bugs in DatabricksDriverFeatureFlagsContextFactory):

  1. Ref-count overcounting: getInstance() incremented the ref-count on every call, but isTelemetryEnabled() calls it repeatedly for the same connection while DatabricksConnection.close() only calls removeInstance() once. This meant the feature-flags context (and its 15-minute scheduler) never shut down after the last connection closed.

  2. Stale connection context → orphaned HTTP client: The feature-flags context is shared per-host but HTTP clients are scoped per-connection UUID. The context stored the first connection's connectionContext. When that connection closed, its HTTP client was removed from the factory. On the next 15-min refresh, refreshAllFeatureFlags() called DatabricksHttpClientFactory.getClient(staleContext) — the UUID was gone, so computeIfAbsent silently created a brand-new DatabricksHttpClient (and a new IdleConnectionEvictor thread) that nobody would ever clean up.

Fix (mirrors the pattern already used correctly in TelemetryClientFactory):

  • FeatureFlagsContextHolder now tracks connection UUIDs (ConcurrentHashMap<String, IDatabricksConnectionContext>) instead of a bare ref-count.
  • getInstance() registers the UUID idempotently — no overcounting.
  • removeInstance() removes the UUID; when the set is empty the context shuts down.
  • When the "owner" connection closes but others remain, the stored connectionContext is atomically updated to another active connection so future HTTP-client lookups resolve to a live client.

Test plan

  • DatabricksDriverFeatureFlagsContextFactoryTest — updated existing tests for new UUID-based semantics and added:
    • testMultipleCallsWithSameConnectionAreIdempotent — single connection, multiple getInstance calls, one removeInstance is sufficient
    • testMultipleConnectionsRequireAllToClose — two connections share context; closing one keeps it alive, closing both shuts it down
    • testConnectionContextUpdatedWhenOwnerConnectionCloses — verifies the stored connectionContext switches to another connection when the owner closes
  • All existing DatabricksDriverFeatureFlagsContextTest tests still pass
  • Full build with mvn test passes

…textFactory

The feature-flags context is shared per-host but HTTP clients are scoped
per-connection UUID. The old ref-count approach was broken in two ways:

1. Multiple getInstance() calls from the same connection incremented the
   ref-count each time (e.g. from isTelemetryEnabled()), but
   DatabricksConnection.close() only calls removeInstance() once. This
   meant the ref-count was always > 1 after close, so the context (and its
   scheduler) never shut down.

2. When a connection closed and its HTTP client was removed from the
   factory, the feature-flags context still held a reference to that
   connection's context (UUID). On the next 15-minute refresh,
   refreshAllFeatureFlags() called getClient(staleContext), which
   re-created a DatabricksHttpClient (and a new IdleConnectionEvictor
   thread) for a UUID that nobody would ever clean up.

Fix: switch to UUID-based tracking (mirroring TelemetryClientFactory).
- getInstance() now idempotently registers the connection UUID.
- removeInstance() removes the UUID; when the set is empty the context is
  shut down.
- When the "owner" connection (whose context is stored in the feature-flags
  context) closes but other connections remain, the stored connectionContext
  is updated to another active connection so future HTTP-client lookups
  resolve correctly.

Closes #1221

Signed-off-by: Samikshya Chand <samikshya.chand@databricks.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
Signed-off-by: Samikshya Chand <samikshya.chand@databricks.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
@@ -70,6 +84,7 @@ public static void setFeatureFlagsContext(
contextMap.put(
key,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there is an existing holder for this key? Shall we check and shutdown that?

Copy link
Collaborator

@gopalldb gopalldb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on existing key, PTAL, LG otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Connection evictor threads accumulate over time with Databricks JDBC driver 3.1.1 in long‑running service

2 participants