Skip to content

[ES-1774740] Fix Thrift polling infinite loop on invalid operation handle#1273

Merged
gopalldb merged 6 commits intodatabricks:mainfrom
gopalldb:fix/thrift-polling-infinite-loop
Mar 13, 2026
Merged

[ES-1774740] Fix Thrift polling infinite loop on invalid operation handle#1273
gopalldb merged 6 commits intodatabricks:mainfrom
gopalldb:fix/thrift-polling-infinite-loop

Conversation

@gopalldb
Copy link
Collaborator

@gopalldb gopalldb commented Mar 11, 2026

Summary

  • Fix infinite poll loop on invalid handle: checkOperationStatusForErrors() now checks TStatus.statusCode for INVALID_HANDLE_STATUS during polling. Previously only operationState was checked — if the server returned INVALID_HANDLE_STATUS without setting operationState (e.g. after a driver restart), the polling loop ran indefinitely.
  • Add timeout and sleep to metadata polling loop: fetchMetadataResults() previously had no timeout and no sleep between polls. It now uses a configurable MetadataOperationTimeout (default 300s) and sleeps between polls using the same interval as the SQL execution polling loop.
  • New connection property: MetadataOperationTimeout (seconds, default 300, 0 = no timeout) controls the metadata polling timeout.

Context

ES-1774740: After a Databricks cluster restart, the JDBC driver entered an infinite poll loop against the invalid operation handle. The root cause was that GetOperationStatus returned INVALID_HANDLE_STATUS in TStatus.statusCode but did not set operationState, so shouldContinuePolling() kept returning true. With the default queryTimeout=0 (infinite), there was no safety net to break the loop.

The metadata polling loop (fetchMetadataResults) had a separate issue: no timeout handler and no sleep between polls, meaning it could hammer the server in a tight loop indefinitely.

SEA mode is not affected — it uses HTTP status codes (e.g. 404) for invalid statements, which propagate as uncaught RuntimeException rather than causing an infinite loop.

Test plan

  • testPollingThrowsOnInvalidHandleStatus — SQL execution polling detects INVALID_HANDLE_STATUS and throws
  • testMetadataPollingThrowsOnInvalidHandleStatus — metadata polling detects INVALID_HANDLE_STATUS and throws
  • testMetadataPollingTimesOut — metadata polling respects timeout, cancels operation, and throws
  • testMetadataPollingWithSleepBetweenPolls — verifies sleep delay between metadata polls
  • All 47 tests in DatabricksThriftAccessorTest pass (43 existing + 4 new)

🤖 Generated with Claude Code

gopalldb and others added 2 commits March 11, 2026 17:00
…metadata polling timeout

When a Databricks server restarts while the JDBC driver is polling for operation
status, the operation handle becomes invalid. The server returns
INVALID_HANDLE_STATUS in TStatus but may not set operationState, causing
shouldContinuePolling() to return true indefinitely.

Fix 1: checkOperationStatusForErrors() now inspects TStatus.statusCode for
INVALID_HANDLE_STATUS (and ERROR_STATUS) during polling, not just operationState.

Fix 2: The metadata polling loop (fetchMetadataResults) previously had no timeout
and no sleep between polls. It now uses a configurable MetadataOperationTimeout
(default 300s) and sleeps between polls using the same interval as SQL execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
gopalldb and others added 2 commits March 11, 2026 21:37
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
TRANSACTION_ROLLBACK_ERROR,
RATE_LIMIT_EXCEEDED
RATE_LIMIT_EXCEEDED,
METADATA_OPERATION_TIMEOUT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot push this until proto files are raised and merged

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, switched to OPERATION_TIMEOUT_ERROR, will change later in separate PR to more specific error code

@samikshya-db samikshya-db changed the title Fix Thrift polling infinite loop on invalid operation handle [ES-1774740] Fix Thrift polling infinite loop on invalid operation handle Mar 12, 2026
Copy link
Collaborator

@vikrantpuppala vikrantpuppala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove NO_CHANGELOG=true from PR description, thanks!

@sreekanth-db
Copy link
Collaborator

Have we tested this manually by reproducing the issue ? Test plan in the description only mentions unit tests

@gopalldb gopalldb enabled auto-merge (squash) March 13, 2026 09:15
@gopalldb gopalldb merged commit c724642 into databricks:main Mar 13, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants