[ES-1774740] Fix Thrift polling infinite loop on invalid operation handle#1273
Merged
gopalldb merged 6 commits intodatabricks:mainfrom Mar 13, 2026
Merged
Conversation
…metadata polling timeout When a Databricks server restarts while the JDBC driver is polling for operation status, the operation handle becomes invalid. The server returns INVALID_HANDLE_STATUS in TStatus but may not set operationState, causing shouldContinuePolling() to return true indefinitely. Fix 1: checkOperationStatusForErrors() now inspects TStatus.statusCode for INVALID_HANDLE_STATUS (and ERROR_STATUS) during polling, not just operationState. Fix 2: The metadata polling loop (fetchMetadataResults) previously had no timeout and no sleep between polls. It now uses a configurable MetadataOperationTimeout (default 300s) and sleeps between polls using the same interval as SQL execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Gopal Lal <gopal.lal@databricks.com>
| TRANSACTION_ROLLBACK_ERROR, | ||
| RATE_LIMIT_EXCEEDED | ||
| RATE_LIMIT_EXCEEDED, | ||
| METADATA_OPERATION_TIMEOUT |
Collaborator
There was a problem hiding this comment.
we cannot push this until proto files are raised and merged
Collaborator
Author
There was a problem hiding this comment.
removed, switched to OPERATION_TIMEOUT_ERROR, will change later in separate PR to more specific error code
vikrantpuppala
approved these changes
Mar 12, 2026
Collaborator
vikrantpuppala
left a comment
There was a problem hiding this comment.
let's remove NO_CHANGELOG=true from PR description, thanks!
samikshya-db
approved these changes
Mar 12, 2026
sreekanth-db
approved these changes
Mar 12, 2026
Collaborator
|
Have we tested this manually by reproducing the issue ? Test plan in the description only mentions unit tests |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
checkOperationStatusForErrors()now checksTStatus.statusCodeforINVALID_HANDLE_STATUSduring polling. Previously onlyoperationStatewas checked — if the server returnedINVALID_HANDLE_STATUSwithout settingoperationState(e.g. after a driver restart), the polling loop ran indefinitely.fetchMetadataResults()previously had no timeout and no sleep between polls. It now uses a configurableMetadataOperationTimeout(default 300s) and sleeps between polls using the same interval as the SQL execution polling loop.MetadataOperationTimeout(seconds, default 300, 0 = no timeout) controls the metadata polling timeout.Context
ES-1774740: After a Databricks cluster restart, the JDBC driver entered an infinite poll loop against the invalid operation handle. The root cause was that
GetOperationStatusreturnedINVALID_HANDLE_STATUSinTStatus.statusCodebut did not setoperationState, soshouldContinuePolling()kept returningtrue. With the defaultqueryTimeout=0(infinite), there was no safety net to break the loop.The metadata polling loop (
fetchMetadataResults) had a separate issue: no timeout handler and no sleep between polls, meaning it could hammer the server in a tight loop indefinitely.SEA mode is not affected — it uses HTTP status codes (e.g. 404) for invalid statements, which propagate as uncaught
RuntimeExceptionrather than causing an infinite loop.Test plan
testPollingThrowsOnInvalidHandleStatus— SQL execution polling detectsINVALID_HANDLE_STATUSand throwstestMetadataPollingThrowsOnInvalidHandleStatus— metadata polling detectsINVALID_HANDLE_STATUSand throwstestMetadataPollingTimesOut— metadata polling respects timeout, cancels operation, and throwstestMetadataPollingWithSleepBetweenPolls— verifies sleep delay between metadata pollsDatabricksThriftAccessorTestpass (43 existing + 4 new)🤖 Generated with Claude Code