Skip to content

Fix TIMEDOUT_STATE not recognized as error on interactive clusters#1

Open
samikshya-db wants to merge 8 commits into
mainfrom
fix/timedout-state-interactive-cluster
Open

Fix TIMEDOUT_STATE not recognized as error on interactive clusters#1
samikshya-db wants to merge 8 commits into
mainfrom
fix/timedout-state-interactive-cluster

Conversation

@samikshya-db
Copy link
Copy Markdown
Owner

Description

Follow-up to databricks#1199 ([ES-1717770]). The previous fix covered the case where FetchResults returns an error status with sqlState=57KD0. However, the interactive cluster path was still broken.

Root cause: When using interactive clusters with enableDirectResults=true, the cluster can enforce its own server-side query timeout and return TIMEDOUT_STATE directly in directResults.operationStatus — before the client's polling loop ever starts. Because isErrorOperationState did not include TIMEDOUT_STATE, the driver:

  1. Did not throw in checkOperationStatusForErrors
  2. shouldContinuePolling(TIMEDOUT_STATE) returned false → polling loop never started → TimeoutHandler never fired
  3. Fell through to executeFetchRequest → server returned an error → driver threw DatabricksHttpException instead of DatabricksTimeoutException

The same gap also affects the polling path when GetOperationStatus returns TIMEDOUT_STATE during polling.

Fix:

  • Add TIMEDOUT_STATE to isErrorOperationState
  • Throw DatabricksTimeoutException for TIMEDOUT_STATE in checkOperationStatusForErrors regardless of whether sqlState is set (interactive clusters do not always populate it)

Testing

  • testTimedOutStateInDirectResultsThrowsTimeoutException — Pavan's exact repro: server returns TIMEDOUT_STATE in directResults before polling starts
  • testTimedOutStateDuringPollingThrowsTimeoutException — server returns TIMEDOUT_STATE during polling

Additional Notes

The original ES-1717770 verification test passed on both warehouse and all-purpose cluster because setQueryTimeout(1) with a long-running query caused the server to return RUNNING_STATE first (query still in-flight), entering the polling loop where TimeoutHandler fired correctly. Pavan's repro consistently hits the other path: the cluster's own timeout fires first, returning TIMEDOUT_STATE directly, bypassing the polling loop entirely.

samikshya-db and others added 8 commits March 2, 2026 17:45
When using interactive clusters with enableDirectResults=true, the server
can return TIMEDOUT_STATE directly in directResults.operationStatus when
the cluster's own query timeout fires before the client's polling loop
starts. Because TIMEDOUT_STATE was not included in isErrorOperationState,
the driver silently fell through to executeFetchRequest and threw
DatabricksHttpException instead of DatabricksTimeoutException.

Fix isErrorOperationState to include TIMEDOUT_STATE, and update
checkOperationStatusForErrors to throw DatabricksTimeoutException for
TIMEDOUT_STATE regardless of whether sqlState is set, since interactive
clusters do not always populate the SQL state field.

Add tests covering:
- TIMEDOUT_STATE in directResults (server timeout fires before polling starts)
- TIMEDOUT_STATE returned during polling

Signed-off-by: Samikshya Chand <samikshya.chand@databricks.com>
Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
When using interactive clusters with enableDirectResults=true, the server
can return TIMEDOUT_STATE directly in directResults.operationStatus when
the cluster's own query timeout fires before the client's polling loop
starts. Because TIMEDOUT_STATE was not included in isErrorOperationState,
the driver silently fell through to executeFetchRequest and threw
DatabricksHttpException instead of DatabricksTimeoutException.

Fix isErrorOperationState to include TIMEDOUT_STATE, and update
checkOperationStatusForErrors to throw DatabricksTimeoutException for
TIMEDOUT_STATE regardless of whether sqlState is set, since interactive
clusters do not always populate the SQL state field.

Add tests covering:
- TIMEDOUT_STATE in directResults (server timeout fires before polling starts)
- TIMEDOUT_STATE returned during polling

Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
samikshya-db pushed a commit that referenced this pull request Apr 28, 2026
…icks#1428)

## Summary

Three CI workflows on `main` have been failing. This PR fixes the root
causes.

### 1. `Integration Tests Workflow - Main Branch` (failing 30+ runs in a
row since Apr 8)

- The cache step uses `path: ~/.m2` with a long-lived restore-key `${{
runner.os }}-m2`. It restores a stale cache from before the JFrog OIDC
migration, whose `~/.m2/settings.xml` (the github-server one written by
`actions/setup-java`) **overwrites** the JFrog mirror config that the
preceding "Configure maven" step just wrote.
- Maven then tries to resolve from `repo.maven.apache.org` directly,
which the protected runner cannot reach: `Could not transfer artifact
... from/to central (https://repo.maven.apache.org/maven2): Remote host
terminated the handshake`.
- **Fix**: narrow `path` to `~/.m2/repository` so `settings.xml` is left
alone (matches the pattern used by `warmMavenCache.yml`).

### 2. `Weekly bug catcher` — same root cause as #1

Same fix in `bugCatcher.yml`.

### 3. `Test JDBC Logging` (Windows jobs failing since Apr 27)

- The "Get JFrog OIDC token" step uses bash syntax (`if [ -z "$X" ]`,
`set -euo pipefail`) but has no `shell:` directive. On Windows runners
the default shell is `pwsh`, which fails parsing the bash `if` with
`Missing '(' after 'if'`. The "Configure maven" step has the same
problem (bash heredoc).
- **Fix**: pin both steps to `shell: bash`. Linux jobs were unaffected
because their default shell is already bash.

## Test plan

- [ ] After merge, confirm `Integration Tests Workflow - Main Branch`
passes on the next push to `main`
- [ ] After merge, confirm `Test JDBC Logging` Windows jobs pass on the
next push to `main`
- [ ] Next scheduled `Weekly bug catcher` run (Mondays 00:00 UTC) is
green

NO_CHANGELOG=true
OVERRIDE_FREEZE=true

This pull request and its description were written by Isaac.

---------

Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant