Skip to content

Commit 269174e

Browse files
gopalldbclaudevikrantpuppalasreekanth-dbsamikshya-db
authored
Improve error messages for transient HTTP failures during Thrift polling (#1426)
## Summary Fixes unclear error messages when the cluster is temporarily unavailable (e.g. HTTP 502 Bad Gateway) during `GetOperationStatus` polling. Previously, errors showed `error: [null]` with no way for callers to identify them as transient/retryable. - **Enrich null error messages**: When the server returns `ERROR_STATUS` with null `errorMessage`, the driver now includes `errorCode`, `errorDetailsJson`, and `infoMessages` from `TStatus` instead of showing `error: [null]` - **Catch transient HTTP failures**: `TException` from `GetOperationStatus` in the polling loop (typically from 502/503 after retry exhaustion) is now caught and wrapped with a descriptive message explaining this is a transient communication failure - **Retryable SQL state**: Uses SQL state `08S01` (communication link failure) so callers can programmatically identify these as retryable errors ### Before ``` Operation status check failed with status code: [ERROR_STATUS] for statement [46a475dc-...], error: [null] ``` No SQL state, no way to identify as transient. ### After ``` Lost connection to server while polling statement [46a475dc-...]. This is typically a transient error (e.g. HTTP 502 Bad Gateway) indicating the cluster was temporarily unavailable. Cause: Retry failure. HTTP response code: 502, Error Message: Bad Gateway ``` SQL state: `08S01` (communication link failure). Fixes #1165 ## Test plan - [x] `DatabricksThriftAccessorTest` — 50 tests pass - [x] Updated `testExecute_setsStatementIdEvenIfStatusRequestFails` to verify `08S01` SQL state and descriptive message This pull request was AI-assisted by Isaac. --------- Signed-off-by: Gopal Lal <gopal.lal@databricks.com> Signed-off-by: Vikrant Puppala <vikrant.puppala@databricks.com> Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Vikrant Puppala <vikrant.puppala@databricks.com> Co-authored-by: Sreekanth <sreekanth.vadigi@databricks.com> Co-authored-by: Samikshya Chand <148681192+samikshya-db@users.noreply.github.com>
1 parent b2af201 commit 269174e

6 files changed

Lines changed: 115 additions & 18 deletions

File tree

NEXT_CHANGELOG.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
- Fixed primitive types within complex types (ARRAY, MAP, STRUCT) not being correctly parsed when Arrow serialization uses alternate formats: TIMESTAMP/TIMESTAMP_NTZ as epoch microseconds or component arrays, and BINARY as base64-encoded strings.
1919
- Fixed `PARSE_SYNTAX_ERROR` for column names containing special characters (e.g., dots) when `EnableBatchedInserts` is enabled, by re-quoting column names with backticks in reconstructed multi-row INSERT statements.
2020
- Fixed Volume ingestion for SEA mode, which was broken due to statement being closed prematurely.
21-
- Fixed preserving error message during query results expiry and graceful handling of RuntimeException.
21+
- Fixed unclear `error: [null]` messages during transient HTTP failures (e.g. 502 Bad Gateway) in Thrift polling. Error messages now include server error details and use SQL state `08S01` (communication link failure) so callers can identify retryable errors. Also fixed `DatabricksError` (RuntimeException) from SDK client being unhandled in CloudFetch download paths.
2222
- Fixed escaped pattern characters in catalogName for `getSchemas`, as returned catalogName should be unescaped.
2323
- Fixed `getColumnClassName()` returning null for VARIANT columns in SEA mode by adding VARIANT to the type system.
2424
- Fixed `getColumns()` returning `DATA_TYPE=0` (NULL) for GEOMETRY/GEOGRAPHY columns in Thrift mode. Now returns `Types.VARCHAR` (12) when geospatial is disabled and `Types.OTHER` (1111) when enabled, consistent with SEA mode.

src/main/java/com/databricks/jdbc/common/DatabricksJdbcConstants.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,9 @@ public final class DatabricksJdbcConstants {
113113
/** Standard SQL state for operation cancelled (SQLSTATE HY008). */
114114
public static final String OPERATION_CANCELLED_SQLSTATE = "HY008";
115115

116+
/** Standard SQL state for communication link failure (SQLSTATE 08S01). */
117+
public static final String COMMUNICATION_LINK_FAILURE_SQLSTATE = "08S01";
118+
116119
public static final int TEMPORARY_REDIRECT_STATUS_CODE = 307;
117120
public static final String REDACTED_TOKEN = "****";
118121
public static final String QUERY_TAGS = "query_tags";

src/main/java/com/databricks/jdbc/dbclient/impl/sqlexec/DatabricksSdkClient.java

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@
4141
import com.databricks.sdk.core.ApiClient;
4242
import com.databricks.sdk.core.DatabricksConfig;
4343
import com.databricks.sdk.core.DatabricksError;
44+
import com.databricks.sdk.core.DatabricksException;
4445
import com.databricks.sdk.core.http.Request;
4546
import com.databricks.sdk.service.sql.*;
4647
import com.google.common.annotations.VisibleForTesting;
@@ -488,11 +489,11 @@ public ChunkLinkFetchResult getResultChunks(
488489
req.withHeaders(getHeaders("getStatementResultN"));
489490
ResultData resultData = apiClient.execute(req, ResultData.class);
490491
return buildChunkLinkFetchResult(resultData.getExternalLinks());
491-
} catch (DatabricksError e) {
492+
} catch (DatabricksException e) {
492493
String errorMessage =
493494
String.format(
494-
"Error fetching result chunks for statement [%s] chunk [%d] (HTTP %d): %s",
495-
statementId, chunkIndex, e.getStatusCode(), e.getMessage());
495+
"Error fetching result chunks for statement [%s] chunk [%d]: %s",
496+
statementId, chunkIndex, e.getMessage());
496497
LOGGER.error(errorMessage, e);
497498
throw new DatabricksSQLException(errorMessage, e, DatabricksDriverErrorCode.SDK_CLIENT_ERROR);
498499
} catch (IOException e) {
@@ -548,11 +549,11 @@ public ResultData getResultChunksData(StatementId typedStatementId, long chunkIn
548549
Request req = new Request(Request.GET, path, apiClient.serialize(request));
549550
req.withHeaders(getHeaders("getStatementResultN"));
550551
return apiClient.execute(req, ResultData.class);
551-
} catch (DatabricksError e) {
552+
} catch (DatabricksException e) {
552553
String errorMessage =
553554
String.format(
554-
"Error fetching result data for statement [%s] chunk [%d] (HTTP %d): %s",
555-
statementId, chunkIndex, e.getStatusCode(), e.getMessage());
555+
"Error fetching result data for statement [%s] chunk [%d]: %s",
556+
statementId, chunkIndex, e.getMessage());
556557
LOGGER.error(errorMessage, e);
557558
throw new DatabricksSQLException(errorMessage, e, DatabricksDriverErrorCode.SDK_CLIENT_ERROR);
558559
} catch (IOException e) {

src/main/java/com/databricks/jdbc/dbclient/impl/thrift/DatabricksThriftAccessor.java

Lines changed: 64 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
package com.databricks.jdbc.dbclient.impl.thrift;
22

3+
import static com.databricks.jdbc.common.DatabricksJdbcConstants.COMMUNICATION_LINK_FAILURE_SQLSTATE;
34
import static com.databricks.jdbc.common.DatabricksJdbcConstants.OPERATION_CANCELLED_SQLSTATE;
45
import static com.databricks.jdbc.common.DatabricksJdbcConstants.QUERY_EXECUTION_TIMEOUT_SQLSTATE;
56
import static com.databricks.jdbc.common.EnvironmentVariables.*;
@@ -34,6 +35,7 @@
3435
import org.apache.thrift.TException;
3536
import org.apache.thrift.TFieldIdEnum;
3637
import org.apache.thrift.protocol.TBinaryProtocol;
38+
import org.apache.thrift.transport.TTransportException;
3739

3840
final class DatabricksThriftAccessor {
3941

@@ -329,8 +331,13 @@ private TGetOperationStatusResp pollTillOperationFinished(
329331
// Check for timeout before continuing
330332
timeoutHandler.checkTimeout();
331333

332-
// Polling for operation status
333-
statusResp = getOperationStatus(statusReq, statementId);
334+
// TTransportException means a transport-level failure (e.g. HTTP 502 Bad Gateway)
335+
// after retries were exhausted. Other TException subtypes propagate unchanged.
336+
try {
337+
statusResp = getOperationStatus(statusReq, statementId);
338+
} catch (TTransportException e) {
339+
throw buildTransportFailureException(statementId.toSQLExecStatementId(), e);
340+
}
334341
checkOperationStatusForErrors(statusResp, statementId.toSQLExecStatementId());
335342
// Save some time if sleep isn't required by breaking.
336343
if (!shouldContinuePolling(statusResp)) {
@@ -752,7 +759,11 @@ TFetchResultsResp fetchMetadataResults(TResp response, String contextDescription
752759
DatabricksDriverErrorCode.OPERATION_TIMEOUT_ERROR);
753760
while (shouldContinuePolling(statusResp)) {
754761
metadataTimeoutHandler.checkTimeout();
755-
statusResp = getThriftClient().GetOperationStatus(statusReq);
762+
try {
763+
statusResp = getThriftClient().GetOperationStatus(statusReq);
764+
} catch (TTransportException e) {
765+
throw buildTransportFailureException(statementId, e);
766+
}
756767
checkOperationStatusForErrors(statusResp, statementId);
757768
if (!shouldContinuePolling(statusResp)) {
758769
break;
@@ -822,11 +833,12 @@ private void checkOperationStatusForErrors(TGetOperationStatusResp statusResp, S
822833
// and the operation handle becomes invalid. Without this check, the polling loop would
823834
// continue indefinitely since operationState may not be set in the response.
824835
if (statusResp.isSetStatus() && isErrorStatusCode(statusResp.getStatus())) {
836+
String serverError = enrichErrorMessage(statusResp.getStatus());
825837
String errorMsg =
826838
String.format(
827839
"Operation status check failed with status code: [%s] for statement [%s], "
828840
+ "error: [%s]",
829-
statusResp.getStatus().getStatusCode(), statementId, statusResp.getErrorMessage());
841+
statusResp.getStatus().getStatusCode(), statementId, serverError);
830842
LOGGER.error(errorMsg);
831843
throw new DatabricksSQLException(
832844
errorMsg, statusResp.isSetSqlState() ? statusResp.getSqlState() : null);
@@ -838,10 +850,11 @@ private void checkOperationStatusForErrors(TGetOperationStatusResp statusResp, S
838850
}
839851

840852
if (statusResp.isSetOperationState() && isErrorOperationState(statusResp.getOperationState())) {
853+
String serverError = enrichErrorMessage(statusResp.getStatus());
841854
String errorMsg =
842855
String.format(
843856
"Operation failed with error: [%s] for statement [%s], with response [%s]",
844-
statusResp.getErrorMessage(), statementId, statusResp);
857+
serverError, statementId, statusResp);
845858
LOGGER.error(errorMsg);
846859

847860
String sqlState = statusResp.getSqlState();
@@ -855,6 +868,52 @@ private void checkOperationStatusForErrors(TGetOperationStatusResp statusResp, S
855868
}
856869
}
857870

871+
/**
872+
* Enriches a null or empty error message from TStatus by including errorCode, errorDetailsJson,
873+
* and infoMessages. Returns the original errorMessage if it is already present.
874+
*/
875+
private String enrichErrorMessage(TStatus status) {
876+
if (status == null) {
877+
return "no error details from server";
878+
}
879+
String errorMessage = status.getErrorMessage();
880+
if (errorMessage != null && !errorMessage.isEmpty()) {
881+
return errorMessage;
882+
}
883+
StringBuilder detail = new StringBuilder();
884+
if (status.isSetErrorCode()) {
885+
detail.append("errorCode=").append(status.getErrorCode());
886+
}
887+
if (status.isSetErrorDetailsJson()
888+
&& status.getErrorDetailsJson() != null
889+
&& !status.getErrorDetailsJson().isEmpty()) {
890+
if (detail.length() > 0) detail.append(", ");
891+
detail.append("details=").append(status.getErrorDetailsJson());
892+
}
893+
if (status.isSetInfoMessages() && status.getInfoMessages() != null) {
894+
if (detail.length() > 0) detail.append(", ");
895+
detail.append("infoMessages=").append(status.getInfoMessages());
896+
}
897+
return detail.length() > 0 ? detail.toString() : "no error details from server";
898+
}
899+
900+
/**
901+
* Builds a DatabricksSQLException for transport-level failures (e.g. HTTP 502 Bad Gateway) during
902+
* polling. Uses SQL state 08S01 (communication link failure) so callers can identify retryable
903+
* errors.
904+
*/
905+
private DatabricksSQLException buildTransportFailureException(
906+
String statementId, TTransportException e) {
907+
String errorMsg =
908+
String.format(
909+
"Lost connection to server while polling statement [%s] (%s). "
910+
+ "This is typically a transient error (e.g. HTTP 502 Bad Gateway) "
911+
+ "indicating the cluster was temporarily unavailable. Cause: %s",
912+
statementId, e.getClass().getSimpleName(), e.getMessage());
913+
LOGGER.error(errorMsg, e);
914+
return new DatabricksSQLException(errorMsg, e, COMMUNICATION_LINK_FAILURE_SQLSTATE);
915+
}
916+
858917
private boolean shouldContinuePolling(TGetOperationStatusResp statusResp) {
859918
return statusResp == null
860919
|| !statusResp.isSetOperationState()

src/test/java/com/databricks/jdbc/dbclient/impl/sqlexec/DatabricksSdkClientTest.java

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1120,7 +1120,6 @@ public void testGetResultChunks_DatabricksError_throwsSQLException() throws Exce
11201120
DatabricksSQLException.class,
11211121
() -> databricksSdkClient.getResultChunks(STATEMENT_ID, 0, 0));
11221122

1123-
assertTrue(exception.getMessage().contains("HTTP 404"));
11241123
assertTrue(exception.getMessage().contains("Results have expired"));
11251124
assertNotNull(exception.getCause());
11261125
}
@@ -1141,7 +1140,6 @@ public void testGetResultChunksData_DatabricksError_throwsSQLException() throws
11411140
DatabricksSQLException.class,
11421141
() -> databricksSdkClient.getResultChunksData(STATEMENT_ID, 0));
11431142

1144-
assertTrue(exception.getMessage().contains("HTTP 404"));
11451143
assertTrue(exception.getMessage().contains("Results have expired"));
11461144
assertNotNull(exception.getCause());
11471145
}

src/test/java/com/databricks/jdbc/dbclient/impl/thrift/DatabricksThriftAccessorTest.java

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
import java.sql.Statement;
2727
import java.util.ArrayList;
2828
import org.apache.thrift.TException;
29+
import org.apache.thrift.transport.TTransportException;
2930
import org.junit.jupiter.api.AfterEach;
3031
import org.junit.jupiter.api.BeforeEach;
3132
import org.junit.jupiter.api.Test;
@@ -472,6 +473,38 @@ void testGetStatementResult_cancelled_throwsWithHY008() throws Exception {
472473
assertTrue(exception.getMessage().contains("was cancelled"));
473474
}
474475

476+
@Test
477+
void testPollingPath_errorStatusWithNullMessage_includesErrorCode() throws Exception {
478+
setup(false);
479+
TExecuteStatementReq request = new TExecuteStatementReq();
480+
TExecuteStatementResp executeResp =
481+
new TExecuteStatementResp()
482+
.setOperationHandle(tOperationHandle)
483+
.setStatus(new TStatus().setStatusCode(TStatusCode.SUCCESS_STATUS));
484+
when(thriftClient.ExecuteStatement(request)).thenReturn(executeResp);
485+
486+
// Server returns ERROR_STATUS with null errorMessage but populated errorCode
487+
TStatus errorStatus = new TStatus().setStatusCode(TStatusCode.ERROR_STATUS).setErrorCode(502);
488+
TGetOperationStatusResp errorResp =
489+
new TGetOperationStatusResp()
490+
.setStatus(errorStatus)
491+
.setOperationState(TOperationState.RUNNING_STATE);
492+
when(thriftClient.GetOperationStatus(operationStatusReq)).thenReturn(errorResp);
493+
494+
Statement statement = mock(Statement.class);
495+
when(parentStatement.getStatement()).thenReturn(statement);
496+
when(statement.getQueryTimeout()).thenReturn(0);
497+
498+
DatabricksSQLException exception =
499+
assertThrows(
500+
DatabricksSQLException.class,
501+
() -> accessor.execute(request, parentStatement, session, StatementType.SQL));
502+
503+
// Verify the enriched message includes errorCode instead of "error: [null]"
504+
assertTrue(exception.getMessage().contains("errorCode=502"));
505+
assertFalse(exception.getMessage().contains("error: [null]"));
506+
}
507+
475508
@Test
476509
void testListPrimaryKeys() throws TException, SQLException, DatabricksValidationException {
477510
setup(false);
@@ -778,7 +811,7 @@ void testExecute_setsStatementIdEvenIfStatusRequestFails()
778811
// Make execute statement succeed but get operation status fail
779812
when(thriftClient.ExecuteStatement(request)).thenReturn(tExecuteStatementResp);
780813
when(thriftClient.GetOperationStatus(any(TGetOperationStatusReq.class)))
781-
.thenThrow(new TException("Failed to get status"));
814+
.thenThrow(new TTransportException("Retry failure. HTTP response code: 502"));
782815
Statement statement = mock(Statement.class);
783816
when(parentStatement.getStatement()).thenReturn(statement);
784817
when(statement.getQueryTimeout()).thenReturn(0);
@@ -789,12 +822,15 @@ void testExecute_setsStatementIdEvenIfStatusRequestFails()
789822
try {
790823
accessor.execute(request, parentStatement, session, StatementType.SQL);
791824
fail("Expected exception due to GetOperationStatus failure");
792-
} catch (DatabricksHttpException e) {
825+
} catch (DatabricksSQLException e) {
793826
// Verify that statement ID was set on parent statement despite the failure
794827
verify(parentStatement).setStatementId(eq(expectedStatementId));
795828

796-
// Verify the error was from GetOperationStatus
797-
assertTrue(e.getMessage().contains("Failed to get status"));
829+
// Verify the error indicates a transient communication failure
830+
assertTrue(e.getMessage().contains("Lost connection to server while polling"));
831+
assertTrue(e.getMessage().contains("TTransportException"));
832+
assertTrue(e.getMessage().contains("502"));
833+
assertEquals("08S01", e.getSQLState());
798834
}
799835
}
800836

0 commit comments

Comments
 (0)