[ES-1884446] Classify transient/mis-categorized server errors with retryable SQL states#1430
Open
samikshya-db wants to merge 3 commits intodatabricks:mainfrom
Open
Conversation
…tates Several DatabricksSQLException categorizations were either missing a SQL state or assigning one that did not reflect a retryable transient failure, making it impossible for callers to identify retryable errors at runtime. Reclassify three patterns in checkOperationStatusForErrors via a new classifyTransientSqlState helper: - Unity Catalog unavailability (UC_CLIENT_EXCEPTION / "Failed to contact the Unity Catalog server", currently undocumented XXUCC) -> 08S01 (communication link failure, retryable). - Connection-acquisition / parquet read deadlines (PARQUET_FAILED_READ_FOOTER, "DEADLINE_EXCEEDED: acquiring connection", currently no SQL state) -> 08S01. - Server-side ConcurrentModificationException currently mis-mapped to 42000 (syntax/access violation) -> 40001 (serialization failure, retryable concurrency error). Adds unit tests for the classifier plus an end-to-end test verifying the SQL state is reassigned through the polling pipeline. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
…ror sites Promotes classifyTransientSqlState to a shared SqlStateClassifier helper and calls it from every Thrift error site (executeAsync, checkResponseForErrors, verifySuccessStatus, and both branches of checkOperationStatusForErrors) so the same server failure surfaces with the same SQL state regardless of which response carries it. Tightens patterns to anchor on stable server-emitted tokens ([UC_CLIENT_EXCEPTION], [PARQUET_FAILED_READ_FOOTER], java.util.ConcurrentModificationException) instead of English prose, so user SQL string literals cannot trigger a false remap. Gates the ConcurrentModificationException -> 40001 remap on originalSqlState == 42000 to preserve unrelated 42xxx states (e.g. 42501 insufficient privilege). Logs the original -> remapped state at INFO at each remap site so support engineers can recover the original state from logs. Replaces helper-coupled unit tests with a dedicated SqlStateClassifierTest and adds end-to-end tests for the operation-state branch and the CME path in DatabricksThriftAccessorTest. Updates NEXT_CHANGELOG.md to call out (a) callers depending on XXUCC/42000 must update and (b) the failure-telemetry error-name bucket migration. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
The shared SqlStateClassifier helper introduced in the previous commit was applied at every Thrift-side error site, but the SEA path (DatabricksSdkClient.handleFailedExecution) still passed the server's SQL state through unmodified — so the same UC outage / parquet deadline / ConcurrentModificationException patterns surfaced via SEA would still be mis-categorized. Apply the classifier in SEA's handleFailedExecution and log the original → remapped state at INFO when they differ, matching the convention used at the Thrift sites. Adds a SEA-path test that asserts an XXUCC-tagged UC failure surfaces with 08S01. Co-authored-by: Isaac Signed-off-by: samikshya-chand_data <samikshya.chand@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Several
DatabricksSQLExceptioncategorizations during Thrift operation polling were either missing a SQL state or assigning one that did not reflect a retryable transient failure, making it impossible for callers to identify retryable errors at runtime. This PR adds a small classifier that remaps three known patterns to standard, retryable SQL states.UC_CLIENT_EXCEPTION/Failed to contact the Unity Catalog server(Unity Catalog unavailability)XXUCC(undocumented)08S01communication link failurePARQUET_FAILED_READ_FOOTER/DEADLINE_EXCEEDED: acquiring connection08S01ConcurrentModificationException(server-side mutation race)42000(syntax/access violation — incorrect)40001serialization failureThe remap happens inside
checkOperationStatusForErrors(both the status-code branch and the operation-state branch), via a new package-privateclassifyTransientSqlState(errorMessage, originalSqlState)helper that returns the original state when no pattern matches. This follows the same pattern as the recently added08S01mapping forTTransportExceptionpolling failures (#1426).Test plan
DatabricksThriftAccessorTestunit tests for each of the four classifier branches (UC, ConcurrentModificationException, parquet/deadline, unrelated-passthrough).testExecute_remapsUnityCatalogErrorToCommunicationLinkFailure) verifying SQL state is reassigned throughaccessor.execute(...)polling./etc/hostsJFrog-migration leftover redirectingrepo.maven.apache.orgto localhost; relying on CI for full run).This pull request and its description were written by Isaac.