Implement getSchemas() for SEA#802
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements the getSchemas() method for the SEA client by retrieving catalogs and fetching their schemas in parallel, using a newly introduced JdbcThreadUtils for multi-threaded operations. Key changes include:
- Introducing JdbcThreadUtils with parallelMap and parallelFlatMap methods for concurrent task execution.
- Modifying DatabricksDatabaseMetaData#getSchemas() to fetch catalogs and schemas concurrently.
- Adding comprehensive tests for JdbcThreadUtils in JdbcThreadUtilsTest.java.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/test/java/com/databricks/jdbc/common/util/JdbcThreadUtilsTest.java | New tests covering multiple execution paths for the JdbcThreadUtils methods. |
| src/main/java/com/databricks/jdbc/model/telemetry/enums/DatabricksDriverErrorCode.java | Updated error code ordering with addition of OPERATION_TIMEOUT_ERROR. |
| src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java | Introduced parallelMap and parallelFlatMap; contains error handling for interruptions and execution errors. |
| src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java | Updated getSchemas() to process catalog schemas in parallel using JdbcThreadUtils. |
Comments suppressed due to low confidence (2)
src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:71
- The error code THREAD_INTERRUPTED_ERROR is referenced but not defined in DatabricksDriverErrorCode. Please add an appropriate definition or update to an existing error code.
throw new DatabricksSQLException("Parallel execution interrupted", e, DatabricksDriverErrorCode.THREAD_INTERRUPTED_ERROR);
src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:80
- The error code INVALID_STATE is used but not defined in DatabricksDriverErrorCode. Please add it to the enum or replace it with an existing, appropriate error code.
throw new DatabricksSQLException("Error in parallel execution", e, DatabricksDriverErrorCode.INVALID_STATE);
There was a problem hiding this comment.
Pull Request Overview
This PR implements the getSchemas() method for the SEA client by fetching catalogs and concurrently retrieving schema information, while introducing the JdbcThreadUtils for parallel operations. Key changes include:
- Adding JdbcThreadUtils and associated unit tests for parallel processing.
- Enhancing error handling and telemetry error codes.
- Updating the getSchemas() method in DatabricksDatabaseMetaData to use parallel processing.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java | Introduces parallelMap and parallelFlatMap methods with proper thread-context handling. |
| src/test/java/com/databricks/jdbc/common/util/JdbcThreadUtilsTest.java | Adds comprehensive unit tests for parallel execution, exception, and timeout scenarios. |
| src/main/java/com/databricks/jdbc/model/telemetry/enums/DatabricksDriverErrorCode.java | Inserts a new error code (OPERATION_TIMEOUT_ERROR) and reorders error codes. |
| src/main/java/com/databricks/jdbc/api/impl/DatabricksDatabaseMetaData.java | Updates getSchemas() to fetch catalogs and execute schema retrieval in parallel. |
Comments suppressed due to low confidence (1)
src/main/java/com/databricks/jdbc/common/util/JdbcThreadUtils.java:73
- The error code THREAD_INTERRUPTED_ERROR is referenced here but is not defined in the DatabricksDriverErrorCode enum. Consider adding the THREAD_INTERRUPTED_ERROR to the enum or using an existing error code that reflects an interruption.
throw new DatabricksSQLException("Parallel execution interrupted", e, DatabricksDriverErrorCode.THREAD_INTERRUPTED_ERROR);
There was a problem hiding this comment.
is there a basis for selecting this particular value for the timeout? Had the same concerns for the max threads variable, but that is configurable.
There was a problem hiding this comment.
I think i added based on if timeout is long enough and below is one data-point:
On the e2-dogfood environment, which has around 9,000 schemas, this approach took approximately 12 seconds—compared to 9 seconds with the existing driver—an acceptable difference.
There was a problem hiding this comment.
how are we testing parallel execute here?
There was a problem hiding this comment.
this particular test method just tests the parallel execute with a single item/task testParallelExecuteWithSingleItem. But there are other test methods that tests parallel execute with multiple items.
Testing methodology:
- parallel map is executed over a set of items
- for each item, the task is to create a new upper case string
String::toUpperCase - we assert the upper-case
Description
Testing
Additional Notes to the Reviewer
Currently afaik, it's not possible to set this configuration directly through the JDBC connection.We need to add a DBSQL config at https://github.com/databricks-eng/universe/blob/68a3f9bf4aa09b4d28f85229cbbb4c6e7bcef7e2/common/dbsql-config/src/SqlConfig.scala#L35 for the configuration which is WIP.