Implement multi-row INSERT batching for PreparedStatement#944
Conversation
* Add INSERT statement detection with new INSERT_PATTERN regex * Create InsertStatementParser utility for parsing INSERT statements * Enhance DatabricksPreparedStatement.executeLargeBatch() to: - Detect compatible INSERT operations in batch - Combine multiple single-row INSERTs into multi-row INSERT - Generate optimized SQL like: INSERT INTO table VALUES (?), (?), (?) - Fall back to individual execution for non-INSERT statements * Add comprehensive unit tests for all new functionality * Maintain backward compatibility and proper JDBC error semantics This addresses performance issues with Spark JDBC writes by reducing the number of database round-trips from N individual INSERTs to 1 multi-row INSERT statement.
…ERT batching Resolves issue where large batches exceeded Databricks' 256 parameter limit by implementing intelligent parameter chunking: - Add MAX_QUERY_PARAMETERS constant (256) to DatabricksJdbcConstants - Implement smart chunking logic: maxRowsPerChunk = 256 / columnCount - Automatically split large batches into optimally-sized chunks - Maintain multi-row INSERT performance benefits within parameter limits - Add comprehensive tests covering chunking scenarios and edge cases - Ensure minimum 1 row per chunk for very wide tables (>256 columns) Example: 60 rows × 5 columns = 300 parameters (exceeds limit) → Automatically chunked into: 51 rows + 9 rows (255 + 45 parameters)
There was a problem hiding this comment.
Pull Request Overview
This PR implements multi-row INSERT batching optimization for prepared statements to improve performance when executing large batches of INSERT operations. The implementation combines multiple single-row INSERT statements into fewer multi-row INSERT statements while respecting Databricks' 256 parameter limit.
- Adds a new
InsertStatementParserutility for parsing INSERT statements and generating multi-row equivalents - Optimizes
executeBatch()andexecuteLargeBatch()to use multi-row INSERT when possible - Includes parameter limit-aware chunking to handle large batches that exceed the 256 parameter maximum
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/main/java/com/databricks/jdbc/common/util/InsertStatementParser.java | New utility class for parsing INSERT statements and generating multi-row batched versions |
| src/main/java/com/databricks/jdbc/common/DatabricksJdbcConstants.java | Adds INSERT pattern constant and maximum query parameters limit |
| src/main/java/com/databricks/jdbc/api/impl/DatabricksStatement.java | Adds isInsertQuery() method to detect INSERT statements |
| src/main/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatement.java | Implements multi-row INSERT batching logic with parameter chunking |
| src/test/java/com/databricks/jdbc/common/util/InsertStatementParserTest.java | Comprehensive tests for INSERT statement parsing and multi-row generation |
| src/test/java/com/databricks/jdbc/api/impl/DatabricksStatementTest.java | Tests for INSERT statement detection |
| src/test/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatementTest.java | Updated tests to verify multi-row batching behavior and parameter chunking |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
jayantsing-db
left a comment
There was a problem hiding this comment.
I have added some comments/suggestions.
… rollout - Add EnableBatchedInserts connection property for controlled rollout - Enhance Javadoc documentation with detailed INSERT compatibility examples - Replace null returns with specific DatabricksParsingException for better debugging - Eliminate redundant INSERT pattern validation for improved performance - Consolidate parsing logic to reduce code duplication - Add comprehensive input validation with clear error messages
|
@jayantsing-db I've committed new changes to address feedback. Can you please review again? |
jayantsing-db
left a comment
There was a problem hiding this comment.
Thanks for the changes. I request that the default value for the feature remains set to 0 for now to avoid any accidental disruptions. Apart from that, just a few minor comments. Please feel free to merge once those are addressed.
| if (!INSERT_PATTERN.matcher(trimmedSql).find()) { | ||
| throw new DatabricksParsingException( | ||
| "SQL statement is not an INSERT operation: " + trimmedSql, | ||
| DatabricksDriverErrorCode.INPUT_VALIDATION_ERROR); | ||
| } | ||
|
|
||
| // Then extract detailed information using our specific pattern | ||
| Matcher matcher = INSERT_DETAILS_PATTERN.matcher(trimmedSql); |
There was a problem hiding this comment.
Opinion: Consider reusing the matcher object.
|
gentle reminder (you maybe already aware): request to sign-off the final commit to main. For more info, please take a look at https://github.com/databricks/databricks-jdbc/blob/main/CONTRIBUTING.md |
- Changed ENABLE_BATCHED_INSERTS default value from "1" to "0" in DatabricksJdbcUrlParams - Updated batch statement tests to explicitly enable EnableBatchedInserts=1 for proper testing - Added lenient mocking to prevent unnecessary stubbing exceptions in test cases - This ensures batched inserts are disabled by default while maintaining test coverage Signed-off-by: josecsotomorales <josecsmorales@gmail.com>
@jayantsing-db, I've addressed all the requested changes and signed off on my commit. Even though the PR is approved, I'm unable to merge it due to the lack of permissions. Could you please merge it? |
|
Hey @josecsotomorales, I just came across this post: https://qualytics.ai/blog/qualytics-databricks-partnership/. Curious whether the integration is using the OSS JDBC driver? |
Hi @jayantsing-db, Yep! — We support two modes today. Standard Connector: uses the Databricks JDBC driver for broad compatibility across environments. Thanks again for accepting our contributions — that helped a ton on our side! 🚀 Unity Catalog Mode: more Spark-native. We do direct Spark reads against Unity Catalog–managed tables, which avoids JDBC, integrates cleanly with UC permissions, and performs better at scale. |
|
Great, thanks and congratulations on the launch! |
Linked issue: #867
This PR implements multi-row INSERT batching optimization for prepared statements to improve performance when executing large batches of INSERT operations. The implementation combines multiple single-row INSERT statements into fewer multi-row INSERT statements while respecting Databricks' 256 parameter limit.
Adds a new InsertStatementParser utility for parsing INSERT statements and generating multi-row equivalents
Optimizes executeBatch() and executeLargeBatch() to use multi-row INSERT when possible
Includes parameter limit-aware chunking to handle large batches that exceed the 256 parameter maximum
Impact illustration (10k rows, 5 columns, 50 ms RTT):
• Before (single-row inserts): 10,000 statements → ~500s of RTT + server planning.
• After (batched): 196 statements (10k ÷ 51) → ~9.8s of RTT.
• That’s about a 50× reduction in latency, not even counting server CPU savings.
Signed-off-by: josecsotomorales josecsmorales@gmail.com, Jayant Singh jayant.singh@databricks.com