Skip to content

Statement.execute() incorrectly returns true for DML containing UNION/INTERSECT/EXCEPT, breaking frameworks that rely on !execute() to detect DML #1418

@mikihiro-sonetaka

Description

@mikihiro-sonetaka

Summary

DatabricksStatement.execute(String sql) in databricks-jdbc 3.3.1 returns true (JDBC-spec "first result is a ResultSet") for DML statements whose body contains the set operators UNION, INTERSECT, or EXCEPT, even when those keywords appear inside a subquery or CTE of an INSERT / UPDATE / DELETE / MERGE statement. This violates JDBC spec and breaks frameworks that use !execute() as the DML detector, such as Slick (a Scala ORM).

Environment

  • databricks-jdbc: 3.3.1
  • Observed on both SEA and Thrift backends (UseThriftClient=1 does not change the behaviour — the decision is made client-side before any backend call)
  • Does not reproduce on Simba legacy driver 2.7.5

Root cause (from source)

src/main/java/com/databricks/jdbc/api/impl/DatabricksStatement.java:724-755 shouldReturnResultSet(String, List<String>) pre-parses the SQL string with the regex list in src/main/java/com/databricks/jdbc/common/DatabricksJdbcConstants.java:139-177:

public static final Pattern UNION_PATTERN =
    Pattern.compile("\\s+UNION\\s+", Pattern.CASE_INSENSITIVE);
public static final Pattern INTERSECT_PATTERN =
    Pattern.compile("\\s+INTERSECT\\s+", Pattern.CASE_INSENSITIVE);
public static final Pattern EXCEPT_PATTERN =
    Pattern.compile("\\s+EXCEPT\\s+", Pattern.CASE_INSENSITIVE);

These patterns are non-anchored and matched via find(). Any occurrence anywhere in the SQL — even deep inside a CTE or subquery of an outer DML — forces execute() to return true.

Minimal reproduction

INSERT INTO my_catalog.my_schema.target_table
SELECT *
FROM (
  SELECT col1, col2 FROM my_catalog.my_schema.source_table WHERE 1 = 0
  UNION ALL
  SELECT col1, col2 FROM my_catalog.my_schema.source_table WHERE 1 = 0
) t

Java test:

try (Statement st = conn.createStatement()) {
    boolean hasResultSet = st.execute(sql);
    System.out.println("hasResultSet = " + hasResultSet);
    System.out.println("updateCount  = " + st.getUpdateCount());
}

Observed:

hasResultSet = true
updateCount  = -1

Expected (per JDBC spec, since this is a DML):

hasResultSet = false
updateCount  = <number of rows affected>

Removing the UNION ALL (keeping one of the two subqueries) makes the same INSERT return hasResultSet = false as expected.

The same issue triggers on:

  • INSERT INTO t WITH cte AS (... UNION ...) SELECT ...
  • INSERT OVERWRITE DIRECTORY 's3://...' USING CSV ... SELECT ... INTERSECT SELECT ...
  • Any DML that uses SELECT * EXCEPT (col) column-exclusion syntax — \s+EXCEPT\s+ matches there too even though it is not a set operator in that position

Impact on Slick (sqlu)

Slick's StatementInvoker.results uses:

val doUpdate = !st.execute()
if (doUpdate) Left(st.getUpdateCount)
else Right(iterator over st.getResultSet)  // extracted with GetUpdateValue

where GetResult.GetUpdateValue unconditionally throws:

object GetUpdateValue extends GetResult[Int] {
  def apply(pr: PositionedResult) =
    throw new SlickException("Update statements should not return a ResultSet")
}

So every Slick sqlu against Databricks whose SQL body contains UNION / INTERSECT / EXCEPT throws on execute() = true, regardless of the actual shape of the returned ResultSet:

  • Non-empty ResultSet (e.g. INSERT INTO t ... UNION ... returning num_affected_rows metrics) → "Update statements should not return a ResultSet"
  • Empty ResultSet (e.g. INSERT OVERWRITE DIRECTORY 's3://...' ... INTERSECT ...) → NoSuchElementException("empty result set")

This is a regression from the Simba 2.x driver, which returned execute() = false for the same statements.

Why SQL-side workaround is often not feasible

  • INSERT OVERWRITE DIRECTORY combined with set operators has no equivalent rewrite that preserves both semantics and performance.
  • SELECT * EXCEPT (col) is a core Databricks SQL feature with no widely-used alternative.

Proposed fix

  1. Remove the non-anchored set-operator patterns from shouldReturnResultSet's match list. A top-level SELECT ... UNION ... or (SELECT ...) UNION (SELECT ...) is already covered by SELECT_PATTERN / FROM_PATTERN / WITH_PATTERN via anchored prefix match after comment stripping. The non-anchored patterns only matter in subquery / CTE positions, where they should not change the enclosing statement's classification.
  2. Alternatively, if prefix-based heuristics must be retained: either switch to a real SQL tokenizer that understands CTE / subquery boundaries, or anchor all three patterns to the statement start after prefix-stripping parentheses.
  3. Add INSERT_PATTERN (already defined in constants) to the negative-match path of shouldReturnResultSet so INSERT-prefixed DMLs cannot be promoted to ResultSet mode.
  4. Document the behaviour and provide a URL parameter (e.g. NonRowcountQueryPrefixes=-UNION,-INTERSECT,-EXCEPT) to let applications opt out of the legacy behaviour without code changes.

Related changelog

v3.2.1 "getResultSet() now returns null in case of DML statements to honour JDBC spec" — this fix only applies when shouldReturnResultSet == false, so it does not help the cases reported here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions