Fix race condition in thrift client by samikshya-db · Pull Request #999 · databricks/databricks-jdbc

samikshya-db · 2025-09-15T18:08:09Z

Description

This issue was uncovered as part of bug raised in [BUG] - query that projects various columns intermittently throws Cannot invoke "org.apache.hive.service.rpc.thrift.TGetFunctionsReq.isSetOperationId()" because "req$9" is null #958. Although the issue was resolved as part of PR add a workaround for getFunctions in Thrift #979 - the cause is deeper than that - TCLIService.client causes state leakage in case previous server call
This PR fixes the issue.
Internal doc on root-cause and analysis: https://docs.google.com/document/d/12kciNIqy-kL1e0HOQs9JOjJlb2zWeNFl02e9kO772PE/edit?tab=t.0

Testing

Reset code before this commit where we fixed getFunctions, and the following code no longer throws error

for (int i = 0; i < 10; i++) {
      try {
        DatabaseMetaData metaData = con.getMetaData();
        metaData.getFunctions("main", "testschema", null).close(); // Metadata call

        Statement stmt = con.createStatement();
        stmt.execute("select * from samikshyachand.default.hello_table"); // SQL query
        stmt.close();

        Thread.sleep(100); // Add a small delay
      } catch (Exception e) {
        e.printStackTrace();
      }
    }

Additional Notes to the Reviewer

gopalldb · 2025-09-16T11:07:11Z

+  private final String endpointUrl;
+  private final IDatabricksConnectionContext connectionContext;
  private TProtocolVersion serverProtocolVersion = JDBC_THRIFT_VERSION;
+  private ThreadLocal<TCLIService.Client> FAKE_SHARED_CLIENT;


nit: don't use caps for non final variables

gopalldb · 2025-09-16T11:07:48Z

-    if (!DriverUtil.isRunningAgainstFake()) {
-      // Create a new thrift client for each thread as client state is not thread safe. Note that
-      // the underlying protocol uses the same http client which is thread safe
-      this.thriftClient =


we don't need this logic any more?

Samikshya explained to me offline, @jayantsing-db can you also take a look as you worked on this previously.

jayantsing-db · 2025-09-25T10:32:19Z

    }
-
+    byte[] requestPayload;
+    synchronized (requestBuffer) {


If i am understanding correctly, a JDBC connection will now have a single thrift client. So, when a JDBC connection is used in different threads concurrently, it will use the same thrift client. Now, if we don't put a sync mechanism, different threads using the same thrift client will fail because:

thread A can increment the seq number of thrift client from x to x + 1

thread B can receive the response for the seq number x. thread B will see that seq num is not matching and will throw exception: out of sequence response

This section of the code with all good intent and purposes tries to put a sync mechanism in place. But I am afraid, this might not be sufficient.

So the thrift calls happen in this order sequentially for a thread:

TCLIService.Client.ExecuteStatement -> seqid ++ -> transport.write -> transport.flush -> <wait for the http response> -> transport.read (when reading it checks the seqid with that present in message)

The current sync mechanism only protects transport.write and transport.read. And things could fail in exciting ways because the whole chain is not isolated/protected.

Happy to chat if this does not make sense at all.

Ideally, JDBC is a sync protocol and client app is not advised to use the same connection in different threads but there is no strict enforcement. Also i wrote a basic test on top of these changes that uses a JDBC connection across many threads and executes statements. I didn't encounter any errors (but i believe like other threading issues that's by chance that threads didn't step over each other)

For this change, my goal was specifically to address the getFunctions issue raised in this repo. I may not be fully following the broader concern you mentioned — could you please share a concrete test case where the current fix fails? That would really help me understand it better.

jayantsing-db · 2025-09-25T10:48:21Z

I see that Samikshya has included a reproducer and these changes fix the reproducer while the earlier thread-local changes were failing. Do we have a understanding of why thread-local changes were failing and these are passing (in case there is any misunderstanding)? I can try to run the reproducer with earlier changes and these changes to understand more.

samikshya-db · 2025-09-25T11:54:13Z

@jayantsing-db

The failure path shows that the getFunctions call is constructing a TGetFunctionsReq with functionName:null -> and the server rejects it (Required field 'functionName' was not present!) -> This leaves the TCLIService.Client in a bad state, which then leaks into the subsequent ExecuteStatement call — hence the req$9 is null error showing up there as well.

The earlier thread-local changes didn’t help because they only isolated clients at the thread level, but the same TCLIService.Client object was still being reused within a connection and carrying over the bad state. With this PR, the client is properly reset between requests, so the reproducer consistently passes.

samikshya-db · 2025-09-25T13:02:54Z

Hello Gopal, Jayant !
I am merging the 2 following PR, but let me know if you have any additional feedback - I will address in follow up PRs. Thanks for the review!

samikshya-db added 2 commits September 15, 2025 17:58

Fix race condition in thrift accessor

fc65c7c

Fix client issue

37e3b18

samikshya-db changed the title ~~Samikshya chand data/main~~ Fix race condition in thrift client Sep 15, 2025

samikshya-db and others added 3 commits September 16, 2025 00:18

update code to be better

fae8c12

Merge branch 'main' into samikshya-chand_data/main

6867875

Fix PR checks

d83d672

samikshya-db marked this pull request as ready for review September 16, 2025 06:22

samikshya-db and others added 2 commits September 16, 2025 14:44

Fix tests

fd547c4

Merge branch 'main' into samikshya-chand_data/main

6c9c2a0

samikshya-db requested a review from gopalldb September 16, 2025 09:18

Revert back to original thrift method

479fd66

gopalldb reviewed Sep 16, 2025

View reviewed changes

gopalldb requested a review from jayantsing-db September 16, 2025 18:12

samikshya-db requested a review from gopalldb September 22, 2025 11:05

Merge branch 'main' into samikshya-chand_data/main

0aecd99

gopalldb approved these changes Sep 25, 2025

View reviewed changes

jayantsing-db reviewed Sep 25, 2025

View reviewed changes

samikshya-db merged commit 165ab43 into databricks:main Sep 25, 2025
12 of 13 checks passed

samikshya-db deleted the samikshya-chand_data/main branch September 25, 2025 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in thrift client #999

Fix race condition in thrift client #999
samikshya-db merged 9 commits into
databricks:mainfrom
samikshya-db:samikshya-chand_data/main

samikshya-db commented Sep 15, 2025 •

edited

Loading

Uh oh!

gopalldb Sep 16, 2025

Uh oh!

gopalldb Sep 16, 2025

Uh oh!

gopalldb Sep 16, 2025

Uh oh!

jayantsing-db Sep 25, 2025 •

edited

Loading

Uh oh!

jayantsing-db Sep 25, 2025

Uh oh!

samikshya-db Sep 25, 2025

Uh oh!

jayantsing-db commented Sep 25, 2025 •

edited

Loading

Uh oh!

samikshya-db commented Sep 25, 2025 •

edited

Loading

Uh oh!

samikshya-db commented Sep 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

samikshya-db commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes to the Reviewer

Uh oh!

gopalldb Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gopalldb Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

gopalldb Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

jayantsing-db Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayantsing-db Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

samikshya-db Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

jayantsing-db commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samikshya-db commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samikshya-db commented Sep 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

samikshya-db commented Sep 15, 2025 •

edited

Loading

jayantsing-db Sep 25, 2025 •

edited

Loading

jayantsing-db commented Sep 25, 2025 •

edited

Loading

samikshya-db commented Sep 25, 2025 •

edited

Loading