Skip to content

Fix race condition in thrift client #999

Merged
samikshya-db merged 9 commits into
databricks:mainfrom
samikshya-db:samikshya-chand_data/main
Sep 25, 2025
Merged

Fix race condition in thrift client #999
samikshya-db merged 9 commits into
databricks:mainfrom
samikshya-db:samikshya-chand_data/main

Conversation

@samikshya-db
Copy link
Copy Markdown
Collaborator

@samikshya-db samikshya-db commented Sep 15, 2025

Description

Testing

  • Reset code before this commit where we fixed getFunctions, and the following code no longer throws error
for (int i = 0; i < 10; i++) {
      try {
        DatabaseMetaData metaData = con.getMetaData();
        metaData.getFunctions("main", "testschema", null).close(); // Metadata call

        Statement stmt = con.createStatement();
        stmt.execute("select * from samikshyachand.default.hello_table"); // SQL query
        stmt.close();

        Thread.sleep(100); // Add a small delay
      } catch (Exception e) {
        e.printStackTrace();
      }
    }

Additional Notes to the Reviewer

@samikshya-db samikshya-db changed the title Samikshya chand data/main Fix race condition in thrift client Sep 15, 2025
@samikshya-db samikshya-db marked this pull request as ready for review September 16, 2025 06:22
private final String endpointUrl;
private final IDatabricksConnectionContext connectionContext;
private TProtocolVersion serverProtocolVersion = JDBC_THRIFT_VERSION;
private ThreadLocal<TCLIService.Client> FAKE_SHARED_CLIENT;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't use caps for non final variables

if (!DriverUtil.isRunningAgainstFake()) {
// Create a new thrift client for each thread as client state is not thread safe. Note that
// the underlying protocol uses the same http client which is thread safe
this.thriftClient =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this logic any more?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samikshya explained to me offline, @jayantsing-db can you also take a look as you worked on this previously.

}

byte[] requestPayload;
synchronized (requestBuffer) {
Copy link
Copy Markdown
Collaborator

@jayantsing-db jayantsing-db Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i am understanding correctly, a JDBC connection will now have a single thrift client. So, when a JDBC connection is used in different threads concurrently, it will use the same thrift client. Now, if we don't put a sync mechanism, different threads using the same thrift client will fail because:

  • thread A can increment the seq number of thrift client from x to x + 1
  • thread B can receive the response for the seq number x. thread B will see that seq num is not matching and will throw exception: out of sequence response

This section of the code with all good intent and purposes tries to put a sync mechanism in place. But I am afraid, this might not be sufficient.

So the thrift calls happen in this order sequentially for a thread:

TCLIService.Client.ExecuteStatement -> seqid ++ -> transport.write -> transport.flush -> <wait for the http response> -> transport.read (when reading it checks the seqid with that present in message)

The current sync mechanism only protects transport.write and transport.read. And things could fail in exciting ways because the whole chain is not isolated/protected.

Happy to chat if this does not make sense at all.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, JDBC is a sync protocol and client app is not advised to use the same connection in different threads but there is no strict enforcement. Also i wrote a basic test on top of these changes that uses a JDBC connection across many threads and executes statements. I didn't encounter any errors (but i believe like other threading issues that's by chance that threads didn't step over each other)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this change, my goal was specifically to address the getFunctions issue raised in this repo. I may not be fully following the broader concern you mentioned — could you please share a concrete test case where the current fix fails? That would really help me understand it better.

@jayantsing-db
Copy link
Copy Markdown
Collaborator

jayantsing-db commented Sep 25, 2025

I see that Samikshya has included a reproducer and these changes fix the reproducer while the earlier thread-local changes were failing. Do we have a understanding of why thread-local changes were failing and these are passing (in case there is any misunderstanding)? I can try to run the reproducer with earlier changes and these changes to understand more.

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

samikshya-db commented Sep 25, 2025

@jayantsing-db

The failure path shows that the getFunctions call is constructing a TGetFunctionsReq with functionName:null -> and the server rejects it (Required field 'functionName' was not present!) -> This leaves the TCLIService.Client in a bad state, which then leaks into the subsequent ExecuteStatement call — hence the req$9 is null error showing up there as well.

The earlier thread-local changes didn’t help because they only isolated clients at the thread level, but the same TCLIService.Client object was still being reused within a connection and carrying over the bad state. With this PR, the client is properly reset between requests, so the reproducer consistently passes.

@samikshya-db
Copy link
Copy Markdown
Collaborator Author

Hello Gopal, Jayant !
I am merging the 2 following PR, but let me know if you have any additional feedback - I will address in follow up PRs. Thanks for the review!

@samikshya-db samikshya-db merged commit 165ab43 into databricks:main Sep 25, 2025
12 of 13 checks passed
@samikshya-db samikshya-db deleted the samikshya-chand_data/main branch September 25, 2025 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants