Fix race condition in thrift client #999
Conversation
| private final String endpointUrl; | ||
| private final IDatabricksConnectionContext connectionContext; | ||
| private TProtocolVersion serverProtocolVersion = JDBC_THRIFT_VERSION; | ||
| private ThreadLocal<TCLIService.Client> FAKE_SHARED_CLIENT; |
There was a problem hiding this comment.
nit: don't use caps for non final variables
| if (!DriverUtil.isRunningAgainstFake()) { | ||
| // Create a new thrift client for each thread as client state is not thread safe. Note that | ||
| // the underlying protocol uses the same http client which is thread safe | ||
| this.thriftClient = |
There was a problem hiding this comment.
we don't need this logic any more?
There was a problem hiding this comment.
Samikshya explained to me offline, @jayantsing-db can you also take a look as you worked on this previously.
| } | ||
|
|
||
| byte[] requestPayload; | ||
| synchronized (requestBuffer) { |
There was a problem hiding this comment.
If i am understanding correctly, a JDBC connection will now have a single thrift client. So, when a JDBC connection is used in different threads concurrently, it will use the same thrift client. Now, if we don't put a sync mechanism, different threads using the same thrift client will fail because:
- thread A can increment the seq number of thrift client from x to x + 1
- thread B can receive the response for the seq number x. thread B will see that seq num is not matching and will throw exception:
out of sequence response
This section of the code with all good intent and purposes tries to put a sync mechanism in place. But I am afraid, this might not be sufficient.
So the thrift calls happen in this order sequentially for a thread:
TCLIService.Client.ExecuteStatement -> seqid ++ -> transport.write -> transport.flush -> <wait for the http response> -> transport.read (when reading it checks the seqid with that present in message)
The current sync mechanism only protects transport.write and transport.read. And things could fail in exciting ways because the whole chain is not isolated/protected.
Happy to chat if this does not make sense at all.
There was a problem hiding this comment.
Ideally, JDBC is a sync protocol and client app is not advised to use the same connection in different threads but there is no strict enforcement. Also i wrote a basic test on top of these changes that uses a JDBC connection across many threads and executes statements. I didn't encounter any errors (but i believe like other threading issues that's by chance that threads didn't step over each other)
There was a problem hiding this comment.
For this change, my goal was specifically to address the getFunctions issue raised in this repo. I may not be fully following the broader concern you mentioned — could you please share a concrete test case where the current fix fails? That would really help me understand it better.
|
I see that Samikshya has included a reproducer and these changes fix the reproducer while the earlier thread-local changes were failing. Do we have a understanding of why thread-local changes were failing and these are passing (in case there is any misunderstanding)? I can try to run the reproducer with earlier changes and these changes to understand more. |
|
The failure path shows that the getFunctions call is constructing a TGetFunctionsReq with functionName:null -> and the server rejects it (Required field 'functionName' was not present!) -> This leaves the TCLIService.Client in a bad state, which then leaks into the subsequent ExecuteStatement call — hence the req$9 is null error showing up there as well. The earlier thread-local changes didn’t help because they only isolated clients at the thread level, but the same TCLIService.Client object was still being reused within a connection and carrying over the bad state. With this PR, the client is properly reset between requests, so the reproducer consistently passes. |
|
Hello Gopal, Jayant ! |
Description
Testing
Additional Notes to the Reviewer