servlet: fix TomcatTransportTest detects write when not ready#12732
servlet: fix TomcatTransportTest detects write when not ready#12732themechbro wants to merge 2 commits intogrpc:masterfrom
Conversation
97ef4d7 to
9bf9ad1
Compare
d3a69ea to
b2cde71
Compare
I don't see how the kernel's handling of the TCP window could impact Tomcat here. And even if that is true, if |
You're right, Eric. Attributing the internal buffer state directly to TCP window shrinking was speculative on my part. I'll remove that from the description. The core issue I’m observing is a stale cache problem. Our By adding the live |
That's what it means to reduce the race window. The race is still there, but harder to hit.
The question is, "what is invalidating the cache?" If gRPC is doing a write, then that would invalidate the cache. So is gRPC not updating the cache properly in that case? Is there a race causing the cache to be wrong? But if it is "isReady() returned Looking at Tomcat, it seems it is buggy. It checks the connection window as part of isReady(), and other threads can definitely impact that. So it is impossible for us to fix this exception. (We could only reduce the race window.) (Actually, the window size calculation weird, because positive sizes seem to indicate available window. So a non-positive window seems it would mean "not ready." I think either dataLeft or the isWaiting checks may be carrying a lot of load.) |
Regarding the cache, gRPC updates Without this additional live |
That is invalid in the servlet API. That is a Tomcat bug. It is inherently racy. |
Agreed. It's a clear spec violation. Since it's a bug in a major container, should gRPC provide this mitigation for user resilience, or should we keep the implementation strictly spec-compliant? |
|
At the very least there should be a bug filed for Tomcat before we make a workaround. Given this appears to be a long-standing issue (based on the Tomcat code), I'm not really sure why there'd be urgency to workaround this in gRPC instead of waiting for Tomcat to get fixed. I found this in our tests, but apparently users aren't noticing. If we found there was some trigger to make this start happening suddenly, then the thought process would be different. If we do workaround it, we'd also want to make this workaround clearer/shorter, such that it is easier reason about and we can easily remove it when it is no longer necessary. This current apparently-AI-generated code is simply too complex. I do see it has to handle some ugly cases (so it isn't necessarily AI's fault), but it is simply too complex to maintain right now. |
Understood |
Fixes #12723
What was happening
In
AsyncServletOutputStreamWriter#runOrBuffer, the application thread relies on the cachedreadyAndDrainedstate to determine if it can write directly to theServletOutputStream.In highly concurrent scenarios (observed in
TomcatTransportTest), this cached state can become stale. The servlet container may have already transitioned to a 'not ready' state, but the corresponding callback has not yet updated gRPC's internal state. When the application thread attempts to write based on the staletruevalue, the container throws anIllegalStateException.The Fix
This PR updates the primary execution path in
runOrBufferto explicitly evaluateisReady.getAsBoolean()alongside the cached state.If the cached state is true but the container is actually not ready, the thread gracefully drops into the
elseblock, buffers the action into thewriteChain, and attempts to updatereadyAndDrainedtofalse.Handling Concurrent CAS Failures
By forcing threads into the
elseblock whenisReady()evaluates to false, we introduce a safe race condition: multiple threads might try to executewriteState.compareAndSet(true, false)simultaneously.To prevent the losing threads from crashing on the subsequent
checkState, an explicit check for the initial state (if (curState.readyAndDrained)) was added inside the CAS failure block. If a thread started astruebut failed the CAS, it simply means another concurrent thread already safely toggled the state tofalse. The thread performs a safe no-op and exits, allowing Tomcat'sonWritePossible()to eventually drain the queue.Passed local testing via:
./gradlew :grpc-servlet:tomcat9Test --tests "*TomcatTransportTest*" -PskipAndroid=true -PskipCodegen=true