Skip to content

servlet: fix TomcatTransportTest detects write when not ready#12732

Open
themechbro wants to merge 2 commits intogrpc:masterfrom
themechbro:fix/tomcat-async-write-ready
Open

servlet: fix TomcatTransportTest detects write when not ready#12732
themechbro wants to merge 2 commits intogrpc:masterfrom
themechbro:fix/tomcat-async-write-ready

Conversation

@themechbro
Copy link
Copy Markdown
Contributor

@themechbro themechbro commented Mar 26, 2026

Fixes #12723

What was happening

In AsyncServletOutputStreamWriter#runOrBuffer, the application thread relies on the cached readyAndDrained state to determine if it can write directly to the ServletOutputStream.

In highly concurrent scenarios (observed in TomcatTransportTest), this cached state can become stale. The servlet container may have already transitioned to a 'not ready' state, but the corresponding callback has not yet updated gRPC's internal state. When the application thread attempts to write based on the stale true value, the container throws an IllegalStateException.

The Fix

This PR updates the primary execution path in runOrBuffer to explicitly evaluate isReady.getAsBoolean() alongside the cached state.

If the cached state is true but the container is actually not ready, the thread gracefully drops into the else block, buffers the action into the writeChain, and attempts to update readyAndDrained to false.

Handling Concurrent CAS Failures

By forcing threads into the else block when isReady() evaluates to false, we introduce a safe race condition: multiple threads might try to execute writeState.compareAndSet(true, false) simultaneously.

To prevent the losing threads from crashing on the subsequent checkState, an explicit check for the initial state (if (curState.readyAndDrained)) was added inside the CAS failure block. If a thread started as true but failed the CAS, it simply means another concurrent thread already safely toggled the state to false. The thread performs a safe no-op and exits, allowing Tomcat's onWritePossible() to eventually drain the queue.

Passed local testing via:
./gradlew :grpc-servlet:tomcat9Test --tests "*TomcatTransportTest*" -PskipAndroid=true -PskipCodegen=true

@themechbro themechbro force-pushed the fix/tomcat-async-write-ready branch from 97ef4d7 to 9bf9ad1 Compare March 26, 2026 12:06
@themechbro themechbro force-pushed the fix/tomcat-async-write-ready branch from d3a69ea to b2cde71 Compare March 28, 2026 17:44
@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Apr 8, 2026

Under certain asynchronous conditions (e.g., TCP window shrinking), Tomcat's internal buffer would fill up, but readyAndDrained was still true

I don't see how the kernel's handling of the TCP window could impact Tomcat here. And even if that is true, if ServletOutputStream.isReady() has returned true and we haven't done a write, then Tomcat should not be throwing a IllegalStateException. Otherwise that IllegalStateException would be inherently racy and would be impossible to avoid. Adding an additional ready check would reduce the race window, but wouldn't prevent the race.

@themechbro
Copy link
Copy Markdown
Contributor Author

Under certain asynchronous conditions (e.g., TCP window shrinking), Tomcat's internal buffer would fill up, but readyAndDrained was still true

I don't see how the kernel's handling of the TCP window could impact Tomcat here. And even if that is true, if ServletOutputStream.isReady() has returned true and we haven't done a write, then Tomcat should not be throwing a IllegalStateException. Otherwise that IllegalStateException would be inherently racy and would be impossible to avoid. Adding an additional ready check would reduce the race window, but wouldn't prevent the race.

You're right, Eric. Attributing the internal buffer state directly to TCP window shrinking was speculative on my part. I'll remove that from the description.

The core issue I’m observing is a stale cache problem. Our readyAndDrained bit is a cached state updated by container callbacks. In the failing TomcatTransportTest, the container has already transitioned to a 'not ready' state, but the gRPC application thread enters runOrBuffer before the next container callback has had a chance to flip our cached bit.

By adding the live isReady() check, we aren't just 'reducing a race window'—we are validating our stale cached state against the actual container state. While a theoretical TOCTOU race still exists between the check and the write, this live check successfully catches the cases where the container was already not ready before we even started the write operation.

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Apr 9, 2026

While a theoretical TOCTOU race still exists between the check and the write, this live check successfully catches the cases where the container was already not ready before we even started the write operation.

That's what it means to reduce the race window. The race is still there, but harder to hit.

The core issue I’m observing is a stale cache problem.

The question is, "what is invalidating the cache?" If gRPC is doing a write, then that would invalidate the cache. So is gRPC not updating the cache properly in that case? Is there a race causing the cache to be wrong?

But if it is "isReady() returned true, but at a future time it returns false, without any writes in-between" then that is a Tomcat bug. The API does not allow for that, as it is inherently racy.

Looking at Tomcat, it seems it is buggy. It checks the connection window as part of isReady(), and other threads can definitely impact that. So it is impossible for us to fix this exception. (We could only reduce the race window.)

(Actually, the window size calculation weird, because positive sizes seem to indicate available window. So a non-positive window seems it would mean "not ready." I think either dataLeft or the isWaiting checks may be carrying a lot of load.)

@themechbro
Copy link
Copy Markdown
Contributor Author

The question is, "what is invalidating the cache?" If gRPC is doing a write, then that would invalidate the cache. So is gRPC not updating the cache properly in that case? Is there a race causing the cache to be wrong?

Regarding the cache, gRPC updates readyAndDrained to false immediately after a write if isReady() returns false. However, since Tomcat can transition the stream to 'not ready' due to connection window changes influenced by other threads (independent of gRPC's write actions), our cached readyAndDrained bit remains true until the next time the application thread enters runOrBuffer.

Without this additional live isReady() check, the application thread proceeds with a write based on that stale true bit, triggering the IllegalStateException. Since the root cause is the container's violation of the expected Servlet API contract, this live check serves as a necessary defensive measure to catch that spontaneous state change and mitigate the race.

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Apr 9, 2026

However, since Tomcat can transition the stream to 'not ready' due to connection window changes influenced by other threads (independent of gRPC's write actions)

That is invalid in the servlet API. That is a Tomcat bug. It is inherently racy.

@themechbro
Copy link
Copy Markdown
Contributor Author

That is invalid in the servlet API. That is a Tomcat bug. It is inherently racy.

Agreed. It's a clear spec violation.

Since it's a bug in a major container, should gRPC provide this mitigation for user resilience, or should we keep the implementation strictly spec-compliant?

@ejona86
Copy link
Copy Markdown
Member

ejona86 commented Apr 9, 2026

At the very least there should be a bug filed for Tomcat before we make a workaround. Given this appears to be a long-standing issue (based on the Tomcat code), I'm not really sure why there'd be urgency to workaround this in gRPC instead of waiting for Tomcat to get fixed. I found this in our tests, but apparently users aren't noticing. If we found there was some trigger to make this start happening suddenly, then the thought process would be different.

If we do workaround it, we'd also want to make this workaround clearer/shorter, such that it is easier reason about and we can easily remove it when it is no longer necessary. This current apparently-AI-generated code is simply too complex. I do see it has to handle some ugly cases (so it isn't necessarily AI's fault), but it is simply too complex to maintain right now.

@themechbro
Copy link
Copy Markdown
Contributor Author

At the very least there should be a bug filed for Tomcat before we make a workaround. Given this appears to be a long-standing issue (based on the Tomcat code), I'm not really sure why there'd be urgency to workaround this in gRPC instead of waiting for Tomcat to get fixed. I found this in our tests, but apparently users aren't noticing. If we found there was some trigger to make this start happening suddenly, then the thought process would be different.

If we do workaround it, we'd also want to make this workaround clearer/shorter, such that it is easier reason about and we can easily remove it when it is no longer necessary. This current apparently-AI-generated code is simply too complex. I do see it has to handle some ugly cases (so it isn't necessarily AI's fault), but it is simply too complex to maintain right now.

Understood

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

servlet: TomcatTransportTest detects write when not ready

2 participants