servlet: fix TomcatTransportTest detects write when not ready by themechbro · Pull Request #12732 · grpc/grpc-java

themechbro · 2026-03-26T11:23:22Z

What was happening

In AsyncServletOutputStreamWriter#runOrBuffer, the application thread relies on the cached readyAndDrained state to determine if it can write directly to the ServletOutputStream.

In highly concurrent scenarios (observed in TomcatTransportTest), this cached state can become stale. The servlet container may have already transitioned to a 'not ready' state, but the corresponding callback has not yet updated gRPC's internal state. When the application thread attempts to write based on the stale true value, the container throws an IllegalStateException.

The Fix

This PR updates the primary execution path in runOrBuffer to explicitly evaluate isReady.getAsBoolean() alongside the cached state.

If the cached state is true but the container is actually not ready, the thread gracefully drops into the else block, buffers the action into the writeChain, and attempts to update readyAndDrained to false.

Handling Concurrent CAS Failures

By forcing threads into the else block when isReady() evaluates to false, we introduce a safe race condition: multiple threads might try to execute writeState.compareAndSet(true, false) simultaneously.

To prevent the losing threads from crashing on the subsequent checkState, an explicit check for the initial state (if (curState.readyAndDrained)) was added inside the CAS failure block. If a thread started as true but failed the CAS, it simply means another concurrent thread already safely toggled the state to false. The thread performs a safe no-op and exits, allowing Tomcat's onWritePossible() to eventually drain the queue.

Passed local testing via:
./gradlew :grpc-servlet:tomcat9Test --tests "*TomcatTransportTest*" -PskipAndroid=true -PskipCodegen=true

ejona86 · 2026-04-08T22:06:19Z

Under certain asynchronous conditions (e.g., TCP window shrinking), Tomcat's internal buffer would fill up, but readyAndDrained was still true

I don't see how the kernel's handling of the TCP window could impact Tomcat here. And even if that is true, if ServletOutputStream.isReady() has returned true and we haven't done a write, then Tomcat should not be throwing a IllegalStateException. Otherwise that IllegalStateException would be inherently racy and would be impossible to avoid. Adding an additional ready check would reduce the race window, but wouldn't prevent the race.

themechbro · 2026-04-09T09:18:54Z

Under certain asynchronous conditions (e.g., TCP window shrinking), Tomcat's internal buffer would fill up, but readyAndDrained was still true

I don't see how the kernel's handling of the TCP window could impact Tomcat here. And even if that is true, if ServletOutputStream.isReady() has returned true and we haven't done a write, then Tomcat should not be throwing a IllegalStateException. Otherwise that IllegalStateException would be inherently racy and would be impossible to avoid. Adding an additional ready check would reduce the race window, but wouldn't prevent the race.

You're right, Eric. Attributing the internal buffer state directly to TCP window shrinking was speculative on my part. I'll remove that from the description.

The core issue I’m observing is a stale cache problem. Our readyAndDrained bit is a cached state updated by container callbacks. In the failing TomcatTransportTest, the container has already transitioned to a 'not ready' state, but the gRPC application thread enters runOrBuffer before the next container callback has had a chance to flip our cached bit.

By adding the live isReady() check, we aren't just 'reducing a race window'—we are validating our stale cached state against the actual container state. While a theoretical TOCTOU race still exists between the check and the write, this live check successfully catches the cases where the container was already not ready before we even started the write operation.

ejona86 · 2026-04-09T16:39:36Z

While a theoretical TOCTOU race still exists between the check and the write, this live check successfully catches the cases where the container was already not ready before we even started the write operation.

That's what it means to reduce the race window. The race is still there, but harder to hit.

The core issue I’m observing is a stale cache problem.

The question is, "what is invalidating the cache?" If gRPC is doing a write, then that would invalidate the cache. So is gRPC not updating the cache properly in that case? Is there a race causing the cache to be wrong?

But if it is "isReady() returned true, but at a future time it returns false, without any writes in-between" then that is a Tomcat bug. The API does not allow for that, as it is inherently racy.

Looking at Tomcat, it seems it is buggy. It checks the connection window as part of isReady(), and other threads can definitely impact that. So it is impossible for us to fix this exception. (We could only reduce the race window.)

(Actually, the window size calculation weird, because positive sizes seem to indicate available window. So a non-positive window seems it would mean "not ready." I think either dataLeft or the isWaiting checks may be carrying a lot of load.)

themechbro · 2026-04-09T16:54:47Z

The question is, "what is invalidating the cache?" If gRPC is doing a write, then that would invalidate the cache. So is gRPC not updating the cache properly in that case? Is there a race causing the cache to be wrong?

Regarding the cache, gRPC updates readyAndDrained to false immediately after a write if isReady() returns false. However, since Tomcat can transition the stream to 'not ready' due to connection window changes influenced by other threads (independent of gRPC's write actions), our cached readyAndDrained bit remains true until the next time the application thread enters runOrBuffer.

Without this additional live isReady() check, the application thread proceeds with a write based on that stale true bit, triggering the IllegalStateException. Since the root cause is the container's violation of the expected Servlet API contract, this live check serves as a necessary defensive measure to catch that spontaneous state change and mitigate the race.

ejona86 · 2026-04-09T16:56:22Z

However, since Tomcat can transition the stream to 'not ready' due to connection window changes influenced by other threads (independent of gRPC's write actions)

That is invalid in the servlet API. That is a Tomcat bug. It is inherently racy.

themechbro · 2026-04-09T16:59:23Z

That is invalid in the servlet API. That is a Tomcat bug. It is inherently racy.

Agreed. It's a clear spec violation.

Since it's a bug in a major container, should gRPC provide this mitigation for user resilience, or should we keep the implementation strictly spec-compliant?

ejona86 · 2026-04-09T17:27:12Z

At the very least there should be a bug filed for Tomcat before we make a workaround. Given this appears to be a long-standing issue (based on the Tomcat code), I'm not really sure why there'd be urgency to workaround this in gRPC instead of waiting for Tomcat to get fixed. I found this in our tests, but apparently users aren't noticing. If we found there was some trigger to make this start happening suddenly, then the thought process would be different.

If we do workaround it, we'd also want to make this workaround clearer/shorter, such that it is easier reason about and we can easily remove it when it is no longer necessary. This current apparently-AI-generated code is simply too complex. I do see it has to handle some ugly cases (so it isn't necessarily AI's fault), but it is simply too complex to maintain right now.

themechbro · 2026-04-09T17:36:57Z

At the very least there should be a bug filed for Tomcat before we make a workaround. Given this appears to be a long-standing issue (based on the Tomcat code), I'm not really sure why there'd be urgency to workaround this in gRPC instead of waiting for Tomcat to get fixed. I found this in our tests, but apparently users aren't noticing. If we found there was some trigger to make this start happening suddenly, then the thought process would be different.

If we do workaround it, we'd also want to make this workaround clearer/shorter, such that it is easier reason about and we can easily remove it when it is no longer necessary. This current apparently-AI-generated code is simply too complex. I do see it has to handle some ugly cases (so it isn't necessarily AI's fault), but it is simply too complex to maintain right now.

Understood

servlet: fix TomcatTransportTest detects write when not ready

9bf9ad1

themechbro force-pushed the fix/tomcat-async-write-ready branch from 97ef4d7 to 9bf9ad1 Compare March 26, 2026 12:06

chore: trigger CI pipeline

b2cde71

themechbro force-pushed the fix/tomcat-async-write-ready branch from d3a69ea to b2cde71 Compare March 28, 2026 17:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

servlet: fix TomcatTransportTest detects write when not ready#12732

servlet: fix TomcatTransportTest detects write when not ready#12732
themechbro wants to merge 2 commits intogrpc:masterfrom
themechbro:fix/tomcat-async-write-ready

themechbro commented Mar 26, 2026 •

edited

Loading

Uh oh!

ejona86 commented Apr 8, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026 •

edited

Loading

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

themechbro commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was happening

The Fix

Handling Concurrent CAS Failures

Uh oh!

ejona86 commented Apr 8, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

ejona86 commented Apr 9, 2026

Uh oh!

themechbro commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

themechbro commented Mar 26, 2026 •

edited

Loading

ejona86 commented Apr 9, 2026 •

edited

Loading