Skip to content

Commit c85d48c

Browse files
jeet1995Copilot
andcommitted
fix(cosmos): HTTP/2 PING handler must observe child-stream read activity, not just parent-pipeline frames
Follow-up to #49095. Root cause: Http2PingHandler is installed on the parent (TCP) channel and only updated lastReadActivityNanos from parent-pipeline channelRead -- which sees PING ACK / SETTINGS / GOAWAY but NOT HEADERS / DATA frames. With reactor-netty's Http2MultiplexHandler in the parent pipeline, all application H2 traffic is demuxed onto per-stream child channels, so a connection serving thousands of QPS still looked idle to the PING handler, causing unnecessary PINGs and measurable P95 overhead under load. Fix: - New package-private AttributeKey<AtomicLong> LAST_CHILD_STREAM_READ_ACTIVITY_NANOS seeded on the parent channel in handlerAdded. - Http2PingCloseRewrapHandler (already installed in every child stream's pipeline) overrides channelReadComplete to stamp the parent attribute via accumulateAndGet(now, Math::max) -- once per read cycle, monotonic. - Http2PingHandler.maybeSendPing now uses effectiveLastReadActivityNanos = max(field, child attr) for both the idle-threshold check AND the outstanding-PING early-return: if child-stream reads arrived after an outstanding PING, clear pingOutstandingSinceNanos and reset consecutiveFailures (defense against middleboxes that drop PING-ACK but pass application H2 frames). - Names use lastReadActivity / read activity -- both signals are inbound reads, never writes. Tests: two new unit tests in Http2PingHandlerTest covering the suppression path and the outstanding-PING reset path; both bypass real time via reflection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent c46043e commit c85d48c

4 files changed

Lines changed: 254 additions & 15 deletions

File tree

sdk/cosmos/azure-cosmos-tests/src/test/java/com/azure/cosmos/implementation/http/Http2PingHandlerTest.java

Lines changed: 98 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import java.lang.reflect.Method;
1717
import java.util.concurrent.ScheduledFuture;
1818
import java.util.concurrent.TimeUnit;
19+
import java.util.concurrent.atomic.AtomicLong;
1920

2021
import static org.assertj.core.api.Assertions.assertThat;
2122

@@ -25,7 +26,7 @@
2526
* Covers state transitions that do not require advancing real time -- the handler's
2627
* timeout / threshold logic reads {@code System.nanoTime()}, so the tests sidestep
2728
* the clock by pre-setting {@code pingOutstandingSinceNanos} (or
28-
* {@code lastActivityNanos}) via reflection:
29+
* {@code lastReadActivityNanos}) via reflection:
2930
* <ul>
3031
* <li>PING ACK with matching payload resets the failure counter (RFC 9113 §6.7 payload echo).</li>
3132
* <li>PING ACK with mismatched payload is ignored (late ACK after timeout cannot mask degradation).</li>
@@ -212,7 +213,7 @@ public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise)
212213

213214
// First tick: force idle -> PING send attempt -> write fails.
214215
// Expect consecutiveFailures=1, channel still open, task still scheduled.
215-
setField(handler, "lastActivityNanos", longAgo);
216+
setField(handler, "lastReadActivityNanos", longAgo);
216217
invokeMaybeSendPing(handler, ctx);
217218
channel.runPendingTasks(); // run the write + listener on the embedded event loop
218219

@@ -222,7 +223,7 @@ public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise promise)
222223
assertThat((Object) getField(handler, "pingTask")).isNotNull();
223224

224225
// Second tick: write fails again -> threshold reached -> task cancelled, channel closed.
225-
setField(handler, "lastActivityNanos", longAgo);
226+
setField(handler, "lastReadActivityNanos", longAgo);
226227
invokeMaybeSendPing(handler, ctx);
227228
channel.runPendingTasks();
228229

@@ -274,6 +275,100 @@ public void ackTimeout_incrementsConsecutiveFailuresAndClosesAtThreshold() throw
274275
channel.finishAndReleaseAll();
275276
}
276277

278+
@Test(groups = "unit")
279+
public void childStreamReadActivity_suppressesIdlePing() throws Exception {
280+
// Regression test for the v4.81.0-beta.1 fix: with parent-pipeline reads
281+
// stale (would breach the idle threshold) but child-stream reads recent
282+
// (via the LAST_CHILD_STREAM_READ_ACTIVITY_NANOS attribute updated by
283+
// Http2PingCloseRewrapHandler.channelReadComplete on each H2 child stream),
284+
// the handler must NOT send a PING.
285+
//
286+
// Without the fix (Http2PingHandler reading only lastReadActivityNanos), a
287+
// saturated connection serving thousands of QPS via H2 child streams would
288+
// still look "idle" to the handler -- because HEADERS/DATA frames are
289+
// demuxed by Http2MultiplexHandler and never surface on the parent's
290+
// channelRead -- and PINGs would fire on every interval, adding measurable
291+
// latency overhead under load (the symptom that motivated this PR).
292+
Http2PingHandler handler = new Http2PingHandler(1, 1, 3);
293+
EmbeddedChannel channel = new EmbeddedChannel(handler);
294+
ChannelHandlerContext ctx = channel.pipeline().context(handler);
295+
296+
// Parent-pipeline read-timestamp is stale (60s ago).
297+
long longAgo = System.nanoTime() - TimeUnit.SECONDS.toNanos(60);
298+
setField(handler, "lastReadActivityNanos", longAgo);
299+
300+
// Child-stream read-activity attribute is fresh: simulate Http2PingCloseRewrapHandler
301+
// having just stamped the parent attribute after demuxing an inbound HEADERS/DATA batch.
302+
AtomicLong childReadActivity = channel.attr(Http2PingHandler.LAST_CHILD_STREAM_READ_ACTIVITY_NANOS).get();
303+
assertThat(childReadActivity).as("handlerAdded should have seeded the child-read-activity attribute").isNotNull();
304+
childReadActivity.set(System.nanoTime());
305+
306+
invokeMaybeSendPing(handler, ctx);
307+
channel.runPendingTasks();
308+
309+
// PING must NOT have fired: effectiveLastReadActivityNanos = max(stale, fresh) = fresh.
310+
assertThat((int) getField(handler, "pingsSent"))
311+
.as("recent child-stream reads must suppress idle-PING")
312+
.isZero();
313+
assertThat((long) getField(handler, "pingOutstandingSinceNanos")).isZero();
314+
assertThat(channel.isOpen()).isTrue();
315+
316+
channel.finishAndReleaseAll();
317+
}
318+
319+
@Test(groups = "unit")
320+
public void childStreamReadActivityAfterOutstandingPing_resetsFailureState() throws Exception {
321+
// Regression test for the v4.81.0-beta.1 fix: when a PING is outstanding
322+
// (pingOutstandingSinceNanos != 0) AND child-stream reads flow AFTER the
323+
// PING was sent, the handler must clear the outstanding state and reset
324+
// consecutiveFailures -- the connection is demonstrably healthy (H2 codec is
325+
// still delivering application data), even though the PING ACK itself was
326+
// never observed on the parent pipeline.
327+
//
328+
// Without the fix, a middlebox that drops PING-ACK frames (but lets
329+
// application H2 frames through) would accumulate consecutiveFailures up to
330+
// the threshold and close a healthy, actively-serving connection.
331+
Http2PingHandler handler = new Http2PingHandler(1, 1, 3);
332+
EmbeddedChannel channel = new EmbeddedChannel(handler);
333+
ChannelHandlerContext ctx = channel.pipeline().context(handler);
334+
335+
// PING sent 5s ago is still outstanding; 2 prior consecutive failures recorded.
336+
long pingSentAt = System.nanoTime() - TimeUnit.SECONDS.toNanos(5);
337+
setField(handler, "pingsSent", 7);
338+
setField(handler, "pingOutstandingSinceNanos", pingSentAt);
339+
setField(handler, "consecutiveFailures", 2);
340+
341+
// Parent-pipeline read-timestamp is even OLDER than the outstanding PING so
342+
// it cannot itself trigger the early-return -- proving the reset is driven
343+
// by the child-stream attribute specifically.
344+
long parentLastRead = System.nanoTime() - TimeUnit.SECONDS.toNanos(10);
345+
setField(handler, "lastReadActivityNanos", parentLastRead);
346+
347+
// Child-stream read-activity stamped AFTER the PING was sent (Http2PingCloseRewrapHandler
348+
// saw inbound HEADERS/DATA on a child stream just now).
349+
AtomicLong childReadActivity = channel.attr(Http2PingHandler.LAST_CHILD_STREAM_READ_ACTIVITY_NANOS).get();
350+
assertThat(childReadActivity).isNotNull();
351+
childReadActivity.set(System.nanoTime());
352+
353+
invokeMaybeSendPing(handler, ctx);
354+
channel.runPendingTasks();
355+
356+
// Early-return cleared outstanding state and the failure counter; no new
357+
// PING was sent; channel is still healthy and open.
358+
assertThat((long) getField(handler, "pingOutstandingSinceNanos"))
359+
.as("child-stream reads after outstanding PING must clear pingOutstandingSinceNanos")
360+
.isZero();
361+
assertThat((int) getField(handler, "consecutiveFailures"))
362+
.as("child-stream reads after outstanding PING must reset consecutiveFailures")
363+
.isZero();
364+
assertThat((int) getField(handler, "pingsSent"))
365+
.as("early-return must NOT send a new PING")
366+
.isEqualTo(7);
367+
assertThat(channel.isOpen()).isTrue();
368+
369+
channel.finishAndReleaseAll();
370+
}
371+
277372
private static void invokeMaybeSendPing(Http2PingHandler handler, ChannelHandlerContext ctx) throws Exception {
278373
Method m = Http2PingHandler.class.getDeclaredMethod("maybeSendPing", ChannelHandlerContext.class);
279374
m.setAccessible(true);

sdk/cosmos/azure-cosmos/CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
#### Bugs Fixed
1313
* Fixed region name normalization for preferred and excluded regions — non-canonical inputs (e.g., `"westus3"`, `"WEST US 3"`) are now mapped to the canonical form. Also fixed a case-sensitive exclude-region check in PPCB reevaluate logic. - See [PR 49090](https://github.com/Azure/azure-sdk-for-java/pull/49090)
1414
* Fixed `UnsupportedOperationException` when using `readManyByPartitionKeys` for empty pages. - See [PR 49311](https://github.com/Azure/azure-sdk-for-java/pull/49311)
15+
* Fixed the HTTP/2 PING keepalive handler so it observes actual HTTP/2 stream traffic (HEADERS / DATA) when deciding whether a connection is idle. The original handler (see [PR 49095](https://github.com/Azure/azure-sdk-for-java/pull/49095)) tracked only parent-pipeline frames (PING ACK, SETTINGS, GOAWAY) plus its own PING sends, because `Http2MultiplexHandler` routes stream frames to child channels. As a result, connections that were continuously serving thousands of requests per second appeared idle and were PINGed on every interval, adding measurable latency overhead under load. The handler now observes inbound child-stream activity via a parent-channel attribute updated by `Http2PingCloseRewrapHandler.channelReadComplete`, and an in-flight PING is cleared (rather than charged toward the close threshold) when child-stream traffic flows after the PING was sent — preventing spurious closes on healthy connections where PING-specific frames are dropped by a middlebox.
1516

1617
#### Other Changes
1718
* Added HTTP/2 PING keepalive (default ON) for Gateway service endpoints to detect silently-broken connections. - See [PR 49095](https://github.com/Azure/azure-sdk-for-java/pull/49095)

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/Http2PingCloseRewrapHandler.java

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,9 @@
66
import io.netty.channel.ChannelHandler;
77
import io.netty.channel.ChannelHandlerContext;
88
import io.netty.channel.ChannelInboundHandlerAdapter;
9+
import io.netty.handler.codec.http2.Http2MultiplexHandler;
10+
11+
import java.util.concurrent.atomic.AtomicLong;
912

1013
/**
1114
* Per-request HTTP/2 child-stream handler that translates a parent-TCP-channel close
@@ -40,6 +43,46 @@ final class Http2PingCloseRewrapHandler extends ChannelInboundHandlerAdapter {
4043

4144
private Http2PingCloseRewrapHandler() {}
4245

46+
/**
47+
* Records inbound H2 stream reads on the parent channel so the PING handler
48+
* can recognize a busy connection as non-idle and skip the PING send.
49+
* <p>
50+
* {@code channelReadComplete} fires at the end of each inbound read cycle on
51+
* the child channel -- so this method stamps the parent attribute exactly once
52+
* per batch of HEADERS/DATA frames demuxed by {@link Http2MultiplexHandler},
53+
* not once per frame. Under normal load this is plenty of granularity to
54+
* suppress unnecessary PINGs without becoming a hot-path allocator.
55+
* <p>
56+
* Edge cases:
57+
* <ul>
58+
* <li><b>Backpressure / autoRead off</b>: {@link Http2MultiplexHandler} may
59+
* buffer frames in the child's inbound queue and defer dispatch. In that
60+
* case {@code channelReadComplete} may fire later than the wire-arrival
61+
* time, but it WILL fire when the read cycle ends. Worst case: one benign
62+
* extra PING during the brief buffering window.</li>
63+
* <li><b>Non-H2 channels</b>: this handler is only installed on H2 child
64+
* streams (where {@code ch.parent() != null}), so the parent reference
65+
* above is non-null in normal operation. Defensive null-check kept for
66+
* safety if the install topology ever changes.</li>
67+
* </ul>
68+
* Uses {@code accumulateAndGet(now, Math::max)} so concurrent updates from
69+
* sibling child streams remain monotonic (in practice child streams sharing a
70+
* parent run on the same event loop, so contention is nil; the monotonic
71+
* semantics document intent and future-proof against multi-event-loop
72+
* scenarios).
73+
*/
74+
@Override
75+
public void channelReadComplete(ChannelHandlerContext ctx) throws Exception {
76+
Channel parent = ctx.channel().parent();
77+
if (parent != null) {
78+
AtomicLong holder = parent.attr(Http2PingHandler.LAST_CHILD_STREAM_READ_ACTIVITY_NANOS).get();
79+
if (holder != null) {
80+
holder.accumulateAndGet(System.nanoTime(), Math::max);
81+
}
82+
}
83+
super.channelReadComplete(ctx);
84+
}
85+
4386
@Override
4487
public void channelInactive(ChannelHandlerContext ctx) throws Exception {
4588
Channel parent = ctx.channel().parent();

0 commit comments

Comments
 (0)