Gate HTTP/2 PING failure-based connection close exception rewraps to PING enablement flag.#49403
Conversation
…equest The previous .doOnRequest() chained onto the HTTP/2 HttpClient builder wrapped every outgoing request's Mono in a Reactor operator. That paid per-request costs even when the rewrap handler was already installed: * MonoPeekTerminal wrap + per-request subscriber allocation (InternalMonoOperator.subscribe +2.60% in alloc profile) * per-request BiConsumer lambda invocation (Http2Pool\$\...run +1.16% in alloc profile) * per-request channel pipeline String-key walk via pipeline.get(...) Long-run async-profiler diff vs v4.80.0 on D2 attributed the PR's -7.6% QPS / +14.5% mean / +166% tail-max regression primarily to a +10.2% allocation rate -- with the operators above dominating the delta. Move the install into the existing .doOnConnected() block, which already runs once per HTTP/2 child stream (the customHeaderCleaner install right above it relies on the same contract). The rewrap handler is @sharable, so installing once per stream is correct and amortizes the install + pipeline-walk away from the hot path. The ch.parent() != null guard is kept so the install only targets child streams, never the parent TCP channel (which uses Http2ParentChannelExceptionHandler instead). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…velyEnabled
Follow-up to the doOnConnected install move (previous commit on this PR).
The rewrap handler translates Http2MultiplexHandler's per-child-stream
channelInactive into a typed Http2PingTimeoutChannelClosedException when
the PING sender closes the parent TCP channel after consecutive PING-ACK
timeouts. If the PING sender is disabled there is no such signal to
translate, so the install adds per-child-stream pipeline-add cost and an
extra pipeline hop on every inbound frame for zero behavioral benefit.
Empirical motivation: long-run perf matrix on a 2 vCPU D2s_v5 VM
(point-read workload, Gateway+H2, ~98% CPU steady state, baseline =
v4.80.0):
baseline (v4.80.0) : 9010 QPS
ping-prefix (PR before doOnConnected fix) : 8301 QPS (-7.9%)
ping-postfix (doOnConnected install only) : 8170 QPS (-9.3%)
ping-off (postfix + HTTP2_PING_HEALTH=false): 8102 QPS (-10.1%)
Toggling the documented kill switch did not recover baseline because
the rewrap install was unconditional -- the kill switch only suppressed
the PING sender, not the per-child-stream rewrap install. async-profiler
CPU diffs ping-off-vs-baseline and ping-off-vs-postfix were essentially
identical, confirming the rewrap install dominates the residual cost.
Consolidating both PING-sender install and rewrap install under
Http2PingHandler.isPingHealthEffectivelyEnabled makes
HTTP2_PING_HEALTH_ENABLED=false a true revert-to-baseline.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refines the Cosmos Gateway HTTP/2 PING-health feature’s Reactor Netty integration to reduce steady-state overhead and ensure the documented kill switch (COSMOS.HTTP2_PING_HEALTH_ENABLED=false) truly reverts behavior to the pre-feature baseline.
Changes:
- Move
Http2PingCloseRewrapHandlerinstallation from a per-request hook (.doOnRequest()) to a connection hook (.doOnConnected()), amortizing setup costs away from the per-call hot path. - Gate
Http2PingCloseRewrapHandlerinstallation onHttp2PingHandler.isPingHealthEffectivelyEnabled(http2Cfg)so disabling PING-health suppresses both the PING sender and the rewrap translation handler. - Document the perf regression fix in the Cosmos
CHANGELOG.md.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java | Reworks the handler install site and adds gating so PING-health overhead is avoided when disabled. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Adds an entry describing the HTTP/2 PING-handler perf regression fixes. |
Comments suppressed due to low confidence (1)
sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/http/ReactorNettyClient.java:231
- Now that the rewrap handler installation has moved from
.doOnRequest()to.doOnConnected(), there are still Javadocs inHttp2PingCloseRewrapHandlerandHttp2PingHandlerthat describe the old.doOnRequest(...)install site. Updating those references would keep the documentation accurate and reduce future confusion when debugging PING-close behavior.
// Install a @Sharable head-of-pipeline rewrap handler on each H2
// child-stream pipeline. When Http2PingHandler closes the parent
// (TCP) channel after consecutive PING-ACK timeouts or PING-send
// failures, the H2 multiplex codec propagates channelInactive to
// every child stream; the rewrap handler intercepts that and fires
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@sdkReviewAgent |
…NFIGURED) The rewrap handler install lived inside .doOnConnected(...), which reactor-netty fires only on the H2 parent TCP channel (State.CONFIGURED, parent==null). The install was gated on ch.parent()!=null, so it never ran on the parent and never reached a child stream -- the handler was installed 0 times and PING-timeout parent closes surfaced as bare PrematureCloseException instead of the typed Http2PingTimeoutChannelClosedException. Move the install to .observe((connection, state) -> ...) keyed on HttpClientState.STREAM_CONFIGURED, the per-child-stream lifecycle event. The handler is now added at the head of each H2 child-stream pipeline, where it intercepts channelInactive ahead of HttpClientOperations and rewraps the close into the typed exception so ClientRetryPolicy can suppress region mark-down. Keeps the PING-health gate, ch.parent()!=null defensive check, get(HANDLER_NAME)==null guard, and TOCTOU try/catch. Also fixes stale doc-comments referencing the old install hook. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The handler is a single @sharable instance installed per H2 child stream, not a per-request object. Reword the class-doc summary from 'Per-request' to 'Per-child-stream' for consistency with the rest of the doc. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Extract the inline .observe() install decision into a package-private static installHttp2PingCloseRewrapHandlerIfNeeded(...) so the per-child-stream gate is unit-testable without a live HTTP/2 connection. Add a TestNG unit test (group "unit") that drives the real method via an EmbeddedChannel and proves the disablement path (kill-switch off, non-positive PING interval, HTTP/2 disabled, non-STREAM_CONFIGURED state, parent-less channel) is a true revert-to-baseline that installs nothing -- plus positive install, head-of-pipeline position, idempotency, and state-before-predicate short-circuit ordering. Behavior-preserving refactor; no user-facing change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
/azp run java - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
P95 / P99 — HTTP/2 PING perf (PR vs harness-matched baseline) 45-min D2s_v5, GW HTTP/2, conc=80, CPU pinned ~98% on both. Baseline pins the benchmark pom to
|
|
✅ Review complete (25:17) No new comments — existing review coverage is sufficient. Steps: ✓ context, correctness, cross-sdk, design, history, past-prs, synthesis, test-coverage |
|
"Collection not available for read" failures in com.azure.cosmos.rx.OrderbyDocumentQueryTest.before_OrderbyDocumentQueryTest - collection creation is async - merging this PR and following up with an approach to wait until collections available for read across regions. |
Two corrections to how
Http2PingCloseRewrapHandler(from #49095) is installed.1. Install per H2 child stream via
.observe(STREAM_CONFIGURED). The handler must sit at the head of each child-stream pipeline to rewrap a parent PING-timeout close into the typedHttp2PingTimeoutChannelClosedExceptionas opposed to checking if the handler was installed on a per-request basis.2. Gate the install on
Http2PingHandler.isPingHealthEffectivelyEnabledso disabling PING-health reverts to baseline (no per-stream handler).