Conversation
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8449) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (72ms) : 70, 75
master - mean (74ms) : 70, 78
section Bailout
This PR (8449) - mean (79ms) : 75, 84
master - mean (80ms) : 76, 83
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,082ms) : 1027, 1138
master - mean (1,088ms) : 1028, 1148
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (116ms) : 110, 123
master - mean (116ms) : 111, 122
section Bailout
This PR (8449) - mean (118ms) : 112, 125
master - mean (116ms) : 111, 121
section CallTarget+Inlining+NGEN
This PR (8449) - mean (801ms) : 775, 826
master - mean (797ms) : 770, 824
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (103ms) : 98, 107
master - mean (102ms) : 98, 106
section Bailout
This PR (8449) - mean (105ms) : 99, 111
master - mean (105ms) : 99, 111
section CallTarget+Inlining+NGEN
This PR (8449) - mean (945ms) : 907, 982
master - mean (943ms) : 909, 978
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (100ms) : 97, 103
master - mean (101ms) : 97, 104
section Bailout
This PR (8449) - mean (104ms) : 98, 111
master - mean (102ms) : 97, 108
section CallTarget+Inlining+NGEN
This PR (8449) - mean (832ms) : 779, 885
master - mean (829ms) : 786, 873
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (194ms) : 189, 199
master - mean (193ms) : 189, 197
section Bailout
This PR (8449) - mean (197ms) : 194, 199
master - mean (195ms) : 193, 197
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,157ms) : 1116, 1198
master - mean (1,153ms) : 1112, 1194
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (278ms) : 273, 282
master - mean (276ms) : 271, 281
section Bailout
This PR (8449) - mean (280ms) : 273, 288
master - mean (276ms) : 273, 280
section CallTarget+Inlining+NGEN
This PR (8449) - mean (947ms) : 925, 968
master - mean (933ms) : 912, 954
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (273ms) : 268, 277
master - mean (271ms) : 266, 276
section Bailout
This PR (8449) - mean (272ms) : 268, 276
master - mean (270ms) : 267, 273
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,145ms) : 1106, 1185
master - mean (1,140ms) : 1103, 1177
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8449) - mean (269ms) : 265, 273
master - mean (267ms) : 263, 270
section Bailout
This PR (8449) - mean (270ms) : 266, 273
master - mean (267ms) : 264, 271
section CallTarget+Inlining+NGEN
This PR (8449) - mean (1,023ms) : 981, 1065
master - mean (1,028ms) : 978, 1078
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-04-21 15:57:12 Comparing candidate commit e80e6ec in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 27 metrics, 0 unstable metrics, 87 known flaky benchmarks.
|
b6d7dd5 to
96ecfea
Compare
| <type fullname="System.Net.DnsEndPoint" /> | ||
| <type fullname="System.Net.EndPoint" /> | ||
| <type fullname="System.Net.HttpStatusCode" /> | ||
| <type fullname="System.Net.HttpVersion" /> |
There was a problem hiding this comment.
The removal of System.Net.HttpVersion appears to be due to the removal of DefaultRequestVersion = HttpVersion.Version20 and DefaultVersionPolicy = HttpVersionPolicy.RequestVersionOrHigher
| DefaultRequestVersion = HttpVersion.Version20, | ||
| DefaultVersionPolicy = HttpVersionPolicy.RequestVersionOrHigher | ||
| }; | ||
| return new HttpClient(handler); |
There was a problem hiding this comment.
I couldn't find any documentation / reason why it had to be HTTP/2 so I think this is fine, but feel free to request it to be reverted.
There was a problem hiding this comment.
I think this could be because gRPC requires HTTP/2 🤔 But standard HTTP certainly shouldn't. So maybe we need to condition the client on that? (assuming this client is actually used in the grpc code path)
There was a problem hiding this comment.
Yeah gRPC pins to 2 here I've assumed that takes affect as there aren't test failures there so I think that is good enough
b8990e6 to
3a29f10
Compare
## Summary of changes
Skips `SubmitsOtlpLogs` and `SubmitsOtlpMetrics` as both have been
significantly more flaky recently and in some PRs (unrelated to them)
have been constantly failing and blocking builds.
## Reason for change
Skips `SubmitsOtlpLogs` and `SubmitsOtlpMetrics` as both have been
significantly more flaky recently and in some PRs (unrelated to them)
have been constantly failing and blocking builds.
## Implementation details
Marked as Skip
## Test coverage
Less
## Other details
<!-- Fixes #{issue} -->
I have been making progress, but haven't resolved this here [Fix OTLP
HTTP/protobuf export failures and improve OTLP integration test
reliability](#8449), but
I haven't reproduced it locally AND it seems to be getting significantly
worse recently - one theory is that our agents are under much higher
load which is causing them to hit this.
<!-- ⚠️ Note:
Where possible, please obtain 2 approvals prior to merging. Unless
CODEOWNERS specifies otherwise, for external teams it is typically best
to have one review from a team member, and one review from apm-dotnet.
Trivial changes do not require 2 reviews.
MergeQueue is NOT enabled in this repository. If you have write access
to the repo, the PR has 1-2 approvals (see above), and all of the
required checks have passed, you can use the Squash and Merge button to
merge the PR. If you don't have write access, or you need help, reach
out in the #apm-dotnet channel in Slack.
-->
andrewlock
left a comment
There was a problem hiding this comment.
Yikes, what a mess, thanks for tackling all this 😅
| DefaultRequestVersion = HttpVersion.Version20, | ||
| DefaultVersionPolicy = HttpVersionPolicy.RequestVersionOrHigher | ||
| }; | ||
| return new HttpClient(handler); |
There was a problem hiding this comment.
I think this could be because gRPC requires HTTP/2 🤔 But standard HTTP certainly shouldn't. So maybe we need to condition the client on that? (assuming this client is actually used in the grpc code path)
| // final flush. Pin the APM URL back to the in-process MockAgent. | ||
| if (useAgentHostBackup && agent is MockTracerAgent.TcpUdpAgent tcpAgent) | ||
| { | ||
| SetEnvironmentVariable("DD_TRACE_AGENT_URL", $"http://127.0.0.1:{tcpAgent.Port}"); |
There was a problem hiding this comment.
Wow, yeah good find
The vendored gRPC client already sets HTTP/2 per-request via RequireHttp2, so forcing DefaultRequestVersion=HTTP/2 on the shared HttpClient breaks HTTP/protobuf export against HTTP/1.1-only servers (e.g., dd-apm-test-agent on port 4318). Remove System.Net.HttpVersion from trimming XML since it is no longer referenced.
Shut down the OtlpExporter in OtlpSubmissionLogSink.DisposeAsync() so the underlying HttpClient is disposed after the final flush.
Replace fire-and-forget session clear with a retry helper that waits for the test-agent HTTP endpoint to be ready. Replace single-GET data fetch with a polling helper (WaitForTestAgentData) that retries for up to 30 seconds, accounting for export flush timing after process exit. Fix incorrect env var OTEL_LOG_EXPORT_INTERVAL (does not exist) to OTEL_BLRP_SCHEDULE_DELAY. Set to 500ms so the OTel SDK gets multiple periodic exports before LoggerProviderSdk.Dispose() hits its 5s shutdown timeout. This is especially important for gRPC, where the first export warms the HTTP/2 connection. Set OTEL_METRIC_EXPORT_INTERVAL to 60000ms so only the shutdown flush fires, preventing duplicate metric batches from observable instruments that broke snapshot comparison. Remove [Flaky] attributes from SubmitsOtlpMetrics and SubmitsOtlpLogs.
| var otlpGeneralProtocolName = OtlpGeneralProtocol switch | ||
| { | ||
| OtlpProtocol.Grpc => "grpc", | ||
| OtlpProtocol.HttpProtobuf => "http/protobuf", | ||
| _ => "grpc", | ||
| }; |
There was a problem hiding this comment.
Unsure if these changes are needed TBH will double check again
There was a problem hiding this comment.
Ah yeah so the issue here was that in tests we only set SetEnvironmentVariable("OTEL_EXPORTER_OTLP_PROTOCOL", protocol); (protocol was http/protobuf)
and that we weren't setting the metric/logs specific protocol so that when we ran the tests instead of running with http/protobuf they actually ran grpc
Also the OpenTelemetry spec says (right at the top):
The following configuration options MUST be available to configure the OTLP exporter. Each configuration option MUST be overridable by a signal specific option.
https://opentelemetry.io/docs/specs/otel/protocol/exporter/#configuration-options
There was a problem hiding this comment.
Can confirm, this new logic is correct 👍🏼
No implementation honored timeoutMilliseconds - all paths just dispose the HttpClient. The real flush is bounded by the HTTP client timeout during the preceding DisposeAsync/StopAsync, so Shutdown() does not need to block. Drop the parameter and the now-unused _timeoutMs field in MetricReader.
fdde571 to
32d2ea4
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Already looking forward to the next diff. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
link04
left a comment
There was a problem hiding this comment.
Did notice some env var I miss adding as well! Thank you for looking into this man.
zacharycmontoya
left a comment
There was a problem hiding this comment.
All these changes look necessary and correct. If the CI flakiness is reduced as result, then let's merge this
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e80e6ecf96
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .WithKeys(ConfigurationKeys.OpenTelemetry.ExporterOtlpLogsProtocol) | ||
| .GetAs( | ||
| defaultValue: new(OtlpProtocol.Grpc, "grpc"), | ||
| defaultValue: new(OtlpGeneralProtocol, otlpGeneralProtocolName), |
There was a problem hiding this comment.
Propagate the general http/json protocol
For users who set only OTEL_EXPORTER_OTLP_PROTOCOL=http/json, this new fallback still resolves logs to the default gRPC protocol because OtlpGeneralProtocol does not parse http/json even though the logs-specific converter below accepts it and the configuration docs list it as valid. In that configuration logs will be sent to the gRPC default endpoint instead of the HTTP exporter unless OTEL_EXPORTER_OTLP_LOGS_PROTOCOL is also set explicitly.
Useful? React with 👍 / 👎.
Summary of changes
This fixes some export failures occurring in OTLP Logs/Metrics exports that were caused by both bugs within the exporter code (forced HTTP/2 and a unclosed
_otlpExporter) and also with some test changes.Reason for change
I've been noticing on many of my CI runs that I kept running into the following error that requires a retry:
On looking further I found a few issues of note:
HttpClients for Logs/Metrics for our OTLP implementation forced HTTP/2. However, the test agent that we us only supports an HTTP/1.1 aiohttp server. Note: this is technically a production change for a test issue I'm fine reverting this and just leaving these tests as[Flaky]OtlpSubmissionLogSink.DisposeAsync()never called_otlpExporter.Shutdown()which appears to have left theHttpClientopen.OTEL_LOG_EXPORT_INTERVALinstead ofOTEL_BLRP_SCHEDULE_DELAY(former doesn't seem to exist ever or anymore)OTEL_EXPORTER_OTLP_PROTOCOLapplying to metrics/logs protocol(s) when they are not set -> https://opentelemetry.io/docs/specs/otel/protocol/exporter/#configuration-optionsImplementation details
It does seem that the test agent is a bit finicky to work with and since we don't use it very much anywhere else I opted to go for some polling strategies instead of single shots for getting request data.
OTEL_LOG_EXPORT_INTERVALtoOTEL_BLRP_SCHEDULE_DELAY/test/session/clearwithClearTestAgentSessionretry helperWaitForTestAgentDatapolling helper (30s timeout, 500ms interval)OTEL_METRIC_EXPORT_INTERVAL=60000so only the shutdown flush fires (one clean batch)OTEL_BLRP_SCHEDULE_DELAY=500; The OpenTelemetry SDK has a hard 5s shutdown timeout that we hit so this should be good enough to get our data out and flush it before it is timed out. I'll add this to the Other Details as well. This seems to be the flakiest bit left[Flaky]fromSubmitsOtlpMetricsandSubmitsOtlpLogsOTEL_EXPORTER_OTLP_PROTOCOLvalue to set the metric/log protocol when not setShutdownas it wasn't used (and I attempted to use it)Test coverage
I'm going to run the CI a few times against this to see if this helps get the tests running quicker and pass
Other details
The test-agent's HTTP server on port 4318 is created in agent.py: otlp_http_app = make_otlp_http_app(agent) # line 2436 otlp_http_runner = web.AppRunner(otlp_http_app) # line 2470 otlp_http_site = web.TCPSite(otlp_http_runner, port=parsed_args.otlp_http_port) # line 2490 This is aiohttp.web -- its TCPSite serves HTTP/1.1 only. aiohttp does not support HTTP/2. You can see it declared as a dependency in setup.py: install_requires=[ "aiohttp", ... ] There's no h2, hyper, hypercorn, or any other HTTP/2 library in the dependencies. For contrast, port 4317 (gRPC) uses grpc_aio.server() (line 2481), which speaks HTTP/2 natively -- that's why gRPC tests were never affected. Source: https://github.com/DataDog/dd-apm-test-agent -- setup.py for dependencies, ddapm_test_agent/agent.py lines 2436-2490 for server setup.