[Tracing] Add experimental http/protobuf support for OTLP traces export#8645
Conversation
Empty skeleton files for the OTLP traces protobuf serializer and its tests, to be filled in via subsequent TDD cycles.
Implements the empty-buffer path: when no spans have been serialized, FinishBody is a no-op that returns 0 bytes written.
Implements SerializeSpans for the first-chunk path: opens the ResourceSpans envelope (resource + scope_spans), then writes each span with its trace_id/span_id/parent_span_id/name/kind/timestamps, attributes, events, links, status, and flags. FinishBody patches the length placeholders. Tests parse the binary output via Google.Protobuf.CodedInputStream through a minimal in-test decoder so we avoid taking a heavy generated- bindings dependency.
…r positions Mirror the JSON serializer's overflow behavior: return 0 from SerializeSpans when the chunk exceeds maxSize, and grow the caller's temporary buffer up-front via MessagePackBinary.EnsureCapacity so per-field writes cannot IndexOutOfRangeException. Store length-placeholder positions as offsets into the eventual destination _buffer (computed via spanBufferOffset), not as absolute offsets into the temporary serialization buffer. This removes implicit coupling to HeaderSize == 0 and temporaryBufferOffset == 0. Add a unit test that exercises the maxSize overflow path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ew Traces http/protobuf implementation to run on all runtimes. This also removes the '#if NETCOREAPP3_1_OR_GREATER' from the vendored OpenTelemetry .NET SDK ProtobufSerializer
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8645) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (73ms) : 71, 76
master - mean (73ms) : 70, 75
section Bailout
This PR (8645) - mean (77ms) : 75, 79
master - mean (78ms) : 74, 81
section CallTarget+Inlining+NGEN
This PR (8645) - mean (1,114ms) : 1053, 1174
master - mean (1,110ms) : 1067, 1153
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (114ms) : 110, 117
master - mean (115ms) : 110, 119
section Bailout
This PR (8645) - mean (118ms) : 112, 125
master - mean (119ms) : 113, 125
section CallTarget+Inlining+NGEN
This PR (8645) - mean (797ms) : 771, 823
master - mean (796ms) : 771, 821
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (102ms) : 99, 105
master - mean (101ms) : 98, 105
section Bailout
This PR (8645) - mean (102ms) : 100, 104
master - mean (106ms) : 101, 112
section CallTarget+Inlining+NGEN
This PR (8645) - mean (949ms) : 912, 985
master - mean (947ms) : 905, 989
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (103ms) : 98, 109
master - mean (100ms) : 96, 104
section Bailout
This PR (8645) - mean (102ms) : 98, 105
master - mean (101ms) : 98, 103
section CallTarget+Inlining+NGEN
This PR (8645) - mean (826ms) : 777, 875
master - mean (829ms) : 789, 869
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (198ms) : 192, 204
master - mean (198ms) : 192, 205
section Bailout
This PR (8645) - mean (201ms) : 197, 205
master - mean (201ms) : 197, 205
section CallTarget+Inlining+NGEN
This PR (8645) - mean (1,190ms) : 1149, 1230
master - mean (1,187ms) : 1132, 1241
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (283ms) : 273, 293
master - mean (284ms) : 276, 293
section Bailout
This PR (8645) - mean (285ms) : 278, 293
master - mean (284ms) : 278, 290
section CallTarget+Inlining+NGEN
This PR (8645) - mean (958ms) : 939, 977
master - mean (957ms) : 939, 975
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (277ms) : 269, 284
master - mean (276ms) : 270, 281
section Bailout
This PR (8645) - mean (277ms) : 270, 283
master - mean (276ms) : 270, 282
section CallTarget+Inlining+NGEN
This PR (8645) - mean (1,150ms) : 1105, 1195
master - mean (1,156ms) : 1114, 1199
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8645) - mean (276ms) : 270, 283
master - mean (277ms) : 268, 285
section Bailout
This PR (8645) - mean (277ms) : 271, 282
master - mean (276ms) : 272, 280
section CallTarget+Inlining+NGEN
This PR (8645) - mean (1,032ms) : 996, 1069
master - mean (1,037ms) : 988, 1087
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-05-22 19:05:54 Comparing candidate commit a1da8e6 in PR branch Some scenarios are present only in baseline or only in candidate runs. If you didn't create or remove some scenarios in your branch, this maybe a sign of crashed benchmarks 💥💥💥 Scenarios present only in baseline:
Found 6 performance improvements and 10 performance regressions! Performance is the same for 43 metrics, 13 unstable metrics, 93 known flaky benchmarks, 33 flaky benchmarks without significant changes.
|
2f1cb24 to
eedee80
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a3f4eb162
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Revert order of OtlpProtocol switch cases - Add comment to OtlpTracesProtobufSerializer to describe why we're manually serializing OTLP traces
…ot allocate each time resource attributes are written from the OtlpTracesProtobufSerializer
andrewlock
left a comment
There was a problem hiding this comment.
LGTM, mostly just some nits/questsions/suggestions
- Remove grpc and let it fall into the general error fallback case - Update the OtlpTracesProtobufSerializerTests to use the real protobuf definitions and produce generated C# files from the Grpc.Tools library
|
… and avoid allocating
- Remove unnecessary ! (null-forgiving) operators - Refactor the serialization of Span.Status to slightly improve perf - Add bounds checking to the OtlpTracesProtobufSerializer
…ring protobuf serialization
…rary buffer size rather than increase the capacity all at once.
|
@andrewlock and @vandonr I've addressed most feedback but I have two small follow-up items then I plan to merge. If you're still invested in the small suggestions, I'd appreciate if you could review my latest changes while I follow-up on the remaining feedback. Remaining feedback I'll be addressing:
|
…and it will allow us to unit test the OTLP protobuf serializer
…s more in-code documentation for WriteAnyValue and implements identical logic to the OpenTelemetry .NET protobuf serializer in terms of handling primitive types.
andrewlock
left a comment
There was a problem hiding this comment.
LGTM overall, thanks! Just one small question about failing to increase buffer size, and one incorrect comment. Though I think integration tests are currently failing
|
|
||
| return bytesWritten; | ||
| } | ||
| catch (Exception ex) when (ex is IndexOutOfRangeException or ArgumentException) |
There was a problem hiding this comment.
It's interesting, because it looks like there's two "failure" modes:
- TemporaryBuffer was resized to > maxSize, but otherwise we completed sucessfully
- Serializer just tried to overwrite the end, and threw.
Just seems strange that we can fail in those two different ways? 🤔 Why do the second one when the first one is an option 🤷♂️ Anyway, just pondering, I realize it's just the behaviour of the serializer! 😅
There was a problem hiding this comment.
Reasoning about this, the catch block below will catch when we hit the following issue:
- Serializer just tried to overwrite the end, and threw.
When we hit that, we resize the buffer to Math.Min(currentBufferSize * 2, maxSize). Then at that point our buffer should always throw if we exceed maxSize. But as a backup, we manually check the condition with if (bytesWritten > maxSize). So the first case is more of a backup that we haven't written more than what can be transferred once we hand control back to the SpanBuffer
There was a problem hiding this comment.
Added a more clear Math.Min(currentBufferSize * 2, maxSize) in c132645. Hope the explanation here helps too
| // range); the catch below treats that as overflow and returns 0. | ||
| // TODO: Follow the OpenTelemetry SDK approach to grow the buffer only as necessary. | ||
| MessagePackBinary.EnsureCapacity(ref bytes, temporaryBufferOffset, maxSize); | ||
| // Initialize the temporary buffer to the initial size deemed best by the OpenTelemetry .NET SDK |
There was a problem hiding this comment.
😂 "Not my fault, mate, it was like that when I got here, go ask the OTel team"
There was a problem hiding this comment.
In general there seems to be more bytes for OTLP payloads because there's more structure and more instrumentation scopes (spans grouped by the library that produced them), so I think it benefits us to start at a higher initial buffer size 😄
…Serializer.SerializeSpans to clarify curious behaviors
…it with proper serialization of array attributes (currently only on events). Also updated the http/json serializer to produce array attributes rather than a simple array.ToString()
Summary of changes
Adds
http/protobufsupport to the OTLP traces exporter introduced in #8211, which introduced the exporter andhttp/jsonsupport.This feature is enabled by setting
OTEL_TRACES_EXPORTER=otlpandOTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf(or the parentOTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf).Reason for change
When #8211 landed,
http/jsonwas the only functional OTLP traces protocol — setting any other protocol would fall back (with a log message) to the Datadog MessagePack encoding.http/protobufis the OTLP default protocol per the OpenTelemetry specification and is what most OTel collectors, Datadog Agent included, expect to receive, so we need to support it to be useful out-of-the-box for users adopting the OTLP traces export.Implementation details
Serializer
Adds
Datadog.Trace.OpenTelemetry.Traces.OtlpTracesProtobufSerializer, a newISpanBufferSerializerthat emits the OTLPExportTraceServiceRequestprotobuf payload directly intoSpanBuffer's output buffer:resource_spansandscope_spansenvelopes are opened on the firstSerializeSpanscall, span entries are appended per chunk, andFinishBodypatches the two length-prefix placeholders using absolute positions into the destination buffer.128) asOtlpTracesJsonSerializer, and re-uses theOtlpMapperlogic to map resource attributes and tags onto the serialized spans.OtlpTracesProtobufSerializeris stateful (it tracks the two length-placeholder positions), soAgentWriternow creates a fresh instance for the front and back buffers rather than sharing one — the previous sharedOtlpTracesJsonSerializerwas stateless and didn't need this.Vendored OpenTelemetry SDK
Builds on the protobuf primitives vendored from the OpenTelemetry .NET SDK in #7653 (
ProtobufSerializer,ProtobufWireType):ProtobufOtlpTraceFieldNumberConstantsandProtobufOtlpResourceFieldNumberConstants(split out so the OTLP trace serializer only depends on what it needs).#if NETCOREAPP3_1_OR_GREATERguard from the vendoredProtobufSerializer— now that we vendor aSpan<>polyfill the serializer compiles and runs on all supported runtimes (.NET Framework 4.6.1+).#if NET6_0_OR_GREATERguard around theSubmitsOtlpTracesintegration test (http/protobufandhttp/jsonboth run on .NET Framework now too).Wiring
ExporterSettings:OtlpProtocol.HttpProtobufnow resolves toTracesEncoding.OtlpProtobufinstead of falling back toDatadogV0_4.AgentWriter.GetSpanSerializer: selectsOtlpTracesProtobufSerializerwhenTracesEncoding == OtlpProtobuf.ApiOtlp.SendTracesAsync: choosesapplication/x-protobufvsapplication/jsonfor the requestContent-Typebased on_tracesEncoding. Adds theapplication/x-protobufMIME type constant toDatadog.Trace.Agent.Transports.MimeTypes.supported-configurations.yaml: documentshttp/protobufas a valid value forOTEL_EXPORTER_OTLP_TRACES_PROTOCOLand regeneratesConfigurationKeys.OpenTelemetry.g.csfor all TFMs.Test coverage
OtlpTracesProtobufSerializerTestsexercises the serializer end-to-end usingGoogle.Protobufand a little custom parsing. Test cases include:FinishBodyreturns 0)maxSizehonoring behaviorOpenTelemetrySdkTests.SubmitsOtlpTracesnow runs bothhttp/jsonandhttp/protobufcases for the Datadog SDK on all TFMs.http/protobufare normalized to thehttp/jsonshape via scrubbers that translate the test agent's protobuf rendering (snake_case field names, string-form enum values likeSPAN_KIND_SERVER) into the JSON serializer's shape (camelCase field names, integer enum values).SubmitsOtlpTraces_DD_http_json.verified.txtrenamed toSubmitsOtlpTraces_DD.verified.txt(one snapshot now covers both protocols).Other details
With this PR,
http/jsonandhttp/protobufare both functional;grpcstill falls back to Datadog MessagePack encoding with a startup warning.Follow-up work
grpcOTLP traces protocolTODO: Telemetry - Record OTLP Traces API submissionsinApiOtlp.SendTracesAsyncstill applies)OTEL_EXPORTER_OTLP_TRACES_TIMEOUTinApiOtlp(still unused)