Skip to content

Commit 014e5d0

Browse files
andrewlockclaudelucaspimentel
authored
Implement Client-Side Stats (CSS) 1.2.0 (#8420)
## Summary of changes Updates the existing client-side-stats implementation to match version 1.2.0 [as defined in the RFC](https://datadoghq.atlassian.net/wiki/spaces/APM/pages/6378947571/Client-Side+Stats+v1.2.0) and as [implemented in the agent](https://github.com/DataDog/datadog-agent/blob/c1d67a906f4c594654600760da1eea4c8037471a/pkg/proto/datadog/trace/stats.proto#L83). ## Reason for change Our implementation is severely lagging the latest implementation in the agent. This hasn't been a big deal, as it's not documented and not enabled by default, but we'd like to fix the implementation to make it usable. ## Implementation details This was driven almost entirely by 🤖, by comparing our existing implementation to the RFC, and also taking the agent/go implementation as the canonical implementation. > Implementing this in .NET highlighted a number of missing aspects in the RFC, which I've raised elsewhere, and aim to get incorporated into the RFC. At a high level, the PR contains the following changes: - Stats Wire Format & Serialization - **Added new aggregation dimensions**: `SpanKind`, `IsTraceRoot` (as Trilean), `HTTPMethod`, `HTTPEndpoint`, `GRPCStatusCode`, `ServiceSource`, `PeerTags` to both the aggregation key and the msgpack wire format - **Fixed `GRPCStatusCode` type**: Changed from `int` to `string` to match Go agent's wire format (agent was returning 400 Bad Request in system tests) - **gRPC status code extraction**: Checks 4 tag names in priority order (`rpc.grpc.status_code`, `grpc.code`, `rpc.grpc.status.code`, `grpc.status.code`) - **Stochastic rounding**: `Hits`, `Errors`, `Duration`, `TopLevelHits` accumulated as `double` (weighted by sampling rate) then rounded probabilistically to `int64` for serialization - **Duration weighting**: Durations are now multiplied by sampling weight (`1/rate`), matching Go agent behavior - **Default env**: Serializes `"unknown-env"` when environment is not configured - **`git_commit_sha`**: Added as optional field in stats payload - **`Service`**: Added as top-level field in stats payload - **Empty bucket suppression**: `HasHits()` check prevents sending payloads with zero-hit buckets (stale keys retained for sketch reuse) - Bucket Timing - **10-second alignment**: Bucket `Start` timestamps aligned to 10-second boundaries (`ts - ts % 10_000_000_000`) matching Go tracer's `alignTs` - **Removed unused `StartTime`** property (only `Start` as aligned nanoseconds) - Agent Discovery (`/info` Endpoint) - **`peer_tags`**: Parsed, sorted, deduplicated; used for peer tag extraction on client/producer/consumer spans - **`span_kinds_stats_computed`**: Parsed to override eligible span kinds - **`obfuscation_version`**: Parsed for obfuscation negotiation - **Trace filters**: Parsed `filter_tags`, `filter_tags_regex`, `ignore_resources` from `/info` - Trace Filtering - **`TraceFilter` implementation**: Evaluates agent-configured filters (exact tags, regex tags, resource patterns) against root spans before stats computation - **Tag-only filters**: Handles filter entries that match on tag key presence without a specific value - SQL Obfuscation - **Operator splitters**: Added `* / = < > ! & ^ % ~ ? @ : #` as token splitters (matching Go agent's `go-sqllexer` `isOperator()`) so queries like `WHERE id='1'` are properly obfuscated - **Whitespace normalization**: Post-processing pass adds spaces around comparison operators (`=`, `<`, `>`, `!`) adjacent to `?` placeholders, matching Go agent's normalizer output (e.g., `id='1'` → `id = ?`) - **Obfuscation gating**: Only runs when `obfuscation_version` is negotiated with agent; sends `Datadog-Obfuscation-Version` header - Peer Tags - **IP quantization**: Peer tag values run through `IpAddressObfuscationUtil.QuantizePeerIpAddresses()` replacing non-allowed IPs with `"blocked-ip-address"` - **Base service handling**: Internal/missing-kind spans with non-default service name use `_dd.base_service:{serviceName}` as sole peer tag - **FNV-1a hashing**: Peer tags hashed with null-byte separators for aggregation key - Other - **No retries for stats**: Stats sends are fire-and-forget (retry limit = 0) - **Synthetics detection**: Uses `StartsWith("synthetics")` prefix matching (not exact match) - **Mock agent updates**: Test mock agent returns `obfuscation_version` in `/info` response ## Test coverage There's a _lot_ going on in this PR, because we were so far behind. I _could_ split this into implementing individual features, but there would be a lot of duplication between PRs, and it didn't seem like it would be that easy to track. At least with this big bang we can compare directly against the system tests etc. The existing system tests for stats computation were checked, and made to pass (which identified a number of hidden expectations which will be added to the RFC). I'll create a PR to enable these in the system-tests repo ## Other details Part of a stack - #8417 - #8418 There are still some _theoretical_ gaps between the go implementation and the .NET implementation, but I _think_ these are non-issues in _most_ cases: | Area | Gap | Impact | |------|-----|--------| | gRPC status code normalization | Go agent normalizes string statuses (e.g., `"CANCELLED"` → `"1"`); .NET passes raw tag value | Stats mismatch if gRPC library uses string-form status codes | | HTTP status code tags | Go agent checks both `http.status_code` and `http.response.status_code` (OTel convention); .NET only checks `http.status_code` | OTel spans using newer convention would get `0` in .NET | | Duration precision truncation | Go agent uses float bit masking; .NET uses integer shifting — both target ~10 bits but may produce slightly different values | Minor histogram differences | | SQL obfuscation | Go agent uses full tokenizer + normalizer; .NET uses character-level splitter with targeted normalization around comparison operators only | Complex SQL with unusual formatting may produce different resource strings | | `HTTP_method` / `HTTP_endpoint` | .NET always populates from span tags; Go agent only populates from newer OTel pipeline paths | Creates different aggregation keys — Go groups all methods/routes together in default path | | `_top_level` metric | Go checks both `_top_level` and `_dd.top_level`; .NET only checks `_dd.top_level` | Spans using older metric name would be missed by .NET | | `span_derived_primary_tags` | Still in Go agent code but implemented in v1.3.0 RFC, which was reverted; removed from .NET | No current impact; may need to re-add if spec reverts back | There is another big elephant in the room, which is perf. The peer tags, in particular, currently requires a _bunch_ of allocation. I'd rather defer trying to fight against that to another PR if possible, unless anyone has some clear ideas 😄 Another aspect I'm not sure about is how this interacts with @zacharycmontoya's recent work to publish OTLP stats. I took a random guess and fought the refactoring, but need to verify it. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Lucas Pimentel <lucas.pimentel@datadoghq.com>
1 parent 5925926 commit 014e5d0

33 files changed

Lines changed: 1638 additions & 233 deletions

tracer/missing-nullability-files.csv

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,6 @@ src/Datadog.Trace/Agent/IStreamFactory.cs
4343
src/Datadog.Trace/Agent/MovingAverageKeepRateCalculator.cs
4444
src/Datadog.Trace/Agent/NullStatsAggregator.cs
4545
src/Datadog.Trace/Agent/SpanBuffer.cs
46-
src/Datadog.Trace/Agent/StatsAggregationKey.cs
4746
src/Datadog.Trace/Agent/StatsAggregator.cs
4847
src/Datadog.Trace/Agent/StatsBucket.cs
4948
src/Datadog.Trace/Agent/StatsBuffer.cs

tracer/src/Datadog.Trace/Agent/Api.cs

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
// <copyright file="Api.cs" company="Datadog">
1+
// <copyright file="Api.cs" company="Datadog">
22
// Unless explicitly stated otherwise all files in this repository are licensed under the Apache 2 License.
33
// This product includes software developed at Datadog (https://www.datadoghq.com/). Copyright 2017 Datadog, Inc.
44
// </copyright>
55

66
using System;
77
using System.Collections.Generic;
88
using System.Diagnostics.CodeAnalysis;
9+
using System.Globalization;
910
using System.IO;
1011
using System.Net.Sockets;
1112
using System.Threading;
@@ -98,13 +99,14 @@ public void ToggleTracerHealthMetrics(bool enabled)
9899

99100
public Task<bool> Ping() => SendTracesAsync(EmptyPayload, 0, false, 0, 0);
100101

101-
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration)
102+
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion)
102103
{
103104
_log.Debug("Sending stats to the Datadog Agent.");
104105

105-
var state = new SendStatsState(stats, bucketDuration);
106+
var state = new SendStatsState(stats, bucketDuration, tracerObfuscationVersion);
106107

107-
return SendWithRetry(_statsEndpoint, _sendStats, state);
108+
// We are supposed to be fire and forget for these stats, with no retries
109+
return SendWithRetry(_statsEndpoint, _sendStats, state, retryLimit: 0);
108110
}
109111

110112
public Task<bool> SendTracesAsync(ArraySegment<byte> traces, int numberOfTraces, bool statsComputationEnabled, long numberOfDroppedP0Traces, long numberOfDroppedP0Spans, bool apmTracingEnabled = true)
@@ -138,10 +140,9 @@ internal bool LogPartialFlushWarningIfRequired(string agentVersion)
138140
return false;
139141
}
140142

141-
private async Task<bool> SendWithRetry<T>(Uri endpoint, SendCallback<T> callback, T state)
143+
private async Task<bool> SendWithRetry<T>(Uri endpoint, SendCallback<T> callback, T state, int retryLimit = 5)
142144
{
143145
// retry up to 5 times with exponential back-off
144-
var retryLimit = 5;
145146
var retryCount = 1;
146147
var sleepDuration = 100; // in milliseconds
147148

@@ -218,6 +219,11 @@ private async Task<SendResult> SendStatsAsyncImpl(IApiRequest request, bool isFi
218219

219220
request.AddContainerMetadataHeaders(_containerMetadata);
220221

222+
if (state.TracerObfuscationVersion > 0)
223+
{
224+
request.AddHeader("Datadog-Obfuscation-Version", state.TracerObfuscationVersion.ToString(CultureInfo.InvariantCulture));
225+
}
226+
221227
using var stream = new MemoryStream();
222228
state.Stats.Serialize(stream, state.BucketDuration);
223229

@@ -441,11 +447,13 @@ private readonly struct SendStatsState
441447
{
442448
public readonly StatsBuffer Stats;
443449
public readonly long BucketDuration;
450+
public readonly int TracerObfuscationVersion;
444451

445-
public SendStatsState(StatsBuffer stats, long bucketDuration)
452+
public SendStatsState(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion)
446453
{
447454
Stats = stats;
448455
BucketDuration = bucketDuration;
456+
TracerObfuscationVersion = tracerObfuscationVersion;
449457
}
450458
}
451459
}

tracer/src/Datadog.Trace/Agent/ApiOtlp.cs

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -74,13 +74,14 @@ private enum SendResult
7474
// return true.
7575
public Task<bool> Ping() => Task.FromResult(true);
7676

77-
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration)
77+
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion)
7878
{
7979
_log.Debug("Sending trace stats to the OTLP Metrics endpoint.");
8080

81-
var state = new SendStatsState(stats, bucketDuration);
81+
var state = new SendStatsState(stats, bucketDuration, tracerObfuscationVersion);
8282

83-
return SendWithRetry(_statsEndpoint, _sendStats, state);
83+
// We are supposed to be fire and forget for these stats, with no retries
84+
return SendWithRetry(_statsEndpoint, _sendStats, state, retryLimit: 0);
8485
}
8586

8687
public Task<bool> SendTracesAsync(ArraySegment<byte> traces, int numberOfTraces, bool statsComputationEnabled, long numberOfDroppedP0Traces, long numberOfDroppedP0Spans, bool apmTracingEnabled = true)
@@ -92,10 +93,9 @@ public Task<bool> SendTracesAsync(ArraySegment<byte> traces, int numberOfTraces,
9293
return SendWithRetry(_tracesEndpoint, _sendTraces, state);
9394
}
9495

95-
private async Task<bool> SendWithRetry<T>(Uri endpoint, SendCallback<T> callback, T state)
96+
private async Task<bool> SendWithRetry<T>(Uri endpoint, SendCallback<T> callback, T state, int retryLimit = 5)
9697
{
9798
// retry up to 5 times with exponential back-off
98-
var retryLimit = 5;
9999
var retryCount = 1;
100100
var sleepDuration = 100; // in milliseconds
101101

@@ -293,11 +293,13 @@ private readonly struct SendStatsState
293293
{
294294
public readonly StatsBuffer Stats;
295295
public readonly long BucketDuration;
296+
public readonly int TracerObfuscationVersion;
296297

297-
public SendStatsState(StatsBuffer stats, long bucketDuration)
298+
public SendStatsState(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion)
298299
{
299300
Stats = stats;
300301
BucketDuration = bucketDuration;
302+
TracerObfuscationVersion = tracerObfuscationVersion;
301303
}
302304
}
303305
}

tracer/src/Datadog.Trace/Agent/ClientStatsPayload.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,8 @@ public void UpdateDetails(MutableSettings settings)
2424
=> Interlocked.Exchange(ref _settings, CreateSettings(settings));
2525

2626
private static AppSettings CreateSettings(MutableSettings settings)
27-
=> new(settings.Environment, settings.ServiceVersion, settings.ProcessTags);
27+
=> new(settings.Environment, settings.ServiceVersion, settings.DefaultServiceName, settings.ProcessTags, settings.GitCommitSha);
2828

29-
internal sealed record AppSettings(string? Environment, string? Version, ProcessTags? ProcessTags);
29+
internal sealed record AppSettings(string? Environment, string? Version, string DefaultServiceName, ProcessTags? ProcessTags, string? GitCommitSha);
3030
}
3131
}

tracer/src/Datadog.Trace/Agent/DiscoveryService/AgentConfiguration.cs

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55

66
#nullable enable
77

8+
using System.Collections.Generic;
9+
810
namespace Datadog.Trace.Agent.DiscoveryService;
911

1012
internal sealed record AgentConfiguration
@@ -24,7 +26,10 @@ public AgentConfiguration(
2426
string? containerTagsHash,
2527
bool clientDropP0,
2628
bool spanMetaStructs,
27-
bool? spanEvents)
29+
bool? spanEvents,
30+
List<string>? peerTags = null,
31+
int obfuscationVersion = 0,
32+
AgentTraceFilterConfig? traceFilterConfig = null)
2833
{
2934
ConfigurationEndpoint = configurationEndpoint;
3035
DebuggerEndpoint = debuggerEndpoint;
@@ -41,6 +46,9 @@ public AgentConfiguration(
4146
ClientDropP0s = clientDropP0;
4247
SpanMetaStructs = spanMetaStructs;
4348
SpanEvents = spanEvents ?? false;
49+
PeerTags = peerTags;
50+
ObfuscationVersion = obfuscationVersion;
51+
TraceFilterConfig = traceFilterConfig ?? AgentTraceFilterConfig.Empty;
4452
}
4553

4654
public string? ConfigurationEndpoint { get; }
@@ -84,4 +92,10 @@ public AgentConfiguration(
8492
public bool SpanMetaStructs { get; }
8593

8694
public bool SpanEvents { get; }
95+
96+
public List<string>? PeerTags { get; }
97+
98+
public int ObfuscationVersion { get; }
99+
100+
public AgentTraceFilterConfig TraceFilterConfig { get; }
87101
}
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
// <copyright file="AgentTraceFilterConfig.cs" company="Datadog">
2+
// Unless explicitly stated otherwise all files in this repository are licensed under the Apache 2 License.
3+
// This product includes software developed at Datadog (https://www.datadoghq.com/). Copyright 2017 Datadog, Inc.
4+
// </copyright>
5+
6+
#nullable enable
7+
8+
using System.Collections.Generic;
9+
10+
namespace Datadog.Trace.Agent.DiscoveryService;
11+
12+
/// <summary>
13+
/// Trace-level filtering configuration received from the agent's /info endpoint.
14+
/// Filters are applied to the root span before stats computation.
15+
/// </summary>
16+
internal sealed record AgentTraceFilterConfig(
17+
List<string>? FilterTagsRequire,
18+
List<string>? FilterTagsReject,
19+
List<string>? FilterTagsRegexRequire,
20+
List<string>? FilterTagsRegexReject,
21+
List<string>? IgnoreResources)
22+
{
23+
public static readonly AgentTraceFilterConfig Empty = new(null, null, null, null, null);
24+
25+
public bool HasFilters =>
26+
FilterTagsRequire is { Count: > 0 } ||
27+
FilterTagsReject is { Count: > 0 } ||
28+
FilterTagsRegexRequire is { Count: > 0 } ||
29+
FilterTagsRegexReject is { Count: > 0 } ||
30+
IgnoreResources is { Count: > 0 };
31+
}

tracer/src/Datadog.Trace/Agent/DiscoveryService/DiscoveryService.cs

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,23 @@ private async Task ProcessDiscoveryResponse(IApiResponse response)
348348
var clientDropP0 = jObject["client_drop_p0s"]?.Value<bool>() ?? false;
349349
var spanMetaStructs = jObject["span_meta_structs"]?.Value<bool>() ?? false;
350350
var spanEvents = jObject["span_events"]?.Value<bool>() ?? false;
351+
var peerTags = (jObject["peer_tags"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).Distinct().OrderBy(x => x).ToList();
352+
var obfuscationVersion = jObject["obfuscation_version"]?.Value<int>() ?? 0;
353+
354+
// Parse trace filter configuration
355+
var filterTags = jObject["filter_tags"];
356+
var filterTagsRegex = jObject["filter_tags_regex"];
357+
var ignoreResources = (jObject["ignore_resources"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).ToList();
358+
var filterTagsRequire = (filterTags?["require"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).ToList();
359+
var filterTagsReject = (filterTags?["reject"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).ToList();
360+
var filterTagsRegexRequire = (filterTagsRegex?["require"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).ToList();
361+
var filterTagsRegexReject = (filterTagsRegex?["reject"] as JArray)?.Values<string>().Where(x => !string.IsNullOrEmpty(x)).ToList();
362+
363+
AgentTraceFilterConfig? traceFilterConfig = null;
364+
if (ignoreResources is not null || filterTagsRequire is not null || filterTagsReject is not null || filterTagsRegexRequire is not null || filterTagsRegexReject is not null)
365+
{
366+
traceFilterConfig = new AgentTraceFilterConfig(filterTagsRequire!, filterTagsReject!, filterTagsRegexRequire!, filterTagsRegexReject!, ignoreResources!);
367+
}
351368

352369
var discoveredEndpoints = (jObject["endpoints"] as JArray)?.Values<string>().ToArray();
353370
string? configurationEndpoint = null;
@@ -436,7 +453,10 @@ private async Task ProcessDiscoveryResponse(IApiResponse response)
436453
containerTagsHash: _serviceRemappingHash.ContainerTagsHash, // either the value just received, or the one we stored before (prevents overriding with null)
437454
clientDropP0: clientDropP0,
438455
spanMetaStructs: spanMetaStructs,
439-
spanEvents: spanEvents);
456+
spanEvents: spanEvents,
457+
peerTags: peerTags!,
458+
obfuscationVersion: obfuscationVersion,
459+
traceFilterConfig: traceFilterConfig);
440460

441461
// Save the hash, whether the details we care about changed or not
442462
_configurationHash = HexString.ToHexString(sha256.Hash);

tracer/src/Datadog.Trace/Agent/IApi.cs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,6 @@ internal interface IApi
1616

1717
Task<bool> SendTracesAsync(ArraySegment<byte> traces, int numberOfTraces, bool statsComputationEnabled, long numberOfDroppedP0Traces, long numberOfDroppedP0Spans, bool apmTracingEnabled = true);
1818

19-
Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration);
19+
Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion);
2020
}
2121
}

tracer/src/Datadog.Trace/Agent/IStatsAggregator.cs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
// </copyright>
55

66
using System;
7+
using System.Collections.Generic;
78
using System.Threading.Tasks;
89
using Datadog.Trace.SourceGenerators;
910

@@ -50,5 +51,7 @@ internal interface IStatsAggregator
5051
SpanCollection ProcessTrace(in SpanCollection trace);
5152

5253
Task DisposeAsync();
54+
55+
StatsAggregationKey BuildKey(Span span, out List<byte[]> utf8PeerTags);
5356
}
5457
}

tracer/src/Datadog.Trace/Agent/ManagedApi.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
// <copyright file="ManagedApi.cs" company="Datadog">
1+
// <copyright file="ManagedApi.cs" company="Datadog">
22
// Unless explicitly stated otherwise all files in this repository are licensed under the Apache 2 License.
33
// This product includes software developed at Datadog (https://www.datadoghq.com/). Copyright 2017 Datadog, Inc.
44
// </copyright>
@@ -62,6 +62,6 @@ void UpdateApi(ExporterSettings exporterSettings, bool healthMetricsEnabled)
6262
public Task<bool> SendTracesAsync(ArraySegment<byte> traces, int numberOfTraces, bool statsComputationEnabled, long numberOfDroppedP0Traces, long numberOfDroppedP0Spans, bool apmTracingEnabled = true)
6363
=> Volatile.Read(ref _api).SendTracesAsync(traces, numberOfTraces, statsComputationEnabled, numberOfDroppedP0Traces, numberOfDroppedP0Spans, apmTracingEnabled);
6464

65-
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration)
66-
=> Volatile.Read(ref _api).SendStatsAsync(stats, bucketDuration);
65+
public Task<bool> SendStatsAsync(StatsBuffer stats, long bucketDuration, int tracerObfuscationVersion)
66+
=> Volatile.Read(ref _api).SendStatsAsync(stats, bucketDuration, tracerObfuscationVersion);
6767
}

0 commit comments

Comments
 (0)