Conversation
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8450) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (73ms) : 70, 77
master - mean (75ms) : 70, 79
section Bailout
This PR (8450) - mean (78ms) : 76, 80
master - mean (77ms) : 75, 80
section CallTarget+Inlining+NGEN
This PR (8450) - mean (1,086ms) : 1042, 1129
master - mean (1,082ms) : 1029, 1135
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (115ms) : 111, 120
master - mean (116ms) : 111, 121
section Bailout
This PR (8450) - mean (120ms) : 113, 127
master - mean (116ms) : 113, 119
section CallTarget+Inlining+NGEN
This PR (8450) - mean (777ms) : 749, 804
master - mean (782ms) : 748, 816
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (104ms) : 98, 110
master - mean (102ms) : 97, 106
section Bailout
This PR (8450) - mean (102ms) : 100, 105
master - mean (102ms) : 100, 105
section CallTarget+Inlining+NGEN
This PR (8450) - mean (946ms) : 910, 982
master - mean (941ms) : 903, 979
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (104ms) : 98, 109
master - mean (105ms) : 100, 110
section Bailout
This PR (8450) - mean (105ms) : 100, 111
master - mean (102ms) : 98, 106
section CallTarget+Inlining+NGEN
This PR (8450) - mean (822ms) : 788, 856
master - mean (829ms) : 789, 870
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (206ms) : 196, 216
master - mean (205ms) : 194, 217
section Bailout
This PR (8450) - mean (211ms) : 199, 223
master - mean (210ms) : 201, 219
section CallTarget+Inlining+NGEN
This PR (8450) - mean (1,224ms) : 1163, 1284
master - mean (1,214ms) : 1156, 1273
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (301ms) : 280, 322
master - mean (296ms) : 281, 311
section Bailout
This PR (8450) - mean (298ms) : 283, 314
master - mean (297ms) : 280, 315
section CallTarget+Inlining+NGEN
This PR (8450) - mean (981ms) : 952, 1010
master - mean (967ms) : 934, 999
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (292ms) : 276, 307
master - mean (289ms) : 275, 304
section Bailout
This PR (8450) - mean (291ms) : 278, 304
master - mean (295ms) : 275, 314
section CallTarget+Inlining+NGEN
This PR (8450) - mean (1,169ms) : 1125, 1214
master - mean (1,165ms) : 1119, 1210
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8450) - mean (290ms) : 270, 311
master - mean (291ms) : 268, 314
section Bailout
This PR (8450) - mean (292ms) : 271, 312
master - mean (289ms) : 275, 304
section CallTarget+Inlining+NGEN
This PR (8450) - mean (1,069ms) : 984, 1154
master - mean (1,071ms) : 978, 1164
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-04-22 20:38:55 Comparing candidate commit 4082bfd in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 27 metrics, 0 unstable metrics, 62 known flaky benchmarks, 25 flaky benchmarks without significant changes.
|
b0dfa05 to
ec5f36b
Compare
| return factory(key); | ||
| } | ||
|
|
||
| return cache.GetOrAdd(key, factory); |
There was a problem hiding this comment.
I think we could/should optimize this pattern. Currently:
cache.Countis expensive - it takes a full lock on the dictionary internallycache.TryGetValuefollowed bycache.GetOrAdd(key, factory);is two lookups on failures
Given we don't need exactly 1000 items (and AFAICT, we never remove items) I think you could optimize this by moving the method call to the TagCache type directly, and storing an additional count locally there, and using that to avoid the full lock, roughly MaxEdgeTagCacheSize items
private int _edgeCacheCount = 0;
private EdgeCache _cache; // hand waving the generic issues
public string[] GetOrCreateEdgeTags<TKey>(TKey key, Func<TKey, string[]> factory)
where TKey : notnull, IEquatable<TKey>
{
if (cache.TryGetValue(key, out var existing))
{
return existing;
}
if (Volatile.Read(ref _edgeCacheCount) < MaxEdgeTagCacheSize)
{
// High-cardinality key space — bypass cache to prevent unbounded memory growth
return factory(key);
}
Interlocked.Increment(ref _editCacheCount);
return cache.GetOrAdd(key, factory);
}We still have the two lookups on cache exceeded, but we lose the expensive Count call at least
| return new PathwayContext(new PathwayHash(hash), pathwayStartNs, edgeStartNs); | ||
| } | ||
|
|
||
| #if NETCOREAPP3_1_OR_GREATER |
There was a problem hiding this comment.
After we merge this: #8476 we can open this up more broadly, and make it the only implementation 🙂
andrewlock
left a comment
There was a problem hiding this comment.
Thanks for this! I think it looks like a good plan overall, there's just the conversion to readonly record struct to simplify things, and the question about whether we can optimize the failure cases to avoid calling the expensive ConcurrentDictionary.Count property I think
|
Pushed fixes for all comments |
andrewlock
left a comment
There was a problem hiding this comment.
LGTM, just one last cleanup we can do (reference equality is default, we don't need a custom comparer)
Just to sense check the limits, we have:
- ~10 different cache key types currently
- each array prob being ballpark ~200 bytes (depends on dynamic data, so hard to say)
- We cache up to ~1000 distinct arrays
So this could raise the "static" memory usage by ~2MB (10x200x1000bytes) if I understand correctly. Given these paths are called many times, we expect a high hit ration, and that most of these are called in hot paths, plus this has a clear impact on throughput, this looks like a great tradeoff overall to me 👍 Thanks!
| // Keyed by string[] identity (reference equality) — safe because TagCache holds strong | ||
| // references to the cached arrays (bounded by MaxEdgeTagCacheSize). | ||
| private readonly ConcurrentDictionary<string[], NodeHash> _nodeHashCache = | ||
| new(NodeHashCacheKeyComparer.Instance); |
There was a problem hiding this comment.
The custom comparer is not required - object comparisons use reference equality by default
| // Keyed by string[] identity (reference equality) — safe because TagCache holds strong | |
| // references to the cached arrays (bounded by MaxEdgeTagCacheSize). | |
| private readonly ConcurrentDictionary<string[], NodeHash> _nodeHashCache = | |
| new(NodeHashCacheKeyComparer.Instance); | |
| // Keyed by string[] identity (reference equality) — safe because TagCache holds strong | |
| // references to the cached arrays (bounded by MaxEdgeTagCacheSize). | |
| private readonly ConcurrentDictionary<string[], NodeHash> _nodeHashCache = | |
| new(); |
| } | ||
|
|
||
| /// <summary> | ||
| /// Reference-equality comparer for string[] keys in <see cref="_nodeHashCache"/>. | ||
| /// Two string[] objects are considered equal only when they are the same instance, | ||
| /// which is always true for the cached arrays held by <see cref="TagCache{TKey, TValue}"/>. | ||
| /// </summary> | ||
| private sealed class NodeHashCacheKeyComparer : IEqualityComparer<string[]> | ||
| { | ||
| internal static readonly NodeHashCacheKeyComparer Instance = new(); | ||
|
|
||
| public bool Equals(string[]? x, string[]? y) => ReferenceEquals(x, y); | ||
|
|
||
| public int GetHashCode(string[] obj) => RuntimeHelpers.GetHashCode(obj); | ||
| } |
There was a problem hiding this comment.
This isn't necessary, equality uses reference equality by default
| } | |
| /// <summary> | |
| /// Reference-equality comparer for string[] keys in <see cref="_nodeHashCache"/>. | |
| /// Two string[] objects are considered equal only when they are the same instance, | |
| /// which is always true for the cached arrays held by <see cref="TagCache{TKey, TValue}"/>. | |
| /// </summary> | |
| private sealed class NodeHashCacheKeyComparer : IEqualityComparer<string[]> | |
| { | |
| internal static readonly NodeHashCacheKeyComparer Instance = new(); | |
| public bool Equals(string[]? x, string[]? y) => ReferenceEquals(x, y); | |
| public int GetHashCode(string[] obj) => RuntimeHelpers.GetHashCode(obj); | |
| } | |
| } |
(If you like, prove it to yourself with this! 😄)
var random = new Random();
var dict = new ConcurrentDictionary<string[], int>();
var a = new string[] { "Hello", "World" };
var b = new string[] { "Hello", "World" };
Console.WriteLine(dict.GetOrAdd(a, key => random.Next()));
Console.WriteLine(dict.GetOrAdd(a, key => random.Next()));
Console.WriteLine(dict.GetOrAdd(b, key => random.Next()));
Console.WriteLine(dict.GetOrAdd(b, key => random.Next()));
DSM Per-Message Overhead Optimizations
Summary of changes
EdgeTagCache<TKey>andBacklogTagCache<TKey>— process-wide, per-typeConcurrentDictionarycaches that intern edge-tag arrays and backlog-tag strings so they are only allocated once per unique key (topic/group/cluster combination).NodeHashCacheEntry/NodeHashSnapshotmechanism insideDataStreamsManagerthat memoizes the expensiveCalculateNodeHashresult per(edgeTags[], nodeHashBase)pair. Reads are lock-free via a volatile field; writes acquire a per-entry lock only on cache miss or base change.PathwayContextEncoder.EncodeIntoand aSpan<byte>-basedDecodeoverload;DataStreamsContextPropagatorusesstackallocbuffers on .NET Core 3.1+ to avoid intermediatebyte[]heap allocations on every produce/consume.DataStreamsAggregatorandDataStreamsManager._nodeHashCachenow use reference-equality comparers backed byRuntimeHelpers.GetHashCode, which is safe because all keys are interned by the caches above.Thread.Sleeppolling loop inDataStreamsWriterwith aManualResetEventSlimthat wakes immediately when the queue reaches 1 000 items or after a 500 ms timeout, eliminating unnecessary context switches.readonly structcache keys (ConsumeEdgeTagCacheKey,ProduceEdgeTagCacheKey,CommitBacklogTagCacheKey,ProduceBacklogTagCacheKey) for Kafka; equivalent structs for AWS SQS/SNS/Kinesis, Azure Service Bus, IBM MQ, and RabbitMQ.Remove(TemporaryBase64PathwayContext)header scan is now skipped whenKafkaCreateConsumerScopeEnabled=true(the default), avoiding an O(n) scan on every message.LastConsumePathwayguard removed: Dropped the redundant!= nullguard on the produce path that required anAsyncLocalread before the actualAsyncLocalread.Reason for change
DSM instrumentation runs on the hot path of every instrumented message. Profiling revealed that the dominant allocations were:
string[]edge-tag array on every produce/consume call.CalculateNodeHashcall (hashing over all edge tags) on every checkpoint.byte[]arrays for pathway context Base64 encoding/decoding.These optimizations target p99 and throughput benchmarks for Kafka, SQS, SNS, RabbitMQ, IBM MQ, Azure Service Bus, and Kinesis instrumentation.
Implementation details
Caching strategy
EdgeTagCache<TKey>andBacklogTagCache<TKey>use the static-generic-class pattern (static class Foo<T>with a static field) to give each integration its own dictionary instance without any runtime dispatch. The key type is areadonly structimplementingIEquatable<TKey>, which prevents boxing inConcurrentDictionarylookups.The caches are bounded at
MaxEdgeTagCacheSize = 1000entries. Once that limit is reached, new keys are computed on the fly (no caching) to prevent unbounded memory growth from high-cardinality identifiers.Node-hash caching
_nodeHashCacheis keyed bystring[]identity (not value equality) because the arrays themselves are interned byEdgeTagCache<TKey>. Each entry holds a volatileNodeHashSnapshot(nodeHashBase+NodeHash). On every checkpoint:Zero-allocation encode/decode
PathwayContextEncoder.EncodeInto(PathwayContext, Span<byte>)writes directly into a caller-supplied buffer.DataStreamsContextPropagatorstackallocsMaxEncodedSize(26 bytes) andMaxBase64EncodedSize(36 bytes) on the stack and usesBase64.EncodeToUtf8/DecodeFromUtf8in-place. The only unavoidable allocation is the finalToArray()passed toheaders.Add, because Kafka takes ownership of the byte array.This path is guarded by
#if NETCOREAPP3_1_OR_GREATER; .NET Framework falls back to the original heap-allocating path.Drain signal
DataStreamsWriterpreviously slept 10 ms unconditionally between drain iterations, burning CPU and adding ~10 ms latency per batch even under load. The newManualResetEventSlimis signalled immediately when either queue exceedsDrainThreshold(1 000 items), capping worst-case latency atDrainTimeoutMs(500 ms) while eliminating idle wakeups.Test coverage
DataStreamsManagerTests: new unit tests verify thatGetOrCreateEdgeTagsandGetOrCreateBacklogTagsreturn the same array/string reference on repeated calls with the same key, and distinct references for different keys. Tests cover Kafka produce/consume, RabbitMQ produce/consume, and generic key types.PathwayContextEncoderTests: existing encode/decode round-trip tests pass against the newSpan<byte>overloads.Other details
MaxEdgeTagCacheSizeconstant isinternalto allow unit tests to verify the overflow/bypass behavior.internal..NET Frameworkcode paths are unchanged — allSpan-based optimizations are gated behind#if NETCOREAPP3_1_OR_GREATER.