[Debugger] Add memory pressure monitoring telemetry for Dynamic Instrumentation (observe-only)#7834
Conversation
d0b375f to
61a1154
Compare
61a1154 to
7d4161f
Compare
This comment has been minimized.
This comment has been minimized.
7d4161f to
82d7293
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an observe-only memory pressure monitor for Dynamic Instrumentation (DI) and emits low-cardinality telemetry on high-pressure transitions (enter/exit), plus a one-shot “disabled” counter when signals are unavailable or the monitor errors.
Changes:
- Introduces
MemoryPressureMonitor+ supporting system/GC signal readers (incl. Windows interop) and wires it into DI via a dedicated refresh hook. - Adds new telemetry metric definitions (counts + shared distributions) and regenerates telemetry collector code for all TFMs.
- Adds focused unit/integration tests validating monitor behavior, transitions, and telemetry aggregation.
Reviewed changes
Copilot reviewed 14 out of 54 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tracer/src/Datadog.Trace/Debugger/RateLimiting/MemoryPressureMonitor.cs | New monitor implementation + transition/disable telemetry emission |
| tracer/src/Datadog.Trace/Debugger/RateLimiting/MemoryPressureConfig.cs | Configuration defaults and thresholds/hysteresis knobs |
| tracer/src/Datadog.Trace/Debugger/RateLimiting/SystemMemorySource.cs | Production memory/GC signal providers |
| tracer/src/Datadog.Trace/Debugger/RateLimiting/WindowsMemoryInfo.cs | Windows GlobalMemoryStatusEx interop helper |
| tracer/src/Datadog.Trace/Debugger/DynamicInstrumentation.cs | Owns/disposes monitor and exposes dedicated refresh hook |
| tracer/src/Datadog.Trace/Debugger/Expressions/ProbeProcessor.cs | Calls the dedicated memory-pressure refresh hook around capture-related activity |
| tracer/src/Datadog.Trace/Debugger/DebuggerFactory.cs | Constructs and injects the monitor into DI |
| tracer/src/Datadog.Trace/Debugger/Caching/DefaultMemoryChecker.cs | Reuses new Windows memory helper instead of local interop copy |
| tracer/src/Datadog.Trace/Telemetry/Metrics/MetricTags.cs | Adds low-cardinality tag enums for debugger memory-pressure metrics |
| tracer/src/Datadog.Trace/Telemetry/Metrics/Count.cs | Adds count metrics for transitions + disabled reasons |
| tracer/src/Datadog.Trace/Telemetry/Metrics/DistributionShared.cs | Adds shared distributions for severity + duration |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: null-collector stubs for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_Count.g.cs | Regenerated: null-collector stubs for new counts |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: buffers + recorders for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_Count.g.cs | Regenerated: buffers + recorders for new counts |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: interface methods for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_Count.g.cs | Regenerated: interface methods for new counts |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/DistributionSharedExtensions.g.cs | Regenerated: names/common-flag/namespaces for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CountExtensions.g.cs | Regenerated: names/common-flag/namespaces for new counts |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: CI visibility collector distribution buffers updated |
| tracer/src/Datadog.Trace/Generated/netstandard2.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_Count.g.cs | Regenerated: CI visibility collector count buffers updated |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: null-collector stubs for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_Count.g.cs | Regenerated: null-collector stubs for new counts |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: buffers + recorders for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_Count.g.cs | Regenerated: buffers + recorders for new counts |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: interface methods for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_Count.g.cs | Regenerated: interface methods for new counts |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/DistributionSharedExtensions.g.cs | Regenerated: names/common-flag/namespaces for new shared distributions |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CountExtensions.g.cs | Regenerated: names/common-flag/namespaces for new counts |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: CI visibility collector distribution buffers updated |
| tracer/src/Datadog.Trace/Generated/netcoreapp3.1/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_Count.g.cs | Regenerated: CI visibility collector count buffers updated |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: null-collector stubs for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_Count.g.cs | Regenerated: null-collector stubs for new counts |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: buffers + recorders for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_Count.g.cs | Regenerated: buffers + recorders for new counts |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: interface methods for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_Count.g.cs | Regenerated: interface methods for new counts |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/DistributionSharedExtensions.g.cs | Regenerated: names/common-flag/namespaces for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CountExtensions.g.cs | Regenerated: names/common-flag/namespaces for new counts |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: CI visibility collector distribution buffers updated |
| tracer/src/Datadog.Trace/Generated/net6.0/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_Count.g.cs | Regenerated: CI visibility collector count buffers updated |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: null-collector stubs for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/NullMetricsTelemetryCollector_Count.g.cs | Regenerated: null-collector stubs for new counts |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: buffers + recorders for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/MetricsTelemetryCollector_Count.g.cs | Regenerated: buffers + recorders for new counts |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: interface methods for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/IMetricsTelemetryCollector_Count.g.cs | Regenerated: interface methods for new counts |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/DistributionSharedExtensions.g.cs | Regenerated: names/common-flag/namespaces for new shared distributions |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CountExtensions.g.cs | Regenerated: names/common-flag/namespaces for new counts |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_DistributionShared.g.cs | Regenerated: CI visibility collector distribution buffers updated |
| tracer/src/Datadog.Trace/Generated/net461/Datadog.Trace.SourceGenerators/TelemetryMetricGenerator/CiVisibilityMetricsTelemetryCollector_Count.g.cs | Regenerated: CI visibility collector count buffers updated |
| tracer/test/Datadog.Trace.Tests/Debugger/RateLimiting/MemoryPressureMonitorTests.cs | New unit tests for thresholds, hysteresis, disable paths, and concurrency behavior |
| tracer/test/Datadog.Trace.Tests/Debugger/DynamicInstrumentationTests.cs | Integration tests for refresh hook usage + monitor disposal |
| tracer/test/Datadog.Trace.Tests/Telemetry/Collectors/MetricsTelemetryCollectorTests.cs | Adds aggregation test for new memory-pressure metrics |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (7834) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (76ms) : 71, 80
master - mean (76ms) : 72, 80
section Bailout
This PR (7834) - mean (79ms) : 74, 83
master - mean (79ms) : 75, 82
section CallTarget+Inlining+NGEN
This PR (7834) - mean (1,111ms) : 1059, 1164
master - mean (1,109ms) : 1054, 1164
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (116ms) : 110, 121
master - mean (115ms) : 108, 121
section Bailout
This PR (7834) - mean (114ms) : 111, 117
master - mean (115ms) : 112, 119
section CallTarget+Inlining+NGEN
This PR (7834) - mean (789ms) : 759, 818
master - mean (797ms) : 773, 821
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (105ms) : 98, 111
master - mean (101ms) : 98, 104
section Bailout
This PR (7834) - mean (103ms) : 99, 106
master - mean (107ms) : 102, 111
section CallTarget+Inlining+NGEN
This PR (7834) - mean (952ms) : 914, 990
master - mean (950ms) : 916, 983
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (102ms) : 96, 108
master - mean (100ms) : 98, 103
section Bailout
This PR (7834) - mean (104ms) : 99, 109
master - mean (101ms) : 99, 103
section CallTarget+Inlining+NGEN
This PR (7834) - mean (825ms) : 787, 863
master - mean (825ms) : 788, 862
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (201ms) : 196, 205
master - mean (201ms) : 197, 205
section Bailout
This PR (7834) - mean (204ms) : 200, 208
master - mean (204ms) : 199, 209
section CallTarget+Inlining+NGEN
This PR (7834) - mean (1,213ms) : 1160, 1266
master - mean (1,214ms) : 1173, 1256
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (289ms) : 283, 295
master - mean (292ms) : 285, 299
section Bailout
This PR (7834) - mean (289ms) : 284, 294
master - mean (291ms) : 285, 297
section CallTarget+Inlining+NGEN
This PR (7834) - mean (966ms) : 944, 987
master - mean (968ms) : 947, 989
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (280ms) : 273, 288
master - mean (280ms) : 274, 287
section Bailout
This PR (7834) - mean (279ms) : 271, 286
master - mean (279ms) : 274, 284
section CallTarget+Inlining+NGEN
This PR (7834) - mean (1,165ms) : 1132, 1197
master - mean (1,163ms) : 1120, 1207
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (7834) - mean (278ms) : 269, 286
master - mean (280ms) : 274, 286
section Bailout
This PR (7834) - mean (278ms) : 272, 285
master - mean (280ms) : 273, 286
section CallTarget+Inlining+NGEN
This PR (7834) - mean (1,043ms) : 997, 1088
master - mean (1,042ms) : 1000, 1083
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarksBenchmark execution time: 2026-06-03 17:08:56 Comparing candidate commit e6b821c in PR branch Found 0 performance improvements and 2 performance regressions! Performance is the same for 70 metrics, 0 unstable metrics, 62 known flaky benchmarks, 64 flaky benchmarks without significant changes.
|
cc85899 to
f639b66
Compare
| AppDomain.CurrentDomain.AssemblyLoad += CheckUnboundProbes; | ||
| assemblyLoadSubscribed = true; | ||
| StartBackgroundProcess(); | ||
| Volatile.Write(ref _initializationState, 2); |
| // Refresh writes these fields through a single CAS-guarded writer; every cross-thread access | ||
| // goes through Volatile.Read/Volatile.Write (and Interlocked where a read-modify-write is needed). | ||
| // We deliberately use the Volatile.* helpers rather than the `volatile` keyword so the idiom is | ||
| // uniform across the int/long fields too (`long` cannot be marked `volatile`). |
There was a problem hiding this comment.
TIL: cannot mark a long field volatile! Now I understand why you are using it so scarcely while in Java we are using it more often :/
There was a problem hiding this comment.
that being said, why not using a sub object with all stats and using a volatile ref to update in one shot for the refresh?
That way you have only one place to manage the concurency.
example:
class stats
{
private long _lastGen2Count;
private long _lastRefreshMs;
private long _highPressureStartMs;
private bool _hasHighPressureStart;
private int _highStreak;
private int _lowStreak;
private bool _hasGen2Baseline;
}
void refresh()
{
Stats stats = new stats();
stats._lastGen2Count = ...;
// ...
stats._hasGen2Baseline = ...;
currentStats = stats; // currentStats is volatile field, assigned
}
then the reader just need to Volatile read currentStats to snapshot the values then use them locally
There was a problem hiding this comment.
But yes the downside of this is one alloc for stats at each call of refresh.
There was a problem hiding this comment.
You can use Interlocked.Exchange(ref _long) though, which is generally preferable anyway IMO, but snapshotting is probably still the better design IMO.
In terms of allocations, you can use a "front and back buffer object with atomic exchange" - that way you create 2 objects for the life of the process 🙂
There was a problem hiding this comment.
You can use Interlocked.Exchange(ref _long) though
Yes but for this case volatile is enough imo
but snapshotting is probably still the better design IMO.
In terms of allocations, you can use a "front and back buffer object with atomic exchange" - that way you create 2 objects for the life of the process 🙂
I like it. Currently, no one really uses it in production, but I would definitely do that in the future if these values become part of a decision-making process. Noted.
There was a problem hiding this comment.
On second thought, I need to get back to this PR anyway because of the distribution metrics limitation, so I might handle this while I'm at it.
There was a problem hiding this comment.
On third thought 😆
I tried the two-buffer snapshot approach locally, but I’m leaning toward keeping the current simpler publication model for now.
It does give a more coherent internal snapshot, but it also adds quite a bit of complexity/versioning, more synchronization.
I think it’s worth revisiting if/when this becomes decision-making state or we expose a single snapshot API, but for the current observe-only telemetry use, I’d prefer to keep the simpler volatile fields.
| // Only ever called from inside the CAS-guarded Refresh, so no synchronization is required. | ||
| private bool DisableCore(MetricTags.DebuggerMemoryPressureDisabledReason reason) |
There was a problem hiding this comment.
For my knowledge as probably I missed something: if it is only used in Refresh method why not put it like ComputeNextHigh or ToScaledInt as "inner method" of Referesh?
There was a problem hiding this comment.
The reasoning behind that is that, unlike ComputeNextHigh and ToScaledInt, which are purely local helpers and should remain local functions, DisableCore has meaning beyond the refresh algorithm, and I haven't fully settled on its future shape yet (since this PR is observe-only).
But you didn't miss anything. For this PR, it is currently only called from Refresh, as you noted.
Done 5c99511
andrewlock
left a comment
There was a problem hiding this comment.
As discussed, I have concerns that these metrics won't actually be usable, because we don't tag them with service identity. So you have no way to compare between enter/exit conditions? 🤔 However, I now realise that you're relying on distributions and going to try to do some statistical comparison between them? I'm not sure..
However, the count metrics are fine in general, but we literally don't support distribution metrics at the moment (intentionally, because they have perf problems), so we'll need to remove those.
The missing service tag isn't a blocker. We dont have to pair a given app's enter with its own exit. We want to be able to say: |
6946fbf to
5c99511
Compare
I changed the memory-pressure severity/duration metrics from distributions to bucketed count metrics. |
andrewlock
left a comment
There was a problem hiding this comment.
I haven't reviewed the Debugger code, just the shared metrics code. The important thing is you'll need to merge your changes into the telemetry intake before these new metrics will be accepted.
| #nullable enable | ||
| using System.Diagnostics.CodeAnalysis; | ||
| using Datadog.Trace.SourceGenerators; | ||
| using NS = Datadog.Trace.Telemetry.MetricNamespaceConstants; |
There was a problem hiding this comment.
| using NS = Datadog.Trace.Telemetry.MetricNamespaceConstants; |
There was a problem hiding this comment.
86f6bb3cd59a512bd61d8c4fe63ffa9e4d6f31ae
| ], | ||
| "metric_type": "count", | ||
| "data_type": "transitions", | ||
| "description": "The number of Dynamic Instrumentation memory-pressure state transitions, tagged by state and the signal that triggered entry", |
There was a problem hiding this comment.
It's best to include the actual state tags allowed here, e.g.
| "description": "The number of Dynamic Instrumentation memory-pressure state transitions, tagged by state and the signal that triggered entry", | |
| "description": "The number of Dynamic Instrumentation memory-pressure state transitions, tagged by state (`state:enter` or `state:exit`) and the signal that triggered entry (`trigger:none`, `trigger:memory`, `trigger:gen2`, `trigger:both`)", |
The same applies to all the other definitions here.
Additionally, these metrics need to be added to the telemetry intake, and deployed, otherwise they will be blocked. These metrics are common across all languages too, just FYI, so we should try to keep the tags applicable to a wide range of languages if possible (e.g. gen2 is very .NET specific, maybe we can use a more generic tag name?)
There was a problem hiding this comment.
86f6bb3cd59a512bd61d8c4fe63ffa9e4d6f31ae
|
|
||
| /// <summary> | ||
| /// Count of Dynamic Instrumentation high-memory-pressure periods bucketed by duration, recorded on exit. | ||
| /// </summary> | ||
| [TelemetryMetric<MetricTags.DebuggerMemoryPressureDurationBucket>("memory_pressure.duration_ms", isCommon: true, NS.LiveDebugger)] DebuggerMemoryPressureDurationMs, |
There was a problem hiding this comment.
Is this really what you want? The total duration of high-memory pressure periods? Maybe it is, just feels kind of strange to me 😅
| /// <summary> | ||
| /// Count of Dynamic Instrumentation memory-pressure transitions bucketed by Gen2 collections per second at the transition. | ||
| /// </summary> | ||
| [TelemetryMetric<MetricTags.DebuggerMemoryPressureState, MetricTags.DebuggerMemoryPressureGen2Bucket>("memory_pressure.gen2_per_sec", isCommon: true, NS.LiveDebugger)] DebuggerMemoryPressureGen2PerSec, |
There was a problem hiding this comment.
It seems a bit weird to have a count of a rate... I guess this is just trying to hack around current lack of distribution support? 🤔
| { | ||
| [Description("bucket:lt_70")] LessThan70, | ||
| [Description("bucket:70_80")] From70To80, | ||
| [Description("bucket:80_85")] From80To85, |
There was a problem hiding this comment.
What if it's 80%? Which bucket does it go in? Might be preferable to make that clear
There was a problem hiding this comment.
86f6bb3cd59a512bd61d8c4fe63ffa9e4d6f31ae
Co-authored-by: Cursor <cursoragent@cursor.com>
Remove the IGCInfoProvider/IHighResolutionClock/IMemoryPressureMonitor/ ISamplerScheduler abstractions and their System* implementations in favor of a single SystemMemorySource. Update the monitor, config, and debugger consumers accordingly, and adjust tests (dropping the fake clock/GC/ scheduler helpers).
Add debugger.memory_pressure.* count/distribution metrics and supporting metric tags (state/trigger/disabled-reason), plus regenerated telemetry collector source. Update collector tests.
3990b8f to
e6b821c
Compare
Summary of changes
Reason for change
Dynamic Instrumentation can perform memory-sensitive work when capturing variables, creating snapshots, serializing payloads, and enqueueing/uploading data.
Before using memory pressure to affect DI behavior, we need production data showing how often high pressure occurs, what drives it, how severe each signal is at entry, how long episodes last, where the signal is even available, and whether it is stable enough for future throttling decisions. This PR adds exactly that observability and nothing else.
Implementation details
The monitor detects high pressure from concrete runtime/system signals only:
GC.GetGCMemoryInfo()(MemoryLoadBytes / TotalAvailableMemoryBytes). This is container/cgroup-aware, and is system/container-wide load rather than process-private bytes - the right scope for "are we close to an OOM".GlobalMemoryStatusEx(dwMemoryLoad, clamped to[0,1]). This is machine-wide and not container/job-object aware.GC.CollectionCount(2)deltas over elapsed time. Identical and process-specific on every platform.Sampling is activity-driven rather than always-on. DI calls into the monitor around capture/snapshot/upload-relevant work, and the monitor refreshes only when the last sample is stale (default min interval
1s). If DI is idle, the monitor does not continuously sample memory.Time is read from
Environment.TickCount64on .NET Core 3.0+ (a monotonicStopwatch-based fallback is used on net461/netstandard2.0), so the hotRefreshIfStale()fast path is a couple of volatile reads with no clock abstraction.If neither memory nor GC signals are available on the host, or a provider throws, the monitor permanently disables itself (it never samples again), writes a single log line (debug for "no signals", error for an exception), and records a one-shot
disabledcounter tagged by reason so a silent zero in the transition metrics is explainable rather than ambiguous.Telemetry is emitted on transitions only (enter high pressure / exit high pressure):
state(enter/exit) andtrigger(memory/gen2/bothon enter;noneon exit) - i.e. which signal drove entry.state).No high-cardinality tags such as probe ID, service name, file path, exception type, or method name are added.
flowchart TD diStart["DynamicInstrumentation starts"] --> createMonitor["Create MemoryPressureMonitor"] createMonitor --> idleMonitor["No background sampling while idle"] probeActivity["DI capture/snapshot/upload activity"] --> refreshIfStale["RefreshIfStale (min interval)"] refreshIfStale --> signalsAvailable{"Memory or GC signal available?"} signalsAvailable -->|"No / provider error"| disable["Disable monitor (logged) + disabled{reason}"] signalsAvailable -->|"Yes"| sampleSignals["Sample memory load and Gen2/sec"] sampleSignals --> pressureDecision{"High pressure transition?"} pressureDecision -->|"ENTER"| enterTelemetry["transitions{state:enter, trigger:*} + severity"] pressureDecision -->|"EXIT"| exitTelemetry["transitions{state:exit, trigger:none} + severity + duration"] pressureDecision -->|"No transition"| noOp["No telemetry emission"] enterTelemetry --> observeOnly["No DI behavior change"] exitTelemetry --> observeOnly noOp --> observeOnly disable --> observeOnly diStart --> dispose["DynamicInstrumentation.Dispose"] dispose --> stopMonitor["Dispose monitor (lock-free)"]Risk is kept low by avoiding per-second telemetry gauges, background sampling while idle, probe-level cardinality and allocation on hot paths. Refreshes are guarded by a non-blocking single-writer CAS so concurrent DI activity does not run overlapping samples.
Metrics implementation
All metrics are instrumentation-telemetry metrics, defined as common metrics (
isCommon: true) under thelive_debuggernamespace. They surface in the backend asdd.instrumentation_telemetry_data.live_debugger.memory_pressure.*. Tags are fixed, low-cardinality enums.memory_pressure.transitionsstate=enter|exit,trigger=none|memory|gen2|bothtriggeris the entry cause onenter;noneonexit.memory_pressure.disabledreason=no_signals|errormemory_pressure.memory_usage_pctstate=enter|exit,bucket=lt_70|70_80|80_85|85_90|gte_90memory_pressure.gen2_per_secstate=enter|exit,bucket=lt_1|1_2|2_5|gte_5memory_pressure.duration_msbucket=lt_1s|1_5s|5_30s|gte_30sexit- count bucketed by length of the high-pressure episode.Runtime/platform segmentation (runtime name, OS, architecture, tracer version) is not added as per-metric tags; it comes from the telemetry payload's
application/hostmetadata and is applied at query time in the backend. This keeps series count bounded while still allowing per-runtime analysis.The severity and duration metrics use normal telemetry
countmetrics with a fixedbuckettag instead of distributions. Distribution telemetry is not supported for this internal telemetry path, and bucketed counts preserve the decision-making signal with bounded cardinality.The
triggertag lives only on the transition count, not on the severity bucket counts: iftrigger:memory, the memory bucket already captures how severe the memory signal was, and splitting every bucket by trigger would mostly restate the transition breakdown with extra series. The transition count'striggerplus thestate-tagged bucket counts answer "what drove it" and "how severe / how long".Dashboards and how we will use the data
The goal is a single dashboard that answers, per runtime/OS, whether DI should ever gate capture under memory pressure and, if so, on which signal and at what threshold.
reasonmemory_pressure.disabledtriggermemory_pressure.transitions{state:enter}memory_pressure.transitions{state:enter}memoryvsgen2vsbothmixmemory_pressure.memory_usage_pct{state:enter}memory_pressure.gen2_per_sec{state:enter}memory_pressure.duration_msEvery widget can be grouped by the dimensions the telemetry intake stamps onto each series from the payload's
application/hostmetadata.How the data is gathered:
memory_pressure.disabled(coverage) →memory_pressure.transitions{state:enter}bytrigger(frequency + cause) → severity distributions (thresholds) → duration (persistence).Scope: what this telemetry answers (and what it doesn't)
This is a fleet-aggregate, pre-decision dataset: it tells us whether high memory pressure occurs during DI work often, severely, and long enough to justify gating capture, and on which signal at what threshold.
It is deliberately not a tool for measuring the impact of a future gate.
Test coverage
Added/updated tests cover:
reason:no_signals, no further sampling.reason:error, no crash.triggerclassification:gen2when only GC drives entry,bothwhen memory and GC cross together,memory/noneon the enter/exit pair.GlobalMemoryStatusExinterop returns valid values.DynamicInstrumentation.state+trigger, thedisabledcounter, and the three bucketed count metrics aggregate into the expected series/tags.Other details
Initial thresholds are observation thresholds, not throttling thresholds:
0.85.0.80.2collections/sec.1collection/sec.These values are intentionally conservative starting points for data collection and can be tuned after production telemetry is available.
This PR intentionally does not include an explicit/manual pressure event signal. There is no concrete production caller or well-defined meaning for that signal yet. If a future DI path has a specific condition, such as snapshot capture exceeding a memory budget, it should add a named API and metric for that scenario.
Any future DI throttling or blocking based on memory pressure should be done in a separate PR after reviewing the collected telemetry.
Note
Incidental DynamicInstrumentation lifecycle hardening (not memory-pressure specific): Wiring the new monitor in as a disposable owned by DynamicInstrumentation, exposed pre-existing races in the start/subscribe/dispose paths. DynamicInstrumentation is disposed at runtime via RCM, not just at shutdown. Since the monitor's lifecycle is tied to those paths, we hardened them here.