Skip to content

[Metrics] Implement cumulative aggregation for async Long and Double counters in MetricPoint#7227

Open
nabutabu wants to merge 12 commits into
open-telemetry:mainfrom
nabutabu:otel-7215-metricpointaggregation
Open

[Metrics] Implement cumulative aggregation for async Long and Double counters in MetricPoint#7227
nabutabu wants to merge 12 commits into
open-telemetry:mainfrom
nabutabu:otel-7215-metricpointaggregation

Conversation

@nabutabu
Copy link
Copy Markdown
Contributor

Fixes #7215
Design discussion issue NA

Changes

Use InterlockedHelper.Add in Update calls and reset aggregation values in TakeSnapshot using Interlocked.Exchange(ref this.runningValue.AsLong, 0);

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

@github-actions github-actions Bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label Apr 30, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

❌ Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.85%. Comparing base (f3e872f) to head (a265e27).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/OpenTelemetry/Metrics/AggregatorStore.cs 94.44% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7227      +/-   ##
==========================================
- Coverage   89.86%   89.85%   -0.01%     
==========================================
  Files         276      276              
  Lines       14007    14023      +16     
==========================================
+ Hits        12588    12601      +13     
- Misses       1419     1422       +3     
Flag Coverage Δ
unittests-Project-Experimental 89.70% <95.45%> (-0.13%) ⬇️
unittests-Project-Stable 89.63% <95.45%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...c/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs 94.80% <100.00%> (+0.02%) ⬆️
src/OpenTelemetry/Metrics/AggregatorStore.cs 88.29% <94.44%> (+2.01%) ⬆️

... and 6 files with indirect coverage changes

Comment thread src/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs

case AggregationType.LongSumIncomingCumulative:
{
Interlocked.Add(ref this.runningValue.AsLong, number);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. If incoming is cumulative, then it is already the sum. We (i.e SDK) shouldn't add it.

Copy link
Copy Markdown
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… Double counters in MetricPoint"

This reverts commit 76d6a88.
@nabutabu
Copy link
Copy Markdown
Contributor Author

Fix was incorrect and the issue was that I did not take into account a couple of things which the failing tests also pointed out. Changing Update/UpdateWithExemplar was incorrect because each reader has its own AggregatorStore and so the same measurement stream feeds both views. Two measurements with different raw tags but the same filtered tags return the same index from FindMetricAggregatorsCustomTag. Then UpdateLongMetricPoint is called twice on that same index which then means that the second Exchange call overwrites the first.

Basically if a MetricPoint is in state CollectPending and is IsAsynchronous then the value sent by UpdateLongCustomTags is the existingRunningValue + newValue instead of just newValue.

Comment thread src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated
Comment thread src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated
@martincostello martincostello requested a review from cijothomas May 1, 2026 07:05
@nabutabu nabutabu marked this pull request as ready for review May 4, 2026 16:07
@nabutabu nabutabu requested a review from a team as a code owner May 4, 2026 16:07
ref var metricPoint = ref this.metricPoints[index];
if (metricPoint.MetricPointStatus == MetricPointStatus.CollectPending)
{
value += metricPoint.GetRunningValueLong();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of modifying the incoming value to achieve "spatial aggregation" is good - but it does create incorrect Exemplar values.

It is not that bad - as Exemplars have limited utility in Observable instruments anyway and Exemplars are disabled also by-default for Observable. We should see if the fix can keep Exemplars correct as well. (Exemplar should get the original unmodified value, but Metricpoint should get the updated one..)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read more about Exemplars and this does complicate this issue a bit. Blurting out my understanding of what's happening for future me:

Exemplars are defined as: "access to the raw measurement value, time stamp when measurement was made, and trace context", which means each exemplar recording should be the raw reading instead of the cumulative value that the (non-exemplar) async cumulative instrument uses.

This is important because each exemplar metric recording is also associated with a trace. If exemplars stored aggregated values, we'd lose the connection to individual traces for each recording. The exemplar's purpose is to preserve the context of specific measurements that contribute to the aggregate, which this current change loses.

I'll come back when I have more on how we could fix this.

  • First thought is to add a parameter to UpdateWithExemplar in MetricPoint -> "number" would be the cumulative value and the new parameter is the measured value.
  • MetricPoint does already have access to its own running value, maybe we could just subtract that when calling UpdateExemplar, this would need to happen before Interlocked.Add is called on the running value. Also would this work for every kind of instrument?

}

[Theory(Skip = "https://github.com/open-telemetry/opentelemetry-specification/issues/1874")]
[Theory]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added just one test to prove that we don't do spatial-aggregation in this SDK. We need much more coverage when fixing the issue. Covering delta/cumulative, multiple cycles, various view combinations.

Copy link
Copy Markdown
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current iteration looks very promising (thanks! This was indeed a hard problem!).
Requesting changes to address below comment
https://github.com/open-telemetry/opentelemetry-dotnet/pull/7227/changes#r3183554865

Also, we need to vastly improve test coverage when doing this fix- lot of potential edge cases. (I spotted only one about Exemplar value, but could be more)

@martincostello martincostello requested a review from cijothomas May 11, 2026 08:45
@nabutabu
Copy link
Copy Markdown
Contributor Author

@cijothomas I added as many tests as I could think of, but I am not sure if this is all that needs coverage. Would appreciate any guidance on any other nuances of the measurements that could potentially be affected!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unskip MetricApiTests.ObservableCounterSpatialAggregationTest

3 participants