Skip to content

[Metrics] Resolve accumulated TODOs in SDK core#7277

Open
nimanikoo wants to merge 11 commits into
open-telemetry:mainfrom
nimanikoo:fix/resolve-simple-todos-metrics-sdk
Open

[Metrics] Resolve accumulated TODOs in SDK core#7277
nimanikoo wants to merge 11 commits into
open-telemetry:mainfrom
nimanikoo:fix/resolve-simple-todos-metrics-sdk

Conversation

@nimanikoo
Copy link
Copy Markdown
Contributor

@nimanikoo nimanikoo commented May 9, 2026

Hey everyone! 👋 I was going through the codebase and spotted a few TODOs that looked straightforward to fix, so I went ahead and took care of them. Thanks for maintaining such a great project and for everything you do for the open source community 🙏

Changes

Clears four long-standing TODO comments in the metrics SDK. Each change is small and self-contained; they are grouped together because they all touch the same area and share the same low risk profile.

Exception safety for custom ExemplarReservoir
MetricPoint.UpdateExemplar called ExemplarReservoir.Offer with no protection. Any exception from a user-supplied reservoir would propagate up and corrupt the measurement path. The fix wraps the call in a try/catch and emits a new Warning-level EventSource event (#57, ExemplarReservoirException). To preserve AggressiveInlining on the hot path, the guard (if (offerExemplar)) stays inlined in UpdateExemplar while the offer + exception handling moves into a non-inlined OfferExemplarSafe helper.

Diagnostic log in MeasurementsCompleted
When MeasurementsCompleted receives a state that isn't a MetricState, it now logs a Verbose-level MeterProviderSdkEvent before returning — consistent with MeasurementRecordedLong and MeasurementRecordedDouble in the same file.

Log unknown instrument type in temporality preference
MonotonicDeltaTemporalityPreferenceFunc had a silent _ => AggregationTemporality.Delta default. It is now a block lambda that emits a Verbose log when an unrecognized instrument type hits that arm. Return behavior is unchanged.

TrimExcess() after AddMetricWithViews
The metrics list is pre-allocated to the maximum view count, but views can be skipped for several reasons (invalid name, existing metric, Drop config, limit exceeded). TrimExcess() is now called after the lock is released to reclaim unused capacity without extending lock hold time.

Merge requirement checklist

  • CONTRIBUTING guidelines followed (license requirements, nullable enabled, static analysis, etc.)
  • Unit tests added/updated
  • Appropriate CHANGELOG.md files updated for non-trivial changes
  • Changes in public API reviewed (if applicable)

Wrap ExemplarReservoir.Offer in try/catch so a buggy custom reservoir can't silently crash the measurement path. Surfaces the error as a new Warning event (open-telemetry#57).
Log a Verbose diagnostic in MeasurementsCompleted when the state object isn't a MetricState, consistent with every other callback in the file.
Log unknown instrument types that fall through to the default case in MonotonicDeltaTemporalityPreferenceFunc.
Call TrimExcess() after releasing the creation lock in AddMetricWithViews to reclaim wasted list capacity
Copilot AI review requested due to automatic review settings May 9, 2026 12:39
@nimanikoo nimanikoo requested a review from a team as a code owner May 9, 2026 12:39
@github-actions github-actions Bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label May 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Resolves several long-standing TODOs in the metrics SDK core by improving exception safety around exemplars, adding missing verbose diagnostics, and reducing unnecessary allocations from over-provisioned metric lists.

Changes:

  • Wraps user-supplied ExemplarReservoir.Offer calls with exception handling and introduces a new Warning-level EventSource event for failures.
  • Adds verbose diagnostics for unexpected state in MeasurementsCompleted and for unrecognized instrument types in temporality preference selection.
  • Calls TrimExcess() on the metrics list created in AddMetricWithViews after releasing the lock to reclaim unused capacity.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs Trims preallocated metrics list capacity after view processing completes.
src/OpenTelemetry/Metrics/Reader/MetricReader.cs Adds verbose log when temporality preference sees an unexpected instrument type.
src/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs Adds exception-safe exemplar offering via helper methods.
src/OpenTelemetry/Metrics/MeterProviderSdk.cs Adds verbose log when MeasurementsCompleted receives an unexpected state type.
src/OpenTelemetry/Internal/OpenTelemetrySdkEventSource.cs Adds new EventId 57 warning event for exemplar reservoir offer exceptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs
Comment thread src/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs
Comment thread src/OpenTelemetry/Metrics/Reader/MetricReader.cs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

❌ Patch coverage is 89.47368% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.93%. Comparing base (b3841bc) to head (f55b873).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/OpenTelemetry/Metrics/Reader/MetricReader.cs 84.21% 3 Missing ⚠️
src/OpenTelemetry/Metrics/MeterProviderSdk.cs 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #7277      +/-   ##
==========================================
- Coverage   89.94%   89.93%   -0.02%     
==========================================
  Files         276      276              
  Lines       14007    14034      +27     
==========================================
+ Hits        12599    12621      +22     
- Misses       1408     1413       +5     
Flag Coverage Δ
unittests-Project-Experimental 89.75% <89.47%> (-0.03%) ⬇️
unittests-Project-Stable 89.65% <89.47%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...nTelemetry/Internal/OpenTelemetrySdkEventSource.cs 64.44% <100.00%> (-0.68%) ⬇️
...c/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs 94.93% <100.00%> (+0.15%) ⬆️
...rc/OpenTelemetry/Metrics/Reader/MetricReaderExt.cs 92.12% <100.00%> (+0.06%) ⬆️
src/OpenTelemetry/Metrics/MeterProviderSdk.cs 94.19% <0.00%> (-0.40%) ⬇️
src/OpenTelemetry/Metrics/Reader/MetricReader.cs 91.08% <84.21%> (-0.87%) ⬇️

... and 8 files with indirect coverage changes

Copy link
Copy Markdown
Member

@martincostello martincostello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please review the Copilot review comments.

Comment thread src/OpenTelemetry/Metrics/Reader/MetricReader.cs
@nimanikoo nimanikoo requested a review from martincostello May 14, 2026 10:24
Comment thread src/OpenTelemetry/Metrics/MetricPoint/MetricPoint.cs
Comment thread test/OpenTelemetry.Tests/Metrics/MetricExemplarTests.cs Outdated
@rajkumar-rangaraj
Copy link
Copy Markdown
Member

copilot feedback

Code Review — PR #7277 [Metrics] Resolve accumulated TODOs in SDK core

Consensus findings (both Opus 4.7 + Opus 4.6 agreed) — real issues

  1. Per-measurement Warning log spam from OfferExemplarSafe 🟡 Medium
    A buggy user ExemplarReservoir.Offer will raise EventSource event remove options from end and make SpanContext specific class #57 (Warning) on every measurement — potentially millions per second on the hot path. Existing user-callback handlers in AggregatorStore.cs swallow silently for this exact reason. Suggested fix: log-once-per-reservoir-instance (track a bool on the wrapper), or drop the level to Verbose.

  2. TrimExcess() is wasted (and arguably harmful) 🟡 Medium
    Opus 4.7 gave the strongest evidence: the returned metrics list is immediately consumed by MetricState.BuildForMetricList which calls metrics.ToArray() — the list itself is then unreachable. TrimExcess just allocates a new backing array that's thrown away. Opus 4.6 separately noted the same allocation cost. Revert this change or remove the call. The original TODO's premise ("memory wasted until Meter disposed") was incorrect.

Where the models disagreed — I verified by reading the code

  1. MeasurementsCompleted severity inconsistency 🟡 Medium — Opus 4.6 correct
    Opus 4.7 said the PR "mirrors the surrounding pattern" at Verbose. It does not. I checked MeterProviderSdk.cs lines 205–222: the sibling methods MeasurementRecordedLong/Double use OpenTelemetrySdkEventSource.Log.MeasurementDropped(..., "SDK internal error occurred.", "Contact SDK owners.") at Warning level for the identical state is not MetricState check. The PR uses MeterProviderSdkEvent at Verbose, making a real SDK bug effectively invisible in production. Should match siblings (MeasurementDropped, Warning).

Unique to Opus 4.7

  1. Missing CHANGELOG entry 🟡 Medium
    The exception-safety change is user-observable: previously, an exception in a custom reservoir bubbled out of Counter.Add/Histogram.Record; now it's swallowed and logged. Needs an Unreleased bullet under src/OpenTelemetry/CHANGELOG.md referencing [Metrics] Resolve accumulated TODOs in SDK core #7277.

Plus useful verifications: event id 57 is unique; [MethodImpl(NoInlining)] is present (not relying on JIT); MonotonicDeltaTemporalityPreferenceFunc outer lambda remains static with no closure allocation; OpenTelemetrySdkEventSource is internal (no .publicApi/ update needed); the PR does include unit tests with ThrowingExemplarReservoir (correcting a wrong premise in my prompt).

Model comparison notes

  • Opus
    4.7 (118s): More thorough, did extra verifications, caught the CHANGELOG gap, but got the MeasurementsCompleted severity wrong (claimed it mirrored siblings — it doesn't).
  • Opus
    4.6 (90s): Faster, correctly flagged the severity inconsistency, but missed the CHANGELOG, didn't verify event id / NoInlining / closure-safety, and didn't notice the PR has tests.
  • Both independently converged on the two most important issues (log spam + wasted TrimExcess), which boosts confidence.

Recommended actions for the PR

  1. Add log-rate-limiting (or drop to Verbose) for ExemplarReservoirException.
  2. Remove the TrimExcess() call — it provides no benefit.
  3. Change MeasurementsCompleted log to use MeasurementDropped at Warning, matching siblings.
  4. Add a CHANGELOG entry describing the new exception-safety behavior.

nimanikoo added 3 commits May 19, 2026 15:17
…and logging

- Drop ExemplarReservoirException event from Warning to Verbose — a
  throwing reservoir fires on every measurement, making Warning level
  impractical on hot paths where exemplar capture is best-effort anyway

- Remove TrimExcess() from AddMetricWithViews — the caller immediately
  calls .ToArray() on the returned list, so the trim just allocates a
  backing array that is discarded on the next line

- Align MeasurementsCompleted error logging with its siblings
  MeasurementRecordedLong/Double, which use MeasurementDropped at
  Warning for the same state mismatch condition

- Add CHANGELOG entry for the exception-safety behaviour introduced
  by OfferExemplarSafe, which is a user-observable contract change
@nimanikoo
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @rajkumar-rangaraj
all four points addressed:

ExemplarReservoirException log level
Dropped event #57 from Warning to Verbose. You're right that exemplar capture is best-effort and a misbehaving reservoir shouldn't pollute the warning stream on every measurement. The Verbose level keeps it observable for anyone actively debugging without becoming noise in production.

TrimExcess()
Removed. I missed that BuildForMetricList calls .ToArray() immediately, which makes the trim pointless it just allocates a backing array that's abandoned on the next line. Good catch.

MeasurementsCompleted severity
Fixed to use MeasurementDropped at Warning, same as MeasurementRecordedLong and MeasurementRecordedDouble. The original MeterProviderSdkEvent call was at Verbose, which would have made a real SDK-state bug completely invisible in production. Should have matched siblings from the start.

CHANGELOG
Added an entry under Unreleased. The swallow-instead-of-propagate behavior on ExemplarReservoir.Offer is user-visible anyone relying on the exception bubbling out of Counter.Add to detect a broken reservoir would be silently broken without this note.

f55b873

I'm really excited to see this merged and looking forward to contributing more to the project in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants