Skip to content

Change network.peer.address to opt-in on http.client.open_connections (related to #3279)#3759

Open
cijothomas wants to merge 2 commits into
open-telemetry:mainfrom
cijothomas:http-peer-address-opt-in
Open

Change network.peer.address to opt-in on http.client.open_connections (related to #3279)#3759
cijothomas wants to merge 2 commits into
open-telemetry:mainfrom
cijothomas:http-peer-address-opt-in

Conversation

@cijothomas
Copy link
Copy Markdown
Member

Towards #3279 (does not fully resolve it).

http.client.open_connections is a synchronous UpDownCounter, which is commonly exported with cumulative temporality. Combined with network.peer.address at recommended, this produces one persistent time series per remote IP for the lifetime of the process — even after connections close — leading to unbounded SDK memory growth, exhaustion of cardinality limits, and collapse of useful low-cardinality attributes (e.g. url.scheme) into the overflow bucket.

This PR moves network.peer.address from Recommended to Opt-In on http.client.open_connections.

A deeper fix will require a spec change. The proposed finish() / remove-series API (open-telemetry/opentelemetry-specification#4702) may also help address this in the future by letting instrumentations explicitly drop a series when the underlying entity (e.g. a connection) goes away.

/cc @open-telemetry/semconv-http-approvers @lmolkova @ManickaP @noahfalk

@github-actions
Copy link
Copy Markdown

This PR contains changes to area(s) that do not have an active SIG/project and will be auto-closed:

  • http

Such changes may be rejected or put on hold until a new SIG/project is established.

Please refer to the Semantic Convention Areas
document to see the current active SIGs and also to learn how to kick start a new one.

@github-actions github-actions Bot closed this May 29, 2026
@cijothomas
Copy link
Copy Markdown
Member Author

@open-telemetry/specs-semconv-maintainers

This PR was auto-closed by a bot because the HTTP SIG is marked inactive in AREAS.md. But that doc explicitly says "Bugs and bugfixes are welcome", so this PR should not have been auto-closed. (Its fixing an issue, not sure if it is a bug, in a in-development metric)

Also, this bot action doesn't seem very welcoming to contributors:

  • Contributing guide does not mention the auto-close anywhere..
  • AREAS.md doesn't warn that PRs touching inactive areas will be auto-closed either..

A contributor (especially a newcomer) following the docs would have no way to anticipate this. Would it make sense to either make this more obvious in contributing/areas.md docs to avoid the shock?

In the meantime, could a maintainer reopen this PR? Happy to address any review feedback for this one.

Comment thread model/http/metrics.yaml Outdated
@cijothomas
Copy link
Copy Markdown
Member Author

@lmolkova I kept the attribute as recommended in the connection.duration metric as it is an histogram, and the original issue only occurs when UpDownCounter is involved (due to its always-cumulative nature). Please confirm if this is acceptable. (I saw you approved, but want to call this one out specifically)

@lmolkova
Copy link
Copy Markdown
Member

lmolkova commented May 29, 2026

@cijothomas so if I understand correctly:

  • updowncounter: let's say we started connection to 1.2.3.4 and recorded a counter, 5 min later connection closes and was never re-established to that IP. 10 days later the process is still alive and we're still reporting 0. This is essentially a leak. Even if we had finish() we wouldn't necessarily know in the instrumentation if it's the time to call it - we don't know if there are other active connections to that IP. It'd need some background mechanism in SDK to drop stall series with value 0.
  • histogram: we won't keep an active time series for 1.2.3.4, we'll never report it after connection closes. While cardinality could be high over long period of time, within short periods it's usually fixed.

If it's correct than I think it makes sense to make it opt-in on the updowncounter and keep on histogram. I added a comment to #1854 tracking general metric guidance to potentially document it.

@cijothomas
Copy link
Copy Markdown
Member Author

@cijothomas so if I understand correctly:

  • updowncounter: let's say we started connection to 1.2.3.4 and recorded a counter, 5 min later connection closes and was never re-established to that IP. 10 days later the process is still alive and we're still reporting 0. This is essentially a leak. Even if we had finish() we wouldn't necessarily know in the instrumentation if it's the time to call it - we don't know if there are other active connections to that IP. It'd need some background mechanism in SDK to drop stall series with value 0.
  • histogram: we won't keep an active time series for 1.2.3.4, we'll never report it after connection closes. While cardinality could be high over long period of time, within short periods it's usually fixed.

If it's correct than I think it makes sense to make it opt-in on the updowncounter and keep on histogram. I added a comment to #1854 tracking general metric guidance to potentially document it.

Your explanation is correct. Histogram also would have this problem of never forgetting the 1.2.3.4 entry, if user opts for Cumulative. With delta, Histogram "forgets".

UpDownCounter is somewhat special - even when user says DELTA, UpDownCounter actually still does cumulative. i.e - a normal user has no easy way to make UpDownCounter behave in DELTA mode!
See https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk_exporters/otlp.md#additional-environment-variable-configuration

(it is possible for an advanced user to control this, but that might defeat the entire point of UpDownCounter. I guess these are spec problems, and should be addressed there.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants