Skip to content

General Availability for OpenTelemetry#3452

Open
tedsuo wants to merge 4 commits into
open-telemetry:mainfrom
tedsuo:otel-ga
Open

General Availability for OpenTelemetry#3452
tedsuo wants to merge 4 commits into
open-telemetry:mainfrom
tedsuo:otel-ga

Conversation

@tedsuo
Copy link
Copy Markdown
Contributor

@tedsuo tedsuo commented May 19, 2026

This meta-project is a re-organization of the work presented in the Stable-by-default OTEP. Since almost none of the work required relates to changes in the spec, I'm moving it from an OTEP to the community repo, since this is where we do the rest of our project planning.

The goal of this PR is to identify the remaining workstreams that are needed to complete delivery of the initial set of goals for OpenTelemetry: a telemetry system that delivers tracing, metrics, and logs from the most common software libraries and cloud infrastructures.

The term "stable by default" seemed to be a little confusing to some people. So we are trying a different term: "general availability" (GA). This is the term we used to use when describing OpenTelemetry as complete, so it seems appropriate to bring it back for this use case.

Currently, this is a first draft. If we agree that this scope of work is correct, we can merge this PR and move on to defining each individual workstream in more detail, as separate documents.

@tedsuo tedsuo added the area/project-proposal Submitting a filled out project template label May 19, 2026
Comment thread projects/otel-ga.md
# OpenTelemetry GA: Completing our initial scope of work

This project file identifies the remaining workstreams needed to complete delivery of the
initial scope of work for OpenTelemetry: A telemetry system that delivers tracing,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title "OpenTelemetry GA" is ambiguous and likely to confuse the community. Many language SDKs and instrumentations have been declared v1.0/stable for years. Do we meant OpenTelemetry as-a-whole here? Lets clarify to avoid this confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify how? Can you suggest a new opening sentence?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some wording suggestion:

This document identifies the remaining workstreams needed for the OpenTelemetry project as a whole to be considered generally available — i.e. an end-to-end platform where users can install, deploy, and operate tracing, metrics, and logs at scale using stable components.

Note: many individual components (language APIs, SDKs, and a growing set of instrumentation libraries) are already at v1.0 today. "Project GA" here is a higher-level milestone about the platform as a whole, not a per-component status.

Comment thread projects/otel-ga.md
technically still in beta, but are recommended to be used in production. This is confusing, as
OpenTelemetry also has components marked 0.X that are genuinely experimental and should not be
used in production. Additionally, some end user organizations have rules that prohibit them
from deploying 0.X software to production.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also ack that there are many components that are declared 1.0 stable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go has special meaning attached to https://go.dev/doc/modules/version-numbers where moving from 0 to 1 bears stability requirements, and it will take some alignment to move up past to v2, etc.

A module developer should increment this number past v1 only when necessary because the version upgrade represents significant disruption for developers whose code uses function in the upgraded module. This disruption includes backward-incompatible changes to the public API, as well as the need for developers using the module to update the package path wherever they import packages from the module.

So the problem for the collector is we conflate GA and stability with API stability, which is a lot of small details.

Comment thread projects/otel-ga.md
maintaining instrumentation. The SemConv Tooling SIG is in charge of this project.

* Weaver
* AI coding
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI coding - bit vague - could we add a one-liner to clarify what we mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely need more details here, but I want the SemConv Tooling SIG to take a look at it before I put too many words in their mouth 🙂

Comment thread projects/otel-ga.md
* Move away from the “community contrib” model for critical instrumentation packages.
* Deploy the new SemConv tooling across all language ecosystems.
* Badges and other forms of recognition
* Native instrumentation push to move instrumentation out of OpenTelemetry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native instrumentation - do we mean libraries/frameworks natively picking a dependency on opentelemetry api, and isntrumenting themselves, without the need for us to make a instrumentation library package?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are two directions a community/contrib instrumentation package could gain more support. One is that a trusted group within OpenTelemetry maintains the package. The second is that the library itself includes the instrumentation natively, so there is no need for OpenTelemetry to maintain a separate package.

In both cases, we've identified a lack of tooling as a barrier. It's difficult to write instrumentation that matches the semantic conventions without making mistakes. So, once we have better tooling, we have an opportunity to try to upstream instrumentation. This would be preferable to maintaining a separate instrumentation package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on framing tooling (or lack of it) as the main barrier. If I'm a library owner deciding whether to take a direct dependency on the OTel API, my decision tree looks roughly like this:

  1. API stability & long-term support — Is the OTel API in my language stable, and is there a guarantee of support for at least ~3 years? I can't take a dependency on something that might force me to react to breaking changes every year.
  2. Performance — Does depending on the OTel API regress my library on the no-op path? Some cost when the SDK is enabled is acceptable; cost when telemetry is disabled is not, because it directly slows down every user of my library, whether they use OTel or not.
  3. Semantic convention stability — Is the semantic convention I'm being asked to emit stable? If it churns, I either ship breaking changes downstream or carry compatibility layers/opt-in-out flags forever.
  4. Tooling / validation — Then — is there tooling to validate that what I produce matches what I'm supposed to produce?

The current wording treats (4) as the primary blocker, but (1)–(3) come first IMHO for any library owner. I also don't recall library authors citing tooling as their blocker in prior discussions — the concerns I've consistently heard are stability and performance. If the SemConv Tooling SIG has data showing otherwise, it would be great to link it. (I only have anecdotal evidence only, so happy to correct my position once I learn more)

For this workstream to actually deliver native instrumentation at scale, I think we need explicit commitments on (1)–(3) — API LTS guarantees, a no-op/hot-path performance commitment, and semconv stability — alongside the tooling work.

Comment thread projects/otel-ga.md
The following projects are seen as important for the long term success of OpenTelemetry, but
not actually necessary to deliver stable components that are deployable at scale.

### Performance / Benchmarking
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is one of the key things we list in our mission, so listing it as out-of-scope reads a bit oddly. If this is a bandwidth call rather than a position change, could we say so explicitly?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martincostello and I had discussed some work to help with this effort, hopefully we can help solve bandwidth problem to a certain extent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm only putting it as out of scope because I feel like we can't ask maintainers to focus on ten things at once. We want maintainers to focus on instrumentation, packaging, and declarative config because those things are necessary for OpenTelemetry to be stable and manageable at scale.

Improving performance is important, but it is not necessary for stability or deployability. So I want to say that at this time, improving performance is something optional that individual SIGs can work on, but not something we are trying to standardize across all implementations by producing a set of standard benchmarks or something like that.

If, in the same way that the SemConv tooling SIG has been workign on tools for managing instrumentation, a group could possibly be working on performance benchmarks they could propose to the community. In the past, this effort has always died because we have not found a useful way to define universal performance benchmarks, and SIGs have instead made better progress working on performance on a case-by-case basis, based on user complaints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open-telemetry/opentelemetry-specification#5118 Opened an OTEP to get cross-implementation performance tracked centrally. It shows prototypes as well.

Comment thread projects/otel-ga.md
are always learning. It's reasonable that we may want to revisit these designs, either to
incorporate new developments within the industry, or to address fundamental performance issues.

Let's finish shipping v1.0 first, before distracting ourselves with v2.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my comment in the beginning line - this also assumes OTel is not v1.0 yet.
https://opentelemetry.io/status/#language-apis--sdks paints a different picture. Lets clarify we meant OTel as whole, not pieces of it to avoid this confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel as a whole is not "v1.0," that chart is too limited as it does not include the work listed in this document: instrumentation stability, packaging, deployment, and management at scale are still not completed.

While it is possible to install an SDK that is v1.0, you cannot reasonably instrument a real world application using only stable components. Today, if we shipped a "stable" distribution of OpenTelemetry in any language, that contained only components that are v1.0 or greater, it would contain no instrumentation and thus be completely useless in a real world scenario. A stable SDK simply isn't enough.

Furthermore, there's currently no way for an organization to deploy that stable distribution at scale, except for a couple of languages in a single environment: Kubernetes. These two hurdles – stability and deployability – are huge gaps, which is why they were flagged as part of the due diligence process done for OpenTelemetry's graduation

Comment thread projects/otel-ga.md

### Stability: Instrumentation

The biggest barrier to general availability is unstable instrumentation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also list who owns this piece? I think we are moving to a model where the maintainers pick a set of key instrumentations and treat it like the core api/sdk? That was my understanding from the wording "move away from community contrib" model!
(💯 agree with the change, comment is about making the ownership clear)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. Actually, figuring out who owns instrumentation is the biggest challenge in all of this work, in my opinion. Currently, no one owns it. And we have no one available to own it: it's unfair to assume that the current SDK maintainers can also take on the work of maintaining instrumentation, even with better tooling.

So, an important part of this workstream is figuring out a new model that offers some kind of reward for organizations that put in the work needed to maintain all of this instrumentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, figuring out who owns instrumentation is the biggest challenge in all of this work, in my opinion. Currently, no one owns it.

Sharing my observations:

In my opinion, it is fair to ask the language's core repo maintainers/approvers to find a list of libraries that they deem important for their ecosystem, and make themselves as owners for them. For years, OTel .NET maintained instrumentation libraries in the core repo itself. It actually helps with validating things e2e also.
I am currently doing it in OTel Rust by owning instrumentation library for the most important library ourselves (of course there is other owners as well)

Rewarding organizations are good idea, but we need a concrete plan.

Comment thread projects/otel-ga.md

As part of the [due diligence](https://github.com/cncf/toc/blob/main/projects/open-telemetry/otel-graduation-dd.md)
for OpenTelemetry's graduation, a scope of work was identified as required in order
for OpenTelemetry to be considered "stable" or "generally available."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we define "stable" for this project?

Does “stable” mean “doesn’t crash at runtime” or “doesn’t introduce breaking changes between releases”? Or is it both?

It would be nice to explicitly define what sort of stability goals we have.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is that stable means both. Once something is v1.0:

  • it is ready for production, meaning it doesn't crash or cause harm.
  • it is supported, meaning we will fix bugs and issue security patches without having them mixed in with breaking changes.

Comment thread projects/otel-ga.md Outdated

### Stability: Collector v1.0

Managed by Collector SIG, the OpenTelemetry Collector needs to complete its roadmap for v1.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many collector component 1.0 has historically be blocked by stable semantic conventions, so that no breaking changes to names occur once the component is tagged 1.0. I recall recently that there was discussion about making it easier to update semconv in 1.0 components. Is that still in effect for the GA concept this document describes? I would be good to include a link to that decision somewhere since it affects a lot of components (collector or instrumentation).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we decided that we were being overly cautious by including data stability with v1.0. While it should be a breaking change if we changed the data, there's nothing wrong with breaking data changes being issued as a v2.0. Combining the two together concepts together as a requirement for v1.0 left us with no way of indicating that the code is safe to run in production.

But you're right, I don't think that this decision has been recorded in the spec. I believe this is the doc that needs to be updated: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/telemetry-stability.md

Comment thread projects/otel-ga.md

* The need for pod attribution and other manual configuration requirements that interfere
with deploying OpenTelemetry at scale.
* All major languages supported.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bullet specifically referring to auto-instrumentation? If so we should be explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, the intention is that all languages that have an auto-instrumentation mechanism can be installed via both the Operator and via system packaging.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All major languages supported.
* Auto-instrumentation supported for all major languages.

Do we have an existing definition for major language or is that distinction left up to the operator/packaging sig?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually mean "Java, .NET, NodeJS, Python, Ruby, PHP" as those are the popular languages that also have an auto-installation path. Maybe there's a better name for it than "major languages" that could be seen as rude. Go could also be on the list if OBI-based auto-instrumentation gets to a good spot. I'm not sure what is required to other languages such as Erlang.

Comment thread projects/otel-ga.md
* Language distributions for SDKs, plugins, and instrumentation
* Declarative configuration for managing instrumentation and stability

### Deployment: Kubernetes Operator v1.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Operator is the go-to k8s solution for many users, but many other users prefer to use the helm charts to install OpenTelemetry in Kubernetes. How do they fit into the GA picture?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that our current approach is for the OTel Helm Charts to just install the Operator. Am I off base in that assumption? Do we need a better solution than that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-operator chart is one way to install the operator, and its 1.0 could be molded into this effort. But separately there is the opentelemetry-collector chart which installs the collector directly.

We have purposefully never given a stance like you should use the opentelemetry operator to install a collector in kubernetes because its not accurate to always suggest the operator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to handle the helm charts via the respective efforts tho:

  • opentelemetry-operator 1.0 with the Operator effort
  • opentelemetry-collector 1.0 with the Collector effort

But I think it is dangerous to claim for GA that only the Operator needs 1.0, as that would be taking a strong stance that the Operator is the Official OpenTelemetry Way to install a collector on Kubernetes.

If the GC/TC wants to take that stance I think its worth discussing further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the opentelemetry-collector helm chart only installs the Collector, that's pretty limiting, isn't it? I'm definitely not opposed to listing it. But I'd like to get a better understanding about whether or not we should be expanding our helm offerings beyond these two options.

Comment thread projects/otel-ga.md
Comment on lines +7 to +10
NOTE: This is a meta-project. It describes a set of workstreams at a high level, so that we can
agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
available." For that reason, it is missing some sections that would normally be in a project
file. In the future we should develop better road-mapping tools, but this is what we have today.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: This is a meta-project. It describes a set of workstreams at a high level, so that we can
agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
available." For that reason, it is missing some sections that would normally be in a project
file. In the future we should develop better road-mapping tools, but this is what we have today.
> [!NOTE]
> This is a meta-project. It describes a set of workstreams at a high level, so that we can
> agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
> available." For that reason, it is missing some sections that would normally be in a project
> file. In the future we should develop better road-mapping tools, but this is what we have today.

Comment thread projects/otel-ga.md
* Deployment: Packaging v1.0
* Deployment: Kubernetes Operator v1.0
* Security
* Roadmaps & Project Management
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: should OpenTelemetry's own self-observability be an explicit workstream in this GA plan?

If we're calling OTel GA — production-ready and deployable at scale — operators need to be able to answer basic questions like is my sdk/exporter dropping data? or is my collector silently failing to export? Today, in most languages and components, you can't easily tell:

  • The semantic conventions for OTel's self-instrumentation are still experimental.
  • Very few SDKs/components implement them end-to-end.
  • Silent data loss has come up repeatedly in issues, SIG discussions, and customer complaints
  • Users often discover missing telemetry days later with no signal from OTel itself.

To operate OTel at scale, we need the system to tell you when it's unhealthy. Is this intentionally out of scope, covered implicitly under another workstream (Collector v1.0? Instrumentation?), or is it a gap worth calling out explicitly?

A concrete example I have always noticed:
OTel SDK's batch processor (default with OTLP exporter) drops telemetry when exporter cannot keep up. And there is no standard way for an operator to know about this. The fix is

  1. Have semantic convention for internal telemetry/self-instrumentation stable
  2. Make sure all sdks implement it. (I think only Java and Go implements this. I opened a PR to add it to Rust recently.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable. OpenTelemetry definitely isn't finished if we are still missing critical forms of self-observability. Possibly related is OpAMP management and health reporting for the SDKs.

@reyang
Copy link
Copy Markdown
Member

reyang commented May 20, 2026

I have a couple meta comments:

  • I feel the use of "General Availability" is confusing and misleading. Take Kubernetes as an example, there is no such thing as "Kubernetes as a project has reached General Availability", the GA term is used for specific features https://kubernetes.io/search/?q=GA.
  • There are many components which are already at v1.0 or even v2.0, if we put something like "this means finalizing the v1.0 roadmap for every component..." it would surprise many users.
  • The term "GA" has been used several times:
image

Comment thread projects/otel-ga.md Outdated
Co-authored-by: Pablo Baeyens <pbaeyens31+github@gmail.com>
@tedsuo
Copy link
Copy Markdown
Contributor Author

tedsuo commented May 26, 2026

@reyang I think you're correct about watching our language about v1.0 vs v2.0, I use v2.0 casually in one place and it isn't appropriate. I'll add some clarification.

As far as the term "General Availability" vs "Stable by Default" or something else... I don't know how much I want to bikeshed that 😅. But I will say, we have always used the term "General Availability" to mean exactly this: a component is stable, v1.0, ready for production and supported. "Stable" in the spec matches this meaning. The issue with most of the components in this roadmap is that they are "de facto" stable, meaning we tell people to run them in production but we haven't marked them as stable or issued a v1.0 for them.

So, given that no term is perfect, and given that we have always used GA in this manner, I'd like to stick with it. If people want to bikeshed and come up with a better term and get buy in from the community, maybe slack is a better place to do that. I promise to change it if it looks like there's consensus, but if you don't mind, I'd like to keep the comment threads here focused on the content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/project-proposal Submitting a filled out project template triage:tc-inbox

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants