Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions projects/otel-ga.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
# OpenTelemetry GA: Completing our initial scope of work

This project file identifies the remaining workstreams needed to complete delivery of the
initial scope of work for OpenTelemetry: A telemetry system that delivers tracing,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The title "OpenTelemetry GA" is ambiguous and likely to confuse the community. Many language SDKs and instrumentations have been declared v1.0/stable for years. Do we meant OpenTelemetry as-a-whole here? Lets clarify to avoid this confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify how? Can you suggest a new opening sentence?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some wording suggestion:

This document identifies the remaining workstreams needed for the OpenTelemetry project as a whole to be considered generally available — i.e. an end-to-end platform where users can install, deploy, and operate tracing, metrics, and logs at scale using stable components.

Note: many individual components (language APIs, SDKs, and a growing set of instrumentation libraries) are already at v1.0 today. "Project GA" here is a higher-level milestone about the platform as a whole, not a per-component status.

metrics, and logs from the most common software libraries and cloud infrastructures.

NOTE: This is a meta-project. It describes a set of workstreams at a high level, so that we can
agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
available." For that reason, it is missing some sections that would normally be in a project
file. In the future we should develop better road-mapping tools, but this is what we have today.
Comment on lines +7 to +10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NOTE: This is a meta-project. It describes a set of workstreams at a high level, so that we can
agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
available." For that reason, it is missing some sections that would normally be in a project
file. In the future we should develop better road-mapping tools, but this is what we have today.
> [!NOTE]
> This is a meta-project. It describes a set of workstreams at a high level, so that we can
> agree upon the overall scope of work needed for OpenTelemetry to be considered GA or "generally
> available." For that reason, it is missing some sections that would normally be in a project
> file. In the future we should develop better road-mapping tools, but this is what we have today.


## Background and description

As part of the [due diligence](https://github.com/cncf/toc/blob/main/projects/open-telemetry/otel-graduation-dd.md)
for OpenTelemetry's graduation, a scope of work was identified as required in order
for OpenTelemetry to be considered "stable" or "generally available."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we define "stable" for this project?

Does “stable” mean “doesn’t crash at runtime” or “doesn’t introduce breaking changes between releases”? Or is it both?

It would be nice to explicitly define what sort of stability goals we have.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is that stable means both. Once something is v1.0:

  • it is ready for production, meaning it doesn't crash or cause harm.
  • it is supported, meaning we will fix bugs and issue security patches without having them mixed in with breaking changes.


The feedback on the initial proposal was that it was too open-ended, so this project attempts
to redefine workstreams to be more specific as to which SIG is indented to work on them, and
the concrete set of deliverables needed for OpenTelemetry to be considered GA.

## Current challenges

### De facto stable

Today, many components are "de facto" stable, meaning that they are versioned as 0.X and are
technically still in beta, but are recommended to be used in production. This is confusing, as
OpenTelemetry also has components marked 0.X that are genuinely experimental and should not be
used in production. Additionally, some end user organizations have rules that prohibit them
from deploying 0.X software to production.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also ack that there are many components that are declared 1.0 stable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go has special meaning attached to https://go.dev/doc/modules/version-numbers where moving from 0 to 1 bears stability requirements, and it will take some alignment to move up past to v2, etc.

A module developer should increment this number past v1 only when necessary because the version upgrade represents significant disruption for developers whose code uses function in the upgraded module. This disruption includes backward-incompatible changes to the public API, as well as the need for developers using the module to update the package path wherever they import packages from the module.

So the problem for the collector is we conflate GA and stability with API stability, which is a lot of small details.


As part of graduation, it was requested that we provide a mechanism for indicating to end
users which OpenTelemetry components are "production ready." Actually, we already have a
mechanism for indicating this – the version number of the component. Going forward, we need to
align "production readiness" with bumping a component to v1.0 or greater.

In practice, this means finalizing the v1.0 roadmap for every component necessary for
OpenTelemetry to be considered generally available.

### Deploying at scale

While there is a way to install every individual component in OpenTelemetry, we do not
currently have tools that can install and manage all of OpenTelemetry at scale. We have the
beginnings of these tools for Kubernetes and Linux, but they are not complete, and can only
manage a portion of the OpenTelemetry components needed for a complete deployment.

### Future-proofing

We've made it this far, but there are several aspects of the project that need to be improved
in order continue maintaining OpenTelemetry after GA. The specific areas that were identified
include security, project management, performance, and long term support.

## Workstreams

Based on the above challenges, the following workstreams need to be developed and managed as
a roadmap to GA that can be presented to the community.

* Stability: Collector v1.0
* Stability: SemConv Tooling for Instrumentation
* Stability: Instrumentation
* Deployment: OpAMP v1.0
* Deployment: Packaging v1.0
* Deployment: Kubernetes Operator v1.0
* Security
* Roadmaps & Project Management
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: should OpenTelemetry's own self-observability be an explicit workstream in this GA plan?

If we're calling OTel GA — production-ready and deployable at scale — operators need to be able to answer basic questions like is my sdk/exporter dropping data? or is my collector silently failing to export? Today, in most languages and components, you can't easily tell:

  • The semantic conventions for OTel's self-instrumentation are still experimental.
  • Very few SDKs/components implement them end-to-end.
  • Silent data loss has come up repeatedly in issues, SIG discussions, and customer complaints
  • Users often discover missing telemetry days later with no signal from OTel itself.

To operate OTel at scale, we need the system to tell you when it's unhealthy. Is this intentionally out of scope, covered implicitly under another workstream (Collector v1.0? Instrumentation?), or is it a gap worth calling out explicitly?

A concrete example I have always noticed:
OTel SDK's batch processor (default with OTLP exporter) drops telemetry when exporter cannot keep up. And there is no standard way for an operator to know about this. The fix is

  1. Have semantic convention for internal telemetry/self-instrumentation stable
  2. Make sure all sdks implement it. (I think only Java and Go implements this. I opened a PR to add it to Rust recently.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable. OpenTelemetry definitely isn't finished if we are still missing critical forms of self-observability. Possibly related is OpAMP management and health reporting for the SDKs.


### Stability: Collector v1.0

Managed by Collector SIG, the OpenTelemetry Collector needs to complete [its roadmap for v1.0](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/ga-roadmap.md). This includes marking core APIs as well as a minimal OTLP distro as 1.x.

The opentelemetry-collector-contrib repository contains over 200 hundred components. [Initial adopter interview findings](https://docs.google.com/document/d/1SQMdfYpCiBfpxtWDwASXVIl-PIzD9X4vdDPXYUphAF0/edit?tab=t.0) revealed that, although many of these are considered not 'core' and are instead community-supported, end-users rely on them. General availability therefore will include [the stability of additional priority components identified in 'phase 1'](https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/44130).

Marking further components or distros as 1.x is explicitly out of scope.

Check warning on line 73 in projects/otel-ga.md

View workflow job for this annotation

GitHub Actions / spelling-check

Unknown word (distros)

### Stability: SemConv Tooling for Instrumentation

In order to lower the cost of manage instrumentation at scale, and to better support native
instrumentation efforts, we need tooling that improves correctness while reducing the cost of
maintaining instrumentation. The SemConv Tooling SIG is in charge of this project.

* Weaver
* AI coding
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI coding - bit vague - could we add a one-liner to clarify what we mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely need more details here, but I want the SemConv Tooling SIG to take a look at it before I put too many words in their mouth 🙂

* Test harnesses

### Stability: Instrumentation

The biggest barrier to general availability is unstable instrumentation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also list who owns this piece? I think we are moving to a model where the maintainers pick a set of key instrumentations and treat it like the core api/sdk? That was my understanding from the wording "move away from community contrib" model!
(💯 agree with the change, comment is about making the ownership clear)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. Actually, figuring out who owns instrumentation is the biggest challenge in all of this work, in my opinion. Currently, no one owns it. And we have no one available to own it: it's unfair to assume that the current SDK maintainers can also take on the work of maintaining instrumentation, even with better tooling.

So, an important part of this workstream is figuring out a new model that offers some kind of reward for organizations that put in the work needed to maintain all of this instrumentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, figuring out who owns instrumentation is the biggest challenge in all of this work, in my opinion. Currently, no one owns it.

Sharing my observations:

In my opinion, it is fair to ask the language's core repo maintainers/approvers to find a list of libraries that they deem important for their ecosystem, and make themselves as owners for them. For years, OTel .NET maintained instrumentation libraries in the core repo itself. It actually helps with validating things e2e also.
I am currently doing it in OTel Rust by owning instrumentation library for the most important library ourselves (of course there is other owners as well)

Rewarding organizations are good idea, but we need a concrete plan.


* Move away from the “community contrib” model for critical instrumentation packages.
* Deploy the new SemConv tooling across all language ecosystems.
* Badges and other forms of recognition
* Native instrumentation push to move instrumentation out of OpenTelemetry.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native instrumentation - do we mean libraries/frameworks natively picking a dependency on opentelemetry api, and isntrumenting themselves, without the need for us to make a instrumentation library package?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are two directions a community/contrib instrumentation package could gain more support. One is that a trusted group within OpenTelemetry maintains the package. The second is that the library itself includes the instrumentation natively, so there is no need for OpenTelemetry to maintain a separate package.

In both cases, we've identified a lack of tooling as a barrier. It's difficult to write instrumentation that matches the semantic conventions without making mistakes. So, once we have better tooling, we have an opportunity to try to upstream instrumentation. This would be preferable to maintaining a separate instrumentation package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on framing tooling (or lack of it) as the main barrier. If I'm a library owner deciding whether to take a direct dependency on the OTel API, my decision tree looks roughly like this:

  1. API stability & long-term support — Is the OTel API in my language stable, and is there a guarantee of support for at least ~3 years? I can't take a dependency on something that might force me to react to breaking changes every year.
  2. Performance — Does depending on the OTel API regress my library on the no-op path? Some cost when the SDK is enabled is acceptable; cost when telemetry is disabled is not, because it directly slows down every user of my library, whether they use OTel or not.
  3. Semantic convention stability — Is the semantic convention I'm being asked to emit stable? If it churns, I either ship breaking changes downstream or carry compatibility layers/opt-in-out flags forever.
  4. Tooling / validation — Then — is there tooling to validate that what I produce matches what I'm supposed to produce?

The current wording treats (4) as the primary blocker, but (1)–(3) come first IMHO for any library owner. I also don't recall library authors citing tooling as their blocker in prior discussions — the concerns I've consistently heard are stability and performance. If the SemConv Tooling SIG has data showing otherwise, it would be great to link it. (I only have anecdotal evidence only, so happy to correct my position once I learn more)

For this workstream to actually deliver native instrumentation at scale, I think we need explicit commitments on (1)–(3) — API LTS guarantees, a no-op/hot-path performance commitment, and semconv stability — alongside the tooling work.


### Deployment: OpAMP v1.0

In order to manage OpenTelemetry at scale, we need a control plane. Therefore, OpAMP needs
to be stable and feature complete for its core set of management tasks.

The OpAMP SIG is in charge of this project.

### Deployment: Packaging v1.0

The Packaging SIG is in charge of this project.

* Official packages for Debian and RHEL
* Cross language definitions for distributions and versioning
* Language distributions for SDKs, plugins, and instrumentation
* Declarative configuration for managing instrumentation and stability

### Deployment: Kubernetes Operator v1.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Operator is the go-to k8s solution for many users, but many other users prefer to use the helm charts to install OpenTelemetry in Kubernetes. How do they fit into the GA picture?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that our current approach is for the OTel Helm Charts to just install the Operator. Am I off base in that assumption? Do we need a better solution than that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-operator chart is one way to install the operator, and its 1.0 could be molded into this effort. But separately there is the opentelemetry-collector chart which installs the collector directly.

We have purposefully never given a stance like you should use the opentelemetry operator to install a collector in kubernetes because its not accurate to always suggest the operator.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to handle the helm charts via the respective efforts tho:

  • opentelemetry-operator 1.0 with the Operator effort
  • opentelemetry-collector 1.0 with the Collector effort

But I think it is dangerous to claim for GA that only the Operator needs 1.0, as that would be taking a strong stance that the Operator is the Official OpenTelemetry Way to install a collector on Kubernetes.

If the GC/TC wants to take that stance I think its worth discussing further.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if the opentelemetry-collector helm chart only installs the Collector, that's pretty limiting, isn't it? I'm definitely not opposed to listing it. But I'd like to get a better understanding about whether or not we should be expanding our helm offerings beyond these two options.


The Kubernetes Operator needs several features in order to make OpenTelemetry generally
available on Kubernetes. Operator SIG is in charge of this project.

* The need for pod attribution and other manual configuration requirements that interfere
with deploying OpenTelemetry at scale.
* All major languages supported.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this bullet specifically referring to auto-instrumentation? If so we should be explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, the intention is that all languages that have an auto-instrumentation mechanism can be installed via both the Operator and via system packaging.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* All major languages supported.
* Auto-instrumentation supported for all major languages.

Do we have an existing definition for major language or is that distinction left up to the operator/packaging sig?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually mean "Java, .NET, NodeJS, Python, Ruby, PHP" as those are the popular languages that also have an auto-installation path. Maybe there's a better name for it than "major languages" that could be seen as rude. Go could also be on the list if OBI-based auto-instrumentation gets to a good spot. I'm not sure what is required to other languages such as Erlang.

* Works with the same distributions and configuration options as developed by the package
management SIG, so that end users only need to learn a single set of configuration patterns.

### Security

While project-wide security protocols can always be improved. The following tasks have been
identified as high priority. The Security SIG is in charge of this project.

* Better staffing of the Security SIG.
* Triage that can keep up with AI-powered CVE reporting noise.
* Oversight to ensure that security response protocols are being followed.
* OTel scanning tools to counter noisy and inaccurate scanning tools used by end users.

### Roadmaps & Project Management

While OpenTelemetry has been making many incremental improvements to its project management
tools and workflows, we have identified a need for larger changes to our project structure.

OpenTelemetry always had a de-facto roadmap – first traces, then metrics, then logs. That
roadmap is now complete. Once we have delivered this final set of work needed to make the
original goals generally available, the roadmap is now unwritten. How will we write it as
a community?

The Governance Committee is in charge of this project, but everyone's input is needed.

* More agency and responsibility for the roadmap in the hands of the maintainers.
* Better ways to socialize the projects that need input.
* Better ways to visualize the long term roadmap.

## Out of scope

There is additional work that we see as necessary for the success of the project. However, in
order to stay focused as a community, we must limit the number of large projects that we take
on at any one time – we cannot ask maintainers to have five or six "number one priorities."

The following projects are seen as important for the long term success of OpenTelemetry, but
not actually necessary to deliver stable components that are deployable at scale.

### Performance / Benchmarking
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance is one of the key things we list in our mission, so listing it as out-of-scope reads a bit oddly. If this is a bandwidth call rather than a position change, could we say so explicitly?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martincostello and I had discussed some work to help with this effort, hopefully we can help solve bandwidth problem to a certain extent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm only putting it as out of scope because I feel like we can't ask maintainers to focus on ten things at once. We want maintainers to focus on instrumentation, packaging, and declarative config because those things are necessary for OpenTelemetry to be stable and manageable at scale.

Improving performance is important, but it is not necessary for stability or deployability. So I want to say that at this time, improving performance is something optional that individual SIGs can work on, but not something we are trying to standardize across all implementations by producing a set of standard benchmarks or something like that.

If, in the same way that the SemConv tooling SIG has been workign on tools for managing instrumentation, a group could possibly be working on performance benchmarks they could propose to the community. In the past, this effort has always died because we have not found a useful way to define universal performance benchmarks, and SIGs have instead made better progress working on performance on a case-by-case basis, based on user complaints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open-telemetry/opentelemetry-specification#5118 Opened an OTEP to get cross-implementation performance tracked centrally. It shows prototypes as well.


OpenTelemetry maintainers are always encouraged to continue optimizing the performance
of the code that they are in charge of.

### Long term support

The primary goal of this initiative is to get all of the necessary components to v1.0.
OpenTelemetry does have compatibility and support requirements for various types of
components, such as APIs and plugin architectures. Once we have reached this milestone, we
can revisit our compatibility requirements and long term support guarantees.

### Better designs and architecture

The core specification for tracing, metrics, and logs have all been completed. However, we
are always learning. It's reasonable that we may want to revisit these designs, either to
incorporate new developments within the industry, or to address fundamental performance issues.

Let's finish shipping v1.0 first, before distracting ourselves with v2.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to my comment in the beginning line - this also assumes OTel is not v1.0 yet.
https://opentelemetry.io/status/#language-apis--sdks paints a different picture. Lets clarify we meant OTel as whole, not pieces of it to avoid this confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTel as a whole is not "v1.0," that chart is too limited as it does not include the work listed in this document: instrumentation stability, packaging, deployment, and management at scale are still not completed.

While it is possible to install an SDK that is v1.0, you cannot reasonably instrument a real world application using only stable components. Today, if we shipped a "stable" distribution of OpenTelemetry in any language, that contained only components that are v1.0 or greater, it would contain no instrumentation and thus be completely useless in a real world scenario. A stable SDK simply isn't enough.

Furthermore, there's currently no way for an organization to deploy that stable distribution at scale, except for a couple of languages in a single environment: Kubernetes. These two hurdles – stability and deployability – are huge gaps, which is why they were flagged as part of the due diligence process done for OpenTelemetry's graduation

Loading