-
-
Notifications
You must be signed in to change notification settings - Fork 7
Starfish Tracing Model #83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,181 @@ | ||
| - Start Date: 2023-03-29 | ||
| - RFC Type: feature | ||
| - RFC PR: [#83](https://github.com/getsentry/rfcs/pull/83) | ||
| - RFC Status: draft | ||
|
|
||
| # Summary | ||
|
|
||
| This RFC proposes a new tracing model for Sentry's performance product to better | ||
| address future product improvements. The goal of this model is to allow | ||
| **storing entire traces without gaps**, support **dynamic sampling**, **indexing | ||
| of spans** and to **extract metrics from spans pre-sampling**. This is an | ||
| evolution of the current transaction based approach. | ||
|
|
||
| # Motivation | ||
|
|
||
| Today Sentry has a strong concept of a "transaction" which appears in multiple parts of the | ||
| product. It is both the transport of span data, the billable entity and the only indexed | ||
| part of the product experience. This means that spans that exist outside of the transaction | ||
| cannot be represented and it also means that spans within a transaction are not indexed itself. | ||
|
|
||
| The existing model has worked well for us to get started with evolving the Sentry errors | ||
| product to capture performance traces, but it has restricted out ability to evolve the product | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| forward. It has created some restrictions on the SDK technology side (from high | ||
| memory pressure, payload size limits) and also has promoted a separate of transaction to | ||
| span on the API layer which is untypical for tracing products. It also has meant that Sentry | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| has challenges with accepting traces coming directly from an OpenTelemetry exporter as the | ||
| transaction concept is not a concept that OpenTelemetry has. | ||
|
|
||
| We want to set a future direction that enables more flexible product choices and that we can | ||
| move towards from our existing tracing model. The goals are: | ||
|
|
||
| * Support capturing entire traces | ||
| * Have a data model story that allows us higher compatibility with Open Telemetry. Specifically | ||
| we want a model that would permit us to ingest Open Telemetry data right from an exporter | ||
| * Have a clear story for indexing and extracing metrics on a per-span level | ||
| * Unified spans and transactions from an SDK perspective | ||
| * Enable a path that allows clients to send 100% of spans outwards at least to a local aggregator, | ||
| preferrably a remote relay | ||
|
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which types of aggregation do you have in mind here ? |
||
| * To better and directly support dynamic sampling in the core tracing model | ||
|
|
||
| We want to lay out a better path forward that | ||
|
|
||
| * capture entire traces | ||
| * browser tabs = a trace | ||
| * index and extract metrics on a span level | ||
| * clients send 100% of metrics | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe this is in the TODO, but how does this proposal help with this "100% sending" or "memory-pressure/payload size" issue listed above? what about the current CPU overhead? It is helpful for me to remember our own sampling limits, even after dynamic sampling has been implemented, are often set extremely low for performance reasons https://github.com/getsentry/sentry/blob/cc47128efd5a00e66b2475acc318ff57860fdf94/src/sentry/utils/sdk.py#L35
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. even with a local relay I don't understand how we could ever have metrics sampled enough to be useful for SLAs / Alerts where performance of the process is not severely degraded in medium throughput and above processes. It seems to me the only way to accomplish this is for metrics collection directly in the process.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah maybe this is what
is going to address? |
||
| * dynamic sampling narrows in on traces | ||
|
|
||
| # Terms | ||
|
|
||
| The new tracing model is an extension to our existing tracing model. As such we try go adhere | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| to some of the existing terms. Note that this document is intentionally glossing over some of | ||
| the details to better describe the desired end result. Individual RFCs will have to be written | ||
| to narrow down on specific schema definitions. | ||
|
|
||
| ## Session | ||
|
|
||
| A session is an optional concept when talking about user actors on the system. A | ||
| session outlives one or more traces and is exclusively used when talking about human | ||
| interactions with a system. | ||
|
|
||
| ## Trace | ||
|
|
||
| A trace has no end. It bundles spans together, some of those spans are organized | ||
| into segments (marked in red in the graph). The user experience does not center | ||
| around a trace, which really is an internal way to bundle things together but it | ||
| narrows down on segments within the trace. | ||
|
|
||
| ```mermaid | ||
| gantt | ||
| title Example Starfish Trace | ||
| dateFormat x | ||
| axisFormat %S.%L | ||
|
|
||
| section Frontend | ||
| /checkout :crit, 0, 1500ms | ||
| GET /api/session :150, 170ms | ||
| POST /api/analytics :190, 70ms | ||
| GET /api/checkout/state :200, 500ms | ||
| GET /api/checkout/cart :1100, 140ms | ||
| <App/> :1300, 180ms | ||
| POST /api/analytics :done, 1450, 70ms | ||
| GET /assistent/poll :done, 1450, 120ms | ||
| POST /api/analytics :done, 1580, 70ms | ||
|
|
||
| section API Service | ||
| /api/checkout/state :crit, 240, 440ms | ||
| cache.get session#58;[redacted] :360, 10ms | ||
| db.query select from session :370, 20ms | ||
| db.query select from user :390, 20ms | ||
| db.query select from checkout :410, 20ms | ||
| http.request GET http#58;//payments/poll :450, 210ms | ||
| thread.spawn refresh-checkout-cache :done, 670, 220ms | ||
|
|
||
| section Payment Service | ||
| /poll :crit, 470, 180ms | ||
| db.query select from payment :490, 30ms | ||
| db.query update payment :530, 60ms | ||
| ``` | ||
|
|
||
| ## Span | ||
|
AbhiPrasad marked this conversation as resolved.
|
||
|
|
||
| Spans are very similar to how the function today, but they get elevated to a more significant | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| level. They largely follow the general semantics in the wider tracing eco system. To drive | ||
| our product ideas we are going to ensure that the quality of the spans is high and that they | ||
| provide at least the following pieces of information: | ||
|
|
||
| * `op`: defines the core operation that the span represents (eg: `db.query`) | ||
| * `description`: the most significant description of what this operation is (eg: the database query). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would heavily be in favor of re-naming this to For those unfamiliar we currently have an inconsistency in the sentry schema where transactions have a I think most users are familiar with name as an attribute (closer to otel and other vendors as well then), and it would require less delta for us to approach. We even alias span descriptions to names in the UI
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we want to get rid of
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need to get rid of description, simply just call it |
||
| The description also gets processed and cleaned in the process. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we expect to use this? @nikhars , if this is going to be a searchable field you may want to look into ngram indexes. |
||
| * `trace_id`: a span relates to one trace by ID | ||
|
mitsuhiko marked this conversation as resolved.
|
||
| * `parent_id`: a span optionally points back to a parent trace which could be from a different | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| service | ||
| * `segment_id`: a span that is part of a segment, always refers back to it. | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| * `is_segment`: when set to `true` this span is a segment. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If segment is meant to be either a transaction or an interaction, should we build it into the spec here? so |
||
| * `start_time` and `end_time` to give the span time information. | ||
| * `tags`: a key/value pair of arbitrary tags that are set per-span. Conceptionally however spans | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently there are a couple ways to store arbitrary data on the existing schema: on transactions you have tags, contexts and extras (inherting from error events) on spans you have tags and data. As per our spec, tags are restricted to string keys -> string values, and have strict size limits., while data is restricted to string keys, but can have arbitrary values (with more liberal size limits). This has lead to most of the SDKs storing arbitrary values in span data, and the OpenTelemetry conversion setting span attributes (otels arbitrary span data) in the data field as well. Span data is now also used for semantic conventions, similar to how opentelemetry semantic conventions use otel span attributes. I would like to keep this behaviour. For example, the I would like an arbitrary place to continue to store data to still exist, so that means we either move to saying this is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that we should be keeping this. I won't have time yet to focus on the span schema so if you want, maybe you can sketch this out directly on the RFC with what the schema should look like? |
||
| also inherit the tags of the segments they are contained in. | ||
| * `measurements`: are span specific metrics that are stored with the span. There is a natural | ||
| measurement of a span which is the `duration` which is automatically calculated from the difference | ||
| of the end to the start timestamp. | ||
|
|
||
| Spans always belong to a trace, but not all spans belong to a segment. A span not belonging to a | ||
| segment are referred to as "detached" spans, spans that belong to a segment are "attached" spans. | ||
| A span must only be attached to a segment if it belongs to the same process and service. Remote | ||
| spans must never be attached to a segment. | ||
|
mitsuhiko marked this conversation as resolved.
|
||
|
|
||
| ## Transaction | ||
|
|
||
| Previously transaction referred to the container that held spans. In the future a "transaction" | ||
| refers to the name of a specific type of segment level span that describes a meaningful activity | ||
| that starts with a request, and results in some meaningful response. By definition a span that | ||
| holds a transaction tag becomes a "segment". Other than that, it's just a regular span. | ||
|
|
||
| ## Segment | ||
|
|
||
| A segment is a special type of span that is the "logical" activity in a service. For instance a | ||
| segment can be the endpoint implementation of an API request, it could be a task that is processed | ||
| by a task worker, it could be the navigation a user performs in a UI or a screen transition. | ||
| Conceptionally segments fall into two categories: "transactions" which are quite mechanical and | ||
| clearly defined operations and "interactions" which are user triggered operations. The difference | ||
| is that an "interaction" has a human user as an actor in it, that might influence it, whereas a | ||
| "transaction" is unlikely to be interrupted once started. A user for instance is quite likely to | ||
| navigate again even before the previous interaction finished, whereas a task is more likely than no | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
| to conclude, even if what triggered the task is no longer interested in it's result. | ||
|
mitsuhiko marked this conversation as resolved.
Outdated
|
||
|
|
||
| The primary user experience in the product can narrow down on certain segments and make the trace | ||
| explorable via that segment. | ||
|
|
||
| **Locality:** All the spans that are attached to a segment thus must be local to the service and process. It's | ||
| still possible for a span to relate to a child of a segment or a segment directly via the `parent_id`, | ||
| but the `segment_id` must not be set. | ||
|
|
||
| **Joining:** At the end of a segment an implicit join is taking place. Any span | ||
| *that has not concluded | ||
| yet will be detached from the segment. In the following example the `<App/>` | ||
| span is part of the segment `/checkout` still where as the HTTP request related | ||
| spans that did not finish when the `/checkout` segment ended are then detached: | ||
|
|
||
| ```mermaid | ||
| gantt | ||
| title Trace Showing Attached and Detached Spans | ||
| dateFormat x | ||
| axisFormat %S.%L | ||
|
|
||
| section Frontend | ||
| /checkout :crit, 0, 500ms | ||
| <App/> :300, 180ms | ||
| POST /api/analytics :done, 450, 70ms | ||
| GET /assistent/poll :done, 450, 120ms | ||
| POST /api/analytics :done, 580, 70ms | ||
| ``` | ||
|
|
||
| # Drawbacks | ||
|
|
||
| # TODO | ||
|
|
||
| * metrics extraction | ||
| * tag propagation | ||
| * transitional mapping | ||
|
|
||

Uh oh!
There was an error while loading. Please reload this page.