Skip to content

Commit 5d1508c

Browse files
content: add "What is distributed tracing?" explainer (#17309)
* content: add "What is distributed tracing?" explainer Adds a high-search-intent product-engineers explainer targeting the generic "distributed tracing" query, mirroring how Datadog's Knowledge Center and New Relic's concepts pages capture top-of-funnel search and funnel readers into the product docs. The page also explicitly disambiguates distributed tracing (APM) from LLM tracing (AI observability), directly addressing the search-collision that prompted this work. Author byline is a placeholder (ian-vanagas) and should be confirmed/reassigned before merge. Generated-By: PostHog Code Task-Id: 75c5f034-5325-4ced-8eb3-e5bcc57f6dec * content: address Vale prose warnings - Capitalize PostHog product names (Session Replay, Error Tracking, Logs, Product Analytics) in the "with PostHog" section, since they refer to the products - Drop the colon from the "How distributed tracing works" heading - Replace "de facto" and the unused "OTel" abbreviation with plain wording Leaves the general-concept "logs"/"metrics" mentions lowercase (correct per Vale's own ProductNames rule) and the immutable Cloudinary image URL as-is. Generated-By: PostHog Code Task-Id: 75c5f034-5325-4ced-8eb3-e5bcc57f6dec * Update contents/product-engineers/what-is-distributed-tracing.md Co-authored-by: Natalia Amorim <natalia@posthog.com> * Update what-is-distributed-tracing.md * content(distributed-tracing): address review feedback - Add daniel-visca to authors.json and use handle in frontmatter so byline renders properly. - Add internal links per Natalia's suggestions: /docs/distributed-tracing for the opening sentence, /ai-observability for the LLM tracing aside, and link Logs / Error Tracking / Distributed tracing in the signals table. - Add a WizardCTA + brief lead-in at the end of the PostHog section. - Add an FAQ section covering the most common follow-on questions (tracing vs distributed tracing, vs logs/metrics, setup, vs LLM tracing, correlation with replays/errors). Generated-By: PostHog Code Task-Id: a323f66d-6d1a-440e-bfdf-4aadc9366d61 * content(authors): set Daniel Visca profile_id Generated-By: PostHog Code Task-Id: a323f66d-6d1a-440e-bfdf-4aadc9366d61 * content(authors): set Daniel Visca role to Product Engineer Generated-By: PostHog Code Task-Id: a323f66d-6d1a-440e-bfdf-4aadc9366d61 * content(authors): use LinkedIn link for Daniel Visca Generated-By: PostHog Code Task-Id: a323f66d-6d1a-440e-bfdf-4aadc9366d61 * content(authors): update Daniel Visca role Generated-By: PostHog Code Task-Id: a323f66d-6d1a-440e-bfdf-4aadc9366d61 --------- Co-authored-by: Natalia Amorim <natalia@posthog.com>
1 parent 43f7575 commit 5d1508c

2 files changed

Lines changed: 161 additions & 0 deletions

File tree

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
---
2+
title: 'What is distributed tracing? (A guide for engineers)'
3+
date: 2026-06-05
4+
author:
5+
- daniel-visca
6+
featuredImage: >-
7+
https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/blog/happy-hog.png
8+
featuredImageType: full
9+
tags:
10+
- Engineering
11+
- Product engineers
12+
crosspost:
13+
- Blog
14+
seo:
15+
metaTitle: 'What is distributed tracing? A guide for engineers'
16+
metaDescription: >-
17+
Here's how traces and spans work, what tracing shows you that logs and metrics can't, and how to set it up.
18+
---
19+
20+
Modern apps rarely live in one process. A single user action fans out across a web server, a handful of internal services, a queue or two, a database, and a few third-party APIs. When that request is slow or fails, the hard part isn't fixing the bug, it's finding *which* of those hops caused it.
21+
22+
**Distributed tracing** is how you find it. It follows a single request end to end, across every service it touches, and shows you exactly where the time went and where things broke.
23+
24+
This guide covers what distributed tracing is, how traces and spans work, what it shows you that logs and metrics can't, and how it differs from the other thing people call "tracing" these days, LLM tracing.
25+
26+
## What is distributed tracing?
27+
28+
[Distributed tracing](/docs/distributed-tracing) is a technique for tracking a request as it moves through a distributed system, recording the path it takes and the time it spends at each step. The result, a **trace**, is the complete story of one request: what called what, in what order, and how long each part took.
29+
30+
It's the third pillar of observability, alongside [logs](/docs/logs) (discrete events: "this happened at this time") and **metrics** (aggregates over time: "p99 latency was 800ms"). Logs and metrics tell you *that* something is wrong. Distributed tracing tells you *where*.
31+
32+
The word "distributed" matters. Tracing a request inside a single process is useful, but the real payoff comes when a request crosses service boundaries, queues, and network calls, the exact places where it's hardest to reason about what happened. Distributed tracing stitches those separate hops back into one connected view.
33+
34+
## How distributed tracing works
35+
36+
A trace is made up of **spans**. Each span is a single unit of work, like an incoming HTTP request, a database query, or a call to a payment API.
37+
38+
Spans nest into a tree. The incoming request is the **root span**, and everything it triggers, every downstream call, query, and job, becomes a child span underneath it. Each span records:
39+
40+
- A **name** and **service**, what ran, and where
41+
- A **start time** and **duration**, when it happened, and how long it took
42+
- A **status**, whether it succeeded or failed
43+
- **Attributes**, any context you attach, like a user ID, a query, or a feature flag value
44+
45+
Two ideas make this work across services:
46+
47+
1. **A shared trace ID.** Every span in a single request carries the same `trace_id`. That's what lets you reconstruct the whole request as a waterfall, even when its spans were emitted by five different services on five different machines.
48+
49+
2. **Context propagation.** When one service calls another, it passes the trace context (the trace ID and the current span ID) along with the request, usually in HTTP headers. The receiving service attaches its spans to that same trace. This is the "distributed" part, and it's what separates distributed tracing from simple in-process timing.
50+
51+
The industry standard for all of this is [OpenTelemetry](https://opentelemetry.io/), a vendor-neutral set of APIs, SDKs, and a wire protocol (OTLP) for generating and exporting traces. Instrument with OpenTelemetry once, and you can send your traces to any compatible backend without rewriting your code.
52+
53+
## What distributed tracing shows you that logs and metrics can't
54+
55+
Each observability signal answers a different question:
56+
57+
| Signal | What it tells you | Example |
58+
|--------|-------------------|---------|
59+
| **Metrics** | The aggregate shape of behavior | "p99 checkout latency is 3.1s" |
60+
| [**Logs**](/docs/logs) | What happened at one point | "Inventory service returned 200 with 0 items" |
61+
| [**Errors**](/docs/error-tracking) | What broke | "TypeError: cannot read property 'price' of undefined" |
62+
| [**Distributed tracing**](/docs/distributed-tracing) | How the request flowed and where time went | "Checkout took 3.2s, and 2.8s of it was waiting on the inventory service" |
63+
64+
Metrics tell you something is slow. Logs tell you what happened at individual points, but you have to manually correlate them across services. Errors tell you something threw. Distributed tracing tells you **how the pieces connected, and where the time and failures actually came from**, across every service a request touched.
65+
66+
In a single process, you can often guess. Once a request fans out across services, queues, and third-party APIs, guessing stops working. Tracing replaces the guesswork with a map.
67+
68+
## When you need distributed tracing
69+
70+
### A request is slow, but you don't know which part
71+
72+
"The checkout endpoint is slow" normally sends you back to the code to add timers by hand. With tracing, you open the trace and read the waterfall top to bottom. The handler is fast, the payment call is fast, but one span sits at 2.8 seconds because the inventory service runs a separate query for every item in the cart instead of one query for the whole cart. You found the N+1 in seconds.
73+
74+
### A failure crosses service boundaries
75+
76+
A user hits an error on the frontend, but the root cause is three services deep. With tracing, you follow the `trace_id` from the failed request down through each service it called and land on the span that actually failed: a downstream auth service returning 401 because a token expired mid-request.
77+
78+
### Latency only happens sometimes
79+
80+
Your endpoint is usually fast, but your p99 is terrible and you can't reproduce it. Averages hide the problem. With tracing, you filter to the slow traces and compare them against the fast ones. The slow traces all share one span: a cache miss that falls through to a cold database query.
81+
82+
### Async and background work disappears
83+
84+
A request kicks off a queue job that runs later, and there's no single stack trace spanning the gap. With context propagation, the job's spans attach to the trace that started them, so you see the whole flow even when it crosses processes and time.
85+
86+
## What good distributed tracing looks like
87+
88+
Useful tracing is about instrumenting the right boundaries, not every line of code.
89+
90+
1. **Trace the boundaries.** Wrap incoming requests, outgoing calls, and database queries. These are where time is spent and where things fail.
91+
2. **Give spans descriptive names.** `GET /api/checkout` and `db.query load_cart` tell you what ran at a glance. `handler` and `query` don't.
92+
3. **Add business context as attributes.** Attach the user ID, the plan, and the feature flag variant. When a trace is slow, you want to know *who* it was slow for.
93+
4. **Propagate context across services.** Pass trace context with every outgoing call so spans from different services join the same trace. This is what makes tracing *distributed*.
94+
95+
## Distributed tracing vs LLM tracing
96+
97+
If you've searched for "tracing" recently, you've probably seen the term used in two very different contexts, and it's worth being clear about the difference:
98+
99+
- **Distributed tracing** (the subject of this guide) is an APM technique. It traces a request across the services, databases, and APIs of a backend system to find latency and failures. The unit of work is a span; the standard is OpenTelemetry.
100+
101+
- **LLM tracing** (sometimes called [AI observability](/ai-observability)) traces a single call or conversation through an AI application, the prompts, model generations, tool calls, token usage, and cost. The goal is debugging and evaluating LLM behavior, not backend performance.
102+
103+
They share vocabulary, "trace" and "span", because LLM tracing borrowed the model from distributed tracing. But they answer different questions, for different audiences, with different tooling. When this guide says "tracing," it means distributed tracing for application performance. If you're debugging an AI agent or tracking LLM costs, you want [LLM observability](/docs/ai-observability) instead.
104+
105+
## Distributed tracing with PostHog
106+
107+
PostHog supports [distributed tracing](/docs/distributed-tracing) over OpenTelemetry. Because it's a standard OTLP receiver, there's no proprietary SDK: instrument your app with the OpenTelemetry libraries you already use, point your trace exporter at PostHog, and add your project token.
108+
109+
The advantage of tracing in PostHog is that your traces live in the same project as [Session Replay](/docs/session-replay), [Error Tracking](/docs/error-tracking), [Logs](/docs/logs), and [Product Analytics](/docs/product-analytics). So you don't just see that a request was slow, you can connect it to the user who experienced it and the business impact it had, in one platform instead of another observability vendor to run.
110+
111+
- **[Get started with distributed tracing](/docs/distributed-tracing/start-here)**, install an OpenTelemetry exporter and send your first spans
112+
- **[Why you need distributed tracing](/docs/distributed-tracing/basics)**, a deeper look at what tracing shows you that nothing else does
113+
114+
If you'd rather have an LLM walk you through setup, the PostHog Wizard can install the SDK, drop in your project key, and wire up your first exporter for you:
115+
116+
<WizardCTA />
117+
118+
## FAQ
119+
120+
<details>
121+
<summary>What's the difference between tracing and distributed tracing?</summary>
122+
123+
In practice the terms are used interchangeably. "Tracing" originally meant in-process instrumentation that times function calls within a single program. "Distributed tracing" is the same idea extended across service boundaries by propagating a shared trace context, so spans emitted by different services join the same trace. Today, when most people say "tracing" in the context of microservices, they mean distributed tracing.
124+
125+
</details>
126+
127+
<details>
128+
<summary>Do I need distributed tracing if I have logs and metrics?</summary>
129+
130+
Logs and metrics tell you *that* something is wrong, distributed tracing tells you *where*. If your system is a single service and a database, you can usually get by with [logs](/docs/logs) and a few metrics. Once a request fans out across multiple services, queues, or third-party APIs, correlating logs by hand stops scaling and you need traces to see the whole flow.
131+
132+
</details>
133+
134+
<details>
135+
<summary>How do I set up distributed tracing in PostHog?</summary>
136+
137+
PostHog ingests traces over OpenTelemetry's standard OTLP protocol, so there's no proprietary SDK to install. Instrument your app with the OpenTelemetry libraries for your language, point your trace exporter at PostHog, and add your project token. The [Get started guide](/docs/distributed-tracing/start-here) walks through it end to end.
138+
139+
</details>
140+
141+
<details>
142+
<summary>What's the difference between distributed tracing and LLM tracing in PostHog?</summary>
143+
144+
[Distributed tracing](/docs/distributed-tracing) traces a request across the services, databases, and APIs of a backend system. [LLM observability](/docs/ai-observability) traces a single call or conversation through an AI application, including prompts, model generations, tool calls, token usage, and cost. They share vocabulary ("trace" and "span") but solve different problems, and PostHog supports both as separate products.
145+
146+
</details>
147+
148+
<details>
149+
<summary>Can I correlate traces with session replays and error events?</summary>
150+
151+
Yes. Because PostHog traces live in the same project as [Session Replay](/docs/session-replay), [Error Tracking](/docs/error-tracking), [Logs](/docs/logs), and [Product Analytics](/docs/product-analytics), you can link a slow or failed trace to the user who experienced it and the session they were in. That's the main reason to keep tracing in the same platform as the rest of your observability data instead of running a separate APM vendor.
152+
153+
</details>

src/data/authors.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -368,6 +368,14 @@
368368
"link_url": "https://www.linkedin.com/in/leon-daly-b5529a180/",
369369
"profile_id": 30833
370370
},
371+
{
372+
"handle": "daniel-visca",
373+
"name": "Daniel Visca",
374+
"role": "Lumberjack",
375+
"link_type": "linkedin",
376+
"link_url": "https://www.linkedin.com/in/daniel-visca",
377+
"profile_id": 43453
378+
},
371379
{
372380
"handle": "danilo-campos",
373381
"name": "Danilo Campos",

0 commit comments

Comments
 (0)