Skip to content

Commit a359c1e

Browse files
PerfectSlayerdevflow.devflow-routing-intake
andauthored
Add architecture and high-level design documentation (#10627)
feat(doc): Initial architecture documentation feat(doc): Iterating on content feat(doc): Add 2nd round of comments feat(doc): Add 3rd round of comments feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Update agents.md to refer to the architecture file feat(doc): Iterate on content feat(doc): Iterate on content feat(doc): Iterate on content Co-authored-by: devflow.devflow-routing-intake <devflow.devflow-routing-intake@kubernetes.us1.ddbuild.io>
1 parent af8b844 commit a359c1e

2 files changed

Lines changed: 298 additions & 11 deletions

File tree

AGENTS.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,26 +7,20 @@ It ships ~120 integrations (~200 instrumentations) for tracing, profiling, AppSe
77

88
## Project layout
99

10+
See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed module descriptions.
11+
1012
```
11-
dd-java-agent/ Main agent
12-
instrumentation/ All auto-instrumentations (one dir per framework)
13-
agent-bootstrap/ Bootstrap classloader classes
14-
agent-builder/ Agent build & bytecode weaving
15-
agent-tooling/ Shared tooling for instrumentations
16-
agent-{product}/ Product-specific modules (ci-visibility, iast, profiling, debugger, llmobs, aiguard, ...)
17-
appsec/ Application Security (WAF, threat detection)
13+
dd-java-agent/ Main agent (shadow jar, instrumentations, product modules)
1814
dd-trace-api/ Public API & configuration constants
1915
dd-trace-core/ Core tracing engine (spans, propagation, writer)
20-
dd-trace-ot/ OpenTracing compatibility layer
16+
dd-trace-ot/ Legacy OpenTracing compatibility library
2117
internal-api/ Internal shared API across modules
18+
components/ Shared low-level components (context, environment, json)
2219
products/ Sub-products (feature flagging, metrics)
2320
communication/ HTTP transport to Datadog Agent
24-
components/ Shared low-level components
2521
remote-config/ Remote configuration support
2622
telemetry/ Agent telemetry
2723
utils/ Shared utility modules (config, time, socket, test, etc.)
28-
metadata/ Supported configurations metadata & requirements
29-
benchmark/ Performance benchmarks
3024
dd-smoke-tests/ Smoke tests (real apps + agent)
3125
docs/ Developer documentation (see below)
3226
```
@@ -35,6 +29,7 @@ docs/ Developer documentation (see below)
3529

3630
| Topic | File |
3731
|---|---|
32+
| Architecture & design | [ARCHITECTURE.md](ARCHITECTURE.md) |
3833
| Building from source | [BUILDING.md](BUILDING.md) |
3934
| Contributing & PR guidelines | [CONTRIBUTING.md](CONTRIBUTING.md) |
4035
| How instrumentations work | [docs/how_instrumentations_work.md](docs/how_instrumentations_work.md) |

ARCHITECTURE.md

Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# Architecture of dd-trace-java
2+
3+
High-level architecture of the Datadog Java APM agent.
4+
Start here to orient yourself in the codebase.
5+
6+
## Bird's Eye View
7+
8+
dd-trace-java is a Java agent that auto-instruments JVM applications at runtime via bytecode manipulation.
9+
It attaches to a running JVM using the `-javaagent` flag, intercepts class loading, and rewrites method bytecode
10+
to inject tracing, security, profiling, and observability logic. No application code changes required.
11+
12+
Ships ~120 integrations (~200 instrumentations) covering major frameworks
13+
(Spring, Servlet, gRPC, JDBC, Kafka, etc.) and supports multiple Datadog products through a single jar:
14+
**Tracing**, **Profiling**, **Application Security (AppSec)**, **IAST**, **CI Visibility**,
15+
**Dynamic Instrumentation**, **LLM Observability**, **Crash Tracking**, **Data Streams**,
16+
**Feature Flagging**, and **USM**.
17+
18+
Communicates with a local Datadog Agent process (or directly with the Datadog intake APIs)
19+
to send collected telemetry.
20+
21+
## Startup Sequence
22+
23+
1. **`AgentBootstrap.premain()`** — JVM entry point. Runs on the application classloader
24+
with minimal logic: locates the agent jar, creates an isolated classloader, jumps to `Agent.start()`.
25+
Must remain tiny and side-effect-free.
26+
27+
2. **`Agent.start()`** — Runs on the bootstrap classloader. Creates the agent classloader,
28+
reads configuration, determines which products are enabled, starts each subsystem on dedicated threads.
29+
30+
3. **`AgentInstaller`** — Installs the ByteBuddy `ClassFileTransformer` that intercepts all class loading.
31+
Discovers all `InstrumenterModule` implementations via service loading, registers their
32+
type matchers and advice classes.
33+
34+
4. **Product subsystems start** — Each enabled product is started via its own `*System.start()` method,
35+
receiving shared communication objects.
36+
37+
## Codemap
38+
39+
### `dd-java-agent/`
40+
41+
Main agent module. Produces the final shadow jar (`dd-java-agent.jar`) using a composite shadow jar
42+
strategy. Each product module builds its own shadow jar, embedded as a nested directory inside the
43+
main jar (`inst/`, `profiling/`, `appsec/`, `iast/`, `debugger/`, `ci-visibility/`, `llm-obs/`,
44+
`shared/`, `trace/`, etc.). A dedicated `sharedShadowJar` bundles common transitive dependencies
45+
(OkHttp, JCTools, LZ4, etc.) to avoid duplication across feature jars. All dependencies are relocated
46+
under `datadog.` prefixes to prevent classpath conflicts. Class files inside feature jars are renamed
47+
to `.classdata` to prevent unintended loading. See [`docs/how_to_work_with_gradle.md`](docs/how_to_work_with_gradle.md).
48+
49+
- **`src/`**`AgentBootstrap` and `AgentJar`, the entry point loaded by `-javaagent`.
50+
Deliberately minimal.
51+
52+
- **`agent-bootstrap/`** — Classes on the bootstrap classloader: `Agent` (startup orchestrator),
53+
decorator base classes (`HttpServerDecorator`, `DatabaseClientDecorator`, etc.), and bootstrap-safe
54+
utilities. Visible to all classloaders, so instrumentation advice and helpers can use them directly.
55+
See [`docs/bootstrap_design_guidelines.md`](docs/bootstrap_design_guidelines.md).
56+
57+
- **`agent-builder/`** — ByteBuddy integration layer. Class transformer pipeline:
58+
`DDClassFileTransformer` intercepts every class load, `GlobalIgnoresMatcher` applies early
59+
filtering, `CombiningMatcher` evaluates instrumentation matchers, `SplittingTransformer`
60+
applies matched transformations. The `ignored_class_name.trie` is a compiled trie built at
61+
build time that short-circuits matcher evaluation for known non-transformable classes (JVM
62+
internals, agent infrastructure, monitoring libraries, large framework packages). When a class
63+
is unexpectedly not instrumented, check the trie first.
64+
65+
- **`agent-tooling/`** — Instrumentation framework. Key types:
66+
- `InstrumenterModule` — Base class for all instrumentation modules. Declares a target system
67+
(Tracing, AppSec, IAST, Profiling, CiVisibility, USM, etc.) and one or more instrumentations.
68+
- `Instrumenter` — Type matching interface: `ForSingleType`, `ForKnownTypes`,
69+
`ForTypeHierarchy`, `ForBootstrap`.
70+
- `muzzle/` — Build-time and runtime safety checks. Verifies that expected types and methods
71+
exist in the library version at runtime. If not, the instrumentation is silently skipped.
72+
See [`docs/how_instrumentations_work.md`](docs/how_instrumentations_work.md) and [`docs/add_new_instrumentation.md`](docs/add_new_instrumentation.md).
73+
74+
- **`instrumentation/`** — All auto-instrumentations, organized as `{framework}/{framework}-{minVersion}/`.
75+
Nearly 200 framework directories. Each follows the same pattern: an `InstrumenterModule` declares the
76+
target system and integration name, one or more `Instrumenter` implementations select target types
77+
via matchers, advice classes inject bytecode via `@Advice.OnMethodEnter`/`@Advice.OnMethodExit`,
78+
and decorator/helper classes contain the actual product logic. Instrumentations are discovered
79+
via `@AutoService(InstrumenterModule.class)` (Java SPI) and validated by Muzzle at build time.
80+
See [`docs/how_instrumentations_work.md`](docs/how_instrumentations_work.md) and [`docs/add_new_instrumentation.md`](docs/add_new_instrumentation.md) for details.
81+
82+
- **`appsec/`** — Application Security. Entry point: `AppSecSystem.start()`. Runs the Datadog WAF
83+
to detect and block attacks in real-time. Hooks into the gateway to intercept HTTP requests.
84+
85+
- **`agent-iast/`** — Interactive Application Security Testing. Entry point: `IastSystem.start()`.
86+
Performs taint tracking: marks user input as tainted, propagates taint through string operations,
87+
and reports when tainted data reaches dangerous sinks (SQL injection, XSS, command injection, etc.).
88+
89+
- **`agent-ci-visibility/`** — CI Visibility. Entry point: `CiVisibilitySystem.start()`.
90+
Instruments test frameworks (JUnit, TestNG, Gradle, Maven, Cucumber) to collect test results,
91+
code coverage, and performance metrics.
92+
93+
- **`agent-profiling/`** — Continuous Profiling. Entry point: `ProfilingAgent`.
94+
Collects CPU, memory, and wall-clock profiles using JFR or the Datadog native profiler (`ddprof`).
95+
Uploads profiles to the Datadog backend.
96+
97+
- **`agent-debugger/`** — Dynamic Instrumentation. Entry point: `DebuggerAgent`.
98+
Probes, snapshot capture, exception replay, code origin mapping.
99+
Driven by remote configuration.
100+
101+
- **`agent-llmobs/`** — LLM Observability. Entry point: `LLMObsSystem.start()`.
102+
Monitors LLM API calls (OpenAI, LangChain, etc.): token usage, model inference, evaluations.
103+
104+
- **`agent-crashtracking/`** — Crash Tracking. Detects JVM crashes and fatal exceptions,
105+
collects system metadata, and uploads crash reports to Datadog's error tracking intake.
106+
107+
- **`agent-otel/`** — OpenTelemetry compatibility shim. `OtelTracerProvider`, `OtelSpan`,
108+
`OtelContext` and other wrappers implement the OTel API by delegating to the Datadog tracer.
109+
Paired with instrumentations in `instrumentation/opentelemetry/` that intercept OTel API calls
110+
and redirect them to shim instances.
111+
112+
### `dd-trace-core/`
113+
114+
Core tracing engine. Grew organically and now also hosts product-specific features that depend on
115+
tight integration with span creation, interception, or serialization. New code should go in
116+
`products/` or `components/` instead. Core tracing types:
117+
118+
- `CoreTracer` — Tracer implementation. Creates spans, manages sampling, drives the writer pipeline.
119+
Implements `AgentTracer.TracerAPI`.
120+
- `DDSpan` / `DDSpanContext` — Concrete span and context implementations with Datadog-specific metadata.
121+
- `PendingTrace` — Collects all spans in a trace. Flushes to the writer when the root span finishes.
122+
- `scopemanager/``ContinuableScopeManager`, `ContinuableScope`, `ScopeContinuation`. Active span
123+
per thread, async context propagation via continuations.
124+
- `propagation/` — Trace context propagation codecs: Datadog, W3C TraceContext, B3, Haystack, X-Ray.
125+
- `common/writer/` — Writer pipeline. `DDAgentWriter` buffers traces and dispatches via
126+
`PayloadDispatcherImpl` to the Datadog Agent's `/v0.4/traces` endpoint. `DDIntakeWriter` for
127+
direct API submission. `TraceProcessingWorker` for async processing.
128+
- `common/sampling/` — Sampling logic: `RuleBasedTraceSampler`, `RateByServiceTraceSampler`,
129+
`SingleSpanSampler`. Supports both head-based and rule-based sampling.
130+
- `tagprocessor/` — Post-processing of span tags: peer service calculation, base service naming,
131+
query obfuscation, endpoint resolution.
132+
133+
Non-tracing code that also lives here due to organic growth:
134+
135+
- `datastreams/` — Data Streams Monitoring. Tracks message pipeline latency across Kafka, RabbitMQ, SQS, etc.
136+
- `civisibility/` — CI Visibility trace interceptors and protocol adapters. Hooks into the trace
137+
completion pipeline to filter and reformat test spans for the CI Test Cycle intake.
138+
- `lambda/` — AWS Lambda support. Coordinates span creation with the serverless extension,
139+
handling invocation start/end and trace context propagation.
140+
- `llmobs/` — LLM Observability span mapper. Serializes LLM-specific spans (messages, tool calls)
141+
to the dedicated LLM Obs intake format.
142+
143+
### `dd-trace-api/`
144+
145+
Public API. Types application developers may use directly: `Tracer`, `GlobalTracer`, `DDTags`,
146+
`DDSpanTypes`, `Trace` (annotation), `ConfigDefaults`. Also houses all configuration key constants
147+
by domain: `TracerConfig`, `GeneralConfig`, `AppSecConfig`, `ProfilingConfig`, `CiVisibilityConfig`,
148+
`IastConfig`, `DebuggerConfig`, etc.
149+
150+
### `internal-api/`
151+
152+
Internal shared API across all agent modules (not public). Like `dd-trace-core`, grew organically
153+
and now hosts interfaces for many products beyond tracing. New product APIs should go in
154+
`products/` or `components/`.
155+
156+
Core tracing abstractions:
157+
158+
- `AgentTracer` — Static tracer facade. Instrumentations call `AgentTracer.startSpan()`,
159+
`AgentTracer.activateSpan()`, etc.
160+
- `AgentSpan` / `AgentScope` / `AgentSpanContext` — Internal span/scope/context interfaces.
161+
- `AgentPropagation` — Context propagation interfaces (`Getter`, `Setter`) that instrumentations
162+
implement to inject/extract trace context from framework-specific carriers (HTTP headers, message
163+
properties, etc.).
164+
- `Config` / `InstrumenterConfig` — Master configuration class and instrumenter-specific config,
165+
centralizing settings for all products. `InstrumenterConfig` is separated from `Config` due to
166+
GraalVM native-image constraints: in native-image builds, all bytecode instrumentation must be
167+
applied at build time (ahead-of-time compilation), so configuration that controls instrumentation
168+
decisions (which classes to instrument, which integrations to enable, resolver behavior, field
169+
injection flags) must be frozen into the native image binary. Runtime-only settings (agent
170+
endpoints, service names, sampling rates) remain in `Config`.
171+
See [`docs/add_new_configurations.md`](docs/add_new_configurations.md).
172+
173+
Cross-product abstractions:
174+
175+
- `gateway/` — Instrumentation Gateway: event bus (`InstrumentationGateway`,
176+
`SubscriptionService`, `Events`, `CallbackProvider`, `RequestContext`) decoupling
177+
instrumentations from product modules. Primarily used by AppSec and IAST to hook into
178+
the HTTP request lifecycle without modifying instrumentations.
179+
- `cache/` — Shared caching primitives (`DDCache`, `FixedSizeCache`, `RadixTreeCache`) used
180+
throughout the agent.
181+
- `naming/` — Service and span operation naming schemas (v0, v1) for databases, messaging,
182+
cloud services, etc.
183+
- `telemetry/` — Multi-product telemetry collection interfaces (`MetricCollector`,
184+
`WafMetricCollector`, `LLMObsMetricCollector`, etc.).
185+
186+
Product-specific APIs that also live here:
187+
188+
- `iast/` — IAST vulnerability detection interfaces: taint tracking (`Taintable`, `IastContext`),
189+
sink definitions for each vulnerability type (SQL injection, XSS, command injection, etc.),
190+
and call site instrumentation hooks. About 60 files.
191+
- `civisibility/` — CI Visibility interfaces: test identification, code coverage, build/test
192+
event handlers, and CI-specific telemetry metrics. About 95 files.
193+
- `datastreams/` — Data Streams Monitoring interfaces: pathway context, stats points,
194+
and schema registry integration.
195+
- `appsec/` — AppSec interfaces: HTTP client request/response payloads for WAF analysis,
196+
RASP call sites.
197+
- `profiling/` — Profiler integration: recording data, timing, and enablement interfaces.
198+
- `llmobs/` — LLM Observability context.
199+
200+
### `components/`
201+
202+
Low-level shared platform components. Not tied to any product, no external dependencies,
203+
bootstrap-safe:
204+
205+
- `context` — Immutable context propagation framework. Provides `Context`, `ContextKey`,
206+
and `Propagator` abstractions for storing and propagating key-value pairs across threads
207+
and carrier objects.
208+
- `environment` — JVM and OS detection utilities. `JavaVersion` for version parsing,
209+
`JavaVirtualMachine` for JVM implementation detection (OpenJDK, Graal, J9),
210+
`OperatingSystem` for OS/architecture detection, and `EnvironmentVariables`/`SystemProperties`
211+
for safe access and mocking.
212+
- `json` — Lightweight, dependency-free JSON serialization. `JsonWriter` for building JSON
213+
with a fluent API, `JsonReader` for streaming parsing.
214+
- `native-loader` — Platform-aware native library loading with pluggable strategies.
215+
`NativeLoader` handles OS/architecture detection, resource extraction from JARs,
216+
and temp file management.
217+
218+
### `products/`
219+
220+
Self-contained product modules following a layered submodule pattern:
221+
222+
- `{product}-api/` — Public API interfaces, zero dependencies.
223+
- `{product}-bootstrap/` — Data classes safe for the bootstrap classloader.
224+
- `{product}-lib/` — Core implementation (shadow jar, excludes shared dependencies).
225+
- `{product}-agent/` — Agent integration entry point (shadow jar).
226+
227+
Current products:
228+
229+
- `metrics/` — StatsD client and monitoring abstraction. Provides `Monitoring` interface with
230+
counters, timers, and histograms for internal agent metrics collection.
231+
- `feature-flagging/` — Server-side feature flag evaluation driven by remote configuration.
232+
Implements the OpenFeature SDK, handles the Unified Feature Control (UFC) protocol,
233+
and tracks flag exposure per user/session.
234+
235+
### `communication/`
236+
237+
HTTP transport to the Datadog Agent and intake APIs. `SharedCommunicationObjects` holds shared
238+
`OkHttpClient` instances (Unix domain socket and named pipe support), agent URL, feature discovery,
239+
and the configuration poller. All product modules receive this at startup.
240+
241+
### `remote-config/`
242+
243+
Remote configuration client. `DefaultConfigurationPoller` periodically polls the Datadog Agent
244+
for configuration updates (AppSec rules, debugger probes, sampling rates, feature flags).
245+
Uses TUF (The Update Framework) for signature validation.
246+
247+
### `telemetry/`
248+
249+
Agent telemetry. `TelemetrySystem` collects and reports which features are enabled,
250+
which integrations loaded, performance metrics, and product-specific counters.
251+
Each product registers periodic actions that collect domain-specific metrics.
252+
253+
### `utils/`
254+
255+
Shared utilities, each in its own submodule:
256+
257+
- `config-utils``ConfigProvider` for reading and merging configuration from environment variables,
258+
system properties, properties files, and CI environment.
259+
- `container-utils` — Parses container runtime information (Docker, Kubernetes, ECS).
260+
- `filesystem-utils` — Permission-safe file existence checks that handle `SecurityException`.
261+
- `flare-utils` — Tracer flare collection (`TracerFlareService`) that gathers diagnostics
262+
(logs, spans, system info) and sends them to Datadog for troubleshooting.
263+
- `queue-utils` — High-performance lock-free queues (`MpscArrayQueue`, `SpscArrayQueue`) for
264+
inter-thread communication and span buffering.
265+
- `socket-utils` — Socket factories (`UnixDomainSocketFactory`, `NamedPipeSocket`) for connecting
266+
to the local Datadog Agent via Unix sockets or named pipes.
267+
- `time-utils` — Time source abstractions (`TimeSource`, `ControllableTimeSource`) for testable
268+
time handling and delay parsing.
269+
- `version-utils` — Agent version string (`VersionInfo.VERSION`) read from packaged resources.
270+
- `test-utils` — Testing utilities: `@Flaky` annotation, log capture, GC control,
271+
forked test configuration.
272+
- `test-agent-utils` — Message decoders for parsing v04/v05 binary protocol frames in tests.
273+
274+
### `dd-trace-ot/`
275+
276+
Legacy OpenTracing compatibility library. Publishes a standalone JAR artifact (`dd-trace-ot.jar`)
277+
that implements the `io.opentracing.Tracer` interface by wrapping the Datadog `CoreTracer`.
278+
This is a pure library for manual instrumentation only — there is no auto-instrumentation or
279+
bytecode advice.
280+
281+
### `dd-smoke-tests/`
282+
283+
End-to-end smoke tests. Each boots a real application with the agent jar and verifies traces, spans,
284+
and product behavior. Covers Spring Boot, Play, Vert.x, Quarkus, WildFly, and more.
285+
Core test hierarchy (Groovy/Spock):
286+
- `ProcessManager` — Base. Spawns forked JVM processes with the agent via `ProcessBuilder`,
287+
captures stdout to log files, tears down on cleanup. `assertNoErrorLogs()` scans logs for errors.
288+
- `AbstractSmokeTest` extends `ProcessManager` — Adds a mock Datadog Agent (`TestHttpServer`)
289+
receiving traces (v0.4/v0.5), telemetry, remote config, and EVP proxy requests. Polling helpers:
290+
`waitForTraceCount`, `waitForSpan`, `waitForTelemetryFlat`.
291+
- `AbstractServerSmokeTest` extends `AbstractSmokeTest` — For HTTP server apps. Adds port
292+
management, waits for server port to open, verifies expected trace output.

0 commit comments

Comments
 (0)