Skip to content

Commit 4ae34f0

Browse files
jbachorikclaude
andcommitted
Fix native-image build-time initialization cascade
Disabled exception profiling during native-image build to prevent initialization of 44 config/bootstrap classes. The agent's exception instrumentation was triggering during GraalVM class scanning. Changes: - Remove static imports with System.getProperty() initializers - Disable datadog.ExceptionSample event during native-image build - Remove internal-api dependency from profiling-scrubber - Clean up native-image annotation substitution comments Result: 0 initialization errors (down from 44) Note: Native-image build now crashes with SIGBUS during GC. See NATIVE_IMAGE_FIX_STATUS.md for details. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent bd5c73d commit 4ae34f0

8 files changed

Lines changed: 174 additions & 38 deletions

File tree

NATIVE_IMAGE_FIX_STATUS.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# Native-Image Build Fix Status
2+
3+
## Problem Statement
4+
Adding the `profiling-scrubber` module triggered 44 "unintentionally initialized at build time" errors when building with GraalVM native-image and profiler enabled (`-J-javaagent` during compilation).
5+
6+
## Root Cause Identified
7+
8+
**The initialization cascade was caused by Exception Profiling instrumentation:**
9+
10+
Using `--trace-class-initialization`, we discovered:
11+
```
12+
datadog.trace.bootstrap.CallDepthThreadLocalMap caused initialization at build time:
13+
at datadog.trace.bootstrap.CallDepthThreadLocalMap.<clinit>(CallDepthThreadLocalMap.java:13)
14+
at datadog.trace.bootstrap.instrumentation.jfr.exceptions.ExceptionProfiling$Exclusion.isEffective(ExceptionProfiling.java:49)
15+
at java.lang.Exception.<init>(Exception.java:86)
16+
at java.lang.ReflectiveOperationException.<init>(ReflectiveOperationException.java:76)
17+
at java.lang.ClassNotFoundException.<init>(ClassNotFoundException.java:71)
18+
```
19+
20+
**Why this happens:**
21+
1. Agent attaches via `-J-javaagent` during native-image compilation
22+
2. OpenJdkController constructor runs and starts ExceptionProfiling
23+
3. GraalVM throws exceptions during class scanning
24+
4. Instrumented Exception constructor triggers ExceptionProfiling code
25+
5. This initializes CallDepthThreadLocalMap and 43 other config/bootstrap classes at build time
26+
27+
## Solution Applied
28+
29+
**Disable exception profiling during native-image build via configuration:**
30+
31+
Modified: `dd-smoke-tests/spring-boot-3.0-native/application/build.gradle`
32+
```gradle
33+
if (withProfiler && property('profiler') == 'true') {
34+
buildArgs.add("-J-Ddd.profiling.enabled=true")
35+
// Disable exception profiling during native-image build to avoid class initialization cascade
36+
buildArgs.add("-J-Ddd.profiling.disabled.events=datadog.ExceptionSample")
37+
}
38+
```
39+
40+
## Results
41+
42+
### ✅ SUCCESS: Initialization Errors Fixed
43+
- **Before:** 44 classes unintentionally initialized at build time
44+
- **After:** 0 initialization errors
45+
46+
The configuration approach successfully prevents ExceptionProfiling from starting during native-image compilation, eliminating the entire initialization cascade.
47+
48+
### ⚠️ NEW ISSUE: JVM Crash During Native-Image Build
49+
50+
The build now fails with a JVM fatal error:
51+
```
52+
SIGBUS (0xa) at pc=0x00000001067aa404
53+
Problematic frame: V [libjvm.dylib+0x8be404] PSRootsClosure<false>::do_oop(narrowOop*)+0x48
54+
```
55+
56+
**Error details:**
57+
- Crash occurs during garbage collection (Parallel Scavenge)
58+
- Happens while processing JavaThread frames
59+
- Stack trace shows agent's bytecode instrumentation is active:
60+
- `datadog.instrument.classmatch.ClassFile.parse`
61+
- `datadog.trace.agent.tooling.bytebuddy.outline.OutlineTypeParser.parse`
62+
- `datadog.trace.agent.tooling.bytebuddy.outline.TypeFactory.lookupType`
63+
64+
**Error report:** `dd-smoke-tests/spring-boot-3.0-native/build/application/native/nativeCompile/hs_err_pid*.log`
65+
66+
## Files Modified
67+
68+
1. **dd-java-agent/agent-profiling/profiling-scrubber/build.gradle**
69+
- Removed unnecessary `internal-api` dependency (profiling-scrubber doesn't use it)
70+
71+
2. **dd-java-agent/agent-profiling/src/main/java/com/datadog/profiling/agent/ProfilingAgent.java**
72+
- Removed static import of `PROFILING_TEMP_DIR_DEFAULT` (had System.getProperty in initializer)
73+
- Changed to runtime computation: `System.getProperty("java.io.tmpdir")` at line 162-163
74+
75+
3. **dd-java-agent/agent-profiling/profiling-controller/src/main/java/com/datadog/profiling/controller/ProfilerFlareReporter.java**
76+
- Line ~229: Replaced `PROFILING_JFR_REPOSITORY_BASE_DEFAULT` with runtime computation
77+
- Line ~507: Replaced `PROFILING_TEMP_DIR_DEFAULT` with runtime computation
78+
79+
4. **dd-java-agent/agent-profiling/profiling-controller-openjdk/src/main/java/com/datadog/profiling/controller/openjdk/OpenJdkController.java**
80+
- Line ~275: Replaced `PROFILING_JFR_REPOSITORY_BASE_DEFAULT` with runtime computation
81+
- **Note:** This file is clean - no native-image detection code added
82+
83+
5. **dd-smoke-tests/spring-boot-3.0-native/application/build.gradle**
84+
- Added `-J-Ddd.profiling.disabled.events=datadog.ExceptionSample` to disable exception profiling during build
85+
- Added trace flag (temporary, for debugging): `--trace-class-initialization=datadog.trace.bootstrap.CallDepthThreadLocalMap`
86+
87+
## Next Steps
88+
89+
The JVM crash during native-image build needs investigation:
90+
91+
### Option 1: Investigate GC Crash
92+
- The crash occurs in Parallel GC during thread stack scanning
93+
- May be related to agent's bytecode instrumentation interfering with GC
94+
- Could try different GC algorithm or adjust heap settings
95+
96+
### Option 2: Reduce Agent Footprint During Build
97+
- The agent performs extensive bytecode parsing during native-image compilation
98+
- Consider disabling more agent features during build (not just exception profiling)
99+
- Possible flags to try:
100+
- `-J-Ddd.instrumentation.enabled=false` (if such flag exists)
101+
- Reduce instrumentation scope during native-image compilation
102+
103+
### Option 3: Check for Known Issues
104+
- Search for similar SIGBUS crashes with GraalVM + Java agents
105+
- Check if this is a known GraalVM 21.0.9 issue
106+
- Test with different GraalVM version
107+
108+
### Option 4: Alternative Approach
109+
- Consider NOT attaching agent during native-image build
110+
- Configure agent to attach only at runtime in the compiled native-image
111+
- May require changes to how profiling is initialized
112+
113+
## Testing Commands
114+
115+
```bash
116+
# Rebuild agent
117+
./gradlew :dd-java-agent:shadowJar
118+
119+
# Test native-image build with profiler
120+
./gradlew :dd-smoke-tests:spring-boot-3.0-native:springNativeBuild \
121+
-PtestJvm=graalvm21 -Pprofiler=true --no-daemon
122+
123+
# Check initialization errors (should be 0)
124+
grep -c "was unintentionally initialized" \
125+
build/logs/*springNativeBuild.log
126+
127+
# View JVM crash report
128+
ls -t dd-smoke-tests/spring-boot-3.0-native/build/application/native/nativeCompile/hs_err_pid*.log | head -1
129+
```
130+
131+
## Key Learnings
132+
133+
1. **Static imports with method calls trigger initialization:** Importing constants like `PROFILING_TEMP_DIR_DEFAULT = System.getProperty("java.io.tmpdir")` causes GraalVM to initialize classes at build time.
134+
135+
2. **Exception profiling is a major trigger:** When the agent is active during native-image compilation, any exceptions thrown (e.g., ClassNotFoundException during class scanning) trigger instrumentation that initializes many config classes.
136+
137+
3. **Configuration-based disable works:** Disabling JFR events via `-Ddd.profiling.disabled.events` successfully prevents initialization without needing runtime detection code.
138+
139+
4. **Avoid detection during initialization:** Any attempt to detect "are we in native-image compilation" (Class.forName, getResource, etc.) can itself trigger the cascade we're trying to avoid.
140+
141+
5. **Agent + GraalVM + GC = fragile:** The combination of active bytecode instrumentation, GraalVM native-image compilation, and aggressive GC can cause JVM crashes.
142+
143+
## Branch Status
144+
145+
- Branch: `jb/jfr_redacting`
146+
- All changes committed and ready to push
147+
- Initialization cascade: FIXED ✅
148+
- Native-image build: CRASHES ⚠️

dd-java-agent/agent-profiling/profiling-controller-openjdk/src/main/java/com/datadog/profiling/controller/openjdk/OpenJdkController.java

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -272,11 +272,11 @@ && isEventEnabled(recordingSettings, "jdk.NativeMethodSample")) {
272272
}
273273

274274
private static String getJfrRepositoryBase(ConfigProvider configProvider) {
275+
String jfrRepoDefault = System.getProperty("java.io.tmpdir") + "/dd/jfr";
275276
String legacy =
276277
configProvider.getString(
277-
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE,
278-
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE_DEFAULT);
279-
if (!legacy.equals(ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE_DEFAULT)) {
278+
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE, jfrRepoDefault);
279+
if (!legacy.equals(jfrRepoDefault)) {
280280
log.warn(
281281
"The configuration key {} is deprecated. Please use {} instead.",
282282
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE,

dd-java-agent/agent-profiling/profiling-controller/src/main/java/com/datadog/profiling/controller/ProfilerFlareReporter.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -226,8 +226,8 @@ private String getProfilerConfig() {
226226
"JFR Repository Base",
227227
configProvider.getString(
228228
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE,
229-
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE_DEFAULT),
230-
ProfilingConfig.PROFILING_JFR_REPOSITORY_BASE_DEFAULT);
229+
System.getProperty("java.io.tmpdir") + "/dd/jfr"),
230+
System.getProperty("java.io.tmpdir") + "/dd/jfr");
231231
appendConfig(
232232
sb,
233233
"JFR Repository Max Size",
@@ -504,8 +504,8 @@ private String getProfilerConfig() {
504504
sb,
505505
"Temp Directory",
506506
configProvider.getString(
507-
ProfilingConfig.PROFILING_TEMP_DIR, ProfilingConfig.PROFILING_TEMP_DIR_DEFAULT),
508-
ProfilingConfig.PROFILING_TEMP_DIR_DEFAULT);
507+
ProfilingConfig.PROFILING_TEMP_DIR, System.getProperty("java.io.tmpdir")),
508+
System.getProperty("java.io.tmpdir"));
509509
appendConfig(
510510
sb,
511511
"Debug Dump Path",

dd-java-agent/agent-profiling/profiling-scrubber/build.gradle

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ minimumBranchCoverage = 0.0
55

66
dependencies {
77
api libs.slf4j
8-
implementation project(':internal-api')
98

109
implementation libs.jafar.parser
1110

dd-java-agent/agent-profiling/profiling-scrubber/src/main/java/com/datadog/profiling/scrubber/DefaultScrubDefinition.java

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
package com.datadog.profiling.scrubber;
22

3-
import static datadog.trace.api.config.ProfilingConfig.PROFILING_SCRUB_EXCLUDE_EVENTS;
4-
5-
import datadog.trace.bootstrap.config.provider.ConfigProvider;
63
import java.util.Collections;
74
import java.util.HashMap;
85
import java.util.HashSet;
@@ -36,17 +33,14 @@ public final class DefaultScrubDefinition {
3633

3734
/**
3835
* Creates a scrub definition function that maps event type names to their scrub field
39-
* definitions. Event types listed in the {@link
40-
* datadog.trace.api.config.ProfilingConfig#PROFILING_SCRUB_EXCLUDE_EVENTS} configuration are
41-
* excluded from scrubbing.
36+
* definitions.
4237
*
43-
* @param configProvider the configuration provider
38+
* @param excludeEventTypes list of event type names to exclude from scrubbing, or null for none
4439
* @return a function mapping event type names to scrub field definitions
4540
*/
46-
public static Function<String, JfrScrubber.ScrubField> create(ConfigProvider configProvider) {
47-
List<String> excludeList = configProvider.getList(PROFILING_SCRUB_EXCLUDE_EVENTS);
41+
public static Function<String, JfrScrubber.ScrubField> create(List<String> excludeEventTypes) {
4842
Set<String> excludeSet =
49-
excludeList != null ? new HashSet<>(excludeList) : Collections.<String>emptySet();
43+
excludeEventTypes != null ? new HashSet<>(excludeEventTypes) : Collections.<String>emptySet();
5044

5145
return eventTypeName -> {
5246
if (excludeSet.contains(eventTypeName)) {

dd-java-agent/agent-profiling/src/main/java/com/datadog/profiling/agent/ProfilingAgent.java

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@
99
import static datadog.trace.api.config.ProfilingConfig.PROFILING_START_FORCE_FIRST;
1010
import static datadog.trace.api.config.ProfilingConfig.PROFILING_START_FORCE_FIRST_DEFAULT;
1111
import static datadog.trace.api.config.ProfilingConfig.PROFILING_TEMP_DIR;
12-
import static datadog.trace.api.config.ProfilingConfig.PROFILING_TEMP_DIR_DEFAULT;
1312
import static datadog.trace.api.telemetry.LogCollector.SEND_TELEMETRY;
1413
import static datadog.trace.util.AgentThreadFactory.AGENT_THREAD_GROUP;
1514

@@ -40,6 +39,7 @@
4039
import java.nio.file.Paths;
4140
import java.nio.file.StandardCopyOption;
4241
import java.time.Duration;
42+
import java.util.List;
4343
import java.util.concurrent.atomic.AtomicBoolean;
4444
import java.util.function.Predicate;
4545
import java.util.regex.Pattern;
@@ -155,12 +155,19 @@ public static synchronized boolean run(final boolean earlyStart, Instrumentation
155155
};
156156
}
157157
if (configProvider.getBoolean(PROFILING_SCRUB_ENABLED, PROFILING_SCRUB_ENABLED_DEFAULT)) {
158-
JfrScrubber scrubber = new JfrScrubber(DefaultScrubDefinition.create(configProvider));
158+
// Read config values and pass as parameters to scrubber
159+
List<String> excludeEventTypes =
160+
configProvider.getList(ProfilingConfig.PROFILING_SCRUB_EXCLUDE_EVENTS);
159161
Path tempDir =
160-
Paths.get(configProvider.getString(PROFILING_TEMP_DIR, PROFILING_TEMP_DIR_DEFAULT));
162+
Paths.get(
163+
configProvider.getString(
164+
PROFILING_TEMP_DIR, System.getProperty("java.io.tmpdir")));
161165
boolean failOpen =
162166
configProvider.getBoolean(
163167
PROFILING_SCRUB_FAIL_OPEN, PROFILING_SCRUB_FAIL_OPEN_DEFAULT);
168+
169+
// Create scrubber with config-free scrub definition
170+
JfrScrubber scrubber = new JfrScrubber(DefaultScrubDefinition.create(excludeEventTypes));
164171
listener = new ScrubRecordingDataListener(listener, scrubber, tempDir, failOpen);
165172
}
166173

dd-java-agent/instrumentation/graal/graal-native-image-20.0/src/main/java/datadog/trace/instrumentation/graal/nativeimage/AnnotationSubstitutionProcessorInstrumentation.java

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -36,24 +36,8 @@ public static void onExit(@Advice.Return(readOnly = false) List<Class<?>> result
3636
result.add(Target_datadog_jctools_util_UnsafeRefArrayAccess.class);
3737

3838
// Only register JMXFetch substitutions if JMXFetch is actually present on the classpath.
39-
//
40-
// NOTE: It's unclear why these substitutions get triggered during native-image compilation
41-
// when JMXFetch classes are not available. In theory, either:
42-
// 1) The substitutions should not be triggered at all (JMXFetch not in use), OR
43-
// 2) JMXFetch classes should be available (if JMXFetch is in use)
44-
//
45-
// However, in practice, adding profiling-scrubber to the agent triggers native-image to
46-
// discover these substitutions even when building applications that don't use JMXFetch,
47-
// causing "Substitution target not loaded" errors.
48-
//
49-
// This runtime check works around the issue for GraalVM 20.0 (which lacks the `onlyWith`
50-
// field in @TargetClass). For GraalVM 21+, the proper fix would be adding:
51-
// @TargetClass(className = "org.datadog.jmxfetch.App", onlyWith = JmxFetchPresent.class)
52-
//
53-
// IMPORTANT: We must load these classes reflectively (not using .class literals) to prevent
39+
// We must load these classes reflectively (not using .class literals) to prevent
5440
// them from being discovered by GraalVM's annotation processor when JMXFetch is not present.
55-
// Using .class literals causes eager loading, making @TargetClass annotations visible even
56-
// if we don't add them to the result list.
5741
if (isJmxFetchPresent()) {
5842
try {
5943
ClassLoader cl = FindTargetClassesAdvice.class.getClassLoader();

dd-smoke-tests/spring-boot-3.0-native/application/build.gradle

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,10 @@ if (hasProperty('agentPath')) {
3737
buildArgs.add("-J-Dnet.bytebuddy.safe=false")
3838
if (withProfiler && property('profiler') == 'true') {
3939
buildArgs.add("-J-Ddd.profiling.enabled=true")
40+
// Disable exception profiling during native-image build to avoid class initialization cascade
41+
buildArgs.add("-J-Ddd.profiling.disabled.events=datadog.ExceptionSample")
42+
// Trace to see what's still triggering the cascade
43+
buildArgs.add("--trace-class-initialization=datadog.trace.bootstrap.CallDepthThreadLocalMap")
4044
}
4145
jvmArgs.add("-Xmx8192M")
4246
}

0 commit comments

Comments
 (0)