Skip to content

Commit ef95d7c

Browse files
author
Yuriy Bezsonov
committed
feat(perf-platform): dual-mode profiling and version diff lanes
Rework the perf-platform implementation around dual-mode async-profiler (CPU + wall via -e cpu --wall 10ms) and add a six-lane analyzer that runs CPU top-N, wall top-N, CPU diff, wall diff, JFR, and thread dump in parallel against a target workload. Collector - Replace the rolling JFR ring approach with a rotating async-profiler session (--loop 15s -o jfr) that produces one JFR file per cycle and pushes each completed file to Pyroscope's /ingest endpoint. Pyroscope splits each push into process_cpu and wall profile types. - Drop the separate jcmd-based JFR.start path. On-demand JFR uses asprof dump on the same running session. - Introduce versionFrom() resolution: explicit perf-profile/version label wins, otherwise fall back to the image tag. Analyzer - PyroscopeTool grows topFunctions(service, profileType, from, to, n) and diff(service, profileType, baseline, current, from, to, n). Both accept either 'cpu' or 'wall' to pick which Pyroscope profile type to query. Diff ranks frames by total-time share (frame plus descendants) so application callers that walk into a new hotspot or contended primitive show up directly. - AnalysisService runs six lanes via virtual threads: pre-fetched Pyroscope top-N for both profile types, version diffs for both profile types, JFR via the collector, and thread dump via the collector. AnalysisContext carries the four pre-fetched Markdown blocks. - AiService renders both diffs first, then both top-N tables, with a system prompt instructing the model to lead with what changed and cross-check CPU vs wall diffs. - Add ProfileRatioExporter and PyroscopeVersionService: an internal Prometheus gauge perf_profile_cpu_ratio whose value is the latest version's CPU total over the prior version's, intended as the alert source for canary regression detection. Grafana OSS has no recording-rule engine, so this lives in the analyzer. Infrastructure - CDK PerfPlatform construct provisions pyroscope-eks-pod-role for Pyroscope's S3 backend and perf-analyzer-eks-pod-role for the analyzer's Bedrock + S3 + workload-source-code access. - perf-platform.sh creates pod identity associations, installs the Grafana alert rule and contact point that calls the analyzer's /api/v1/grafana-webhook, and applies RBAC for the analyzer/collector ServiceAccounts.
1 parent 27228b3 commit ef95d7c

16 files changed

Lines changed: 1205 additions & 324 deletions

File tree

apps/perf-analyzer/src/main/java/com/example/perf/analyzer/AiService.java

Lines changed: 100 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,38 @@ public class AiService {
2929
private static final ObjectMapper MAPPER = new ObjectMapper();
3030

3131
private static final String SYSTEM_PROMPT = """
32-
You are a Java performance engineer. You receive a pre-aggregated
33-
Pyroscope top functions table, a JFR summary covering GC pauses,
34-
JIT compilation, deoptimization, monitor contention, safepoints
35-
and JVM configuration, and a thread dump. All were captured around
36-
the time an alert fired or a developer triggered an on-demand
37-
analysis. Analyze the data and report what you find.
32+
You are a Java performance engineer. You receive:
33+
- per-frame Pyroscope diffs between the current service version
34+
and the prior one (when two versions are present), shown in two
35+
views: CPU profile and wall-clock profile. The diffs rank frames
36+
by total-time share (frame plus its descendants), so an
37+
application caller that walks into a new hotspot or a new
38+
contended primitive will appear directly in the diff even when
39+
the actual time is burned in a JVM intrinsic, glibc lock primitive
40+
or other low-level frame underneath it. CPU diff surfaces new
41+
on-CPU work; wall diff surfaces new contention or I/O waits.
42+
- pre-aggregated Pyroscope top-functions tables for the current
43+
version, again in CPU and wall views.
44+
- a JFR summary covering GC pauses, JIT compilation, deoptimization,
45+
monitor contention, safepoints and JVM configuration.
46+
- a thread dump.
47+
All were captured around the time an alert fired or a developer
48+
triggered an on-demand analysis.
49+
50+
When diffs are present, treat the frames with the largest positive
51+
deltas as the primary suspects — they are what changed. Pay particular
52+
attention to *application* (project package) frames in the diff: they
53+
identify the caller in the user's code and are usually the right
54+
anchor for a finding even when an underlying low-level frame
55+
(HashMap.merge, futex, arraycopy, JIT helper) shows a similar delta —
56+
those are typically the *mechanism*, not the cause. Cross-check the
57+
CPU and wall diffs: a real regression usually shows clearly in at
58+
least one and weakly in the other (CPU-heavy regression: big in CPU,
59+
small in wall; contention regression: big in wall, small in CPU).
60+
Frames that are large in absolute terms but small in any diff are
61+
long-standing workload characteristics and are usually not the cause
62+
of a regression, though they may still be worth flagging separately.
63+
Analyze the data and report what you find.
3864
""";
3965

4066
private final ChatClient chatClient;
@@ -87,15 +113,74 @@ String buildPrompt(AnalysisService.AnalysisContext ctx, boolean sourceCodeAvaila
87113
sb.append("- platform: **").append(r.platform().name().toLowerCase().replace('_', '-')).append("**\n");
88114
sb.append("- target: **").append(r.target()).append("**\n");
89115
sb.append("- trigger: ").append(r.source()).append("\n");
116+
if (ctx.currentVersion() != null) {
117+
sb.append("- current version: ").append(ctx.currentVersion()).append("\n");
118+
}
119+
if (ctx.baselineVersion() != null) {
120+
sb.append("- prior version (baseline for diff): ").append(ctx.baselineVersion()).append("\n");
121+
}
90122
if (r.reason() != null && !r.reason().isBlank()) {
91123
sb.append("- reason: ").append(r.reason()).append("\n");
92124
}
93125
sb.append("- analysisId: ").append(ctx.analysisId()).append("\n\n");
94126

127+
// Diffs are the most decision-useful pieces when they exist — they
128+
// tell the model what is *new* in the current version. Wall and CPU
129+
// diffs answer different questions: wall surfaces new contention or
130+
// I/O waits; CPU surfaces new actual computation. Present both first
131+
// (when available) so regression findings lead the report.
132+
boolean diffWallPresent = ctx.pyroscopeDiffWallMarkdown() != null
133+
&& !ctx.pyroscopeDiffWallMarkdown().isBlank();
134+
boolean diffCpuPresent = ctx.pyroscopeDiffCpuMarkdown() != null
135+
&& !ctx.pyroscopeDiffCpuMarkdown().isBlank();
136+
137+
if (diffWallPresent || diffCpuPresent) {
138+
sb.append("## Pyroscope version diffs (pre-fetched)\n\n")
139+
.append("Two views of what changed between baseline and current. Both diffs rank by ")
140+
.append("**total-time share** (frame plus descendants), so application callers that walk ")
141+
.append("into a new hotspot or contended primitive show up directly. ")
142+
.append("**CPU diff** is what to read first for new computation — frames with positive ")
143+
.append("Δ pp on the CPU diff are new on-CPU work. **Wall diff** is what to read first ")
144+
.append("for new contention or I/O waits — frames with positive Δ pp on the wall diff are ")
145+
.append("new waiting time. A regression often shows up sharply in one and weakly in the ")
146+
.append("other; cross-check both. When an application frame and a low-level frame both ")
147+
.append("rise together, the application frame is the cause and the low-level frame is the ")
148+
.append("mechanism.\n\n");
149+
150+
sb.append("### CPU diff (process_cpu)\n\n");
151+
sb.append(diffCpuPresent
152+
? ctx.pyroscopeDiffCpuMarkdown()
153+
: "_CPU diff unavailable._");
154+
sb.append("\n\n");
155+
156+
sb.append("### Wall diff (wall)\n\n");
157+
sb.append(diffWallPresent
158+
? ctx.pyroscopeDiffWallMarkdown()
159+
: "_Wall diff unavailable._");
160+
sb.append("\n\n");
161+
} else if (ctx.currentVersion() != null) {
162+
sb.append("## Pyroscope version diffs (pre-fetched)\n\n")
163+
.append("_No prior version is visible in Pyroscope for this service; diffs are not available. ")
164+
.append("Rely on the top-functions snapshots and JFR/thread-dump signals below._\n\n");
165+
}
166+
167+
// Top-functions snapshots in both lenses. Same window, two lenses on
168+
// the same workload state. Useful both as supporting context for the
169+
// diff above and as the primary signal when no diff is available.
95170
sb.append("## Pyroscope top functions (pre-fetched)\n\n")
96-
.append(ctx.pyroscopeTopFunctionsMarkdown() == null
97-
? "_Pyroscope data unavailable._\n"
98-
: ctx.pyroscopeTopFunctionsMarkdown())
171+
.append("Two lenses on the same window for the current version. Use **CPU** to find what is ")
172+
.append("computing right now; use **wall** to find what is waiting right now.\n\n");
173+
174+
sb.append("### CPU (process_cpu)\n\n")
175+
.append(ctx.pyroscopeCpuTopFunctionsMarkdown() == null
176+
? "_Pyroscope CPU data unavailable._\n"
177+
: ctx.pyroscopeCpuTopFunctionsMarkdown())
178+
.append("\n\n");
179+
180+
sb.append("### Wall (wall)\n\n")
181+
.append(ctx.pyroscopeWallTopFunctionsMarkdown() == null
182+
? "_Pyroscope wall data unavailable._\n"
183+
: ctx.pyroscopeWallTopFunctionsMarkdown())
99184
.append("\n\n");
100185

101186
sb.append("## JFR summary\n\n")
@@ -117,9 +202,12 @@ String buildPrompt(AnalysisService.AnalysisContext ctx, boolean sourceCodeAvaila
117202
One-line verdict: Healthy / Degraded / Critical.
118203
119204
## Findings
120-
Correlate the Pyroscope ranked functions with JFR events and
121-
thread states. Flag contention, resource pressure, configuration
122-
issues, or patterns that suggest a problem. Cite specific methods,
205+
Correlate the Pyroscope ranked functions (CPU and wall) with JFR
206+
events and thread states. When diffs are available, lead with what
207+
changed; cross-check CPU vs wall diffs to distinguish a new
208+
CPU-heavy code path from a new contention or I/O bottleneck. Flag
209+
resource pressure, configuration issues, or patterns that suggest
210+
a problem. Cite specific methods,
123211
thread names, and numbers.
124212
125213
## Recommendations

apps/perf-analyzer/src/main/java/com/example/perf/analyzer/AnalysisService.java

Lines changed: 90 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,15 @@ public class AnalysisService {
4646
private static final DateTimeFormatter TS =
4747
DateTimeFormatter.ofPattern("yyyyMMdd-HHmmss").withZone(ZoneOffset.UTC);
4848
private static final ObjectMapper MAPPER = new ObjectMapper();
49-
private static final int COLLECTOR_PORT = 8080;
49+
private static final int COLLECTOR_PORT = 8090;
5050

5151
private final CoreV1Api k8s;
5252
private final EcsClient ecs;
5353
private final S3Repository s3;
5454
private final JfrParser jfrParser;
5555
private final AiService ai;
5656
private final PyroscopeTool pyroscope;
57+
private final PyroscopeVersionService versions;
5758
private final String collectorNamespace;
5859
private final String collectorPodLabel;
5960

@@ -67,6 +68,7 @@ public AnalysisService(
6768
JfrParser jfrParser,
6869
AiService ai,
6970
PyroscopeTool pyroscope,
71+
PyroscopeVersionService versions,
7072
@Value("${perf.analyzer.collector.namespace:monitoring}") String collectorNamespace,
7173
@Value("${perf.analyzer.collector.pod-label:app=perf-collector}") String collectorPodLabel
7274
) {
@@ -76,6 +78,7 @@ public AnalysisService(
7678
this.jfrParser = jfrParser;
7779
this.ai = ai;
7880
this.pyroscope = pyroscope;
81+
this.versions = versions;
7982
this.collectorNamespace = collectorNamespace;
8083
this.collectorPodLabel = collectorPodLabel;
8184
}
@@ -127,31 +130,79 @@ private void run(AnalysisRequest request, String analysisId, String prefix) {
127130
var jfrDumpUri = s3.profilingDumpUri(request, analysisId, "jfr");
128131
var threadDumpUri = s3.profilingDumpUri(request, analysisId, "json");
129132

133+
// Pre-resolve the version pair so the diff lane and the prompt both
134+
// know the exact labels we're comparing. Same 5-minute window as the
135+
// exporter — if the exporter is firing an alert for this service,
136+
// we want the analyzer looking at the same window it alerted on.
137+
var pyroService = pyroscopeServiceName(request);
138+
var versionPair = versions.selectCurrentAndBaseline(pyroService, 300L);
139+
140+
var windowStart = Instant.now().minus(Duration.ofMinutes(5));
141+
var windowEnd = Instant.now();
142+
130143
var jfrLane = CompletableFuture.supplyAsync(
131144
() -> captureJfr(request, analysisId, collectorUrl, jfrDumpUri), laneExecutor);
132145
var threadLane = CompletableFuture.supplyAsync(
133146
() -> captureThreadDump(request, analysisId, collectorUrl, threadDumpUri), laneExecutor);
134-
var pyroLane = CompletableFuture.supplyAsync(
135-
() -> pyroscope.topFunctions(
136-
request.service(),
137-
Instant.now().minus(Duration.ofMinutes(5)).toString(),
138-
Instant.now().toString(),
139-
20),
147+
148+
// Top-functions lanes — one per profile type. Wall surfaces "where is
149+
// time being spent including waits"; CPU surfaces "where is the CPU
150+
// actually doing work". Both views go to the model so it can decide
151+
// which lens applies to the question at hand.
152+
var pyroWallLane = CompletableFuture.supplyAsync(
153+
() -> pyroscope.topFunctions(pyroService, "wall",
154+
windowStart.toString(), windowEnd.toString(), 20),
140155
laneExecutor);
156+
var pyroCpuLane = CompletableFuture.supplyAsync(
157+
() -> pyroscope.topFunctions(pyroService, "cpu",
158+
windowStart.toString(), windowEnd.toString(), 20),
159+
laneExecutor);
160+
161+
// Diff lanes — conditional on a baseline existing. We run BOTH
162+
// profile-type diffs because a regression can show up differently
163+
// in each lens: a new contention point shows in wall, a new hot
164+
// computation shows in CPU. CompletableFuture.completedFuture(null)
165+
// for the skipped case keeps the prompt-builder branch simple.
166+
CompletableFuture<String> diffWallLane = versionPair.hasBaseline()
167+
? CompletableFuture.supplyAsync(
168+
() -> pyroscope.diff(pyroService, "wall",
169+
versionPair.baseline(), versionPair.current(),
170+
windowStart.toString(), windowEnd.toString(), 20),
171+
laneExecutor)
172+
: CompletableFuture.completedFuture(null);
173+
CompletableFuture<String> diffCpuLane = versionPair.hasBaseline()
174+
? CompletableFuture.supplyAsync(
175+
() -> pyroscope.diff(pyroService, "cpu",
176+
versionPair.baseline(), versionPair.current(),
177+
windowStart.toString(), windowEnd.toString(), 20),
178+
laneExecutor)
179+
: CompletableFuture.completedFuture(null);
141180

142181
String jfrMarkdown = null;
143182
String threadDumpText = null;
144-
String pyroscopeMarkdown = null;
183+
String pyroscopeWallMarkdown = null;
184+
String pyroscopeCpuMarkdown = null;
185+
String pyroscopeDiffWallMarkdown = null;
186+
String pyroscopeDiffCpuMarkdown = null;
145187

146188
try { jfrMarkdown = jfrLane.get(3, TimeUnit.MINUTES); }
147189
catch (Exception e) { logger.warn("JFR lane failed: {}", e.getMessage()); }
148190
try { threadDumpText = threadLane.get(60, TimeUnit.SECONDS); }
149191
catch (Exception e) { logger.warn("Thread dump lane failed: {}", e.getMessage()); }
150-
try { pyroscopeMarkdown = pyroLane.get(30, TimeUnit.SECONDS); }
151-
catch (Exception e) { logger.warn("Pyroscope lane failed: {}", e.getMessage()); }
192+
try { pyroscopeWallMarkdown = pyroWallLane.get(30, TimeUnit.SECONDS); }
193+
catch (Exception e) { logger.warn("Pyroscope wall lane failed: {}", e.getMessage()); }
194+
try { pyroscopeCpuMarkdown = pyroCpuLane.get(30, TimeUnit.SECONDS); }
195+
catch (Exception e) { logger.warn("Pyroscope cpu lane failed: {}", e.getMessage()); }
196+
try { pyroscopeDiffWallMarkdown = diffWallLane.get(30, TimeUnit.SECONDS); }
197+
catch (Exception e) { logger.warn("Pyroscope wall-diff lane failed: {}", e.getMessage()); }
198+
try { pyroscopeDiffCpuMarkdown = diffCpuLane.get(30, TimeUnit.SECONDS); }
199+
catch (Exception e) { logger.warn("Pyroscope cpu-diff lane failed: {}", e.getMessage()); }
152200

153201
var ctx = new AnalysisContext(
154-
request, analysisId, jfrMarkdown, threadDumpText, pyroscopeMarkdown,
202+
request, analysisId, jfrMarkdown, threadDumpText,
203+
pyroscopeWallMarkdown, pyroscopeCpuMarkdown,
204+
pyroscopeDiffWallMarkdown, pyroscopeDiffCpuMarkdown,
205+
versionPair.current(), versionPair.baseline(),
155206
metadata.githubRepo(), metadata.githubPath());
156207

157208
String analysisMd;
@@ -160,7 +211,7 @@ private void run(AnalysisRequest request, String analysisId, String prefix) {
160211
} catch (Exception e) {
161212
writePartialFailure(request, analysisId, prefix,
162213
"Bedrock analysis failed: " + e.getMessage(),
163-
jfrMarkdown, threadDumpText, pyroscopeMarkdown);
214+
jfrMarkdown, threadDumpText, pyroscopeWallMarkdown);
164215
return;
165216
}
166217

@@ -354,9 +405,29 @@ private static String shortRandom() {
354405
return Long.toHexString(Double.doubleToLongBits(Math.random())).substring(0, 6);
355406
}
356407

408+
/**
409+
* Pyroscope service name derived from the request platform. The collector
410+
* publishes samples as {@code <workload>-eks} / {@code <workload>-ecs} so
411+
* the two runtimes have separate entries in Profiles Drilldown. Analyzer
412+
* queries must use the same suffix or Pyroscope returns no data.
413+
*/
414+
private static String pyroscopeServiceName(AnalysisRequest request) {
415+
var suffix = request.platform() == Platform.ECS_FARGATE ? "ecs" : "eks";
416+
return request.service() + "-" + suffix;
417+
}
418+
357419
// === Domain types ===
358420

359-
public enum Platform { EKS, ECS_FARGATE }
421+
public enum Platform {
422+
EKS, ECS_FARGATE;
423+
424+
/** Case-insensitive deserialization: accepts "eks"/"EKS" and "ecs-fargate"/"ECS_FARGATE". */
425+
@com.fasterxml.jackson.annotation.JsonCreator
426+
public static Platform fromJson(String value) {
427+
if (value == null) return null;
428+
return Platform.valueOf(value.trim().toUpperCase().replace('-', '_'));
429+
}
430+
}
360431

361432
public enum TriggerSource { ON_DEMAND, GRAFANA_WEBHOOK }
362433

@@ -378,7 +449,12 @@ public record AnalysisContext(
378449
String analysisId,
379450
String jfrSummaryMarkdown,
380451
String threadDumpText,
381-
String pyroscopeTopFunctionsMarkdown,
452+
String pyroscopeWallTopFunctionsMarkdown, // wall-clock top-N
453+
String pyroscopeCpuTopFunctionsMarkdown, // on-CPU top-N
454+
String pyroscopeDiffWallMarkdown, // null if only one version visible
455+
String pyroscopeDiffCpuMarkdown, // null if only one version visible
456+
String currentVersion, // null if no version labels present
457+
String baselineVersion, // null if only one version visible
382458
String githubRepo, // e.g. "aws-samples/java-on-aws", null if not configured
383459
String githubPath // e.g. "apps/unicorn-store-spring"
384460
) {}

0 commit comments

Comments
 (0)