Skip to content

Commit e6daf23

Browse files
author
Yuriy Bezsonov
committed
refactor(perf-platform): drop version diff and ratio exporter
Remove the canary-regression detection path entirely. The diff lane, ratio gauge, and version-aware webhook handling fought the timing constraints of a workshop session — JIT warm-up on a freshly started canary produces ±10pp deltas that drown anything else, and there is no way to wait it out in 90 minutes. Replace with a four-lane single-version analysis that runs equally well at zero load, light load, and saturating load. Analyzer - Drop ProfileRatioExporter and PyroscopeVersionService entirely. No more in-process Prometheus gauge, no more version-pair selection. - PyroscopeTool keeps a single @tool method, topFunctions(service, profileType, from, to, limit). The diff() tool method, totalsForVersion helper, and total-time delta formatter are gone. - AnalysisService runs four parallel lanes (CPU top, wall top, JFR, thread dump). AnalysisContext drops the diff and version fields. Window stays 5 minutes, single static value at the top of the file. - AiService prompt reframed around two-lens correlation rather than diffs. System prompt: CPU is what is computing, wall is what is waiting; cross-check both. The application-frame-vs-low-level-frame guidance carries over. - AnalyzeController.toRequest still recovers workload + platform from the {service_name}-{eks|ecs} convention so any Grafana alert label pattern carrying that name works without code changes. Collector - AsyncProfilerAttach javadoc rewritten to match the implementation: CPU + wall via -e cpu --wall 10ms, JFR rotation via --loop 15s, Pyroscope splits the JFR file into process_cpu and wall profile types automatically. No code change to the attach itself; the previous wall-only experiment has been reverted. Verified end-to-end on the live cluster: stripped analyzer rebuilt, redeployed, and a single on-demand /api/v1/analyze against the unicorn-store-spring pod produced a 60-second report that correctly identified the latent blocking .get() in UnicornService.publishUnicorn- Event with a source-code citation.
1 parent ef95d7c commit e6daf23

7 files changed

Lines changed: 76 additions & 642 deletions

File tree

apps/perf-analyzer/src/main/java/com/example/perf/analyzer/AiService.java

Lines changed: 24 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -30,49 +30,39 @@ public class AiService {
3030

3131
private static final String SYSTEM_PROMPT = """
3232
You are a Java performance engineer. You receive:
33-
- per-frame Pyroscope diffs between the current service version
34-
and the prior one (when two versions are present), shown in two
35-
views: CPU profile and wall-clock profile. The diffs rank frames
36-
by total-time share (frame plus its descendants), so an
37-
application caller that walks into a new hotspot or a new
38-
contended primitive will appear directly in the diff even when
39-
the actual time is burned in a JVM intrinsic, glibc lock primitive
40-
or other low-level frame underneath it. CPU diff surfaces new
41-
on-CPU work; wall diff surfaces new contention or I/O waits.
42-
- pre-aggregated Pyroscope top-functions tables for the current
43-
version, again in CPU and wall views.
33+
- pre-aggregated Pyroscope top-functions tables for the target
34+
service in two views: CPU profile (process_cpu) and wall-clock
35+
profile (wall). CPU shows what is actually computing on a core;
36+
wall shows where every sampled thread spends time, regardless of
37+
state. CPU is the right lens for finding new computation; wall
38+
is the right lens for finding contention, blocked I/O, and
39+
downstream waits.
4440
- a JFR summary covering GC pauses, JIT compilation, deoptimization,
4541
monitor contention, safepoints and JVM configuration.
4642
- a thread dump.
4743
All were captured around the time an alert fired or a developer
4844
triggered an on-demand analysis.
4945
50-
When diffs are present, treat the frames with the largest positive
51-
deltas as the primary suspects — they are what changed. Pay particular
52-
attention to *application* (project package) frames in the diff: they
53-
identify the caller in the user's code and are usually the right
54-
anchor for a finding even when an underlying low-level frame
55-
(HashMap.merge, futex, arraycopy, JIT helper) shows a similar delta —
56-
those are typically the *mechanism*, not the cause. Cross-check the
57-
CPU and wall diffs: a real regression usually shows clearly in at
58-
least one and weakly in the other (CPU-heavy regression: big in CPU,
59-
small in wall; contention regression: big in wall, small in CPU).
60-
Frames that are large in absolute terms but small in any diff are
61-
long-standing workload characteristics and are usually not the cause
62-
of a regression, though they may still be worth flagging separately.
63-
Analyze the data and report what you find.
46+
Cross-check across signals: a method that appears hot in the wall
47+
profile and shows blocked or waiting threads in the dump is the
48+
same observation in two lenses. A method that appears hot in CPU
49+
but quiet in wall is real on-CPU work, not waiting. Pay particular
50+
attention to *application* (project package) frames — they identify
51+
the caller in the user's code and are usually the right anchor for
52+
a finding even when the actual time is burned in a low-level frame
53+
underneath (HashMap, futex, JIT helper, glibc lock primitive); the
54+
low-level frame is the mechanism, the application frame is the
55+
cause. Analyze the data and report what you find.
6456
""";
6557

6658
private final ChatClient chatClient;
67-
private final PyroscopeTool pyroscopeTool;
6859
private final String githubToken;
6960

7061
public AiService(
7162
ChatClient.Builder chatClientBuilder,
7263
PyroscopeTool pyroscopeTool,
7364
@Value("${GITHUB_TOKEN:}") String githubToken
7465
) {
75-
this.pyroscopeTool = pyroscopeTool;
7666
this.githubToken = githubToken;
7767
this.chatClient = chatClientBuilder
7868
.defaultSystem(SYSTEM_PROMPT)
@@ -113,63 +103,16 @@ String buildPrompt(AnalysisService.AnalysisContext ctx, boolean sourceCodeAvaila
113103
sb.append("- platform: **").append(r.platform().name().toLowerCase().replace('_', '-')).append("**\n");
114104
sb.append("- target: **").append(r.target()).append("**\n");
115105
sb.append("- trigger: ").append(r.source()).append("\n");
116-
if (ctx.currentVersion() != null) {
117-
sb.append("- current version: ").append(ctx.currentVersion()).append("\n");
118-
}
119-
if (ctx.baselineVersion() != null) {
120-
sb.append("- prior version (baseline for diff): ").append(ctx.baselineVersion()).append("\n");
121-
}
122106
if (r.reason() != null && !r.reason().isBlank()) {
123107
sb.append("- reason: ").append(r.reason()).append("\n");
124108
}
125109
sb.append("- analysisId: ").append(ctx.analysisId()).append("\n\n");
126110

127-
// Diffs are the most decision-useful pieces when they exist — they
128-
// tell the model what is *new* in the current version. Wall and CPU
129-
// diffs answer different questions: wall surfaces new contention or
130-
// I/O waits; CPU surfaces new actual computation. Present both first
131-
// (when available) so regression findings lead the report.
132-
boolean diffWallPresent = ctx.pyroscopeDiffWallMarkdown() != null
133-
&& !ctx.pyroscopeDiffWallMarkdown().isBlank();
134-
boolean diffCpuPresent = ctx.pyroscopeDiffCpuMarkdown() != null
135-
&& !ctx.pyroscopeDiffCpuMarkdown().isBlank();
136-
137-
if (diffWallPresent || diffCpuPresent) {
138-
sb.append("## Pyroscope version diffs (pre-fetched)\n\n")
139-
.append("Two views of what changed between baseline and current. Both diffs rank by ")
140-
.append("**total-time share** (frame plus descendants), so application callers that walk ")
141-
.append("into a new hotspot or contended primitive show up directly. ")
142-
.append("**CPU diff** is what to read first for new computation — frames with positive ")
143-
.append("Δ pp on the CPU diff are new on-CPU work. **Wall diff** is what to read first ")
144-
.append("for new contention or I/O waits — frames with positive Δ pp on the wall diff are ")
145-
.append("new waiting time. A regression often shows up sharply in one and weakly in the ")
146-
.append("other; cross-check both. When an application frame and a low-level frame both ")
147-
.append("rise together, the application frame is the cause and the low-level frame is the ")
148-
.append("mechanism.\n\n");
149-
150-
sb.append("### CPU diff (process_cpu)\n\n");
151-
sb.append(diffCpuPresent
152-
? ctx.pyroscopeDiffCpuMarkdown()
153-
: "_CPU diff unavailable._");
154-
sb.append("\n\n");
155-
156-
sb.append("### Wall diff (wall)\n\n");
157-
sb.append(diffWallPresent
158-
? ctx.pyroscopeDiffWallMarkdown()
159-
: "_Wall diff unavailable._");
160-
sb.append("\n\n");
161-
} else if (ctx.currentVersion() != null) {
162-
sb.append("## Pyroscope version diffs (pre-fetched)\n\n")
163-
.append("_No prior version is visible in Pyroscope for this service; diffs are not available. ")
164-
.append("Rely on the top-functions snapshots and JFR/thread-dump signals below._\n\n");
165-
}
166-
167111
// Top-functions snapshots in both lenses. Same window, two lenses on
168-
// the same workload state. Useful both as supporting context for the
169-
// diff above and as the primary signal when no diff is available.
112+
// the same workload state.
170113
sb.append("## Pyroscope top functions (pre-fetched)\n\n")
171-
.append("Two lenses on the same window for the current version. Use **CPU** to find what is ")
172-
.append("computing right now; use **wall** to find what is waiting right now.\n\n");
114+
.append("Two lenses on the same window. Use **CPU** to find what is computing right now; ")
115+
.append("use **wall** to find what is waiting on locks, I/O, or downstream calls right now.\n\n");
173116

174117
sb.append("### CPU (process_cpu)\n\n")
175118
.append(ctx.pyroscopeCpuTopFunctionsMarkdown() == null
@@ -203,12 +146,10 @@ String buildPrompt(AnalysisService.AnalysisContext ctx, boolean sourceCodeAvaila
203146
204147
## Findings
205148
Correlate the Pyroscope ranked functions (CPU and wall) with JFR
206-
events and thread states. When diffs are available, lead with what
207-
changed; cross-check CPU vs wall diffs to distinguish a new
208-
CPU-heavy code path from a new contention or I/O bottleneck. Flag
209-
resource pressure, configuration issues, or patterns that suggest
210-
a problem. Cite specific methods,
211-
thread names, and numbers.
149+
events and thread states. Distinguish on-CPU work (CPU view) from
150+
blocking waits and contention (wall view). Flag resource pressure,
151+
configuration issues, or patterns that suggest a problem. Cite
152+
specific methods, thread names, and numbers.
212153
213154
## Recommendations
214155
Prioritized — most impactful first. Only include actionable items

apps/perf-analyzer/src/main/java/com/example/perf/analyzer/AnalysisService.java

Lines changed: 16 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,13 @@
2828

2929
/**
3030
* Virtual-thread orchestrator. Each submitted analysis runs on its own
31-
* virtual thread, launching three parallel sub-lanes for JFR, thread dump,
32-
* and Pyroscope top functions. Results are stitched together, handed to
33-
* AiService, and the Markdown report lands in S3 alongside the raw artifacts.
31+
* virtual thread, launching four parallel sub-lanes:
32+
* - Pyroscope CPU top functions (current 5-minute window)
33+
* - Pyroscope wall top functions (current 5-minute window)
34+
* - JFR snapshot via the collector
35+
* - Thread dump via the collector
36+
* Results are stitched together, handed to AiService, and the Markdown
37+
* report lands in S3 alongside the raw artifacts.
3438
*
3539
* Collector locating (pod -> node -> DaemonSet pod IP on EKS; task ARN ->
3640
* ENI IP on ECS) lives here too. It's a single-responsibility concern —
@@ -47,14 +51,14 @@ public class AnalysisService {
4751
DateTimeFormatter.ofPattern("yyyyMMdd-HHmmss").withZone(ZoneOffset.UTC);
4852
private static final ObjectMapper MAPPER = new ObjectMapper();
4953
private static final int COLLECTOR_PORT = 8090;
54+
private static final Duration WINDOW = Duration.ofMinutes(5);
5055

5156
private final CoreV1Api k8s;
5257
private final EcsClient ecs;
5358
private final S3Repository s3;
5459
private final JfrParser jfrParser;
5560
private final AiService ai;
5661
private final PyroscopeTool pyroscope;
57-
private final PyroscopeVersionService versions;
5862
private final String collectorNamespace;
5963
private final String collectorPodLabel;
6064

@@ -68,7 +72,6 @@ public AnalysisService(
6872
JfrParser jfrParser,
6973
AiService ai,
7074
PyroscopeTool pyroscope,
71-
PyroscopeVersionService versions,
7275
@Value("${perf.analyzer.collector.namespace:monitoring}") String collectorNamespace,
7376
@Value("${perf.analyzer.collector.pod-label:app=perf-collector}") String collectorPodLabel
7477
) {
@@ -78,7 +81,6 @@ public AnalysisService(
7881
this.jfrParser = jfrParser;
7982
this.ai = ai;
8083
this.pyroscope = pyroscope;
81-
this.versions = versions;
8284
this.collectorNamespace = collectorNamespace;
8385
this.collectorPodLabel = collectorPodLabel;
8486
}
@@ -130,25 +132,17 @@ private void run(AnalysisRequest request, String analysisId, String prefix) {
130132
var jfrDumpUri = s3.profilingDumpUri(request, analysisId, "jfr");
131133
var threadDumpUri = s3.profilingDumpUri(request, analysisId, "json");
132134

133-
// Pre-resolve the version pair so the diff lane and the prompt both
134-
// know the exact labels we're comparing. Same 5-minute window as the
135-
// exporter — if the exporter is firing an alert for this service,
136-
// we want the analyzer looking at the same window it alerted on.
137135
var pyroService = pyroscopeServiceName(request);
138-
var versionPair = versions.selectCurrentAndBaseline(pyroService, 300L);
139-
140-
var windowStart = Instant.now().minus(Duration.ofMinutes(5));
141136
var windowEnd = Instant.now();
137+
var windowStart = windowEnd.minus(WINDOW);
142138

139+
// Four parallel lanes. Pyroscope lanes pre-fetch ranked top-N tables
140+
// for both profile types so the prompt has them ready; collector
141+
// lanes capture JFR + thread dump for the same workload.
143142
var jfrLane = CompletableFuture.supplyAsync(
144143
() -> captureJfr(request, analysisId, collectorUrl, jfrDumpUri), laneExecutor);
145144
var threadLane = CompletableFuture.supplyAsync(
146145
() -> captureThreadDump(request, analysisId, collectorUrl, threadDumpUri), laneExecutor);
147-
148-
// Top-functions lanes — one per profile type. Wall surfaces "where is
149-
// time being spent including waits"; CPU surfaces "where is the CPU
150-
// actually doing work". Both views go to the model so it can decide
151-
// which lens applies to the question at hand.
152146
var pyroWallLane = CompletableFuture.supplyAsync(
153147
() -> pyroscope.topFunctions(pyroService, "wall",
154148
windowStart.toString(), windowEnd.toString(), 20),
@@ -158,32 +152,10 @@ private void run(AnalysisRequest request, String analysisId, String prefix) {
158152
windowStart.toString(), windowEnd.toString(), 20),
159153
laneExecutor);
160154

161-
// Diff lanes — conditional on a baseline existing. We run BOTH
162-
// profile-type diffs because a regression can show up differently
163-
// in each lens: a new contention point shows in wall, a new hot
164-
// computation shows in CPU. CompletableFuture.completedFuture(null)
165-
// for the skipped case keeps the prompt-builder branch simple.
166-
CompletableFuture<String> diffWallLane = versionPair.hasBaseline()
167-
? CompletableFuture.supplyAsync(
168-
() -> pyroscope.diff(pyroService, "wall",
169-
versionPair.baseline(), versionPair.current(),
170-
windowStart.toString(), windowEnd.toString(), 20),
171-
laneExecutor)
172-
: CompletableFuture.completedFuture(null);
173-
CompletableFuture<String> diffCpuLane = versionPair.hasBaseline()
174-
? CompletableFuture.supplyAsync(
175-
() -> pyroscope.diff(pyroService, "cpu",
176-
versionPair.baseline(), versionPair.current(),
177-
windowStart.toString(), windowEnd.toString(), 20),
178-
laneExecutor)
179-
: CompletableFuture.completedFuture(null);
180-
181155
String jfrMarkdown = null;
182156
String threadDumpText = null;
183157
String pyroscopeWallMarkdown = null;
184158
String pyroscopeCpuMarkdown = null;
185-
String pyroscopeDiffWallMarkdown = null;
186-
String pyroscopeDiffCpuMarkdown = null;
187159

188160
try { jfrMarkdown = jfrLane.get(3, TimeUnit.MINUTES); }
189161
catch (Exception e) { logger.warn("JFR lane failed: {}", e.getMessage()); }
@@ -193,16 +165,10 @@ private void run(AnalysisRequest request, String analysisId, String prefix) {
193165
catch (Exception e) { logger.warn("Pyroscope wall lane failed: {}", e.getMessage()); }
194166
try { pyroscopeCpuMarkdown = pyroCpuLane.get(30, TimeUnit.SECONDS); }
195167
catch (Exception e) { logger.warn("Pyroscope cpu lane failed: {}", e.getMessage()); }
196-
try { pyroscopeDiffWallMarkdown = diffWallLane.get(30, TimeUnit.SECONDS); }
197-
catch (Exception e) { logger.warn("Pyroscope wall-diff lane failed: {}", e.getMessage()); }
198-
try { pyroscopeDiffCpuMarkdown = diffCpuLane.get(30, TimeUnit.SECONDS); }
199-
catch (Exception e) { logger.warn("Pyroscope cpu-diff lane failed: {}", e.getMessage()); }
200168

201169
var ctx = new AnalysisContext(
202170
request, analysisId, jfrMarkdown, threadDumpText,
203171
pyroscopeWallMarkdown, pyroscopeCpuMarkdown,
204-
pyroscopeDiffWallMarkdown, pyroscopeDiffCpuMarkdown,
205-
versionPair.current(), versionPair.baseline(),
206172
metadata.githubRepo(), metadata.githubPath());
207173

208174
String analysisMd;
@@ -449,13 +415,9 @@ public record AnalysisContext(
449415
String analysisId,
450416
String jfrSummaryMarkdown,
451417
String threadDumpText,
452-
String pyroscopeWallTopFunctionsMarkdown, // wall-clock top-N
453-
String pyroscopeCpuTopFunctionsMarkdown, // on-CPU top-N
454-
String pyroscopeDiffWallMarkdown, // null if only one version visible
455-
String pyroscopeDiffCpuMarkdown, // null if only one version visible
456-
String currentVersion, // null if no version labels present
457-
String baselineVersion, // null if only one version visible
458-
String githubRepo, // e.g. "aws-samples/java-on-aws", null if not configured
459-
String githubPath // e.g. "apps/unicorn-store-spring"
418+
String pyroscopeWallTopFunctionsMarkdown,
419+
String pyroscopeCpuTopFunctionsMarkdown,
420+
String githubRepo,
421+
String githubPath
460422
) {}
461423
}

apps/perf-analyzer/src/main/java/com/example/perf/analyzer/AnalyzeController.java

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -82,11 +82,14 @@ private static void validateTarget(AnalyzeRequest r) {
8282
}
8383

8484
/**
85-
* Derive the analysis request from Grafana alert labels. The exporter emits
86-
* {@code perf_profile_cpu_ratio{service_name="unicorn-store-spring-eks",version="v2"}} —
87-
* we strip the {@code -eks}/{@code -ecs} suffix to recover the workload name and
88-
* derive the platform from the suffix, then resolve a current pod/task by the
89-
* {@code perf-profile/service} label or tag.
85+
* Derive the analysis request from Grafana alert labels. Whatever the
86+
* alert source, we expect either:
87+
* - a {@code service_name} label that matches what the collector
88+
* publishes (e.g. {@code unicorn-store-spring-eks}), with the
89+
* {@code -eks}/{@code -ecs} suffix telling us the platform; or
90+
* - a {@code service} label plus an explicit {@code platform} label.
91+
* For EKS we resolve a current pod by the {@code perf-profile/service}
92+
* label; for ECS the alert must carry an explicit {@code task} label.
9093
*/
9194
private AnalysisService.AnalysisRequest toRequest(Map<String, String> labels) {
9295
if (labels == null) return null;
@@ -123,16 +126,14 @@ private AnalysisService.AnalysisRequest toRequest(Map<String, String> labels) {
123126
return null;
124127
}
125128
} else {
126-
// ECS target resolution via tags isn't wired here yet — fall back to
127-
// whatever the alert labels supply (set manually in curl-based tests).
128129
task = labels.get("task");
129130
if (task == null || task.isBlank()) {
130131
logger.warn("No task in labels for workload={} on ECS", workload);
131132
return null;
132133
}
133134
}
134135

135-
var reason = "Grafana alert: " + labels.getOrDefault("alertname", "PerfProfileRegression");
136+
var reason = "Grafana alert: " + labels.getOrDefault("alertname", "unknown");
136137
return new AnalysisService.AnalysisRequest(
137138
workload, platform, pod, task, reason,
138139
AnalysisService.TriggerSource.GRAFANA_WEBHOOK);

0 commit comments

Comments
 (0)