Skip to content

Commit 37bcc39

Browse files
fix(telemetry): synchronous send + version-alignment CI (#1706) (#136)
Java SDK had the same telemetry-delivery bug as Python (#1692) and Go (#1693): CompletableFuture.runAsync(lambda) without an explicit executor submits to ForkJoinPool.commonPool(), whose threads are daemon by default since Java 8. When the JVM's main thread exits, those daemon threads are killed mid-flight — the OkHttpClient.newCall().execute() inside the lambda is abandoned. CLI Java binaries, AWS Lambda Java handlers, serverless cold-starts, and quickstart scripts silently drop telemetry pings with no error visible to the caller. Per the 2026-04-24 SDK-telemetry investigation, this is a likely major contributor to the 0 external-confirmed Java records we've observed to date. ## Fix (TelemetryReporter.java) - Replaced CompletableFuture.runAsync(...) with synchronous execution. Blocks the caller briefly: ~350ms warm / ~1.3s cold on a reachable checkpoint, bounded at TIMEOUT_SECONDS on an unreachable one. Acceptable for a control-plane SDK's construction path, matching the Go SDK pattern shipped in #1693. - Shared monotonic deadline across /health probe and checkpoint POST. Previously detectPlatformVersion (2s timeout) and the checkpoint POST (3s timeout) had independent timeouts that could stack to ~5s. detectPlatformVersion now takes a budgetMs parameter derived from the shared deadline; POST uses whatever is left. - Both operations skip when remaining budget is below MIN_BUDGET_MS (100ms) to avoid issuing calls that are guaranteed to time out. ## Regression test (TelemetryReporterShortLivedTest.java) Verifies the core invariant: when sendPing returns, the HTTP round-trip has already completed (not still-pending on a dying daemon thread). Uses a WireMock server with a fixed-delay response; measures elapsed time to confirm sendPing actually blocked. Verified: - FAIL at 0.070s when reverted to CompletableFuture.runAsync - PASS at 0.971s with the fix So a future regression to fire-and-forget is caught by CI, not by missing telemetry in production. ## CHANGELOG Added [Unreleased] section with terse one-line entries per the telemetry-CHANGELOG-minimal rule: delivery fix + shared-deadline bound + the alignment-check addition below. ## Version alignment CI (.github/scripts + .github/workflows) Mirrors the pattern just added in axonflow-sdk-go PR #130 and already present in the platform repo. Script compares pom.xml <version> against the first released '## [X.Y.Z]' section in CHANGELOG.md; CI runs on any PR or push to main that touches either file. Prevents the drift pattern where a release ships to Maven Central but the repo's pom stays behind (and the inverse: pom bumped but CHANGELOG still shows the prior version as latest released). Verified the script locally: PASS on current state (pom 5.7.0 == CHANGELOG 5.7.0); FAIL when pom is manually mismatched to 5.8.0. Full test suite: 1,200 tests, 0 failures. Closes getaxonflow/axonflow-enterprise#1706.
1 parent 5720c3d commit 37bcc39

5 files changed

Lines changed: 279 additions & 34 deletions

File tree

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
#!/usr/bin/env bash
2+
# Validates that the Java SDK's declared version (pom.xml) matches the
3+
# latest released version in CHANGELOG.md. Patterned on the AxonFlow
4+
# platform's and Go SDK's script of the same name.
5+
#
6+
# Purpose: keep the repo's manifest (pom.xml) in lock-step with the
7+
# CHANGELOG so the state on `main` always matches the most recent
8+
# published tag. Prevents two drift patterns:
9+
# - repo says 5.7.0 but 5.7.1 has already shipped to Maven Central
10+
# - repo says 5.7.1 but CHANGELOG still shows 5.7.0 as latest released
11+
#
12+
# Run locally:
13+
# ./.github/scripts/validate-version-alignment.sh
14+
#
15+
# CI: runs on every PR and push to main that touches CHANGELOG.md or
16+
# pom.xml (see .github/workflows/validate-version-alignment.yml).
17+
18+
set -euo pipefail
19+
20+
ERRORS=0
21+
22+
# Latest RELEASED version = first `## [x.y.z]` line that isn't the
23+
# Keep-a-Changelog "[Unreleased]" placeholder. The Unreleased section
24+
# accumulates in-flight changes between tags and must not be used as
25+
# the expected-version target — the manifest only gets bumped when we
26+
# actually cut a tag.
27+
LATEST_VERSION=$(grep -m1 -E '^## \[[0-9]' CHANGELOG.md | sed 's/## \[\(.*\)\].*/\1/' | sed 's/^v//')
28+
29+
if [ -z "${LATEST_VERSION:-}" ]; then
30+
echo "❌ Could not extract a released version (## [X.Y.Z]) from CHANGELOG.md"
31+
exit 1
32+
fi
33+
34+
echo "📋 Latest CHANGELOG version: $LATEST_VERSION"
35+
echo ""
36+
37+
# Check pom.xml <version>. We match the first top-level <version> (the
38+
# project's own declaration), not any <version> inside <dependencies>
39+
# or <plugins>. The project version is the first match in Maven's
40+
# standard layout.
41+
echo "🔧 Checking pom.xml..."
42+
POM_VERSION=$(grep -m1 -E '^\s*<version>[0-9]+\.[0-9]+\.[0-9]+.*</version>' pom.xml \
43+
| sed -E 's|.*<version>(.*)</version>.*|\1|' || true)
44+
45+
if [ -z "${POM_VERSION:-}" ]; then
46+
echo " ❌ pom.xml — could not extract <version> element"
47+
ERRORS=$((ERRORS + 1))
48+
elif [ "$POM_VERSION" != "$LATEST_VERSION" ]; then
49+
echo " ❌ pom.xml — <version> is \"$POM_VERSION\", expected \"$LATEST_VERSION\""
50+
ERRORS=$((ERRORS + 1))
51+
else
52+
echo " ✅ pom.xml — $POM_VERSION"
53+
fi
54+
55+
echo ""
56+
57+
if [ "$ERRORS" -gt 0 ]; then
58+
echo "❌ Found $ERRORS version misalignment(s)."
59+
echo ""
60+
echo "Fix: bump the stale file to match CHANGELOG v$LATEST_VERSION."
61+
echo "Or, if CHANGELOG is behind a tag you already pushed, add the"
62+
echo "missing '## [${POM_VERSION:-X.Y.Z}] - YYYY-MM-DD' section."
63+
exit 1
64+
fi
65+
66+
echo "✅ All version declarations match CHANGELOG v$LATEST_VERSION."
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Version Alignment Check
2+
3+
# Blocks merges that would leave pom.xml's <version> out of sync with
4+
# CHANGELOG.md's most recent released section. The invariant on main:
5+
# <version> in pom.xml == first `## [X.Y.Z]` section in CHANGELOG
6+
# When it's time to release, a single release-prep PR renames
7+
# `[Unreleased]` → `[X.Y.Z] - YYYY-MM-DD` AND bumps pom.xml in the
8+
# same commit, so this gate always sees them together.
9+
#
10+
# See .github/scripts/validate-version-alignment.sh for the script.
11+
12+
on:
13+
pull_request:
14+
branches: [main]
15+
paths:
16+
- 'CHANGELOG.md'
17+
- 'pom.xml'
18+
- '.github/scripts/validate-version-alignment.sh'
19+
- '.github/workflows/validate-version-alignment.yml'
20+
push:
21+
branches: [main]
22+
paths:
23+
- 'CHANGELOG.md'
24+
- 'pom.xml'
25+
- '.github/scripts/validate-version-alignment.sh'
26+
- '.github/workflows/validate-version-alignment.yml'
27+
28+
permissions:
29+
contents: read
30+
31+
env:
32+
DO_NOT_TRACK: '1'
33+
34+
jobs:
35+
validate-versions:
36+
name: Validate Version Alignment
37+
runs-on: ubuntu-latest
38+
steps:
39+
- uses: actions/checkout@v4
40+
- name: Check version alignment
41+
run: ./.github/scripts/validate-version-alignment.sh

CHANGELOG.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,17 @@ All notable changes to the AxonFlow Java SDK will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [Unreleased]
9+
10+
### Fixed
11+
12+
- Telemetry pings now deliver reliably from short-lived JVMs (CLI, serverless, cold-starts). `AxonFlow` construction blocks briefly while the ping is sent synchronously (bounded by the telemetry timeout).
13+
- Telemetry path is bounded at `TIMEOUT_SECONDS` (3s) total; the `/health` probe and checkpoint POST share a single monotonic deadline instead of stacking independent timeouts.
14+
15+
### Added
16+
17+
- **Version alignment check** (`.github/workflows/validate-version-alignment.yml`). CI now fails any PR or push to `main` where `pom.xml`'s `<version>` drifts from the first released `## [X.Y.Z]` section in `CHANGELOG.md`. Matches the pattern in the platform repo and the Go SDK.
18+
819
## [5.7.0] - 2026-04-22
920

1021
### Added

src/main/java/com/getaxonflow/sdk/telemetry/TelemetryReporter.java

Lines changed: 71 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
import java.net.URI;
2424
import java.net.URISyntaxException;
2525
import java.util.UUID;
26-
import java.util.concurrent.CompletableFuture;
2726
import java.util.concurrent.TimeUnit;
2827
import java.util.regex.Matcher;
2928
import java.util.regex.Pattern;
@@ -58,6 +57,14 @@ public class TelemetryReporter {
5857

5958
static final String DEFAULT_ENDPOINT = "https://checkpoint.getaxonflow.com/v1/ping";
6059
private static final int TIMEOUT_SECONDS = 3;
60+
/**
61+
* Minimum remaining HTTP budget (milliseconds). Below this, skip the operation rather than issue
62+
* a request that is almost guaranteed to time out before any useful work completes. Keeps the
63+
* telemetry path from making "essentially zero budget" calls when the shared deadline is nearly
64+
* spent.
65+
*/
66+
private static final long MIN_BUDGET_MS = 100L;
67+
6168
private static final MediaType JSON = MediaType.get("application/json; charset=utf-8");
6269

6370
/**
@@ -108,36 +115,62 @@ static void sendPing(
108115
String endpoint =
109116
(checkpointUrl != null && !checkpointUrl.isEmpty()) ? checkpointUrl : DEFAULT_ENDPOINT;
110117

111-
final String finalSdkEndpoint = sdkEndpoint;
112-
final String endpointType = classifyEndpoint(finalSdkEndpoint);
113-
CompletableFuture.runAsync(
114-
() -> {
115-
try {
116-
String platformVersion = detectPlatformVersion(finalSdkEndpoint);
117-
String payload = buildPayload(mode, platformVersion, endpointType);
118-
119-
OkHttpClient client =
120-
new OkHttpClient.Builder()
121-
.connectTimeout(TIMEOUT_SECONDS, TimeUnit.SECONDS)
122-
.readTimeout(TIMEOUT_SECONDS, TimeUnit.SECONDS)
123-
.writeTimeout(TIMEOUT_SECONDS, TimeUnit.SECONDS)
124-
.build();
125-
126-
RequestBody body = RequestBody.create(payload, JSON);
127-
Request request = new Request.Builder().url(endpoint).post(body).build();
128-
129-
try (Response response = client.newCall(request).execute()) {
130-
if (debug) {
131-
logger.debug("Telemetry ping sent, status={}", response.code());
132-
}
133-
}
134-
} catch (Exception e) {
135-
// Silent failure - telemetry must never disrupt SDK operation
136-
if (debug) {
137-
logger.debug("Telemetry ping failed (silent): {}", e.getMessage());
138-
}
139-
}
140-
});
118+
String endpointType = classifyEndpoint(sdkEndpoint);
119+
120+
// Send synchronously with a single bounded deadline shared across the
121+
// /health probe and the checkpoint POST. A CompletableFuture.runAsync
122+
// here would default to ForkJoinPool.commonPool (daemon threads) which
123+
// die at JVM exit, silently dropping the ping for short-lived JVMs (CLI
124+
// binaries, Lambda handlers, serverless cold-starts, quickstart scripts).
125+
// See axonflow-enterprise#1706.
126+
//
127+
// Blocks the caller briefly (~350ms warm / ~1.3s cold on a reachable
128+
// checkpoint; bounded at TIMEOUT_SECONDS on an unreachable one). That is
129+
// acceptable for a control-plane SDK's construction path, and matches
130+
// the pattern shipped for the Go SDK in axonflow-enterprise#1693.
131+
try {
132+
long deadlineMs =
133+
System.nanoTime() / 1_000_000L + TimeUnit.SECONDS.toMillis(TIMEOUT_SECONDS);
134+
135+
// Health probe gets up to 1s of the remaining budget, so the POST
136+
// always has room even when the probe fully consumes its slice.
137+
long healthBudgetMs =
138+
Math.min(
139+
TimeUnit.SECONDS.toMillis(1),
140+
Math.max(0L, deadlineMs - System.nanoTime() / 1_000_000L));
141+
String platformVersion =
142+
(sdkEndpoint != null && !sdkEndpoint.isEmpty() && healthBudgetMs > MIN_BUDGET_MS)
143+
? detectPlatformVersion(sdkEndpoint, healthBudgetMs)
144+
: null;
145+
146+
String payload = buildPayload(mode, platformVersion, endpointType);
147+
148+
long postBudgetMs = Math.max(0L, deadlineMs - System.nanoTime() / 1_000_000L);
149+
if (postBudgetMs < MIN_BUDGET_MS) {
150+
return;
151+
}
152+
153+
OkHttpClient client =
154+
new OkHttpClient.Builder()
155+
.connectTimeout(postBudgetMs, TimeUnit.MILLISECONDS)
156+
.readTimeout(postBudgetMs, TimeUnit.MILLISECONDS)
157+
.writeTimeout(postBudgetMs, TimeUnit.MILLISECONDS)
158+
.build();
159+
160+
RequestBody body = RequestBody.create(payload, JSON);
161+
Request request = new Request.Builder().url(endpoint).post(body).build();
162+
163+
try (Response response = client.newCall(request).execute()) {
164+
if (debug) {
165+
logger.debug("Telemetry ping sent, status={}", response.code());
166+
}
167+
}
168+
} catch (Exception e) {
169+
// Silent failure - telemetry must never disrupt SDK operation
170+
if (debug) {
171+
logger.debug("Telemetry ping failed (silent): {}", e.getMessage());
172+
}
173+
}
141174
}
142175

143176
/**
@@ -418,16 +451,20 @@ private static String padHextet(String h) {
418451

419452
/**
420453
* Detect platform version by calling the agent's /health endpoint. Returns null on any failure.
454+
*
455+
* <p>The {@code budgetMs} parameter is derived from the shared telemetry deadline so the health
456+
* probe and the checkpoint POST don't stack into a larger combined budget. See
457+
* axonflow-enterprise#1706.
421458
*/
422-
static String detectPlatformVersion(String sdkEndpoint) {
459+
static String detectPlatformVersion(String sdkEndpoint, long budgetMs) {
423460
if (sdkEndpoint == null || sdkEndpoint.isEmpty()) {
424461
return null;
425462
}
426463
try {
427464
OkHttpClient client =
428465
new OkHttpClient.Builder()
429-
.connectTimeout(2, TimeUnit.SECONDS)
430-
.readTimeout(2, TimeUnit.SECONDS)
466+
.connectTimeout(budgetMs, TimeUnit.MILLISECONDS)
467+
.readTimeout(budgetMs, TimeUnit.MILLISECONDS)
431468
.build();
432469

433470
Request request = new Request.Builder().url(sdkEndpoint + "/health").get().build();
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
/*
2+
* Copyright 2026 AxonFlow
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
package com.getaxonflow.sdk.telemetry;
17+
18+
import static com.github.tomakehurst.wiremock.client.WireMock.aResponse;
19+
import static com.github.tomakehurst.wiremock.client.WireMock.post;
20+
import static com.github.tomakehurst.wiremock.client.WireMock.postRequestedFor;
21+
import static com.github.tomakehurst.wiremock.client.WireMock.urlEqualTo;
22+
import static org.assertj.core.api.Assertions.assertThat;
23+
24+
import com.github.tomakehurst.wiremock.junit5.WireMockRuntimeInfo;
25+
import com.github.tomakehurst.wiremock.junit5.WireMockTest;
26+
import org.junit.jupiter.api.DisplayName;
27+
import org.junit.jupiter.api.Test;
28+
29+
/**
30+
* Regression test for axonflow-enterprise#1706: the telemetry ping must be delivered synchronously
31+
* before {@code sendPing} returns, so that short-lived JVMs (CLI binaries, AWS Lambda handlers,
32+
* serverless cold-starts, quickstart scripts) don't drop the ping on JVM exit.
33+
*
34+
* <p>Root cause of the original bug: {@code CompletableFuture.runAsync(lambda)} submits to {@code
35+
* ForkJoinPool.commonPool()}, whose threads are daemon by default since Java 8. When the main
36+
* thread exits, the daemon pool is killed mid-flight and the in-flight HTTP POST is abandoned —
37+
* silently, with no error visible to the caller.
38+
*
39+
* <p>The key invariant: once {@code sendPing} returns, the HTTP round-trip must have already
40+
* completed (or timed out cleanly). We verify this by pointing the reporter at a WireMock server
41+
* that responds with a fixed delay; if {@code sendPing} is synchronous, it blocks until the
42+
* response, and elapsed time reflects the delay. If anyone regresses the code back to {@code
43+
* runAsync}, the call returns immediately and this test fails.
44+
*/
45+
@WireMockTest
46+
@DisplayName("TelemetryReporter — short-lived-process regression")
47+
class TelemetryReporterShortLivedTest {
48+
49+
@Test
50+
@DisplayName("sendPing blocks until the HTTP round-trip completes (no fire-and-forget drop)")
51+
void sendPingBlocksUntilRoundTripCompletes(WireMockRuntimeInfo info) {
52+
// Mock checkpoint returns 200 only after a 200ms delay. If sendPing is
53+
// synchronous, the caller blocks for at least that long. If sendPing
54+
// regresses to fire-and-forget (daemon-thread), the caller returns
55+
// almost immediately (<50ms) and the assertion below fails.
56+
info.getWireMock()
57+
.register(
58+
post(urlEqualTo("/v1/ping"))
59+
.willReturn(aResponse().withStatus(200).withFixedDelay(200).withBody("{}")));
60+
61+
String checkpointUrl = info.getHttpBaseUrl() + "/v1/ping";
62+
63+
long startNs = System.nanoTime();
64+
TelemetryReporter.sendPing(
65+
"production",
66+
"", // empty SDK endpoint: skip /health probe so we measure only the POST
67+
Boolean.TRUE,
68+
false,
69+
false,
70+
null, // DO_NOT_TRACK
71+
null, // AXONFLOW_TELEMETRY
72+
checkpointUrl);
73+
long elapsedMs = (System.nanoTime() - startNs) / 1_000_000L;
74+
75+
// The fixed-delay mock forces a ~200ms round-trip. A synchronous implementation
76+
// must have waited for it; a fire-and-forget implementation would have returned
77+
// in well under 50ms. 150ms is the floor: generous slack for JVM scheduling /
78+
// network setup on slow CI machines without losing the regression signal.
79+
assertThat(elapsedMs)
80+
.as(
81+
"sendPing should have blocked long enough for the 200ms fixed-delay response "
82+
+ "to complete. An elapsed time under ~150ms strongly suggests a regression "
83+
+ "to the CompletableFuture.runAsync fire-and-forget pattern, which would "
84+
+ "drop the ping on JVM exit in short-lived processes (see #1706).")
85+
.isGreaterThanOrEqualTo(150L);
86+
87+
// And: the ping actually landed on the mock.
88+
info.getWireMock().verifyThat(postRequestedFor(urlEqualTo("/v1/ping")));
89+
}
90+
}

0 commit comments

Comments
 (0)