Skip to content

fix(telemetry): synchronous send + version-alignment CI#136

Merged
saurabhjain1592 merged 1 commit into
mainfrom
fix/telemetry-sync-bounded
Apr 24, 2026
Merged

fix(telemetry): synchronous send + version-alignment CI#136
saurabhjain1592 merged 1 commit into
mainfrom
fix/telemetry-sync-bounded

Conversation

@saurabhjain1592

Copy link
Copy Markdown
Member

Closes getaxonflow/axonflow-enterprise#1706

Summary

Java SDK had the same telemetry-delivery bug as Python (#1692) and Go (#1693): `CompletableFuture.runAsync(lambda)` without an explicit executor submits to `ForkJoinPool.commonPool()`, whose threads are daemon by default since Java 8. When the JVM's main thread exits, those daemon threads are killed mid-flight — the HTTP POST is abandoned. CLI binaries, AWS Lambda handlers, serverless cold-starts, and quickstart scripts silently drop their telemetry.

Missed yesterday because I tested Python/Go/TS but not Java; surfaced today when reviewing why TS/Java showed near-zero external pings.

Fix

`TelemetryReporter.sendPing()`:

  • Replaced `CompletableFuture.runAsync(...)` with synchronous execution. `AxonFlow` construction blocks briefly (~350ms warm, ~1.3s cold; bounded at `TIMEOUT_SECONDS`) while the ping is sent.
  • `detectPlatformVersion` now takes a `budgetMs` parameter. The `/health` probe and checkpoint POST share a single monotonic deadline — previously they each had their own 2-3s timeout, stacking to ~5s worst case, which defeated the "bounded at TIMEOUT_SECONDS" invariant.

Matches the shape shipped for Go SDK in #128 (axonflow-enterprise#1693).

Regression test

`TelemetryReporterShortLivedTest.java` verifies the core invariant: when `sendPing` returns, the HTTP round-trip has already completed — not still pending on a dying daemon thread. Uses a WireMock server with a 200ms fixed-delay response; asserts elapsed time >= 150ms (sync must have blocked). Under revert to `runAsync`: FAIL at 0.070s (returns immediately). With fix: PASS at 0.971s (blocks for the delay).

Also included: version alignment CI

New `.github/scripts/validate-version-alignment.sh` + `.github/workflows/validate-version-alignment.yml`.

Mirrors the pattern in the platform repo and the Go SDK's PR #130. Compares `pom.xml ` against the first released `## [X.Y.Z]` section in `CHANGELOG.md`; CI runs on any PR or push to main that touches either file. Prevents both drift patterns: pom behind CHANGELOG (manifest didn't get bumped) and CHANGELOG behind pom (tag shipped without CHANGELOG entry).

Verified locally:

  • PASS on current state (pom 5.7.0 == CHANGELOG 5.7.0)
  • FAIL when pom is manually mismatched to 5.8.0 with actionable error message

Test plan

  • Full test suite: `mvn test` — 1,200 tests, 0 failures
  • New regression test fails under revert, passes with fix
  • Version alignment script works in both directions
  • JSON-valid changes; no lint issues

Post-merge

CHANGELOG has `[Unreleased]` entries. Bundle into the next Java SDK release (v5.7.1 patch or later minor) per the "commit per published version" rule.

Java SDK had the same telemetry-delivery bug as Python (#1692) and Go
(#1693): CompletableFuture.runAsync(lambda) without an explicit executor
submits to ForkJoinPool.commonPool(), whose threads are daemon by
default since Java 8. When the JVM's main thread exits, those daemon
threads are killed mid-flight — the OkHttpClient.newCall().execute()
inside the lambda is abandoned. CLI Java binaries, AWS Lambda Java
handlers, serverless cold-starts, and quickstart scripts silently drop
telemetry pings with no error visible to the caller. Per the 2026-04-24
SDK-telemetry investigation, this is a likely major contributor to the
0 external-confirmed Java records we've observed to date.

## Fix (TelemetryReporter.java)

- Replaced CompletableFuture.runAsync(...) with synchronous execution.
  Blocks the caller briefly: ~350ms warm / ~1.3s cold on a reachable
  checkpoint, bounded at TIMEOUT_SECONDS on an unreachable one.
  Acceptable for a control-plane SDK's construction path, matching the
  Go SDK pattern shipped in #1693.
- Shared monotonic deadline across /health probe and checkpoint POST.
  Previously detectPlatformVersion (2s timeout) and the checkpoint POST
  (3s timeout) had independent timeouts that could stack to ~5s.
  detectPlatformVersion now takes a budgetMs parameter derived from the
  shared deadline; POST uses whatever is left.
- Both operations skip when remaining budget is below MIN_BUDGET_MS
  (100ms) to avoid issuing calls that are guaranteed to time out.

## Regression test (TelemetryReporterShortLivedTest.java)

Verifies the core invariant: when sendPing returns, the HTTP round-trip
has already completed (not still-pending on a dying daemon thread). Uses
a WireMock server with a fixed-delay response; measures elapsed time to
confirm sendPing actually blocked. Verified:

- FAIL at 0.070s when reverted to CompletableFuture.runAsync
- PASS at 0.971s with the fix

So a future regression to fire-and-forget is caught by CI, not by
missing telemetry in production.

## CHANGELOG

Added [Unreleased] section with terse one-line entries per the
telemetry-CHANGELOG-minimal rule: delivery fix + shared-deadline bound
+ the alignment-check addition below.

## Version alignment CI (.github/scripts + .github/workflows)

Mirrors the pattern just added in axonflow-sdk-go PR #130 and already
present in the platform repo. Script compares pom.xml <version> against
the first released '## [X.Y.Z]' section in CHANGELOG.md; CI runs on any
PR or push to main that touches either file. Prevents the drift pattern
where a release ships to Maven Central but the repo's pom stays behind
(and the inverse: pom bumped but CHANGELOG still shows the prior
version as latest released).

Verified the script locally: PASS on current state (pom 5.7.0 ==
CHANGELOG 5.7.0); FAIL when pom is manually mismatched to 5.8.0.

Full test suite: 1,200 tests, 0 failures.

Closes getaxonflow/axonflow-enterprise#1706.
@saurabhjain1592 saurabhjain1592 merged commit 37bcc39 into main Apr 24, 2026
10 checks passed
@saurabhjain1592 saurabhjain1592 deleted the fix/telemetry-sync-bounded branch April 24, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant