Skip to content

Commit 6727439

Browse files
Merge pull request #6 from MarketDataApp/06_retry_policies
Retry policy (§9 partial) + `executeSync` cancellation handling
2 parents 0050824 + 98ac66e commit 6727439

7 files changed

Lines changed: 925 additions & 21 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
- Default retry attempts corrected from 3 to 4 (one initial + three retries) to
12+
match SDK requirements §9.3 ("max 3 retries, yielding 4 total attempts").
13+
1014
### Added
1115
- Project scaffold per ADRs 001–007: Gradle Kotlin DSL build, JDK 17 toolchain,
1216
`integrationTest` source set, Spotless + JaCoCo, Vanniktech Maven Publish.

CLAUDE.md

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -65,15 +65,16 @@ The Java SDK must also satisfy the canonical, cross-language [SDK Requirements](
6565
- §4 configuration cascade — `Configuration.resolve(...)` does explicit → `MARKETDATA_*` env var → `.env` in CWD → default. Env var names live in `EnvVars` (package-private, in the SDK root package). The 4-arg constructor's parameters feed step 1; the no-arg constructor skips it and starts at step 2.
6666
- §5 demo mode + `validateOnStartup` parameter on the 4-arg constructor (defaults to `true` via the no-arg constructor); token redaction via `Tokens.redact` (matches the spec example `***…***YKT0`).
6767
- §6 sealed `MarketDataException` hierarchy with the 7 canonical subtypes and full support context (`requestId`, `requestUrl`, `statusCode`, `timestamp`, `exceptionType`) + `getSupportInfo()`.
68-
- §10 timeouts: `REQUEST_TIMEOUT = 99s` and `CONNECT_TIMEOUT = 2s` exposed as constants on `MarketDataClient`. Connect timeout is wired into the `HttpClient`; the per-request 99 s timeout is a constant ready to be applied to `HttpRequest.Builder#timeout` when the request layer lands.
69-
- §12 concurrency: `Semaphore(50)` field on `MarketDataClient` (wiring of acquire/release lands with the request layer).
68+
- §10 timeouts: `REQUEST_TIMEOUT = 99s` and `CONNECT_TIMEOUT = 2s` exposed as constants on `MarketDataClient`. Connect timeout is wired into the `HttpClient`; the per-request 99 s timeout is applied via `HttpRequest.Builder#timeout` in `HttpTransport.buildRequest`.
69+
- §12 concurrency: 50-permit `AsyncSemaphore` on `HttpTransport` with acquire/release wired around every dispatch. The custom semaphore replaces `java.util.concurrent.Semaphore` so `executeAsync` never parks the caller's thread on a full pool (ADR-007).
70+
- §9 retry/backoff: `RetryPolicy` (4 total attempts = 1 initial + 3 retries, exponential 1s→30s per §9.3) wired into `HttpTransport.executeAsync` via a per-attempt loop using `CompletableFuture.delayedExecutor` (no scheduled threads). Network errors and HTTP 501–599 retry; 500 and 4xx do not.
7071
- §15 packaging: SemVer, MIT `LICENSE`, `CHANGELOG.md` in Keep a Changelog format, version auto-detected via JAR manifest (`Implementation-Version`).
7172
- §16 security: tokens never logged verbatim (use `Tokens.redact`); TLS validated by default (`HttpClient` does not expose a skip-verify option).
7273
- ADR-002 CI: split into four workflows.
7374
- `.github/workflows/pull-request.yml` — runs on PR `opened`/`synchronize`/`reopened` (no pre-PR push trigger by design). JDK 17 only. Runs `./gradlew build` (unit tests + Spotless + JaCoCo) and uploads coverage to Codecov. **Does not** run integration tests — those are handled by the on-demand workflow below.
7475
- `.github/workflows/main.yml` — runs only on `push` to `main`. Two jobs: `verify` does the full forward-compat matrix `{17, 21, 25}` for unit tests via `-PtestJdk=N`; `integration-tests` does a parallel matrix `{17, 21, 25}` against the live API. Both are mandatory for the merge to be considered successful. The JDK 17 matrix entry of `verify` also uploads coverage to Codecov as the new baseline that PRs compare against. `integration-tests` fails the build if `MARKETDATA_TOKEN` secret is absent (it is required on main).
7576
- `.github/workflows/pr-matrix-on-demand.yml` — manually triggered on a PR by commenting `/run-all-jdks`, `/jdk-matrix`, or `/test-all`. Runs the **unit-test** matrix on JDK 21 and 25 (17 already ran via `pull-request.yml`). Gated to write/maintain/admin commenters. Reacts 👀 to the trigger comment and posts a result summary.
76-
- `.github/workflows/pr-integration-on-demand.yml` — manually triggered on a PR by commenting `integrationtest` (JDK 17 only) or `integrationtestfull` (matrix `{17, 21, 25}`). Runs the **integration-test** suite against the live API. Same write+ permission gate as the matrix-on-demand workflow. Aggregates the matrix outcome into a single required check named **"Integration tests pass"** so branch protection can require it uniformly regardless of which command was used. Branch-protection rules on `main` should list this check as required for merge.
77+
- `.github/workflows/pr-integration-on-demand.yml` — manually triggered on a PR by commenting `/integrationtest` (JDK 17 only) or `/integrationtestfull` (matrix `{17, 21, 25}`) on the **first line** of the comment body. Runs the **integration-test** suite against the live API. Same write+ permission gate as the matrix-on-demand workflow. The first-line + exact-match constraint prevents accidental triggers from quoted replies (`> /integrationtest`) or prose that mentions the command. Aggregates the matrix outcome into a single required check named **"Integration tests pass"** so branch protection can require it uniformly regardless of which command was used. Branch-protection rules on `main` should list this check as required for merge.
7778
- All four `issue_comment`-driven workflows execute from the default branch's copy of their YAML, not the PR's. Feature-branch edits to these workflows take effect only after merge to main.
7879
- `-PtestJdk=N` is wired to **all** `Test` tasks (`test` and `integrationTest`) via `tasks.withType<Test>().configureEach { javaLauncher.set(...) }` in `build.gradle.kts`, so the matrix flag works uniformly across unit and integration tests.
7980
- Coverage ratchet lives in `codecov.yml`: project status with `target: auto, threshold: 5%` (cannot drop >5 pp vs base branch) plus a patch-coverage requirement of 70 % on new code. Requires a `CODECOV_TOKEN` repo secret — without it the upload step fails because workflows pass `fail_ci_if_error: true`.
@@ -84,19 +85,14 @@ The Java SDK must also satisfy the canonical, cross-language [SDK Requirements](
8485
- §5 actual `/user/` startup validation call (the `validateOnStartup` flag is the seam; the call itself comes with the request layer).
8586
- §7 honoring `MARKETDATA_LOGGING_LEVEL` and the spec's exact `{timestamp} - {logger_name} - {level} - {message}` format. Currently the SDK uses `java.util.logging` with default formatting; consumers can attach their own handler.
8687
- §8 rate-limit header parsing, pre-flight check, request-scoped attachment.
87-
- §9 retry/backoff policy and `/status/` cache workflow.
88-
- §12 acquire/release of the concurrency semaphore around dispatched requests.
88+
- §9 `/status/` cache workflow and `Retry-After` header override (retry/backoff itself lives in `RetryPolicy` and is wired; what is missing is the `/status/` pre-check before retrying 501–599 and respecting the server-specified `Retry-After` over the calculated exponential backoff).
8989
- §13 100% coverage threshold via JaCoCo `violationRules`; deferred until there is functional code worth the threshold.
9090

9191
When picking up new work, check this list before reaching for the SDK requirements doc — most foundational rules are already encoded in code; missing pieces are deferred deliberately, not by accident.
9292

93-
**Known latent gaps to revisit when retry/timeout lands:**
94-
- `HttpTransport.executeSync` only catches `CompletionException` from `.join()`, not `CancellationException`. Today the latter is unreachable — the user can't cancel a future they never see (the future is local to `executeSync`), no internal code cancels it, and `dispatch`'s `handle((response, error) -> ...)` translates every upstream error (including a hypothetical `CancellationException` from `sendAsync`) into `CompletionException(NetworkError)`. The gap becomes real once we add:
95-
- `dispatched.orTimeout(99s)` / `completeOnTimeout` to enforce the §10 timeout strictly (these produce `CancellationException` on the downstream future).
96-
- A retry coordinator (§9) that cancels in-flight futures when aborting a retry chain.
97-
- A bump to JDK 21+ where `HttpClient.close()` cancels in-flight futures.
98-
When any of those land, extend the catch in `executeSync` (or fold it into `asRuntime`) so cancellations don't escape as raw `RuntimeException` to sync callers. Tracked as Issue #2 of the 2026-05-11 review (`REVIEW-2026-05-11-markets-status.md`).
93+
**Known latent gaps:**
9994
- `HttpTransport.buildUri` URL-encodes query-param values with `URLEncoder.encode(..., UTF_8)`, which is form-encoding semantics: spaces become `+`, not `%20`. Fine for today's typed params (dates, numerics) but a future endpoint that takes an arbitrary string (e.g. `symbol="BRK A"`) would round-trip differently against an RFC-3986-strict server. Switch to a path/query-segment-aware encoder when the first such param lands. Tracked as Issue #10 of the 2026-05-11 review.
95+
- `Retry-After` server header is parsed and respected by neither `RetryPolicy` nor `HttpTransport`. Today every retry uses the calculated exponential backoff (`min(1s × 2^N, 30s)`). Implementing the override needs the response headers to reach `RetryPolicy.backoffDelay`, which today only sees the attempt index — most natural path is to surface a `Duration` on `ServerError` (or thread it through a separate channel) when 5xx responses carry the header. Follow-up of the §9 work.
10096

10197
## Acceptance checklist
10298

src/main/java/com/marketdata/sdk/HttpTransport.java

Lines changed: 94 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020
import java.util.concurrent.CancellationException;
2121
import java.util.concurrent.CompletableFuture;
2222
import java.util.concurrent.CompletionException;
23+
import java.util.concurrent.TimeUnit;
2324
import java.util.concurrent.atomic.AtomicReference;
2425
import org.jspecify.annotations.Nullable;
2526

@@ -55,6 +56,7 @@ final class HttpTransport implements AutoCloseable {
5556
private final HttpClient httpClient;
5657
private final ObjectMapper jsonMapper;
5758
private final AsyncSemaphore concurrencyPermits;
59+
private final RetryPolicy retryPolicy;
5860
private final AtomicReference<@Nullable RateLimits> latestRateLimits = new AtomicReference<>();
5961

6062
private final String baseUrl;
@@ -63,7 +65,7 @@ final class HttpTransport implements AutoCloseable {
6365
private final @Nullable String token;
6466

6567
HttpTransport(String baseUrl, String apiVersion, String userAgent, @Nullable String token) {
66-
this(baseUrl, apiVersion, userAgent, token, defaultHttpClient());
68+
this(baseUrl, apiVersion, userAgent, token, defaultHttpClient(), RetryPolicy.defaults());
6769
}
6870

6971
// Package-private constructor used by tests to inject a stubbed HttpClient
@@ -74,13 +76,25 @@ final class HttpTransport implements AutoCloseable {
7476
String userAgent,
7577
@Nullable String token,
7678
HttpClient httpClient) {
79+
this(baseUrl, apiVersion, userAgent, token, httpClient, RetryPolicy.defaults());
80+
}
81+
82+
// Package-private constructor used by retry tests to drive sub-millisecond backoffs.
83+
HttpTransport(
84+
String baseUrl,
85+
String apiVersion,
86+
String userAgent,
87+
@Nullable String token,
88+
HttpClient httpClient,
89+
RetryPolicy retryPolicy) {
7790
this.baseUrl = baseUrl;
7891
this.apiVersion = apiVersion;
7992
this.userAgent = userAgent;
8093
this.token = token;
8194
this.concurrencyPermits = new AsyncSemaphore(CONCURRENCY_LIMIT);
8295
this.jsonMapper = buildJsonMapper();
8396
this.httpClient = httpClient;
97+
this.retryPolicy = retryPolicy;
8498
}
8599

86100
private static HttpClient defaultHttpClient() {
@@ -102,14 +116,80 @@ private static HttpClient defaultHttpClient() {
102116
}
103117

104118
/**
105-
* Async-first request execution.
106-
*
107-
* <p>Acquires a concurrency permit, fires the request, parses rate-limit headers, decodes the
108-
* body when the status is 200/203/404 (the API returns 404 with {@code {"s":"no_data"}} as a
109-
* sentinel — see SDK requirements §9.1), and translates other status codes to the appropriate
110-
* {@link MarketDataException} subtype.
119+
* Async-first request execution with retry. Orchestrates one or more attempts according to {@link
120+
* RetryPolicy}: retries 501–599 and IOException-shaped {@link NetworkError}s with exponential
121+
* backoff, surfaces every other failure immediately. Cancellation of the returned future bails
122+
* out of any pending backoff and propagates to the current in-flight attempt.
111123
*/
112124
<T> CompletableFuture<T> executeAsync(RequestSpec spec, Class<T> responseType) {
125+
CompletableFuture<T> result = new CompletableFuture<>();
126+
// One cascade-cancel handler installed once: whichever attempt is currently in flight is
127+
// tracked in `currentDispatched`; cancelling `result` cancels that. Previous attempts in
128+
// the chain are already done by the time the next one updates the reference, so this
129+
// avoids accumulating a handler per attempt.
130+
AtomicReference<@Nullable CompletableFuture<T>> currentDispatched = new AtomicReference<>();
131+
result.whenComplete(
132+
(r, t) -> {
133+
if (t instanceof CancellationException) {
134+
CompletableFuture<T> inFlight = currentDispatched.get();
135+
if (inFlight != null && !inFlight.isDone()) {
136+
inFlight.cancel(false);
137+
}
138+
}
139+
});
140+
attempt(spec, responseType, 0, result, currentDispatched);
141+
return result;
142+
}
143+
144+
private <T> void attempt(
145+
RequestSpec spec,
146+
Class<T> responseType,
147+
int attemptIdx,
148+
CompletableFuture<T> result,
149+
AtomicReference<@Nullable CompletableFuture<T>> currentDispatched) {
150+
if (result.isDone()) {
151+
// Caller cancelled (or completed exceptionally from a previous attempt's whenComplete).
152+
// Don't burn another HTTP request.
153+
return;
154+
}
155+
CompletableFuture<T> dispatched = executeOnce(spec, responseType);
156+
currentDispatched.set(dispatched);
157+
158+
// If the caller cancelled `result` between attempts (during a backoff window), the handler
159+
// installed in executeAsync has fired but `currentDispatched` was either null or pointing
160+
// to the previous (already-done) attempt — so the new one was never cancelled. Check here
161+
// and propagate immediately.
162+
if (result.isCancelled() && !dispatched.isDone()) {
163+
dispatched.cancel(false);
164+
return;
165+
}
166+
167+
dispatched.whenComplete(
168+
(value, error) -> {
169+
if (result.isDone()) {
170+
return;
171+
}
172+
if (error == null) {
173+
result.complete(value);
174+
return;
175+
}
176+
Throwable cause = unwrap(error);
177+
if (retryPolicy.shouldRetry(cause, attemptIdx)) {
178+
long delayMs = retryPolicy.backoffDelay(attemptIdx).toMillis();
179+
CompletableFuture.delayedExecutor(delayMs, TimeUnit.MILLISECONDS)
180+
.execute(
181+
() -> attempt(spec, responseType, attemptIdx + 1, result, currentDispatched));
182+
} else {
183+
result.completeExceptionally(cause);
184+
}
185+
});
186+
}
187+
188+
/**
189+
* Single-shot dispatch — one HTTP request, one permit lease, one response decode. Public retry
190+
* orchestration lives in {@link #executeAsync}.
191+
*/
192+
private <T> CompletableFuture<T> executeOnce(RequestSpec spec, Class<T> responseType) {
113193
URI uri = buildUri(spec);
114194
HttpRequest request = buildRequest(uri);
115195

@@ -186,12 +266,19 @@ private <T> CompletableFuture<T> dispatch(URI uri, HttpRequest request, Class<T>
186266
/**
187267
* Sync wrapper around {@link #executeAsync}. Per ADR-006, calls {@code .join()} and unwraps
188268
* {@link CompletionException} so callers see the underlying {@link MarketDataException} directly.
269+
*
270+
* <p>{@link CancellationException} can in principle escape {@code .join()} as a sibling of {@link
271+
* CompletionException} (not nested), so it's caught explicitly. Today no internal code cancels
272+
* the future {@code executeSync} owns, but covering it keeps the contract honest if a future
273+
* change (timeout watchdog, retry coordinator) starts cancelling internally.
189274
*/
190275
<T> T executeSync(RequestSpec spec, Class<T> responseType) {
191276
try {
192277
return executeAsync(spec, responseType).join();
193278
} catch (CompletionException e) {
194279
throw asRuntime(e.getCause());
280+
} catch (CancellationException e) {
281+
throw asRuntime(e);
195282
}
196283
}
197284

0 commit comments

Comments
 (0)