Skip to content

feat(relayenv): re-anchor mechanism#720

Draft
aaron-zeisler wants to merge 7 commits into
feat/concurrent-keysfrom
aaronz/SDK-2542/re-anchor-mechanism
Draft

feat(relayenv): re-anchor mechanism#720
aaron-zeisler wants to merge 7 commits into
feat/concurrent-keysfrom
aaronz/SDK-2542/re-anchor-mechanism

Conversation

@aaron-zeisler

@aaron-zeisler aaron-zeisler commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements T2.c — the re-anchor mechanism (SDK-2542) for concurrent SDK keys. When sdkKey.value changes, relay now swaps its single upstream SDK client to the new anchor synchronously, preserving downstream connections and rolling back cleanly on failure.

Jira: SDK-2542 · Epic: SDK-2453 · Milestone M3.

Targets feat/concurrent-keys (not v8). Builds on the T0 PoC findings and the store-handover spike.

Background

The design (.agent-docs/concurrent-keys/phase1-design.md §7) and the T0 PoC established that re-anchoring is not a transparent side-effect of today's code: flipping the anchor pointer before the new client initializes leaves GetClient() nil mid-swap (PoC H6) and strands the env on init failure (H7); the in-memory store is rebuilt empty when the new client constructs (H1/H5). This PR closes those gaps.

Changes (one commit per logical step)

  1. test(store) — bring over the three store-handover real-client spike tests (one rewritten as TestRealClient_HandoverPreservesUnderlyingStore).
  2. feat(store)SSERelayDataStoreAdapter.Build reuses its existing wrapper on a second client construction (store handover → no empty-store window). The wrapper is refcounted: only the final Close tears down the underlying store, so the retiring client's Close can't pull the store out from under the new anchor (design §7 lifecycle caveat).
  3. feat(credential)Reconcile no longer flips primarySdkKey; it returns a ReconcileResult (AnchorChange + MobilePrimaryRepoint), and a new CommitAnchor moves the pointer. The new anchor is stripped from additions in Case A so addCredential's async startSDKClient can't race the synchronous build.
  4. feat(relayenv) — synchronous re-anchor with Case A (build new client + store handover + wait Initialized → commit → ReplaceCredential → big-segment stub; rollback on init failure) and Case B (reuse the existing client → commit → ReplaceCredential). Order: add → re-anchor → remove, serialized by a dedicated reconcileMu.
  5. test(relayenv) — integration tests: Case A success, Case A init-failure rollback, Case B reuse (no new client), mobile-primary repoint.
  6. docs — document Case B in design §7.
  7. style — gofmt + godox lint.

Acceptance criteria coverage

Criterion Where
Case A success (build, init, flip, store handover, old client retained) TestReanchorSync_CaseA_BuildsNewClientAndMovesAnchor
Case A init failure → rollback (anchor stays, old client serves, structured error) TestReanchorSync_CaseA_InitFailureRollsBack
Case B reuse (no new Build, anchor flips) TestReanchorSync_CaseB_ReusesExistingClient
Store handover (Build reuse; Close doesn't tear down shared store) TestRealClient_HandoverPreservesUnderlyingStore, TestReanchorPoC_H5_StoreSurvivesReAnchor, H1 sub-test
Mobile primary switch repoints event dispatcher; non-primary does not TestReanchorSync_MobilePrimaryRepoint_AlreadyAcceptedKey + PR #712's gate
No orphan clients (Case A doesn't trip #716 guard; Case B spawns nothing) Case A / Case B tests

PoC tests for H1 and H5 were inverted from "asserts broken behavior" to "asserts the fix" and retained as regressions.

Out of scope

  • T2.d (SDK-2543) — big-segment sync re-wire. A clean call site (reanchorBigSegmentSync) is defined as a no-op (matches today's behavior, no regression); T2.d fills it in.
  • T2.e — handler fan-out.

Coordination note for PR #719

#719 renamed a test to assert pre-M3 behavior on the legacy UpdateCredential path. T2.c changes only the ReconcileCredentials path, so that legacy test (TestReAnchoringToKeyStillInGraceClosesStaleClient on this branch) is unaffected and still passes. The Reconcile-path successor asserting the new event/metrics repoint behavior is TestReanchorSync_CaseB_ReusesExistingClient. If #719 lands with a ReconcileCredentials-based variant, fold its assertions into the Case B test.

Three tests (TestRealClient_CloseInvokesWrapperClose,
TestRealClient_ReadsAfterCloseAreStillFunctional,
TestRealClient_HandoverScenarioToday) exercise the real ld.LDClient
against SSERelayDataStoreAdapter / streamUpdatesStoreWrapper.

The T0 PoC could not validate the wrapper's Close behavior against
the real client (it used a fake). These confirm the design's
lifecycle caveat: ld.LDClient.Close() propagates through the wrapper
to the underlying store. With store handover (next commit), the
adapter — not the retiring client's wrapper — must own that
lifecycle.
SSERelayDataStoreAdapter.Build now hands its existing wrapper to a
second SDK client (re-anchor handover) instead of constructing a
fresh wrapper around a fresh underlying store. The new anchor's
client therefore reads populated, initialized data immediately — no
empty-store window during re-anchor.

The wrapper is refcounted: each handover bumps the count via
acquire(); each client's Close decrements; only the final Close
tears down the underlying store. This honors design §7's
store-lifecycle caveat — Close must not tear down the shared store
while another client still holds it — without leaking the store on
final env teardown.

Two PoC tests that asserted the old broken behavior (H1 in-memory
sub-test, H5 in-memory wipe) are inverted to assert the post-fix
invariant. The spike test TestRealClient_HandoverScenarioToday is
renamed and rewritten to TestRealClient_HandoverPreservesUnderlyingStore
covering the full refcounted lifecycle.
…mitAnchor

Adds the API surface for the synchronous re-anchor sequence (design §7).

- Reconcile no longer flips r.primarySdkKey when the anchor changes.
  Callers receive a ReconcileResult carrying an AnchorChange struct
  (PreviousAnchor, NewAnchor, NewAnchorPreviouslyAccepted) and drive
  the flip via the new CommitAnchor(key) method once they are ready.
- ReconcileResult also exposes MobilePrimaryRepoint when the primary
  mobile key changed to a key that was already in the accepted set.
  In that case the key is not in additions and addCredential's gate
  won't fire, so the caller must invoke eventDispatcher.ReplaceCredential
  itself.
- env_context's reconcileCredentials applies the anchor change
  immediately after Reconcile to preserve pre-T2.c behavior. The full
  synchronous sequence (build new client first, then commit) lands in
  the next commit. The MobilePrimaryRepoint case is handled now —
  this is the fix for the gap left by PR #712's gate (already-accepted
  mobile key becoming primary skipped event-dispatcher repointing).

Two rotator tests that previously relied on Reconcile flipping the
anchor in place now call CommitAnchor explicitly with the returned
ReconcileResult.AnchorChange.
Implements the synchronous re-anchor sequence described in design §7.
reconcileCredentials now interleaves add → re-anchor → remove and owns
the new anchor's setup, replacing the prior commit's immediate
CommitAnchor.

The rotator strips the new anchor from additions in Case A so the
async startSDKClient invocation in addCredential cannot race the
synchronous client build. reanchor branches on whether a client
already exists for the new anchor:

  Case A (no client): install peripherals if the key is brand new
  (envStreams, handlers, connection mapping); construct the SDK
  client synchronously via c.sdkClientFactory with store handover
  (Build returns the existing wrapper, so the new client reads
  populated initialized data); on err or !Initialized, roll back —
  do not CommitAnchor, log a structured error, leave the previous
  anchor authoritative; on success, install the client, rebuild the
  evaluator, then CommitAnchor + ReplaceCredential + big-segment stub.

  Case B (client exists, e.g. a former anchor in grace): reuse the
  client and skip the build; CommitAnchor + ReplaceCredential +
  big-segment stub.

A new reconcileMu serializes concurrent reconciles without blocking
other env operations (GetClient / GetStore / addCredential continue
to run during the synchronous build).

Rotator-test updates: tests that previously expected the anchor in
the additions list now expect it stripped, and explicitly invoke
CommitAnchor mirroring what env_context's reconcileCredentials does.

T2.d (SDK-2543) will fill in reanchorBigSegmentSync; until then it is
a no-op (matching today's behavior — no regression on re-anchor).
Regression coverage for the synchronous re-anchor sequence (design §7):

- Case A success: re-anchor to a brand-new key builds a fresh client,
  hands over the store (same wrapper, still initialized, data intact),
  commits the anchor, and retains the old client during its grace
  period.
- Case A init failure: a failing new-client build rolls back — anchor
  pointer stays on the old key, no broken client installed, old client
  preserved, initErr surfaced.
- Case B: re-anchoring onto a key whose client still exists (a former
  anchor in its grace period) reuses the client and builds nothing new
  (clientCh stays empty), while the anchor flips.
- Mobile-primary repoint: switching the primary mobile key to an
  already-accepted mobile key repoints the primary without spawning an
  SDK client — the gap PR #712's gate left, now driven via
  ReconcileResult.MobilePrimaryRepoint.

All four pass under -race.
§7 previously described only Case A (re-anchor onto a brand-new key).
Add the Case B path — re-anchoring onto a key that already has a live
client (typically a former anchor still in its grace period): no
Build, no store handover, just the flip + ReplaceCredential + big-
segment re-wire that Case A also performs after init. The two paths
converge after the flip; the caller branches on c.clients[newAnchor].

Also clarify failure handling (rollback is Case A only; structured
Errorf log, no dedicated alarm infra), the Reconcile/additions
stripping interaction, and update the consolidated T2.c/T2.d
requirements table to reflect the case split and the synchronous
ReplaceCredential call site.
gofmt the rotator's removeCredentialFromList helper, and reword the
reanchorBigSegmentSync stub comment to avoid the godox TODO keyword
(godox is enabled in .golangci.yml; run.tests is false so only
production code is checked). No behavior change.
@aaron-zeisler aaron-zeisler changed the title feat(relayenv): re-anchor mechanism (T2.c, SDK-2542) feat(relayenv): re-anchor mechanism Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant