feat(relayenv): re-anchor mechanism#720
Draft
aaron-zeisler wants to merge 7 commits into
Draft
Conversation
Three tests (TestRealClient_CloseInvokesWrapperClose, TestRealClient_ReadsAfterCloseAreStillFunctional, TestRealClient_HandoverScenarioToday) exercise the real ld.LDClient against SSERelayDataStoreAdapter / streamUpdatesStoreWrapper. The T0 PoC could not validate the wrapper's Close behavior against the real client (it used a fake). These confirm the design's lifecycle caveat: ld.LDClient.Close() propagates through the wrapper to the underlying store. With store handover (next commit), the adapter — not the retiring client's wrapper — must own that lifecycle.
SSERelayDataStoreAdapter.Build now hands its existing wrapper to a second SDK client (re-anchor handover) instead of constructing a fresh wrapper around a fresh underlying store. The new anchor's client therefore reads populated, initialized data immediately — no empty-store window during re-anchor. The wrapper is refcounted: each handover bumps the count via acquire(); each client's Close decrements; only the final Close tears down the underlying store. This honors design §7's store-lifecycle caveat — Close must not tear down the shared store while another client still holds it — without leaking the store on final env teardown. Two PoC tests that asserted the old broken behavior (H1 in-memory sub-test, H5 in-memory wipe) are inverted to assert the post-fix invariant. The spike test TestRealClient_HandoverScenarioToday is renamed and rewritten to TestRealClient_HandoverPreservesUnderlyingStore covering the full refcounted lifecycle.
…mitAnchor Adds the API surface for the synchronous re-anchor sequence (design §7). - Reconcile no longer flips r.primarySdkKey when the anchor changes. Callers receive a ReconcileResult carrying an AnchorChange struct (PreviousAnchor, NewAnchor, NewAnchorPreviouslyAccepted) and drive the flip via the new CommitAnchor(key) method once they are ready. - ReconcileResult also exposes MobilePrimaryRepoint when the primary mobile key changed to a key that was already in the accepted set. In that case the key is not in additions and addCredential's gate won't fire, so the caller must invoke eventDispatcher.ReplaceCredential itself. - env_context's reconcileCredentials applies the anchor change immediately after Reconcile to preserve pre-T2.c behavior. The full synchronous sequence (build new client first, then commit) lands in the next commit. The MobilePrimaryRepoint case is handled now — this is the fix for the gap left by PR #712's gate (already-accepted mobile key becoming primary skipped event-dispatcher repointing). Two rotator tests that previously relied on Reconcile flipping the anchor in place now call CommitAnchor explicitly with the returned ReconcileResult.AnchorChange.
Implements the synchronous re-anchor sequence described in design §7. reconcileCredentials now interleaves add → re-anchor → remove and owns the new anchor's setup, replacing the prior commit's immediate CommitAnchor. The rotator strips the new anchor from additions in Case A so the async startSDKClient invocation in addCredential cannot race the synchronous client build. reanchor branches on whether a client already exists for the new anchor: Case A (no client): install peripherals if the key is brand new (envStreams, handlers, connection mapping); construct the SDK client synchronously via c.sdkClientFactory with store handover (Build returns the existing wrapper, so the new client reads populated initialized data); on err or !Initialized, roll back — do not CommitAnchor, log a structured error, leave the previous anchor authoritative; on success, install the client, rebuild the evaluator, then CommitAnchor + ReplaceCredential + big-segment stub. Case B (client exists, e.g. a former anchor in grace): reuse the client and skip the build; CommitAnchor + ReplaceCredential + big-segment stub. A new reconcileMu serializes concurrent reconciles without blocking other env operations (GetClient / GetStore / addCredential continue to run during the synchronous build). Rotator-test updates: tests that previously expected the anchor in the additions list now expect it stripped, and explicitly invoke CommitAnchor mirroring what env_context's reconcileCredentials does. T2.d (SDK-2543) will fill in reanchorBigSegmentSync; until then it is a no-op (matching today's behavior — no regression on re-anchor).
Regression coverage for the synchronous re-anchor sequence (design §7): - Case A success: re-anchor to a brand-new key builds a fresh client, hands over the store (same wrapper, still initialized, data intact), commits the anchor, and retains the old client during its grace period. - Case A init failure: a failing new-client build rolls back — anchor pointer stays on the old key, no broken client installed, old client preserved, initErr surfaced. - Case B: re-anchoring onto a key whose client still exists (a former anchor in its grace period) reuses the client and builds nothing new (clientCh stays empty), while the anchor flips. - Mobile-primary repoint: switching the primary mobile key to an already-accepted mobile key repoints the primary without spawning an SDK client — the gap PR #712's gate left, now driven via ReconcileResult.MobilePrimaryRepoint. All four pass under -race.
§7 previously described only Case A (re-anchor onto a brand-new key). Add the Case B path — re-anchoring onto a key that already has a live client (typically a former anchor still in its grace period): no Build, no store handover, just the flip + ReplaceCredential + big- segment re-wire that Case A also performs after init. The two paths converge after the flip; the caller branches on c.clients[newAnchor]. Also clarify failure handling (rollback is Case A only; structured Errorf log, no dedicated alarm infra), the Reconcile/additions stripping interaction, and update the consolidated T2.c/T2.d requirements table to reflect the case split and the synchronous ReplaceCredential call site.
gofmt the rotator's removeCredentialFromList helper, and reword the reanchorBigSegmentSync stub comment to avoid the godox TODO keyword (godox is enabled in .golangci.yml; run.tests is false so only production code is checked). No behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements T2.c — the re-anchor mechanism (SDK-2542) for concurrent SDK keys. When
sdkKey.valuechanges, relay now swaps its single upstream SDK client to the new anchor synchronously, preserving downstream connections and rolling back cleanly on failure.Jira: SDK-2542 · Epic: SDK-2453 · Milestone M3.
Targets
feat/concurrent-keys(not v8). Builds on the T0 PoC findings and the store-handover spike.Background
The design (
.agent-docs/concurrent-keys/phase1-design.md§7) and the T0 PoC established that re-anchoring is not a transparent side-effect of today's code: flipping the anchor pointer before the new client initializes leavesGetClient()nil mid-swap (PoC H6) and strands the env on init failure (H7); the in-memory store is rebuilt empty when the new client constructs (H1/H5). This PR closes those gaps.Changes (one commit per logical step)
test(store)— bring over the three store-handover real-client spike tests (one rewritten asTestRealClient_HandoverPreservesUnderlyingStore).feat(store)—SSERelayDataStoreAdapter.Buildreuses its existing wrapper on a second client construction (store handover → no empty-store window). The wrapper is refcounted: only the finalClosetears down the underlying store, so the retiring client'sClosecan't pull the store out from under the new anchor (design §7 lifecycle caveat).feat(credential)—Reconcileno longer flipsprimarySdkKey; it returns aReconcileResult(AnchorChange+MobilePrimaryRepoint), and a newCommitAnchormoves the pointer. The new anchor is stripped fromadditionsin Case A soaddCredential's asyncstartSDKClientcan't race the synchronous build.feat(relayenv)— synchronous re-anchor with Case A (build new client + store handover + waitInitialized→ commit →ReplaceCredential→ big-segment stub; rollback on init failure) and Case B (reuse the existing client → commit →ReplaceCredential). Order: add → re-anchor → remove, serialized by a dedicatedreconcileMu.test(relayenv)— integration tests: Case A success, Case A init-failure rollback, Case B reuse (no new client), mobile-primary repoint.docs— document Case B in design §7.style— gofmt + godox lint.Acceptance criteria coverage
TestReanchorSync_CaseA_BuildsNewClientAndMovesAnchorTestReanchorSync_CaseA_InitFailureRollsBackTestReanchorSync_CaseB_ReusesExistingClientTestRealClient_HandoverPreservesUnderlyingStore,TestReanchorPoC_H5_StoreSurvivesReAnchor, H1 sub-testTestReanchorSync_MobilePrimaryRepoint_AlreadyAcceptedKey+ PR #712's gatePoC tests for H1 and H5 were inverted from "asserts broken behavior" to "asserts the fix" and retained as regressions.
Out of scope
reanchorBigSegmentSync) is defined as a no-op (matches today's behavior, no regression); T2.d fills it in.Coordination note for PR #719
#719 renamed a test to assert pre-M3 behavior on the legacy
UpdateCredentialpath. T2.c changes only theReconcileCredentialspath, so that legacy test (TestReAnchoringToKeyStillInGraceClosesStaleClienton this branch) is unaffected and still passes. The Reconcile-path successor asserting the new event/metrics repoint behavior isTestReanchorSync_CaseB_ReusesExistingClient. If #719 lands with aReconcileCredentials-based variant, fold its assertions into the Case B test.