feat(LLMO-4176): add brandalf flag override and decision matrix to onboarding mode resolution#2171
Conversation
Captures the design for blocking actions that would create or extend organizations containing a mix of v1 and v2 (Brandalf) LLMO sites. Two coordinated guards (onboarding-time and feature-flag flip) sharing a single helper module, plus a read-only monitoring script for existing inconsistent orgs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Replace the two-guard design (onboarding consistency check + flag-flip guard) with a single rule inside resolveLlmoOnboardingMode: legacy customers (any site created before the Brandalf GA cutoff of 2026-04-01) default to v1; new customers default to v2; the brandalf feature flag remains an explicit override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This PR will trigger a minor release when merged. |
…cutoff Simplify the rule further: mode is now decided purely from whether the org has any site created before the cutoff. The brandalf feature flag is no longer an input to resolveLlmoOnboardingMode. The cutoff is a Unix epoch milliseconds value supplied via the LLMO_BRANDALF_GA_CUTOFF_MS environment variable, so it can be tuned without a full redeployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The simplified rule has only two outcomes (v1 for legacy customers, v2 for everyone else), so there is no longer a configurable default mode. Remove LLMO_ONBOARDING_DEFAULT_VERSION and normalizeLlmoOnboardingMode from the new resolveLlmoOnboardingMode and note that the env var should be cleaned up from Lambda config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Restore the env-driven default but layer the legacy-customer protection on top of it: when the default resolves to v2 (the normal state), an org with any pre-cutoff site is still forced onto v1; when the default is explicitly set to v1, that wins globally and short-circuits the per-org check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…createdAt Rewrites resolveLlmoOnboardingMode with a two-branch rule: - If LLMO_ONBOARDING_DEFAULT_VERSION=v1 → v1 (kill switch, no DB call) - If org has any site created before LLMO_BRANDALF_GA_CUTOFF_MS → v1 - Otherwise → v2 The cutoff defaults to 2026-04-01T00:00:00Z (1743465600000 ms) and can be tuned via the LLMO_BRANDALF_GA_CUTOFF_MS env var without a full redeploy. The brandalf feature flag is no longer read during mode resolution (still written after successful v2 onboarding for DRS). Adds 36 unit tests covering all branches, fallback on DB error, kill switch short-circuit, and custom cutoff env var. Adds IT tests that validate hasPreBrandalfSites against a real PostgREST DB using seed sites with explicit created_at timestamps. Fixes the existing v1-mode performLlmoOnboarding test to use the kill switch instead of the brandalf flag. TEMPORARY: hasPreBrandalfSites, resolveBrandalfCutoffMs, and the related seed data should be removed once all v1 customers are migrated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The postgrest upsert stub returned a plain resolved value, but upsertFeatureFlag calls .upsert().select().single(). Extend the stub to be chainable so v2-path tests reach the success handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The hard-coded default LLMO_BRANDALF_GA_CUTOFF_MS_DEFAULT was 1743465600000, which is 2025-04-01T00:00:00Z — one year earlier than the documented intent (2026-04-01T00:00:00Z). With the wrong value, any v1 customer onboarded between 2025-04-01 and 2026-04-01 would slip past the legacy guard and be misclassified as v2, defeating the safeguard this PR was meant to provide. Switch to Date.UTC(2026, 3, 1) so the value is self-documenting, and add an explicit pin in the unit test so a future regression on this constant fails loudly instead of silently passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…helpers
Six fixes from the deep review of the v1/v2 safeguard:
1. Cross-org re-parenting bypass (release blocker). performLlmoOnboarding
resolved the onboarding mode BEFORE createOrFindSite, but
createOrFindSite silently re-parents an existing site found via the
global Site.findByBaseURL lookup. A legacy (pre-cutoff) site moved
into a brand-new org would be classified v2 and instantly create the
mixed v1/v2 state this safeguard exists to prevent. Now:
- createOrFindSite persists the re-parent immediately (await save())
so resolveLlmoOnboardingMode's Site.allByOrganizationId query sees
the move.
- performLlmoOnboarding resolves the mode AFTER createOrFindSite.
Regression guarded by both an updated unit test on createOrFindSite
(asserts save() is now called in the re-parent branch) and a
calledAfter() ordering check on the full happy-path test.
2. Mixed-state diagnostic. When hasPreBrandalfSites returns true AND
the org already has brandalf=true, log.error so SRE notices: this
means a previously-migrated v2 org is being forced back to v1 and
needs manual remediation. Best-effort — failure to read the flag
only warns, never blocks onboarding.
3. Drop normalizeLlmoOnboardingMode. Dead code in src/, AND its default
('v1' for unknown input) is the OPPOSITE of resolveLlmoOnboardingMode's
default ('v2'). Anyone re-introducing it later would silently flip
the global default. Footgun removed.
4. Surface data-quality issues. hasPreBrandalfSites now log.warn's when
a site has missing/null/unparseable createdAt instead of silently
skipping. Combined with the lookup-error fall-through to v2, the
function was previously biased toward v2 in every failure mode —
exactly wrong for a safeguard whose purpose is preventing v1→v2 drift.
5. Re-use readBrandalfFlagOverride. Previously dead in src/; now used
by the mixed-state diagnostic in #2.
6. Doc sync. v1-v2-onboarding-consistency-safeguard.md no longer
references the wrong 1743465600000 cutoff, the deleted
normalizeLlmoOnboardingMode helper, or the false claim that
readBrandalfFlagOverride has callers outside this file. Updated
pseudocode matches the real implementation, including the new
call ordering note.
Test coverage: 38 unit tests for llmo-onboarding-mode (5 new mixed-state
cases + 3 updated to assert the new createdAt warns), 79 unit tests for
llmo-onboarding controller (createOrFindSite save assertion flipped,
calledAfter() regression guard added).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI build job failed on global 100% branch coverage — the \`s.getId?.() ?? '<unknown>'\` fallback in the new createdAt warn paths was unreachable from the existing tests because every makeSite() helper provides a getId stub. Add two cases that pass partially-hydrated site objects (no getId) for both the null-createdAt and unparseable-createdAt branches, restoring 100% on src/support/llmo-onboarding-mode.js. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mode-resolution test fixtures added in 89d21f8 (ORG_LEGACY_LLMO + ORG_NEW_LLMO + their two sites) bumped the global counts that several baseline IT tests assert on, which has been failing ci/it-postgres on this branch since the fixtures landed: - GET /organizations 3 → 5 - GET /sites (excluding ORG_1) 2 → 4 - GET /sites/by-delivery-type/aem_edge 3 → 5 (also uppercase + mixed) Same root cause for the compression wrapper test "does not compress when no Accept-Encoding header": /sites response now exceeds the wrapper's 1024-byte minSize threshold, and Node 24's undici auto-adds Accept-Encoding regardless of whether the IT http-client sets it. The "Accept-Encoding: identity" test directly above already covers the same wrapper short-circuit, so the redundant case is removed with an explanatory comment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ertion The LLMO mode-resolution test sites (SITE_LEGACY_LLMO, SITE_NEW_LLMO) are seeded with fixed historical / future created_at values to straddle the Brandalf GA cutoff (2026-04-01). expectSiteListDto's expectISOTimestamp asserts createdAt is within the last hour, which intentionally fails for these fixtures. Filter them out of the DTO check on the GET /sites baseline test — the count and id assertions still verify they're present in the response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The brand_sites_type_check constraint (added in #2149) only allows 'base' and 'localized'. All brand URL callsites were still passing type: 'url', causing upsertBrand to fail silently and leaving v2 onboarded sites without an active brand — which in turn caused the DRS brand-presence scheduler to 422. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
End-to-end test results — dev environmentEnvironment: spacecat-dev / drs-dev / audit-worker-dev Test 1 — Legacy org stays on v1 (site cutoff protection)Org: AE-ASSETS-ENG
Test: Onboarded Observed: Both onboarded as v1 — ✅ Legacy org correctly protected from v2. Test 2 — New org goes to v2 (happy path)Setup: Set Vault warm container caveat: Even after writing to Vault, warm containers kept the old (missing) secret due to metadata baseline caching (60s metadata check, baseline set on cold start without re-fetch). Fixed by publishing Lambda version v5608 with a dummy env var change to force cold start. Test: Onboarded Observed: Onboarded as v2 — ✅ v2 path executed correctly after Vault secret propagated. Test 3 — v2 onboarding end-to-end: petco.comTest: Onboarded Observed: v2 config initialized, Root cause: Fix: Committed as part of this PR — all 4 callsites ( Test 4 — Kill switch (
|
…s URLs Update 4 test expectations that still asserted type: 'url' after the fix in brand URL callsites. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Behaviour & environment variable referenceHow the safeguard works
Environment variables
Both variables are optional. The system works out of the box with the hardcoded default cutoff of What each mode doesv1 path (
v2 path (
Mixed-state detectionIf an org has This is a monitoring signal. Onboarding itself succeeds (v1 path). Remediation requires manually auditing which sites in the org are pre/post-cutoff. Operational notes
Dev test results (2026-04-13)
🤖 Generated with Claude Code |
Related PRsaudit-worker fix (required for v1 onboarding to work end-to-end): Why both PRs are needed
Without the audit-worker fix, mixed-state orgs (brandalf flag = true, but pre-Brandalf sites present) will always fail brand-presence schedule creation with DRS 422 — even though api-service correctly chose the v1 onboarding path. |
…tions - Add PR links (api-service #2171, audit-worker #2380) - Document the mixed-state DRS 422 root cause found during dev testing - Add audit-worker companion fix section with decision table - Mark resolved open questions (cutoff default, inclusivity, env var rollout) - Update implementation order with completion status and note that both PRs must be deployed together Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add brandalf flag as highest-priority check in resolveLlmoOnboardingMode: if brandalf=true, return v2 (except row 1 remediation) - Row 1 remediation: kill switch + pre-cutoff sites + brandalf=true → revert flag to false, log warning, return v1 - Fix DRS prompt generation job to use overrideBaseURL when available, matching the Brandalf job pattern - Update design doc with full 8-row decision matrix - Rewrite unit tests to cover all matrix scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi Iryna!! Notes from Cursor: What works well
Issues and suggestions (by severity)1. Misleading warning when revert fails (row 1)After a failed Suggestion: Only log “reverted” when upsert succeeds, or split messages (“attempting revert” / “revert succeeded” / “revert failed — flag may still be true”). 2.
|
Move the "Reverted brandalf flag" warn log inside the try block so it only fires when upsertFeatureFlag actually succeeds. On failure, the error message now includes "Flag may still be true" to prevent misleading incident response. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-customer-analysis (#2380) ## Summary - **Root cause**: For mixed-state orgs (`brandalf=true` feature flag but pre-Brandalf sites that force v1 onboarding via `hasPreBrandalfSites`), the `llmo-customer-analysis` handler was calling `isBrandalfEnabled` independently of the actual onboarding path. It found no v2 customer config brand (none created during v1 onboarding) and called DRS without `brand_id`. DRS rejects this with `422 - brand_id is required for v2 (brandalf-enabled) sites`. - **Fix**: Short-circuit `isBrandalfEnabled` when `auditContext.onboardingMode === 'v1'`. The `onboardingMode` is set by `spacecat-api-service` based on `resolveLlmoOnboardingMode` and propagated through `drs-prompt-generation` → `llmo-customer-analysis`. When it's explicitly `'v1'`, we skip v2 brand resolution and create the BP schedule without `brand_id`. - **Test**: New unit test covers the mixed-state scenario, asserting the feature-flags API is never called and the schedule payload omits `brand_id`/`spacecat_org_id`. ## Behaviour | onboardingMode | brandalf flag | Outcome | |---|---|---| | `'v2'` | true | Resolve brand from DB, include `brand_id` in schedule | | `'v1'` (explicit) | true | **Skip brandalf check**, create schedule without `brand_id` | | not set | true | Fall back to `isBrandalfEnabled` (backward compat) | | not set | false | No brand resolution, schedule without `brand_id` | ## Related - spacecat-api-service PR: adobe/spacecat-api-service#2171 (LLMO-4176 Brandalf GA safeguard) ## Test plan - [ ] Unit tests pass (`npm test`) - [ ] Deploy to dev and re-run v1 onboarding for nordstrom.com (site `4483e763-a918-431d-8b7a-ec8d4f3bf9d9`) — confirm brand-presence schedule created successfully in CloudWatch logs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
# [1.441.0](v1.440.2...v1.441.0) (2026-04-15) ### Features * **LLMO-4176:** add brandalf flag override and decision matrix to onboarding mode resolution ([#2171](#2171)) ([977e4e9](977e4e9))
|
🎉 This PR is included in version 1.441.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
Implements LLMO-4176 (epic LLMO-4054).
Rewrites
resolveLlmoOnboardingModeto follow an 8-row decision matrix based on 3 inputs:Priority order: Brandalf flag (highest) → kill switch → legacy site check → default v2.
Temporary safeguard — will be removed once all v1 customers are migrated to v2.
Related PRs
baseSiteIdoverbrand_sitesjoinWhat changed
src/support/llmo-onboarding-mode.js— RewritesresolveLlmoOnboardingModewith 4-step logic:true: check for Row 1 condition (kill switch + pre-cutoff → revert flag to false, return v1), otherwise return v2LLMO_ONBOARDING_DEFAULT_VERSION=v1→ v1New exports (all TEMPORARY):
LLMO_BRANDALF_GA_CUTOFF_MS_DEFAULT— epoch-ms constant (2026-04-01T00:00:00Z)resolveBrandalfCutoffMs(context)— reads cutoff fromLLMO_BRANDALF_GA_CUTOFF_MSenv varhasPreBrandalfSites(organizationId, context)— returnstrueif org has any site created before cutoffsrc/controllers/llmo/llmo-onboarding.js— Fixes DRS baseUrl bug (NASCAR issue): passesoverrideBaseURLto DRS prompt generation job when available.Files changed
src/support/llmo-onboarding-mode.jssrc/controllers/llmo/llmo-onboarding.jsoverrideBaseURL)src/support/customer-config-mapper.jstest/support/llmo-onboarding-mode.test.jstest/controllers/llmo/llmo-onboarding.test.jstest/support/slack/actions/onboard-llmo-modal.test.jstest/it/shared/tests/llmo-onboarding.jshasPreBrandalfSites+resolveLlmoOnboardingModetest/it/postgres/llmo-onboarding.test.jstest/it/postgres/seed-data/organizations.jstest/it/postgres/seed-data/sites.jscreated_attest/it/shared/seed-ids.jsdocs/llmo-brandalf-apis/v1-v2-onboarding-consistency-safeguard.mdTest plan
Unit tests (automated — run in CI)
Integration tests (automated — require Docker)
Covers:
hasPreBrandalfSites→truefor org with pre-cutoff sitehasPreBrandalfSites→falsefor org with post-cutoff sitehasPreBrandalfSites→falsefor org with no sitesresolveLlmoOnboardingMode→v1for legacy orgresolveLlmoOnboardingMode→v2for new orgresolveLlmoOnboardingMode→v1when kill switch activeE2E validation (manual — completed on dev)
Validated on dev environment (
spacecat.experiencecloud.live/api/ci) with two test orgs. Results documented indocs/llmo-brandalf-apis/e2e-test-results.md.🤖 Generated with Claude Code