Skip to content

test(rabbitmq): bootstrap global CouchDB views in TestMain#4753

Open
Crash-- wants to merge 1 commit into
masterfrom
investigate/rabbitmq-cache-flake
Open

test(rabbitmq): bootstrap global CouchDB views in TestMain#4753
Crash-- wants to merge 1 commit into
masterfrom
investigate/rabbitmq-cache-flake

Conversation

@Crash--
Copy link
Copy Markdown
Contributor

@Crash-- Crash-- commented May 10, 2026

Background — how we ran into this

While debugging CI failures on #4716 (S3 VFS backend), the test (1.25.x, 3.2.3) job kept failing with this signature, deterministically across reruns:

--- FAIL: TestSyncCreatedOrgContact/CreatesExternalContactsForOtherMembers
    org_contacts_test.go:28:
        Error Trace:    pkg/rabbitmq/rabbitmq_test.go:1070
                        pkg/rabbitmq/org_contacts_test.go:28
        Received unexpected error:
            CouchDB(not_found): missing
--- FAIL: TestSyncCreatedOrgContact/SkipsExistingExternalContact
    ...
--- FAIL: TestSyncDeletedOrgContact/...

The same source built green on test (1.26.x, 3.3.3) in the same workflow run, and crucially had been green on test (1.25.x, 3.2.3) on the immediately preceding commit (only diff: a single trailing newline removal in an unrelated file). Reruns on fresh GitHub-hosted runners reproduced the failure identically. After purging the PR's actions/setup-go caches and re-running, both jobs went green.

That ruled out a normal flake and pointed at cache-shaped state.

Root cause

pkg/rabbitmq tests call lifecycle.Create / instance.Get directly. They never go through testutils.NewSetupGetTestInstancestack.Start, so they never trigger couchdb.InitGlobalDB, which is the only function in the codebase that creates the design doc backing the domain-and-aliases view on the global instances DB.

$ grep -rn "InitGlobalDB" pkg/ model/
pkg/couchdb/index.go      (definition)
model/stack/main.go       (only caller)

So the implicit assumption these tests have been making is:

  1. go test ./... runs packages in lexicographic order, so model/move, model/sharing, etc. run before pkg/rabbitmq.
  2. Those earlier packages do go through NewSetup/GetTestInstance, which calls stack.Start and therefore InitGlobalDB.
  3. The design doc lives in the shared CouchDB service for the run, so it survives across test binaries and is there when pkg/rabbitmq queries it.

actions/setup-go@v5 cache: true caches ~/.cache/go-build, which contains go test's result cache. When the cache is warm and a package's inputs haven't changed, Go reports (cached) ok for it without re-running the test binary — so the side effect of creating the design doc never happens. The CouchDB service in the new run is fresh (its container is per-job), so the design doc is genuinely absent.

Then in instance.Service.Get:

err := couchdb.ExecView(prefixer.GlobalPrefixer, couchdb.DomainAndAliasesView, ...)
if couchdb.IsNoDatabaseError(err) {
    return nil, ErrNotFound
}
if err != nil {
    return nil, err   // <-- propagates "not_found: missing" from the missing design doc
}

IsNoDatabaseError only matches Reason == "no_db_file" or "Database does not exist.". A missing design doc comes back as 404 not_found "missing", which falls through to the raw-error branch. lifecycle.Create does not unwrap that as ErrNotFound, propagates it, and the second lifecycle.Create call in the test (the first bob := createInstanceInOrg(...) after target := ...) fails.

This explains every detail of the observed symptom:

  • Failure is on the second createInstanceInOrg call: the first creates the global DB lazily via CreateDoc (which IsNoDatabaseError does handle), the second hits the now-existing DB but missing-design-doc path.
  • It's deterministic across reruns: same cache → same skipped packages → same missing side effect.
  • 1.25 vs 1.26 split: each Go version has its own cache key (setup-go-...-go-1.25.9-... vs ...-go-1.26.3-...), and the two caches were in different states.
  • Cache purge fixes it: forces model/* to actually re-run, recreating the design doc.

Fix in this PR

Add a TestMain in pkg/rabbitmq that bootstraps the global views the same way production does, removing the cross-package implicit dependency for this package:

func TestMain(m *testing.M) {
    if err := loadTestConfigForMain(); err != nil { log.Fatalf(...) }
    if _, err := couchdb.CheckStatus(ctx); err == nil {
        if err := couchdb.InitGlobalDB(ctx); err != nil { log.Printf(...) }
    }
    os.Exit(m.Run())
}

loadTestConfigForMain is a small in-file replica of config.UseTestFile's setup, since UseTestFile requires a *testing.T we don't have inside TestMain. CouchDB unreachability is tolerated (logged) so this TestMain doesn't hard-fail in environments where CouchDB is intentionally absent — individual tests that need it will still fail through testutils.NeedCouchdb(t) as before.

Why not the broader fix

A more robust change would live in model/instance/service.go:

-   if couchdb.IsNoDatabaseError(err) {
+   if couchdb.IsNotFoundError(err) {
        return nil, ErrNotFound
    }

IsNotFoundError is a strict superset of IsNoDatabaseError and would treat a missing design doc as "no instance with that domain", which is what every caller of instance.Get already wants. That removes the implicit dependency for every test package, not just this one.

We deliberately don't ship that change here: the repercussions on other Get callers haven't been fully audited (e.g. it would silently hide a deployment-time view rename), and we wanted the immediate CI fix decoupled from a behavior-shift in core instance lookup. The TestMain carries a TODO pointing at this follow-up.

Test plan

  • CI on this branch: pkg/rabbitmq tests pass on both test (1.25.x, 3.2.3) and test (1.26.x, 3.3.3) from a clean runner.
  • Re-run the same CI job to verify it stays green even when model/* is cache-skipped (which is what cached reruns simulate).
  • Confirm connection_test.go / publisher_test.go (which don't need CouchDB) still run unaffected when CouchDB is absent locally.

🤖 Generated with Claude Code

Tests in this package call lifecycle.Create / instance.Get directly
without going through testutils.NewSetup → GetTestInstance →
stack.Start, so they never trigger couchdb.InitGlobalDB themselves.

They've historically relied on the side effect of earlier model/* test
packages bootstrapping the global DB and on the design doc persisting
in the shared CouchDB service across test binaries. Go's test result
cache (persisted via actions/setup-go cache) can let those packages be
skipped, breaking the implicit dependency. The CI flake on PR #4716
manifested as TestSyncCreatedOrgContact failing with
"CouchDB(not_found): missing" because instance.Service.Get queried
_design/domain-and-aliases on a global instances DB where that design
doc had never been created.

A more robust fix would be to make instance.Service.Get treat any
CouchDB "not_found" as ErrNotFound (it currently only handles
no_db_file / "Database does not exist."). That would remove the
implicit dependency for every package, not just this one. The
repercussions on other Get callers haven't been fully audited yet, so
this localized bootstrap stays in place until the broader change is
vetted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Crash-- Crash-- requested a review from a team as a code owner May 10, 2026 19:19
@Crash-- Crash-- assigned Copilot and unassigned Copilot May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants