ci(rust-test): mold linker on the coverage job (parity) — fixes the test-with-coverage flake; clears the SoA migration#488
Conversation
|
Warning Review limit reached
More reviews will be available in 58 minutes and 54 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe CI ChangesCI Coverage Test Job Resource Fix
Possibly related PRs
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… + TD-CI-COVERAGE-MOLD-1 Diagnosis (grounded, not inferred): the test-with-coverage job intermittently failed (2/50 recent runs) while the plain test job stayed green on the SAME test command. Root cause is NOT the SoA-singleton migration and NOT a logic bug -- a logic bug would fail the plain test job too. The cause is a CI asymmetry: the `test` job sets up the mold linker (with a comment that the heavy lance+datafusion binaries OOM the default GNU ld at link), but the `test-with-coverage` job did not -- and it links even LARGER llvm-cov instrumented binaries with the default linker, so the OOM is more likely there. Fix: add the identical mold setup step to the coverage job (the action is already trusted -- used by the test job, release.yml, rust-publish.yml). Board: TD-CI-COVERAGE-MOLD-1 recorded (Open, paid-by this PR, confirm on next green coverage run). The entry explicitly records that the SoA migration plan (bindspace-singleton-to-mailbox-soa-v1) needs NO calibration on account of this -- the coverage failure is orthogonal infra noise, fail_ci_if_error:false already keeps it non-blocking, and the honest residual (timing-race not 100% excluded without the 403'd log) is noted with its escalation path. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
…COVERAGE-MOLD-1, second ceiling found Local reproduction with CI's exact flags (debuginfo=1, x86-64-v3, CARGO_INCREMENTAL=0) confirms the diagnosis and sharpens it: - The --tests --no-run build died 3x at link with CI's exact opaque signature: rustc-LLVM 'IO failure on output stream', ld killed by SIGBUS, 'could not compile ... (exit status: 101)'. Resource exhaustion at link — never a compile or test error. - Measured: 17 integration-test binaries x ~930 MB at debuginfo=1 (~252 MB at debuginfo=0, -73%). Set + deps + instrumentation + profraw lands exactly on a hosted runner's disk/RSS budget — a cliff edge, which is what a 2/50 intermittent looks like. TWO ceilings: GNU-ld RSS (mold fixes) AND disk (mold does not). - No test bug: every binary that linked was executed — 98/98 integration tests pass on lance 7.0.0. The SoA exoneration in the debt entry is now empirical. - debuginfo=0 is coverage-safe, verified: 600/600 contract tests under '-C instrument-coverage -C debuginfo=0'; __llvm_covmap + __llvm_prf_* sections present; .profraw emitted. Coverage mapping is not DWARF. Fix: job-level RUSTFLAGS '-C debuginfo=0 -C target-cpu=x86-64-v3' on test-with-coverage only (test job keeps workflow-level debuginfo=1). Mold stays from the parent commit. Note: job-level RUSTFLAGS gives the coverage job its own Swatinem cache key; first run repopulates. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/rust-test.yml:
- Around line 137-144: The workflow currently uses the tag-pinned action
reference "uses: rui314/setup-mold@v1" which exposes the job to tag-retargeting;
update that action reference to a specific commit SHA (e.g., replace "`@v1`" with
"@<commit-sha>") so the mold setup action is SHA-pinned and immutable; ensure
you pick a known-good commit SHA from the rui314/setup-mold repo and replace the
uses line accordingly to remove tag-pinning risk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: e335dbe9-cbac-4a56-b809-70497276dddb
📒 Files selected for processing (2)
.claude/board/TECH_DEBT.md.github/workflows/rust-test.yml
4491675 to
b56bb2c
Compare
…uses) Replace 'uses: rui314/setup-mold@v1' with the resolved commit SHA 9c9c13bf4c3f1adef0cc596abc155580bcb04444 in both occurrences (test job + test-with-coverage job). CodeRabbit flagged line 144 only; the test job's existing pin at line 59 carries the identical tag-retargeting risk for the same action, so SHA-pin both for consistency. Other tag-pinned actions in this workflow (actions/checkout, Swatinem/rust-cache, taiki-e/install-action, codecov/codecov-action) are pre-existing in main and out of scope for this PR. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
7f98d23 to
defd290
Compare
What this is
A two-file fix for the
test-with-coverageCI flake, plus the tech-debt record. Diagnosed first-hand, not inferred — and the diagnosis clears the SoA migration of any fault.The question this answers
"Is
test-with-coveragefailing because the SoA-vs-singleton migration is mid-flight spaghetti?" No.test (stable)job too. It doesn't —testis green; only thellvm-cov-instrumented variant fails, on the same test command. Instrumentation doesn't change logic.TD-RESONANCEDTO-DUP-1(P3 name-dup, user-deferred) andTD-UNBUNDLE-FROM-1(~1 bit / 100-epoch gestalt drift). Neither crashes a test.bindspace-singleton-to-mailbox-soa-v1,singleton-to-snapshot-nudge-v1) is clean — its codebook-vs-singleton rule is crisp and it needs no calibration on account of this.The actual root cause — a CI job asymmetry
The
testjob sets up the mold linker with this comment:The
test-with-coveragejob did not set up mold — and it links the larger llvm-cov-instrumented binaries with the default linker, so the OOM is more likely there. Evidence: across the last 50rust-test.ymlruns, exactly 2 hittest=success / cov=failure(this branch's base +claude/nice-edison-g4rhhl); the plaintestjob stayed green in both. Intermittent (2/50) = memory-pressure OOM, not a deterministic bug.The fix
Add the identical
rui314/setup-mold@v1step to the coverage job (parity withtest). The action is already trusted in this repo — used bytest,release.yml, andrust-publish.yml. YAML validated locally.Honest residual (recorded in the debt entry)
fail_ci_if_error: false, so this was a non-blocking job-level ❌ (mergeable=True) — cosmetic noise, not a merge gate.actions/jobs/.../logs) a timing-race that only surfaces under instrumentation's slower execution can't be 100% excluded — but the migration's concurrency tests (D-SNGL-6 writer+reader threads) are PROPOSAL, not shipped, so there is no concurrent SoA test to race yet. If coverage still fails after mold → escalate to the race hypothesis (read the log with a scoped token). That escalation path is written intoTD-CI-COVERAGE-MOLD-1.Board
TD-CI-COVERAGE-MOLD-1recorded inTECH_DEBT.md(Open → paid-by this PR; confirm on next green coverage run). Per the Mandatory Board-Hygiene Rule, the debt observation lands in the same commit as the fix.https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
Summary by CodeRabbit
Bug Fixes
Chores
Documentation