Skip to content

Commit b56bb2c

Browse files
committed
ci(rust-test): coverage job debuginfo=0 — local repro confirms TD-CI-COVERAGE-MOLD-1, second ceiling found
Local reproduction with CI's exact flags (debuginfo=1, x86-64-v3, CARGO_INCREMENTAL=0) confirms the diagnosis and sharpens it: - The --tests --no-run build died 3x at link with CI's exact opaque signature: rustc-LLVM 'IO failure on output stream', ld killed by SIGBUS, 'could not compile ... (exit status: 101)'. Resource exhaustion at link — never a compile or test error. - Measured: 17 integration-test binaries x ~930 MB at debuginfo=1 (~252 MB at debuginfo=0, -73%). Set + deps + instrumentation + profraw lands exactly on a hosted runner's disk/RSS budget — a cliff edge, which is what a 2/50 intermittent looks like. TWO ceilings: GNU-ld RSS (mold fixes) AND disk (mold does not). - No test bug: every binary that linked was executed — 98/98 integration tests pass on lance 7.0.0. The SoA exoneration in the debt entry is now empirical. - debuginfo=0 is coverage-safe, verified: 600/600 contract tests under '-C instrument-coverage -C debuginfo=0'; __llvm_covmap + __llvm_prf_* sections present; .profraw emitted. Coverage mapping is not DWARF. Fix: job-level RUSTFLAGS '-C debuginfo=0 -C target-cpu=x86-64-v3' on test-with-coverage only (test job keeps workflow-level debuginfo=1). Mold stays from the parent commit. Note: job-level RUSTFLAGS gives the coverage job its own Swatinem cache key; first run repopulates. https://claude.ai/code/session_01PBTGaPCSnnt6u3pjXpbLwY
1 parent a2feffe commit b56bb2c

2 files changed

Lines changed: 51 additions & 0 deletions

File tree

.claude/board/TECH_DEBT.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,43 @@
1717

1818
### TD-CI-COVERAGE-MOLD-1 — `test-with-coverage` job lacks the mold linker the `test` job has (2026-06-12)
1919

20+
**2026-06-12 local-repro addendum (same PR, later commit) — diagnosis CONFIRMED
21+
and sharpened; fix extended to `debuginfo=0`.** Reproduced the failure mode
22+
locally with CI's exact env (`RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"`,
23+
`CARGO_INCREMENTAL=0`, stable toolchain):
24+
25+
- `cargo test --manifest-path crates/lance-graph/Cargo.toml --tests --no-run`
26+
died **** at the link step with the exact opaque signature CI shows:
27+
`rustc-LLVM ERROR: IO failure on output stream: No space left on device`,
28+
`collect2: fatal error: ld terminated with signal 7 [Bus error]` (SIGBUS =
29+
mmap'd output on a full filesystem), `error: could not compile … (exit
30+
status: 101)`. Resource exhaustion, not a compile error.
31+
- **Measured weight:** each of the 17 integration-test binaries links to
32+
~930 MB at `debuginfo=1`; ~252 MB stripped of debuginfo (−73 %). Set total
33+
≈ 16 GB + ~13 GB deps tree + instrumentation growth + `.profraw` ≈ the
34+
hosted runner's disk/RSS budget — a cliff edge, which is exactly what a
35+
2/50 intermittent looks like. So there are TWO ceilings, not one: GNU-ld
36+
RSS (mold fixes) AND disk (mold does NOT fix).
37+
- **No test bug exists:** every integration-test binary that linked was
38+
executed — **98/98 tests pass** against lance 7.0.0 (test_sql_query 14,
39+
test_datafusion_varlength_complex 19, test_to_sql 12, neighborhood_cascade
40+
10, test_explain_output 8, test_lance_vector_search 7, test_to_spark_sql 7,
41+
spo_ground_truth 7, spo_promotion 4, test_case_insensitivity 4,
42+
test_complex_return_clauses 3, hdr_proof 3). The SoA-migration exoneration
43+
above is now empirical, not inferential.
44+
- **`debuginfo=0` is coverage-safe (verified, not assumed):** 600/600
45+
lance-graph-contract lib tests pass under
46+
`-C instrument-coverage -C debuginfo=0`; the test binary embeds
47+
`__llvm_covmap` / `__llvm_prf_{names,cnts,data}` sections and emits
48+
`.profraw`. LLVM coverage mapping is independent of DWARF.
49+
- **Paid-by (extended):** this PR now also sets job-level
50+
`RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"` on
51+
`test-with-coverage` (workflow-level stays `debuginfo=1` for the `test`
52+
job). Relieves both ceilings; mold stays as parity + link-speed insurance.
53+
Side effect: the coverage job gets its own Swatinem cache key (first run
54+
repopulates). The "escalate to timing-race hypothesis" path below is
55+
retired unless coverage still flakes after BOTH fixes.
56+
2057
**Open — fix applied this PR, CONFIRM on next green run.** The `Rust Tests`
2158
workflow's `test` job sets up the `mold` linker (`rui314/setup-mold@v1`) with the
2259
comment *"Heavy lance+datafusion integration-test binaries OOM the default GNU

.github/workflows/rust-test.yml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,20 @@ jobs:
101101
test-with-coverage:
102102
runs-on: ubuntu-24.04
103103
timeout-minutes: 30
104+
env:
105+
# Override the workflow-level debuginfo=1 for this job only. llvm-cov
106+
# coverage lives in __llvm_covmap/__llvm_prf_* ELF sections, NOT in
107+
# DWARF (verified locally: 600/600 contract tests pass under
108+
# `-C instrument-coverage -C debuginfo=0`, covmap sections present,
109+
# .profraw emitted). At debuginfo=1 each of the 17 instrumented
110+
# integration-test binaries links to ~930 MB (measured; ~252 MB at
111+
# debuginfo=0, -73%) — the full set + deps tree sits exactly at the
112+
# hosted runner's disk/RSS cliff, which is the 2/50 intermittent
113+
# exit-101 (TD-CI-COVERAGE-MOLD-1). Dropping debuginfo relieves BOTH
114+
# ceilings (GNU-ld/mold RSS at link, and disk). Note: a job-level
115+
# RUSTFLAGS gives this job its own Swatinem cache key — the first run
116+
# after this change repopulates the coverage cache.
117+
RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"
104118
defaults:
105119
run:
106120
working-directory: lance-graph

0 commit comments

Comments
 (0)