Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions .claude/board/TECH_DEBT.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,72 @@

## Open Debt

### TD-CI-COVERAGE-MOLD-1 — `test-with-coverage` job lacks the mold linker the `test` job has (2026-06-12)

**2026-06-12 local-repro addendum (same PR, later commit) — diagnosis CONFIRMED
and sharpened; fix extended to `debuginfo=0`.** Reproduced the failure mode
locally with CI's exact env (`RUSTFLAGS="-C debuginfo=1 -C target-cpu=x86-64-v3"`,
`CARGO_INCREMENTAL=0`, stable toolchain):

- `cargo test --manifest-path crates/lance-graph/Cargo.toml --tests --no-run`
died **3×** at the link step with the exact opaque signature CI shows:
`rustc-LLVM ERROR: IO failure on output stream: No space left on device`,
`collect2: fatal error: ld terminated with signal 7 [Bus error]` (SIGBUS =
mmap'd output on a full filesystem), `error: could not compile … (exit
status: 101)`. Resource exhaustion, not a compile error.
- **Measured weight:** each of the 17 integration-test binaries links to
~930 MB at `debuginfo=1`; ~252 MB stripped of debuginfo (−73 %). Set total
≈ 16 GB + ~13 GB deps tree + instrumentation growth + `.profraw` ≈ the
hosted runner's disk/RSS budget — a cliff edge, which is exactly what a
2/50 intermittent looks like. So there are TWO ceilings, not one: GNU-ld
RSS (mold fixes) AND disk (mold does NOT fix).
- **No test bug exists:** every integration-test binary that linked was
executed — **98/98 tests pass** against lance 7.0.0 (test_sql_query 14,
test_datafusion_varlength_complex 19, test_to_sql 12, neighborhood_cascade
10, test_explain_output 8, test_lance_vector_search 7, test_to_spark_sql 7,
spo_ground_truth 7, spo_promotion 4, test_case_insensitivity 4,
test_complex_return_clauses 3, hdr_proof 3). The SoA-migration exoneration
above is now empirical, not inferential.
- **`debuginfo=0` is coverage-safe (verified, not assumed):** 600/600
lance-graph-contract lib tests pass under
`-C instrument-coverage -C debuginfo=0`; the test binary embeds
`__llvm_covmap` / `__llvm_prf_{names,cnts,data}` sections and emits
`.profraw`. LLVM coverage mapping is independent of DWARF.
- **Paid-by (extended):** this PR now also sets job-level
`RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"` on
`test-with-coverage` (workflow-level stays `debuginfo=1` for the `test`
job). Relieves both ceilings; mold stays as parity + link-speed insurance.
Side effect: the coverage job gets its own Swatinem cache key (first run
repopulates). The "escalate to timing-race hypothesis" path below is
retired unless coverage still flakes after BOTH fixes.

**Open — fix applied this PR, CONFIRM on next green run.** The `Rust Tests`
workflow's `test` job sets up the `mold` linker (`rui314/setup-mold@v1`) with the
comment *"Heavy lance+datafusion integration-test binaries OOM the default GNU
`ld` at the link step (intermittent)."* The sibling `test-with-coverage` job did
**not** set up mold, and links the **even larger llvm-cov-instrumented** binaries
with the default linker — so the OOM is *more* likely there. Symptom: across the
last 50 `rust-test.yml` runs, exactly 2 hit `test=success / cov=failure`
(`claude/probe-mantissa-fill` a32cb177 and `claude/nice-edison-g4rhhl` 12c5ea35);
the plain `test` job stayed green in both. **This is NOT a logic/test failure and
NOT a side-effect of the SoA-singleton migration** (`bindspace-singleton-to-mailbox-soa-v1`
et al.): a migration bug would fail the plain `test` job too — it doesn't, and the
two SoA debts (TD-RESONANCEDTO-DUP-1 P3/deferred, TD-UNBUNDLE-FROM-1 ~1-bit/100-epoch
drift) crash nothing. Confidence: HIGH on the infra cause; the workflow's own
comment names this exact OOM, mold is missing only on the coverage job, and the
failure is intermittent (= memory pressure, not a deterministic bug).
**Residual (honest):** the codecov upload step already sets `fail_ci_if_error:
false`, so the noise is a job-level ❌ that does NOT block merge (`mergeable=True`);
and without the CI log (token 403) a timing-sensitive race surfacing only under
instrumentation's slower execution cannot be *100%* excluded — but the migration's
concurrency tests (D-SNGL-6 writer+reader threads) are PROPOSAL, not shipped, so
there is no concurrent SoA test to race yet. **Paid by:** this PR adds the mold
step to the coverage job (parity with `test`). **Confirm** by a green
`test-with-coverage` run; if it still fails after mold, escalate to the
timing-race hypothesis (read the actual `cargo llvm-cov` log with a scoped token).
Cross-ref: `.github/workflows/rust-test.yml` (test job mold step vs coverage job);
`bindspace-singleton-to-mailbox-soa-v1` (the migration this is NOT).

### TD-UNBUNDLE-FROM-1 — `unbundle_from` is NOT the inverse of `bundle_into` (2026-06-07)

**Open.** `crates/lance-graph-planner/src/cache/kv_bundle.rs` — `unbundle_from`
Expand Down
24 changes: 23 additions & 1 deletion .github/workflows/rust-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ jobs:
# Heavy lance+datafusion integration-test binaries OOM the default GNU `ld`
# at the `cargo test --no-run` link step (intermittent). mold links them
# fast + low-memory (already used by release.yml / rust-publish.yml).
uses: rui314/setup-mold@v1
uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1
- uses: Swatinem/rust-cache@v2
with:
shared-key: "lance-graph-deps"
Expand Down Expand Up @@ -101,6 +101,20 @@ jobs:
test-with-coverage:
runs-on: ubuntu-24.04
timeout-minutes: 30
env:
# Override the workflow-level debuginfo=1 for this job only. llvm-cov
# coverage lives in __llvm_covmap/__llvm_prf_* ELF sections, NOT in
# DWARF (verified locally: 600/600 contract tests pass under
# `-C instrument-coverage -C debuginfo=0`, covmap sections present,
# .profraw emitted). At debuginfo=1 each of the 17 instrumented
# integration-test binaries links to ~930 MB (measured; ~252 MB at
# debuginfo=0, -73%) — the full set + deps tree sits exactly at the
# hosted runner's disk/RSS cliff, which is the 2/50 intermittent
# exit-101 (TD-CI-COVERAGE-MOLD-1). Dropping debuginfo relieves BOTH
# ceilings (GNU-ld/mold RSS at link, and disk). Note: a job-level
# RUSTFLAGS gives this job its own Swatinem cache key — the first run
# after this change repopulates the coverage cache.
RUSTFLAGS: "-C debuginfo=0 -C target-cpu=x86-64-v3"
defaults:
run:
working-directory: lance-graph
Expand All @@ -120,6 +134,14 @@ jobs:
run: |
rustup toolchain install stable
rustup default stable
- name: Setup mold linker
# Parity with the `test` job above (TD-CI-COVERAGE-MOLD-1): the heavy
# lance+datafusion test binaries OOM the default GNU `ld` at link
# (intermittent) — and llvm-cov INSTRUMENTED binaries are larger, so
# the OOM is MORE likely here than in the plain `test` job that already
# has mold. Without this step the coverage job flaked while `test`
# stayed green (2/50 runs). mold links them fast + low-memory.
uses: rui314/setup-mold@9c9c13bf4c3f1adef0cc596abc155580bcb04444 # v1
- uses: Swatinem/rust-cache@v2
with:
shared-key: "lance-graph-deps"
Expand Down
Loading