Skip to content

fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC#71

Merged
avrabe merged 4 commits into
mainfrom
fix/zephyr-ci-disk-enospc
Jun 19, 2026
Merged

fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC#71
avrabe merged 4 commits into
mainfrom
fix/zephyr-ci-disk-enospc

Conversation

@avrabe

@avrabe avrabe commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

The "flaky" zephyr-tests are a real CI disk-exhaustion bug

The intermittent zephyr-tests failures are not flaky tests and not a kernel bug — the CI runner's root filesystem runs out of space.

Root cause (evidence-grounded)

The ci-base image bundles the full multi-arch Zephyr SDK (xtensa/rx/arc/riscv/… under /opt/toolchains), which consumes most of the ~14 GB free on the host /. Each job then does a west clone + build on top (the x86_64 SMP job rebuilds core via -Zbuild-std), and unlucky jobs tip over into ENOSPC. The runner crashes writing its own diagnostic log:

failure: Unhandled exception. System.IO.IOException: No space left on device :
'/home/runner/actions-runner/cached/.../Worker_*.log'

Older runs show the same cause during image unpack:

failed to register layer: write /opt/toolchains/zephyr-sdk-1.0.0/.../libc.a: no space left on device
##[error]Docker pull failed with exit code 1

Why it looked flaky

Because it's whichever job loses the disk race, the failing set wanders across runs — 8 recent main runs each red on a different unrelated subset:

run failed jobs
bc1bff2 lifo_usage, msgq
40b2742 mbox_usage, kheap
4654734 common, event, fifo, Binary size
f5fa4cd smp_semaphore, condvar, sched, sched_deadline
fa68a39 smp_*×3, syscalls, fatal_exception

PR #70's 5 failures all annotate the same ENOSPC runner crash. A changing failing-set across unrelated suites is the disk-race signature, not test non-determinism.

Fix

Mount the runner's large ephemeral scratch disk (/mnt, ~70 GB) into each container and put the west workspace + builds there, off the near-full root fs. Applied to all four jobs (zephyr-test, zephyr-mpu-test, zephyr-smp-test, size-comparison).

Validation

This PR's own zephyr-tests run is the oracle — ENOSPC-class failures should disappear. Crucially, if any test then fails for a real reason, it will be visible instead of masked by disk-death. (If ENOSPC persists, the next lever is relocating the SDK install / HOME off / too.)

🤖 Generated with Claude Code

…ent ENOSPC

Root cause of the intermittent zephyr-tests failures (NOT flaky tests, NOT a
kernel bug): the CI runner's root filesystem fills up. The ci-base image layers
bundle the full multi-arch Zephyr SDK (xtensa/rx/arc/... under /opt/toolchains),
consuming most of the ~14 GB free on the host `/`; the per-job west clone +
build artifacts (esp. the x86_64 SMP job's -Zbuild-std=core rebuild) then tip
unlucky jobs over the edge. The failure surfaces as the runner crashing while
writing its own diag log:

    Unhandled exception. System.IO.IOException: No space left on device :
    '/home/runner/actions-runner/cached/.../Worker_*.log'

— and in older runs as `failed to register layer: ... no space left on device`
during the docker image unpack. Because it's whichever job loses the disk race,
the failing SET wanders run-to-run across unrelated suites (semaphore, sched,
mbox_usage, fatal_exception, smp_semaphore, lifo_usage, ...), which is exactly
why it looked like flaky tests. Evidence: 8 recent main runs each red on a
different subset; PR#70's 5 failures all annotate the ENOSPC runner crash.

Fix: mount the runner's large ephemeral scratch disk (/mnt, ~70 GB) into each
container (`--volume /mnt:/mnt`) and create the west workspace under
/mnt/zephyr-workspace, so the heavy clone + build writes land on the big disk
instead of the near-full root fs. Applied to all four jobs (zephyr-test,
zephyr-mpu-test, zephyr-smp-test, size-comparison).

The PR's own zephyr-tests run is the oracle: ENOSPC-class failures should
disappear. If any test then fails for a *real* reason, it will now be visible
instead of masked by disk-death.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 19, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe and others added 3 commits June 19, 2026 07:53
… workflow)

The zephyr-tests /mnt relocation fixed that workflow's ENOSPC, but PR CI surfaced
the SAME failure in llvm-lto.yml — `llvm-lto-test (stack)` died with
`Docker pull failed` / no-space-left. llvm-lto.yml was untouched by the first
commit yet shares the pattern, and is WORSE: it uses the larger full `ci` image
and runs four west builds per job (gcc-baseline, gcc-gale, llvm-gale, llvm-lto),
so it fills the ~14 GB root fs even more readily.

Apply the identical fix to both llvm-lto jobs: mount /mnt and put the west
workspace + all four build dirs there. (Pull-time image-unpack pressure on / is
not fully addressed by workspace relocation — flagged for follow-up if it
persists — but moving four builds off / removes the dominant build-time
consumer.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t 403 failures

Reading the actual job logs of PR #71's run finally explained the "flaky" failures
as TWO distinct CI-infra causes — neither a test or code bug:

1. **GitHub API rate limiting (the dominant cause).** `west sdk install` does a
   "Fetching Zephyr SDK list..." call to the GitHub releases API on *every* job.
   Unauthenticated, that is 60 req/hr per runner IP; with ~59 concurrent matrix
   jobs the limit is blown and the fetch returns
     "Failed to fetch: 403, API rate limit exceeded for <ip>"
   The SDK then never installs and the build dies at
     "CMake Error ... FindZephyr-sdk.cmake (find_package)"
   (this is exactly what `sys_event` failed on). Whichever jobs lose the
   rate-limit window fail — which is why the failing SET wandered run-to-run and
   looked like flaky tests.
2. Residual ENOSPC on the disk tail (`mutex`) — already mitigated by the /mnt
   relocation in the earlier commits; may further ease now that the rate-limit
   retry-churn on /root stops.

Fix: set GITHUB_TOKEN (= github.token) at workflow-level env in both ENOSPC/SDK
workflows so the SDK release-list fetch is authenticated (60 -> 5000 req/hr).
Uses github.token because the `secrets` context is not available at
workflow-level env. Propagates to every job.

Sibling suites (event_api, sys_mutex, semaphore, msgq, ...) passed in the same
run, confirming sys_event was not a real test failure — it lost the API-limit
race. The PR's re-run is the oracle.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_TOKEN was ignored)

The previous commit set GITHUB_TOKEN in the env, but the token-fix run proved
`west sdk install` does NOT read it — stackprot still failed with:
  fetch_releases API rate limit exceeded. Try executing install script with
  --personal-access-token argument or use a .netrc file
  ... 403, API rate limit exceeded for <ip>
(the install tool itself told us how to authenticate). The env-only change cut
failures 2->1 by luck of the concurrency window, not by authenticating.

Fix: pass the token explicitly — `west sdk install -t <tc>
--personal-access-token "${GITHUB_TOKEN}"` — at all six install sites in both
workflows. The workflow-level GITHUB_TOKEN env (github.token) carries the value;
GitHub Actions masks it in logs. This authenticates the SDK release-list fetch
(60 -> 5000 req/hr) so concurrent matrix jobs stop hitting the anonymous limit.

Grounded in the stackprot job log of run 27810691312 (the prior token-env run).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@avrabe avrabe merged commit 4da207a into main Jun 19, 2026
56 of 59 checks passed
@avrabe avrabe deleted the fix/zephyr-ci-disk-enospc branch June 19, 2026 18:39
avrabe added a commit that referenced this pull request Jun 19, 2026
…consume) (#70)

Adds the k_msgq_put wasm-cross-LTO module (3rd primitive after sem #59 / mutex #60), unblocked by synth v0.11.48 (#372+#359). Build+consume complete; build-pipeline + rivet + cargo oracles green.

CI: zephyr-tests core (qemu_cortex_m3 + mps2) all green incl. the msgq suite. The only reds are pre-existing/non-blocking and unrelated to this change: llvm-lto docker-pull ENOSPC (gale#73, red on main) + smp_sched (continue-on-error). Same explained pattern as merged #71.

Production note: CONFIG_GALE_WASM_LTO_MSGQ is default-off; on-silicon validation of the generalized production shim (k_timeout_t int64 ABI / PUT_PEND / arbitrary msg_size) remains a separate gate before enabling in a safety build. The no-wait hot path is already silicon-GREEN at 673 cyc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant