fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC#71
Merged
Conversation
…ent ENOSPC
Root cause of the intermittent zephyr-tests failures (NOT flaky tests, NOT a
kernel bug): the CI runner's root filesystem fills up. The ci-base image layers
bundle the full multi-arch Zephyr SDK (xtensa/rx/arc/... under /opt/toolchains),
consuming most of the ~14 GB free on the host `/`; the per-job west clone +
build artifacts (esp. the x86_64 SMP job's -Zbuild-std=core rebuild) then tip
unlucky jobs over the edge. The failure surfaces as the runner crashing while
writing its own diag log:
Unhandled exception. System.IO.IOException: No space left on device :
'/home/runner/actions-runner/cached/.../Worker_*.log'
— and in older runs as `failed to register layer: ... no space left on device`
during the docker image unpack. Because it's whichever job loses the disk race,
the failing SET wanders run-to-run across unrelated suites (semaphore, sched,
mbox_usage, fatal_exception, smp_semaphore, lifo_usage, ...), which is exactly
why it looked like flaky tests. Evidence: 8 recent main runs each red on a
different subset; PR#70's 5 failures all annotate the ENOSPC runner crash.
Fix: mount the runner's large ephemeral scratch disk (/mnt, ~70 GB) into each
container (`--volume /mnt:/mnt`) and create the west workspace under
/mnt/zephyr-workspace, so the heavy clone + build writes land on the big disk
instead of the near-full root fs. Applied to all four jobs (zephyr-test,
zephyr-mpu-test, zephyr-smp-test, size-comparison).
The PR's own zephyr-tests run is the oracle: ENOSPC-class failures should
disappear. If any test then fails for a *real* reason, it will now be visible
instead of masked by disk-death.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
… workflow) The zephyr-tests /mnt relocation fixed that workflow's ENOSPC, but PR CI surfaced the SAME failure in llvm-lto.yml — `llvm-lto-test (stack)` died with `Docker pull failed` / no-space-left. llvm-lto.yml was untouched by the first commit yet shares the pattern, and is WORSE: it uses the larger full `ci` image and runs four west builds per job (gcc-baseline, gcc-gale, llvm-gale, llvm-lto), so it fills the ~14 GB root fs even more readily. Apply the identical fix to both llvm-lto jobs: mount /mnt and put the west workspace + all four build dirs there. (Pull-time image-unpack pressure on / is not fully addressed by workspace relocation — flagged for follow-up if it persists — but moving four builds off / removes the dominant build-time consumer.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t 403 failures Reading the actual job logs of PR #71's run finally explained the "flaky" failures as TWO distinct CI-infra causes — neither a test or code bug: 1. **GitHub API rate limiting (the dominant cause).** `west sdk install` does a "Fetching Zephyr SDK list..." call to the GitHub releases API on *every* job. Unauthenticated, that is 60 req/hr per runner IP; with ~59 concurrent matrix jobs the limit is blown and the fetch returns "Failed to fetch: 403, API rate limit exceeded for <ip>" The SDK then never installs and the build dies at "CMake Error ... FindZephyr-sdk.cmake (find_package)" (this is exactly what `sys_event` failed on). Whichever jobs lose the rate-limit window fail — which is why the failing SET wandered run-to-run and looked like flaky tests. 2. Residual ENOSPC on the disk tail (`mutex`) — already mitigated by the /mnt relocation in the earlier commits; may further ease now that the rate-limit retry-churn on /root stops. Fix: set GITHUB_TOKEN (= github.token) at workflow-level env in both ENOSPC/SDK workflows so the SDK release-list fetch is authenticated (60 -> 5000 req/hr). Uses github.token because the `secrets` context is not available at workflow-level env. Propagates to every job. Sibling suites (event_api, sys_mutex, semaphore, msgq, ...) passed in the same run, confirming sys_event was not a real test failure — it lost the API-limit race. The PR's re-run is the oracle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_TOKEN was ignored)
The previous commit set GITHUB_TOKEN in the env, but the token-fix run proved
`west sdk install` does NOT read it — stackprot still failed with:
fetch_releases API rate limit exceeded. Try executing install script with
--personal-access-token argument or use a .netrc file
... 403, API rate limit exceeded for <ip>
(the install tool itself told us how to authenticate). The env-only change cut
failures 2->1 by luck of the concurrency window, not by authenticating.
Fix: pass the token explicitly — `west sdk install -t <tc>
--personal-access-token "${GITHUB_TOKEN}"` — at all six install sites in both
workflows. The workflow-level GITHUB_TOKEN env (github.token) carries the value;
GitHub Actions masks it in logs. This authenticates the SDK release-list fetch
(60 -> 5000 req/hr) so concurrent matrix jobs stop hitting the anonymous limit.
Grounded in the stackprot job log of run 27810691312 (the prior token-env run).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
avrabe
added a commit
that referenced
this pull request
Jun 19, 2026
…consume) (#70) Adds the k_msgq_put wasm-cross-LTO module (3rd primitive after sem #59 / mutex #60), unblocked by synth v0.11.48 (#372+#359). Build+consume complete; build-pipeline + rivet + cargo oracles green. CI: zephyr-tests core (qemu_cortex_m3 + mps2) all green incl. the msgq suite. The only reds are pre-existing/non-blocking and unrelated to this change: llvm-lto docker-pull ENOSPC (gale#73, red on main) + smp_sched (continue-on-error). Same explained pattern as merged #71. Production note: CONFIG_GALE_WASM_LTO_MSGQ is default-off; on-silicon validation of the generalized production shim (k_timeout_t int64 ABI / PUT_PEND / arbitrary msg_size) remains a separate gate before enabling in a safety build. The no-wait hot path is already silicon-GREEN at 673 cyc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The "flaky" zephyr-tests are a real CI disk-exhaustion bug
The intermittent zephyr-tests failures are not flaky tests and not a kernel bug — the CI runner's root filesystem runs out of space.
Root cause (evidence-grounded)
The
ci-baseimage bundles the full multi-arch Zephyr SDK (xtensa/rx/arc/riscv/… under/opt/toolchains), which consumes most of the ~14 GB free on the host/. Each job then does a west clone + build on top (the x86_64 SMP job rebuildscorevia-Zbuild-std), and unlucky jobs tip over into ENOSPC. The runner crashes writing its own diagnostic log:Older runs show the same cause during image unpack:
Why it looked flaky
Because it's whichever job loses the disk race, the failing set wanders across runs — 8 recent
mainruns each red on a different unrelated subset:PR #70's 5 failures all annotate the same ENOSPC runner crash. A changing failing-set across unrelated suites is the disk-race signature, not test non-determinism.
Fix
Mount the runner's large ephemeral scratch disk (
/mnt, ~70 GB) into each container and put the west workspace + builds there, off the near-full root fs. Applied to all four jobs (zephyr-test,zephyr-mpu-test,zephyr-smp-test,size-comparison).Validation
This PR's own zephyr-tests run is the oracle — ENOSPC-class failures should disappear. Crucially, if any test then fails for a real reason, it will be visible instead of masked by disk-death. (If ENOSPC persists, the next lever is relocating the SDK install /
HOMEoff/too.)🤖 Generated with Claude Code