fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC by avrabe · Pull Request #71 · pulseengine/gale

avrabe · 2026-06-19T05:39:46Z

The "flaky" zephyr-tests are a real CI disk-exhaustion bug

The intermittent zephyr-tests failures are not flaky tests and not a kernel bug — the CI runner's root filesystem runs out of space.

Root cause (evidence-grounded)

The ci-base image bundles the full multi-arch Zephyr SDK (xtensa/rx/arc/riscv/… under /opt/toolchains), which consumes most of the ~14 GB free on the host /. Each job then does a west clone + build on top (the x86_64 SMP job rebuilds core via -Zbuild-std), and unlucky jobs tip over into ENOSPC. The runner crashes writing its own diagnostic log:

failure: Unhandled exception. System.IO.IOException: No space left on device :
'/home/runner/actions-runner/cached/.../Worker_*.log'

Older runs show the same cause during image unpack:

failed to register layer: write /opt/toolchains/zephyr-sdk-1.0.0/.../libc.a: no space left on device
##[error]Docker pull failed with exit code 1

Why it looked flaky

Because it's whichever job loses the disk race, the failing set wanders across runs — 8 recent main runs each red on a different unrelated subset:

run	failed jobs
`bc1bff2`	lifo_usage, msgq
`40b2742`	mbox_usage, kheap
`4654734`	common, event, fifo, Binary size
`f5fa4cd`	smp_semaphore, condvar, sched, sched_deadline
`fa68a39`	smp_*×3, syscalls, fatal_exception
…	…

PR #70's 5 failures all annotate the same ENOSPC runner crash. A changing failing-set across unrelated suites is the disk-race signature, not test non-determinism.

Fix

Mount the runner's large ephemeral scratch disk (/mnt, ~70 GB) into each container and put the west workspace + builds there, off the near-full root fs. Applied to all four jobs (zephyr-test, zephyr-mpu-test, zephyr-smp-test, size-comparison).

Validation

This PR's own zephyr-tests run is the oracle — ENOSPC-class failures should disappear. Crucially, if any test then fails for a real reason, it will be visible instead of masked by disk-death. (If ENOSPC persists, the next lever is relocating the SDK install / HOME off / too.)

🤖 Generated with Claude Code

…ent ENOSPC Root cause of the intermittent zephyr-tests failures (NOT flaky tests, NOT a kernel bug): the CI runner's root filesystem fills up. The ci-base image layers bundle the full multi-arch Zephyr SDK (xtensa/rx/arc/... under /opt/toolchains), consuming most of the ~14 GB free on the host `/`; the per-job west clone + build artifacts (esp. the x86_64 SMP job's -Zbuild-std=core rebuild) then tip unlucky jobs over the edge. The failure surfaces as the runner crashing while writing its own diag log: Unhandled exception. System.IO.IOException: No space left on device : '/home/runner/actions-runner/cached/.../Worker_*.log' — and in older runs as `failed to register layer: ... no space left on device` during the docker image unpack. Because it's whichever job loses the disk race, the failing SET wanders run-to-run across unrelated suites (semaphore, sched, mbox_usage, fatal_exception, smp_semaphore, lifo_usage, ...), which is exactly why it looked like flaky tests. Evidence: 8 recent main runs each red on a different subset; PR#70's 5 failures all annotate the ENOSPC runner crash. Fix: mount the runner's large ephemeral scratch disk (/mnt, ~70 GB) into each container (`--volume /mnt:/mnt`) and create the west workspace under /mnt/zephyr-workspace, so the heavy clone + build writes land on the big disk instead of the near-full root fs. Applied to all four jobs (zephyr-test, zephyr-mpu-test, zephyr-smp-test, size-comparison). The PR's own zephyr-tests run is the oracle: ENOSPC-class failures should disappear. If any test then fails for a *real* reason, it will now be visible instead of masked by disk-death. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-06-19T05:41:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

… workflow) The zephyr-tests /mnt relocation fixed that workflow's ENOSPC, but PR CI surfaced the SAME failure in llvm-lto.yml — `llvm-lto-test (stack)` died with `Docker pull failed` / no-space-left. llvm-lto.yml was untouched by the first commit yet shares the pattern, and is WORSE: it uses the larger full `ci` image and runs four west builds per job (gcc-baseline, gcc-gale, llvm-gale, llvm-lto), so it fills the ~14 GB root fs even more readily. Apply the identical fix to both llvm-lto jobs: mount /mnt and put the west workspace + all four build dirs there. (Pull-time image-unpack pressure on / is not fully addressed by workspace relocation — flagged for follow-up if it persists — but moving four builds off / removes the dominant build-time consumer.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…t 403 failures Reading the actual job logs of PR #71's run finally explained the "flaky" failures as TWO distinct CI-infra causes — neither a test or code bug: 1. **GitHub API rate limiting (the dominant cause).** `west sdk install` does a "Fetching Zephyr SDK list..." call to the GitHub releases API on *every* job. Unauthenticated, that is 60 req/hr per runner IP; with ~59 concurrent matrix jobs the limit is blown and the fetch returns "Failed to fetch: 403, API rate limit exceeded for <ip>" The SDK then never installs and the build dies at "CMake Error ... FindZephyr-sdk.cmake (find_package)" (this is exactly what `sys_event` failed on). Whichever jobs lose the rate-limit window fail — which is why the failing SET wandered run-to-run and looked like flaky tests. 2. Residual ENOSPC on the disk tail (`mutex`) — already mitigated by the /mnt relocation in the earlier commits; may further ease now that the rate-limit retry-churn on /root stops. Fix: set GITHUB_TOKEN (= github.token) at workflow-level env in both ENOSPC/SDK workflows so the SDK release-list fetch is authenticated (60 -> 5000 req/hr). Uses github.token because the `secrets` context is not available at workflow-level env. Propagates to every job. Sibling suites (event_api, sys_mutex, semaphore, msgq, ...) passed in the same run, confirming sys_event was not a real test failure — it lost the API-limit race. The PR's re-run is the oracle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…_TOKEN was ignored) The previous commit set GITHUB_TOKEN in the env, but the token-fix run proved `west sdk install` does NOT read it — stackprot still failed with: fetch_releases API rate limit exceeded. Try executing install script with --personal-access-token argument or use a .netrc file ... 403, API rate limit exceeded for <ip> (the install tool itself told us how to authenticate). The env-only change cut failures 2->1 by luck of the concurrency window, not by authenticating. Fix: pass the token explicitly — `west sdk install -t <tc> --personal-access-token "${GITHUB_TOKEN}"` — at all six install sites in both workflows. The workflow-level GITHUB_TOKEN env (github.token) carries the value; GitHub Actions masks it in logs. This authenticates the SDK release-list fetch (60 -> 5000 req/hr) so concurrent matrix jobs stop hitting the anonymous limit. Grounded in the stackprot job log of run 27810691312 (the prior token-env run). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…consume) (#70) Adds the k_msgq_put wasm-cross-LTO module (3rd primitive after sem #59 / mutex #60), unblocked by synth v0.11.48 (#372+#359). Build+consume complete; build-pipeline + rivet + cargo oracles green. CI: zephyr-tests core (qemu_cortex_m3 + mps2) all green incl. the msgq suite. The only reds are pre-existing/non-blocking and unrelated to this change: llvm-lto docker-pull ENOSPC (gale#73, red on main) + smp_sched (continue-on-error). Same explained pattern as merged #71. Production note: CONFIG_GALE_WASM_LTO_MSGQ is default-off; on-silicon validation of the generalized production shim (k_timeout_t int64 ABI / PUT_PEND / arbitrary msg_size) remains a separate gate before enabling in a safety build. The no-wait hot path is already silicon-GREEN at 673 cyc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

avrabe mentioned this pull request Jun 19, 2026

feat(release): standardised gale release pipeline — signed wasm + rivet compliance #72

Merged

avrabe and others added 3 commits June 19, 2026 07:53

avrabe merged commit 4da207a into main Jun 19, 2026
56 of 59 checks passed

avrabe deleted the fix/zephyr-ci-disk-enospc branch June 19, 2026 18:39

avrabe mentioned this pull request Jun 19, 2026

CI: llvm-lto.yml docker-pull ENOSPC on the large ci image (pre-existing, intermittent) #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC#71

fix(ci): relocate zephyr-tests west workspace to /mnt — fix intermittent ENOSPC#71
avrabe merged 4 commits into
mainfrom
fix/zephyr-ci-disk-enospc

avrabe commented Jun 19, 2026

Uh oh!

codecov Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

avrabe commented Jun 19, 2026

The "flaky" zephyr-tests are a real CI disk-exhaustion bug

Root cause (evidence-grounded)

Why it looked flaky

Fix

Validation

Uh oh!

codecov Bot commented Jun 19, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant