Skip to content

Nightly CI is flaky on master — package download 404s break ~30% of scheduled runs #2036

@widgetii

Description

@widgetii

Problem

Over the past 30 days (2026-04-03 → 2026-05-03), build.yml failed in 10 of 35 scheduled runs on master. ~80% of failures share a single root cause: a transient HTTP 404 during package download aborts the whole job.

Data

Date (UTC) Run Failed boards Root cause
2026-05-02 25264150594 hi3516cv300_lite, hi3518ev200_ultimate mbedtls-openipc 404 from codeload.github.com
2026-05-01 25237141812 hi3516dv100_lite end-of-build "Server Error" (GHA infra)
2026-04-29 25139121700 t30_lite, gk7605v100_lite ingenic-opensdk-HEAD 404, majestic 404
2026-04-28 25082907818 hi3516ev300_ultimate majestic 404 + lame-3.100 from sourceforge 404
2026-04-27 25024579089 hi3516cv100_lite, hi3516dv200_lite, hi3516dv300_lite motors-HEAD 404, jsonfilter@<sha> 404 (×2)
2026-04-26 24969305329 gk7205v200_ultimate toolchain-external from miniupnp.free.fr 404
2026-04-19 24641267971 gk7605v100_lite end-of-build "Server Error" (GHA infra)
2026-04-06 24055456241 hi3516ev300_neo one-off kernel build error (drivers/net/mdio); already self-resolved
2026-04-05 24012348732 hi3516ev200_lite, t20_lite ipctool-HEAD 404, opus 404 from downloads.xiph.org
2026-04-05 23999110802 ssc333_ultimate mbedtls-openipc 404 from codeload.github.com

Pattern summary

  • ~80% (8/10 runs): transient HTTP 404 on package download. Worst offenders: mbedtls-openipc (4×), majestic (2×), jsonfilter (2×), and HEAD-pinned packages (motors, ipctool, ingenic-opensdk).
  • ~20% (2/10 runs): GHA "Server Error" at end of build — runner-infrastructure flake.
  • One-off real bug: hi3516ev300_neo kernel break on 2026-04-06; that board has been green every night since, so no action needed.

Root cause

Buildroot's support/download/wget helper does a single-attempt fetch with no retries. Any transient 404 (codeload.github.com, downloads.xiph.org, sourceforge mirrors, openipc S3, etc.) kills the matrix entry, which marks the whole build workflow failed (matrix is fail-fast: false, but failure of any cell still fails the run).

Compounding factors:

  • No BR2_PRIMARY_SITE / BR2_BACKUP_SITE configured in general/openipc.fragment — no fallback mirror.
  • The GHA workflow caches /tmp/ccache but not output/dl/, so every run re-downloads every tarball.
  • Several packages pin VERSION = HEAD (motors, ipctool, ingenic-opensdk), which makes their tarball URLs unstable across runs and unfit for content-addressable caching.

Proposed short-term workarounds (target: green nightly within 1 week)

  1. Wrap make BOARD=... in a 3-attempt retry loop in both .github/workflows/build.yml and .github/workflows/build-one.yml. Buildroot resumes from per-package stamp files, so retries are cheap.
  2. Cache output/dl/ in CI as a shared (non-per-platform) cache keyed by month, with a dl- restore-key for month rollover.
  3. Set `BR2_BACKUP_SITE="https://sources.buildroot.net\"\` in general/openipc.fragment for free fallback on upstream-Buildroot packages (opus, lame, miniupnp, etc.).

A PR with these changes will follow and link this issue.

Follow-ups (long-term)

  • Pin HEAD-versioned packages (motors, ipctool, ingenic-opensdk) to specific commits so URLs are stable and cacheable.
  • Self-host tarballs for chronically flaky upstreams (sourceforge mirrors, xiph) by uploading to OpenIPC's S3 / GitHub releases and overriding *_SITE.
  • Investigate the end-of-build "Server Error" pattern — likely log-streaming infra flake; the retry wrapper covers it but worth confirming with GHA support if it persists past the workaround PR.
  • Consider whether the matrix's "any cell fails → workflow red" behavior is the right signal, or whether transient categories should be reported but not failing.

Definition of done

  • Three consecutive nightly build.yml runs on master finish with conclusion: success.
  • All four follow-up checkboxes either resolved or split out into their own tracking issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ciContinuous Integration: workflows, build flakiness, infra

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions