OLG-nos-server shasta proposal#1
Open
mateusz-bajorski-shasta wants to merge 26 commits into
Open
Conversation
Top-level orchestration: - Makefile invokes the Dockerized wrapper per TARGET - dock-run.sh builds the toolchain image (docker/Dockerfile) and runs build.sh inside with the host UID and gitconfig mounted - build.sh handles target-specific patches and the eve build invocation - cache-setup.sh symlinks a per-revision dist cache - .gitignore excludes the populated eve/ checkout and build artifacts Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
config.yml pins lf-edge/eve at bf6e6e259f18a3d87efa0493fbbbddcdd31e4359 and lists the patch folders to apply. setup.py exposes three modes: - --setup: fresh clone + reset to pinned revision + git am all patches + symlink profiles/ from repo into eve/profiles/ - --rebase: fetch + reset + reapply patches (for upstream bumps) - --update: regenerate patches/ via git format-patch from the eve checkout Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Series covers: - pkg/lps: Local Profile Server package - pkg/pillar/cmd/gen-loc-config: LOC config generator with VyOS support, cloud-init, EnforceNetworkInterfaceOrder, deterministic LMP NIC naming - zedagent: LOC bootstrap fixes, pre-provisioned UUID handling, airgap TLS handling, parseEdgeNodeInfo nil panic, UUID mismatch fix - wait: bypass controller onboarding when /config/device-uuid is present - downloader: localhost/loopback datastore handling - dpcmanager: exclude PhyIoUsageDedicated ports from last-resort DPC - Makefile / build.sh: rootfs sizing, loc config generation, ms-01 entry Extracted via 'git format-patch bf6e6e259..origin/lps -o patches/' with profile-only commits dropped (profile content lives on main under profiles/, not in patches). Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
profiles/ms-01.yml — MS-01 mini PC with VyOS in PCIe passthrough profiles/qemu.yml — QEMU testing with switch network instance Profile YAMLs ship as repo content rather than inside patches; setup.py symlinks eve/profiles -> repo/profiles so edits take effect without regenerating patches. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Covers native and Dockerized builds, the branching model (main on patches-on-upstream, lps as legacy), and the repository structure. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Tooling so AI agents can contribute patches to this series safely:
- CLAUDE.md at repo root documents the patches-on-upstream model and the
edit-in-eve / --update / verify loop. Auto-loaded in any session.
- tools/patches.py library parses the series headers; CLI wrappers expose
"which patches touch path X" and a patches/README.md generator.
- tools/check-commit-msg.sh enforces "<scope>: <subject>" on patch
commits, installable into eve/.git/hooks via tools/install-eve-hooks.sh
- tools/verify.sh fresh-applies all patches in a sandbox dir and checks
patches/README.md is current and no patch touches profiles/. Wired into
.github/workflows/verify.yml on PRs to main.
- .claude/commands/olg-patch-{add,edit}.md and olg-rebase-upstream.md
encode the canonical loops for agents to follow.
setup.py: fix precedence so -d/--directory beats config.yml output_dir
(needed by verify.sh to use a sandbox dir).
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
build.sh:
Picks the eve make target per profile:
qemu -> make ... live (live.raw + live.qcow2)
* -> make ... installer (installer.raw)
README:
Adds image-kind-per-target table and the `cd eve && make run-live`
entry point for booting the qemu live image. The qemu profile is
meant to boot under QEMU directly without going through the
installer flow.
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Boots eve/dist/amd64/snapshot/installer.raw under qemu-system-x86_64 with the serial console captured to .qemu-runs/<ts>/serial.log. Exits 0 if known 'EVE is alive' markers appear in the log within the timeout, 1 otherwise. Used for fast 'did it boot at all?' smoke tests before flashing real hardware. .gitignore: add .qemu-runs/ for the per-run log/disk output dir. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Adds operational sections that weren't in the original doc, from the ms-01 and qemu end-to-end runs: - Host prerequisites (qemu, swtpm, ovmf, socat, python3-yaml, git-lfs) with a pointer to the no-install hard rule. - Hidden build input: pkg/lps/images/*.qcow2 is git-lfs and not materialized by setup.py --setup; surface the foot-gun and where to source the VyOS qcow2 before booting. - Image kind per profile table (qemu -> live, others -> installer) and a note that dock-run.sh can't drive full builds (no docker socket). - Booting under QEMU: make run-live invocation, headless serial-capture pattern, .qemu-runs/<ts>/ convention. - Inspecting a running EVE: SSH-in cheat sheet (EVE host + jumphost to inner guest VM), eve CLI commands, DomainStatus/DomainMetric/cidata pubsub paths, sample Monitor grep alternation. - Cache / dist layout: eve/dist -> ../cache/<rev>/dist symlink explained and how to force a clean rebuild without nuking the cache. - Known qemu-profile gaps (VyOS LAN bridge link-down, ssh_key not propagated to VyOS cloud-init) as explicit TODOs. - Clarifies that two patches matching grep 'profiles/' carry intentional string mentions of the path (not diff headers), so the "no patch touches profiles/" rule is about diff headers not raw grep. - File map gains cache-setup.sh symlink note, .qemu-runs/ entry, and tools/qemu/boot.sh listing. - Hard rule added: never autonomously install host packages, ask first. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Scaffolding to produce the VyOS qcow2 that the EVE LPS package needs to bundle. Mirrors the eve patches-on-upstream shape but for vyos-build: - services/vyos/config.yml — pins vyos-build repo/branch/(revision), docker image, build command, and output naming. - services/vyos/patches/0001-add-qcow2-cloudinit-build-flavor.patch — adds a custom build flavor to vyos-build that emits a qcow2 image with cloud-init, VPP/DPDK/XDP, and admin tools. - build-vyos.sh — top-level orchestrator. Clones vyos-build/, resets to the pinned revision, applies the patches, runs the build inside a --privileged docker container, then (with --install) drops the qcow2 into eve/pkg/lps/images/ so the next 'make pkg/lps' bundles it. services/ (vs tools/) is the right home for runtime VM components. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Match the freshly-built VyOS qcow2 from the qcow2-cloudinit build flavor produced by tools/vyos/patches/0001-add-qcow2-cloudinit-build-flavor.patch. Old image: 744620032 bytes, sha 2a61b736... New image: 805044224 bytes, sha 8eadff0c... LPS validates these against the bundled file before serving, so the profile must move with each rebuild of the qcow2. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two clarifications from the qcow2 rebuild + reboot session: - The qcow2 swap round-trip: every new VyOS image build produces a different artifact, so profiles/qemu.yml (image.sha256, image.size) and eve/pkg/lps/images/ must move together before rebuilding pkg/lps and the EVE image. Documented as a step-by-step recipe under the hidden-build-input section. - `eve app console` is a wrapper around `tio`, not xenconsole. It insists on a real tty for stdin (non-interactive attaches fail with "Saving current stdin settings failed") and the detach escape is Ctrl-t q, not Ctrl-]. The agent inspection section now spells this out. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Wire hw-id binding for every VIF (not just LMP) and enforce EnforceNetworkInterfaceOrder so the guest kernel ifname order matches the cidata template regardless of driver probe order. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Add a subsection under "Cache / dist layout" describing the
linuxkit-builder anonymous-volume gotcha:
- The buildkit cache for EVE pkg builds lives in
/var/lib/buildkit inside the `linuxkit-builder` container, mounted
from an anonymous Docker volume.
- After a kill (Ctrl-C, OOM, disk-full) the container is left in
Exited state and the volume can hold 40-50 GB of half-written
layers.
- That volume is invisible to `docker buildx du` (separate buildkit
instance) and is not reclaimed by `docker system prune` /
`docker volume prune -f` while the container references it.
- Recovery: `docker rm -v linuxkit-builder && docker volume prune -f
&& docker image prune -af`.
Discovered while debugging the cloud-init Configuration error under
task #27; without this note the disk pressure during build iteration
was hard to diagnose for anyone not already familiar with the
container's layout.
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two cidata-related patches that together resolve the symptom seen on
every fresh boot of the bundled small-vyos-ssh-nat image:
vyos-config[NNNN]: Configuration error
Patch 0023 (gen-loc-config: drop hw-id cidata directives):
Stops emitting `set interfaces ethernet <iface> hw-id '<mac>'` from
the generated cidata. With EnforceNetworkInterfaceOrder=true and
VyOS's `net.ifnames=0 biosdevname=0` cmdline already in place, hw-id
added nothing useful and risked drift if the VM was re-MAC'd.
Patch 0024 (gen-loc-config: delete baked-in ethernet on LAN bridge members):
Identifies the actual root cause of the Configuration error: the
stock small-vyos-ssh-nat.qcow2 image ships with `interfaces ethernet
eth0 { address "dhcp"; hw-id ...; mtu ... }` already declared. When
the cidata then makes eth0 a member of bridge br0, VyOS's
verify_address (configverify.py:227) rejects the candidate config:
Cannot assign address to interface "eth0" as it is a member of
bridge "br0"!
The fix emits `delete interfaces ethernet <iface>` for each LAN
bridge member before any of the bridge `set` lines, wiping the
conflicting baked-in nodes. `delete` on a non-existent path is a
non-fatal warning in VyOS so this remains safe if a future image
drop ships without the offending node.
The patch also corrects the comment block that previously blamed
hw-id for the bridge-member commit failure.
Verified by rebuilding pkg/pillar (no FORCE_BUILD), regenerating the
live image, and booting EVE three times — once from scratch, once with
the same persistent VyOS volume, once with the VyOS volume wiped and
re-seeded so cidata applies on a true first boot. All three boots:
- VyOS reaches RUNNING; no "Configuration error" in /persist/newlog
- br0 = 192.168.33.1/24; admin / yourpassword logs in
- interface order is byte-identical across boots
(eth0 02:16:3e:f0:c1:9b @ PCI 03:00.0,
eth1 02:16:3e:7a:2c:58 @ PCI 04:00.0)
Refs olg debug task #27.
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Add full PCIe-passthrough support to the qemu profile: - Pillar config: SystemAdapterList with MgmtUplink (Uplink=true with DHCP), LMP (Uplink=false), each switch NI port (Uplink=false); PhysicalIO entries with appropriate PhyIoMemberUsage. - gen-loc-config: emit deterministic ethN naming via cloud-boothook MAC-based rename, and emit `mac` directive alongside `hw-id` so VyOS's vyos-interface-rescan / interfaces_ethernet.apply() pipeline doesn't corrupt the MAC on a stale udev cache miss. - profiles/qemu.yml: declare two e1000e passthrough interfaces at 0000:01:00.0 / 0000:02:00.0 with MACs 52:54:00:e1:00:00 / :01. - eve/Makefile: optional QEMU_OPTS_e1000e block, gated by `E1000E=1`, that wires two e1000e devices behind dedicated pcie-root-ports with matching MACs. - CLAUDE.md: document that `make E1000E=1 run-live` is required for the qemu profile - without it, VyOS goes BROKEN at launch with AdaptersFailed=true / driver_override failure. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two unrelated changes bundled here because they're both about keeping the qemu profile in sync with the current working state: 1. Image sha256 + size updated to match the freshly-built small-vyos-ssh-nat.qcow2 produced by the current build-vyos.sh pipeline (kernel 6.18.31, vyos-build pin 11a3b4bd). LPS validates both fields against what it serves; a mismatch refuses the serve. 2. LAN subnet changed from 192.168.1.0/24 → 192.168.33.0/24. The /24 choice matters: the outer QEMU's eth0 user-net is on 192.168.1.0/24 (where EVE gets the mgmt 192.168.1.10 and the QEMU NAT host is 192.168.1.2). If the inner VyOS LAN bridge also lives on 192.168.1.0/24, two things go wrong: (a) inner eth0 collides with the outer NAT, or (b) when cloud-init's vyos_config_commands fails to commit (the original task #27 bug), VyOS's fallback config falls back to DHCP on eth0 and grabs the outer 192.168.1.11 — silently masking the cidata failure. Picking a disjoint /24 (192.168.33/24) keeps the layers cleanly separated and makes failures obvious. Both changes are inline edits to profiles/qemu.yml; no patches/ regeneration needed (profiles/ lives on main, never inside a patch). Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Drops the eth0 type:switch interface that previously let VyOS leach EVE's slirp NAT (single-DHCP-client-per-slirp meant only one of EVE or VyOS could egress at a time). Adds a third PCIe entry pt_e1k_wan backed by patches/0023's new outer-QEMU e1000e+slirp on 192.168.5.0/24, and points vyos.wan at it (eth3 inside the VM). EVE keeps mgmt_uplink: eth0 untouched, so EVE retains its own 192.168.1.10 slirp + 2222 hostfwd for ssh. Now both EVE and VyOS have their own NAT'd internet via independent slirp instances, unblocking end-to-end testing of "VyOS shares internet to EVE". Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
After verifying the two-bridge CPlane/LAN design end-to-end under
the qemu profile, this rolls the work into the patch series:
0024 pkg/lps: bring up LMP bridge with deterministic DHCP client-id
Absorbs four fixups from in-flight iteration: bind
/run/zedrouter into the LPS container with raw/admin caps so
udhcpc can run; require the LMP bridge to be an actual Linux
bridge AND match the bn<N> name convention so we skip
nireconciler's transient eth1-bridge intermediate state; outer
retry loop so udhcpc tolerates kea coming up after cc_vyos_userdata.
0025 gen-loc-config: split LMP cplane from data-plane LAN bridges
Absorbs two fixups: bridge names constrained to br[0-9]+ (VyOS
rejects brcp/brlan at commit); 'masquerade' must be quoted +
'duid' (not 'identifier') for static-mapping + emit runcmd
with the load/commit/save vbash incantation that activates
the staged config.boot (paired with the vyos-cloud-init patch
in olg-eve/vyos-builder-patches/).
0026 pkg/lps: gitignore the bundled VyOS qcow2 so linuxkit doesn't
tag dirty.
0027 Makefile: drop slirp from LMP eth1, use isolated mcast socket.
Removes slirp's built-in DHCP server from the LMP link so
EVE's udhcpc only sees VyOS's reply.
0028 pkg/lps + gen-loc-config: switch LMP client-id to RFC 4361
DUID-EN. kea-dhcp4 only matches duid reservations against
option-61 values prefixed with 0xFF + IAID; the old '0x00 +
olg-eve' cid was a plain client-id and was unmatched by the
reservation. Both sides moved to the new wire format in lock-step.
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
qemu.yml: already on the new layout. Recording the resource bump
(3 vCPU / 1.5 GB) and patched qcow2 sha256/size on main so other
fresh clones get the same defaults.
ms-01.yml: was still on the single-bridge schema. Updated to match
qemu's two-bridge layout:
- mgmt_uplink: eth2 (same physical port as lmp; EVE rides VyOS via
the CPlane bridge for outbound — no dedicated mgmt NIC on ms-01)
- cplane block: same 192.168.33.0/24 as qemu so the RFC 4361 DUID-EN
client-id baked into pkg/lps doesn't need per-profile bytes
- Removed eth0 (the LMP VIF) from vyos.lan — it belongs to br0,
not br1
- Bumped to the patched VyOS qcow2 (9c366e944206...) so runcmd is
enabled and the cidata's commit fires.
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
LPS' waitForLMPBridge previously matched bridges by a hardcoded bn<N> regex. That works on QEMU (where nireconciler renames the LMP bridge to bn1) but silently rejects hardware setups (ms-01) where the bridge keeps the port name (e.g. eth2), leaving the LMP bridge with no DHCP lease. New gate: NI semantics (Activated == true, ChangeInProgress == 0, BridgeIfindex assigned, /sys/class/net/<BridgeName>/bridge exists). ChangeInProgress covers the transient eth1-as-bridge race the bn<N> regex was working around. Verified on QEMU: EVE bn1 got 192.168.33.2 from VyOS' kea; VyOS reachable at 192.168.33.1. ms-01 verification pending. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
The MS-01's two 2.5G ports are an Intel i226-V (PCI 0000:57:00.0, no AMT) and an i226-LM (PCI 0000:58:00.0, AMT/vPro). LMP was on the i226-V, putting OOB management on a NIC that can't carry AMT traffic. Swap LMP ↔ WAN passthrough at the profile level: - lmp / mgmt_uplink: eth2 → eth3 (i226-LM, AMT-capable) - WAN passthrough: eth3 → eth2 (i226-V at 0000:57:00.0/group16) Cable convention follows: LMP cable plugs into the i226-LM port, WAN cable plugs into the i226-V port. Header comments and the inside-VyOS port table updated to match. Built ms-01 installer.raw verified — loc-config.bin binds vyos-lmp to eth3 and the WAN passthrough to PCI 57:00.0 / group16. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Real fix for cable-less first boot. Fix G (lmp-keep-up dummy on the LMP bridge) operates downstream of the gate; Fix H removes the gate itself by short-circuiting the uplink-based connectivity test in dpcmanager.verifyDPC when /config/server is localhost. With this in place, domainmgr's 'Waiting for AssignableAdapters, DPC with management ports' loop releases on first DPC verify, zedrouter creates the LMP bridge, VyOS launches, and Fix G's dummy keeps the bridge oper-up while real cables come and go. Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Adds a 'Flashing EVE to hardware' section between QEMU and the
running-EVE introspection guidance, covering:
- which build artifacts matter (installer.raw vs installer/rootfs.img)
- safe USB flash with dd (device id, unmount, bs/conv flags)
- what partition layout the installer lays down (2 GB IMGA + 2 GB IMGB
after patch 0030)
- the still-unscripted SSH dd-to-IMGB upgrade flow for Fix F
- first-boot timing expectations on ms-01 with the current patch
series (Fix A + G + H + IMGB-2GB + profile swap)
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.