Skip to content

OLG-nos-server shasta proposal#1

Open
mateusz-bajorski-shasta wants to merge 26 commits into
mainfrom
shasta-proposal
Open

OLG-nos-server shasta proposal#1
mateusz-bajorski-shasta wants to merge 26 commits into
mainfrom
shasta-proposal

Conversation

@mateusz-bajorski-shasta
Copy link
Copy Markdown

No description provided.

Top-level orchestration:
- Makefile invokes the Dockerized wrapper per TARGET
- dock-run.sh builds the toolchain image (docker/Dockerfile) and runs
  build.sh inside with the host UID and gitconfig mounted
- build.sh handles target-specific patches and the eve build invocation
- cache-setup.sh symlinks a per-revision dist cache
- .gitignore excludes the populated eve/ checkout and build artifacts

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
config.yml pins lf-edge/eve at bf6e6e259f18a3d87efa0493fbbbddcdd31e4359
and lists the patch folders to apply.

setup.py exposes three modes:
- --setup: fresh clone + reset to pinned revision + git am all patches +
  symlink profiles/ from repo into eve/profiles/
- --rebase: fetch + reset + reapply patches (for upstream bumps)
- --update: regenerate patches/ via git format-patch from the eve checkout

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Series covers:
- pkg/lps: Local Profile Server package
- pkg/pillar/cmd/gen-loc-config: LOC config generator with VyOS support,
  cloud-init, EnforceNetworkInterfaceOrder, deterministic LMP NIC naming
- zedagent: LOC bootstrap fixes, pre-provisioned UUID handling, airgap
  TLS handling, parseEdgeNodeInfo nil panic, UUID mismatch fix
- wait: bypass controller onboarding when /config/device-uuid is present
- downloader: localhost/loopback datastore handling
- dpcmanager: exclude PhyIoUsageDedicated ports from last-resort DPC
- Makefile / build.sh: rootfs sizing, loc config generation, ms-01 entry

Extracted via 'git format-patch bf6e6e259..origin/lps -o patches/' with
profile-only commits dropped (profile content lives on main under
profiles/, not in patches).

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
profiles/ms-01.yml — MS-01 mini PC with VyOS in PCIe passthrough
profiles/qemu.yml  — QEMU testing with switch network instance

Profile YAMLs ship as repo content rather than inside patches; setup.py
symlinks eve/profiles -> repo/profiles so edits take effect without
regenerating patches.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Covers native and Dockerized builds, the branching model (main on
patches-on-upstream, lps as legacy), and the repository structure.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Tooling so AI agents can contribute patches to this series safely:

- CLAUDE.md at repo root documents the patches-on-upstream model and the
  edit-in-eve / --update / verify loop. Auto-loaded in any session.
- tools/patches.py library parses the series headers; CLI wrappers expose
  "which patches touch path X" and a patches/README.md generator.
- tools/check-commit-msg.sh enforces "<scope>: <subject>" on patch
  commits, installable into eve/.git/hooks via tools/install-eve-hooks.sh
- tools/verify.sh fresh-applies all patches in a sandbox dir and checks
  patches/README.md is current and no patch touches profiles/. Wired into
  .github/workflows/verify.yml on PRs to main.
- .claude/commands/olg-patch-{add,edit}.md and olg-rebase-upstream.md
  encode the canonical loops for agents to follow.

setup.py: fix precedence so -d/--directory beats config.yml output_dir
(needed by verify.sh to use a sandbox dir).

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
build.sh:
  Picks the eve make target per profile:
    qemu  -> make ... live       (live.raw + live.qcow2)
    *     -> make ... installer  (installer.raw)

README:
  Adds image-kind-per-target table and the `cd eve && make run-live`
  entry point for booting the qemu live image. The qemu profile is
  meant to boot under QEMU directly without going through the
  installer flow.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Boots eve/dist/amd64/snapshot/installer.raw under qemu-system-x86_64
with the serial console captured to .qemu-runs/<ts>/serial.log. Exits 0
if known 'EVE is alive' markers appear in the log within the timeout,
1 otherwise. Used for fast 'did it boot at all?' smoke tests before
flashing real hardware.

.gitignore: add .qemu-runs/ for the per-run log/disk output dir.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Adds operational sections that weren't in the original doc, from the
ms-01 and qemu end-to-end runs:

- Host prerequisites (qemu, swtpm, ovmf, socat, python3-yaml, git-lfs)
  with a pointer to the no-install hard rule.
- Hidden build input: pkg/lps/images/*.qcow2 is git-lfs and not
  materialized by setup.py --setup; surface the foot-gun and where to
  source the VyOS qcow2 before booting.
- Image kind per profile table (qemu -> live, others -> installer) and a
  note that dock-run.sh can't drive full builds (no docker socket).
- Booting under QEMU: make run-live invocation, headless serial-capture
  pattern, .qemu-runs/<ts>/ convention.
- Inspecting a running EVE: SSH-in cheat sheet (EVE host + jumphost to
  inner guest VM), eve CLI commands, DomainStatus/DomainMetric/cidata
  pubsub paths, sample Monitor grep alternation.
- Cache / dist layout: eve/dist -> ../cache/<rev>/dist symlink explained
  and how to force a clean rebuild without nuking the cache.
- Known qemu-profile gaps (VyOS LAN bridge link-down, ssh_key not
  propagated to VyOS cloud-init) as explicit TODOs.
- Clarifies that two patches matching grep 'profiles/' carry intentional
  string mentions of the path (not diff headers), so the "no patch
  touches profiles/" rule is about diff headers not raw grep.
- File map gains cache-setup.sh symlink note, .qemu-runs/ entry, and
  tools/qemu/boot.sh listing.
- Hard rule added: never autonomously install host packages, ask first.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Scaffolding to produce the VyOS qcow2 that the EVE LPS package needs to
bundle. Mirrors the eve patches-on-upstream shape but for vyos-build:

- services/vyos/config.yml — pins vyos-build repo/branch/(revision), docker
  image, build command, and output naming.
- services/vyos/patches/0001-add-qcow2-cloudinit-build-flavor.patch — adds
  a custom build flavor to vyos-build that emits a qcow2 image with
  cloud-init, VPP/DPDK/XDP, and admin tools.
- build-vyos.sh — top-level orchestrator. Clones vyos-build/, resets to
  the pinned revision, applies the patches, runs the build inside a
  --privileged docker container, then (with --install) drops the qcow2
  into eve/pkg/lps/images/ so the next 'make pkg/lps' bundles it.

services/ (vs tools/) is the right home for runtime VM components.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Match the freshly-built VyOS qcow2 from the qcow2-cloudinit build
flavor produced by tools/vyos/patches/0001-add-qcow2-cloudinit-build-flavor.patch.

Old image: 744620032 bytes, sha 2a61b736...
New image: 805044224 bytes, sha 8eadff0c...

LPS validates these against the bundled file before serving, so the
profile must move with each rebuild of the qcow2.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two clarifications from the qcow2 rebuild + reboot session:

- The qcow2 swap round-trip: every new VyOS image build produces a
  different artifact, so profiles/qemu.yml (image.sha256, image.size)
  and eve/pkg/lps/images/ must move together before rebuilding pkg/lps
  and the EVE image. Documented as a step-by-step recipe under the
  hidden-build-input section.

- `eve app console` is a wrapper around `tio`, not xenconsole. It
  insists on a real tty for stdin (non-interactive attaches fail with
  "Saving current stdin settings failed") and the detach escape is
  Ctrl-t q, not Ctrl-]. The agent inspection section now spells this
  out.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Wire hw-id binding for every VIF (not just LMP) and enforce
EnforceNetworkInterfaceOrder so the guest kernel ifname order
matches the cidata template regardless of driver probe order.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Add a subsection under "Cache / dist layout" describing the
linuxkit-builder anonymous-volume gotcha:

  - The buildkit cache for EVE pkg builds lives in
    /var/lib/buildkit inside the `linuxkit-builder` container, mounted
    from an anonymous Docker volume.
  - After a kill (Ctrl-C, OOM, disk-full) the container is left in
    Exited state and the volume can hold 40-50 GB of half-written
    layers.
  - That volume is invisible to `docker buildx du` (separate buildkit
    instance) and is not reclaimed by `docker system prune` /
    `docker volume prune -f` while the container references it.
  - Recovery: `docker rm -v linuxkit-builder && docker volume prune -f
    && docker image prune -af`.

Discovered while debugging the cloud-init Configuration error under
task #27; without this note the disk pressure during build iteration
was hard to diagnose for anyone not already familiar with the
container's layout.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two cidata-related patches that together resolve the symptom seen on
every fresh boot of the bundled small-vyos-ssh-nat image:

  vyos-config[NNNN]: Configuration error

Patch 0023 (gen-loc-config: drop hw-id cidata directives):
  Stops emitting `set interfaces ethernet <iface> hw-id '<mac>'` from
  the generated cidata. With EnforceNetworkInterfaceOrder=true and
  VyOS's `net.ifnames=0 biosdevname=0` cmdline already in place, hw-id
  added nothing useful and risked drift if the VM was re-MAC'd.

Patch 0024 (gen-loc-config: delete baked-in ethernet on LAN bridge members):
  Identifies the actual root cause of the Configuration error: the
  stock small-vyos-ssh-nat.qcow2 image ships with `interfaces ethernet
  eth0 { address "dhcp"; hw-id ...; mtu ... }` already declared. When
  the cidata then makes eth0 a member of bridge br0, VyOS's
  verify_address (configverify.py:227) rejects the candidate config:

    Cannot assign address to interface "eth0" as it is a member of
    bridge "br0"!

  The fix emits `delete interfaces ethernet <iface>` for each LAN
  bridge member before any of the bridge `set` lines, wiping the
  conflicting baked-in nodes. `delete` on a non-existent path is a
  non-fatal warning in VyOS so this remains safe if a future image
  drop ships without the offending node.

  The patch also corrects the comment block that previously blamed
  hw-id for the bridge-member commit failure.

Verified by rebuilding pkg/pillar (no FORCE_BUILD), regenerating the
live image, and booting EVE three times — once from scratch, once with
the same persistent VyOS volume, once with the VyOS volume wiped and
re-seeded so cidata applies on a true first boot. All three boots:
- VyOS reaches RUNNING; no "Configuration error" in /persist/newlog
- br0 = 192.168.33.1/24; admin / yourpassword logs in
- interface order is byte-identical across boots
  (eth0 02:16:3e:f0:c1:9b @ PCI 03:00.0,
   eth1 02:16:3e:7a:2c:58 @ PCI 04:00.0)

Refs olg debug task #27.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Add full PCIe-passthrough support to the qemu profile:

- Pillar config: SystemAdapterList with MgmtUplink (Uplink=true with
  DHCP), LMP (Uplink=false), each switch NI port (Uplink=false);
  PhysicalIO entries with appropriate PhyIoMemberUsage.
- gen-loc-config: emit deterministic ethN naming via cloud-boothook
  MAC-based rename, and emit `mac` directive alongside `hw-id` so
  VyOS's vyos-interface-rescan / interfaces_ethernet.apply() pipeline
  doesn't corrupt the MAC on a stale udev cache miss.
- profiles/qemu.yml: declare two e1000e passthrough interfaces at
  0000:01:00.0 / 0000:02:00.0 with MACs 52:54:00:e1:00:00 / :01.
- eve/Makefile: optional QEMU_OPTS_e1000e block, gated by `E1000E=1`,
  that wires two e1000e devices behind dedicated pcie-root-ports
  with matching MACs.
- CLAUDE.md: document that `make E1000E=1 run-live` is required for
  the qemu profile - without it, VyOS goes BROKEN at launch with
  AdaptersFailed=true / driver_override failure.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Two unrelated changes bundled here because they're both about keeping
the qemu profile in sync with the current working state:

1. Image sha256 + size updated to match the freshly-built
   small-vyos-ssh-nat.qcow2 produced by the current build-vyos.sh
   pipeline (kernel 6.18.31, vyos-build pin 11a3b4bd). LPS validates
   both fields against what it serves; a mismatch refuses the serve.

2. LAN subnet changed from 192.168.1.0/24 → 192.168.33.0/24. The /24
   choice matters: the outer QEMU's eth0 user-net is on 192.168.1.0/24
   (where EVE gets the mgmt 192.168.1.10 and the QEMU NAT host is
   192.168.1.2). If the inner VyOS LAN bridge also lives on
   192.168.1.0/24, two things go wrong: (a) inner eth0 collides with
   the outer NAT, or (b) when cloud-init's vyos_config_commands fails
   to commit (the original task #27 bug), VyOS's fallback config falls
   back to DHCP on eth0 and grabs the outer 192.168.1.11 — silently
   masking the cidata failure. Picking a disjoint /24 (192.168.33/24)
   keeps the layers cleanly separated and makes failures obvious.

Both changes are inline edits to profiles/qemu.yml; no patches/
regeneration needed (profiles/ lives on main, never inside a patch).

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Drops the eth0 type:switch interface that previously let VyOS leach
EVE's slirp NAT (single-DHCP-client-per-slirp meant only one of EVE or
VyOS could egress at a time). Adds a third PCIe entry pt_e1k_wan
backed by patches/0023's new outer-QEMU e1000e+slirp on
192.168.5.0/24, and points vyos.wan at it (eth3 inside the VM).

EVE keeps mgmt_uplink: eth0 untouched, so EVE retains its own
192.168.1.10 slirp + 2222 hostfwd for ssh. Now both EVE and VyOS
have their own NAT'd internet via independent slirp instances,
unblocking end-to-end testing of "VyOS shares internet to EVE".

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
After verifying the two-bridge CPlane/LAN design end-to-end under
the qemu profile, this rolls the work into the patch series:

  0024  pkg/lps: bring up LMP bridge with deterministic DHCP client-id
        Absorbs four fixups from in-flight iteration: bind
        /run/zedrouter into the LPS container with raw/admin caps so
        udhcpc can run; require the LMP bridge to be an actual Linux
        bridge AND match the bn<N> name convention so we skip
        nireconciler's transient eth1-bridge intermediate state; outer
        retry loop so udhcpc tolerates kea coming up after cc_vyos_userdata.

  0025  gen-loc-config: split LMP cplane from data-plane LAN bridges
        Absorbs two fixups: bridge names constrained to br[0-9]+ (VyOS
        rejects brcp/brlan at commit); 'masquerade' must be quoted +
        'duid' (not 'identifier') for static-mapping + emit runcmd
        with the load/commit/save vbash incantation that activates
        the staged config.boot (paired with the vyos-cloud-init patch
        in olg-eve/vyos-builder-patches/).

  0026  pkg/lps: gitignore the bundled VyOS qcow2 so linuxkit doesn't
        tag dirty.

  0027  Makefile: drop slirp from LMP eth1, use isolated mcast socket.
        Removes slirp's built-in DHCP server from the LMP link so
        EVE's udhcpc only sees VyOS's reply.

  0028  pkg/lps + gen-loc-config: switch LMP client-id to RFC 4361
        DUID-EN. kea-dhcp4 only matches duid reservations against
        option-61 values prefixed with 0xFF + IAID; the old '0x00 +
        olg-eve' cid was a plain client-id and was unmatched by the
        reservation. Both sides moved to the new wire format in lock-step.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
qemu.yml: already on the new layout. Recording the resource bump
(3 vCPU / 1.5 GB) and patched qcow2 sha256/size on main so other
fresh clones get the same defaults.

ms-01.yml: was still on the single-bridge schema. Updated to match
qemu's two-bridge layout:
  - mgmt_uplink: eth2 (same physical port as lmp; EVE rides VyOS via
    the CPlane bridge for outbound — no dedicated mgmt NIC on ms-01)
  - cplane block: same 192.168.33.0/24 as qemu so the RFC 4361 DUID-EN
    client-id baked into pkg/lps doesn't need per-profile bytes
  - Removed eth0 (the LMP VIF) from vyos.lan — it belongs to br0,
    not br1
  - Bumped to the patched VyOS qcow2 (9c366e944206...) so runcmd is
    enabled and the cidata's commit fires.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
LPS' waitForLMPBridge previously matched bridges by a hardcoded bn<N>
regex. That works on QEMU (where nireconciler renames the LMP bridge
to bn1) but silently rejects hardware setups (ms-01) where the bridge
keeps the port name (e.g. eth2), leaving the LMP bridge with no DHCP
lease.

New gate: NI semantics (Activated == true, ChangeInProgress == 0,
BridgeIfindex assigned, /sys/class/net/<BridgeName>/bridge exists).
ChangeInProgress covers the transient eth1-as-bridge race the bn<N>
regex was working around.

Verified on QEMU: EVE bn1 got 192.168.33.2 from VyOS' kea; VyOS
reachable at 192.168.33.1. ms-01 verification pending.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
The MS-01's two 2.5G ports are an Intel i226-V (PCI 0000:57:00.0,
no AMT) and an i226-LM (PCI 0000:58:00.0, AMT/vPro). LMP was on the
i226-V, putting OOB management on a NIC that can't carry AMT
traffic.

Swap LMP ↔ WAN passthrough at the profile level:
  - lmp / mgmt_uplink: eth2 → eth3  (i226-LM, AMT-capable)
  - WAN passthrough:   eth3 → eth2  (i226-V at 0000:57:00.0/group16)

Cable convention follows: LMP cable plugs into the i226-LM port,
WAN cable plugs into the i226-V port. Header comments and the
inside-VyOS port table updated to match.

Built ms-01 installer.raw verified — loc-config.bin binds vyos-lmp
to eth3 and the WAN passthrough to PCI 57:00.0 / group16.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Real fix for cable-less first boot. Fix G (lmp-keep-up dummy on
the LMP bridge) operates downstream of the gate; Fix H removes
the gate itself by short-circuiting the uplink-based connectivity
test in dpcmanager.verifyDPC when /config/server is localhost.

With this in place, domainmgr's 'Waiting for AssignableAdapters,
DPC with management ports' loop releases on first DPC verify,
zedrouter creates the LMP bridge, VyOS launches, and Fix G's
dummy keeps the bridge oper-up while real cables come and go.

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Adds a 'Flashing EVE to hardware' section between QEMU and the
running-EVE introspection guidance, covering:

  - which build artifacts matter (installer.raw vs installer/rootfs.img)
  - safe USB flash with dd (device id, unmount, bs/conv flags)
  - what partition layout the installer lays down (2 GB IMGA + 2 GB IMGB
    after patch 0030)
  - the still-unscripted SSH dd-to-IMGB upgrade flow for Fix F
  - first-boot timing expectations on ms-01 with the current patch
    series (Fix A + G + H + IMGB-2GB + profile swap)

Signed-off-by: Mateusz Bajorski <mbajorski@shasta.cloud>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant