Skip to content

Commit efc8cb5

Browse files
committed
refactor: replace galactic-agent with galactic-router (controller-runtime)
Replace the gRPC-based galactic-agent DaemonSet with a controller-runtime based galactic-router. Key changes: - Remove internal/agent, internal/bootstrap, internal/gobgp packages - Add internal/controller with BGPRouter, BGPPeer, BGPAdvertisement, BGPPolicy, Secret, Node reconcilers - Add internal/reconcile for CRD-to-DesiredRouter translation - Add internal/runtime with RuntimeFactory pattern (GoBGP tenant, FRR fabric stub) - Add internal/model for internal BGP types and internal/hash for change detection - Update deployment manifests, Dockerfile, containerlab config, and docs - Switch health probes to gRPC on port 5000; remove HTTP health and webhook ports - GoBGP starts lazily on first BGPRouter reconcile (listenPort=-1, outbound-only) - Hash-based no-op suppression prevents redundant GoBGP Apply calls
1 parent 9a26bb8 commit efc8cb5

58 files changed

Lines changed: 3157 additions & 1436 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.devcontainer/galactic/README.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,7 @@ The devcontainer includes the following extensions:
6262

6363
### Forwarded Ports
6464
- **8080** - Metrics endpoint
65-
- **8081** - Health check endpoint
66-
- **9443** - Webhook server
65+
- **5000** - gRPC health endpoint (liveness/readiness probes)
6766

6867
## Capabilities
6968

.devcontainer/galactic/devcontainer.json

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -107,18 +107,14 @@
107107
}
108108
}
109109
},
110-
"forwardPorts": [8080, 8081, 9443],
110+
"forwardPorts": [8080, 5000],
111111
"portsAttributes": {
112112
"8080": {
113113
"label": "Metrics",
114114
"onAutoForward": "silent"
115115
},
116-
"8081": {
117-
"label": "Health",
118-
"onAutoForward": "silent"
119-
},
120-
"9443": {
121-
"label": "Webhook",
116+
"5000": {
117+
"label": "gRPC Health",
122118
"onAutoForward": "silent"
123119
}
124120
},

AGENTS.md

Lines changed: 18 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,33 +6,36 @@ See [ARCHITECTURE.md](ARCHITECTURE.md) for a full architecture reference includi
66

77
## Purpose & Architecture
88

9-
Galactic is the SRv6 data plane for multi-cloud VPC networking. It consists of a DaemonSet agent (`internal/agent/`) that manages kernel SRv6 routes and VRFs per node, and a CNI plugin (`internal/cni/`) that wires containers into VPC networks. VPC and VPCAttachment CRD management lives in a separate operator project; Galactic receives pre-populated identifiers through the CNI config and acts on them. BGP is used as the control plane for distributing SRv6 routes between agents.
9+
Galactic is the SRv6 data plane for multi-cloud VPC networking. It consists of a controller-runtime reconciler (`cmd/galactic-router/`) that watches Cosmos BGP CRDs and drives an embedded GoBGP server per node, and a CNI plugin (`internal/cni/`) that wires containers into VPC networks. VPC and VPCAttachment CRD management lives in a separate operator project; Galactic receives pre-populated identifiers through the CNI config and acts on them.
1010

11-
**Data flow:** CNI invoked with pre-populated VPC/VPCAttachment identifiers → gRPC registers endpoint with agent → agent manages SRv6 ingress routes locally → BGP distributes SRv6 routes between agents.
11+
**Data flow:** CNI invoked with pre-populated VPC/VPCAttachment identifiers → CNI creates kernel SRv6 state (VRF, veth, ingress route) and writes a `BGPAdvertisement` CRD → `galactic-router` reconciles the CRD → GoBGP advertises the EVPN path → BGP distributes routes between nodes.
1212

1313
**Non-obvious decisions:**
1414
- VPC identifiers are 48-bit hex; VPCAttachment identifiers are 16-bit hex. These are embedded into IPv6 SRv6 endpoint addresses for deterministic route lookups. Both are supplied by an external operator via the CNI config.
1515
- Identifiers are also Base62-encoded for interface naming (VRF: `vrfX-Y`, veth host side: `galX-Y`) to keep kernel interface name length within limits.
16-
- `galactic-cni` is a pure CNI plugin; `main()` calls `cni.RunPlugin()` directly with no CLI layer. `galactic-agent` uses flag parsing for its configuration flags.
16+
- `galactic-cni` is a pure CNI plugin; `main()` calls `cni.RunPlugin()` directly with no CLI layer. `galactic-router` uses environment variables (`NODE_NAME`, `ROUTER_ROLE`) for its configuration.
1717
- The Kubernetes operator, VPC/VPCAttachment CRDs, and webhook code have been removed from this repository. They live in a separate companion operator project.
18+
- GoBGP starts lazily on the first `BGPRouter` reconcile (`listenPort=-1`, outbound-only). ASN or RouterID changes trigger a full `Reconfigure`.
19+
- Liveness and readiness probes use the gRPC health protocol on port 5000. There is no HTTP health endpoint.
1820

1921
## Tech Stack
2022

21-
- **Go 1.26** — agent and CNI plugin
23+
- **Go 1.26** — router and CNI plugin
24+
- **controller-runtime** — BGPRouter/BGPPeer/BGPAdvertisement/BGPPolicy reconcilers
25+
- **Cosmos BGP API** (`bgp.miloapis.com/v1alpha1`) — BGPRouter, BGPPeer, BGPAdvertisement, BGPPolicy CRDs
2226
- **Multus CNI** — multi-network for pods; NAD generation is handled by the external operator
23-
- **gRPC + protobuf** — CNI-to-agent local communication
2427
- **SRv6 + netlink** — kernel-level routing; `github.com/vishvananda/netlink`
25-
- **BGP**control plane for SRv6 route distribution between agents (in progress)
28+
- **GoBGP v4**embedded BGP server for the tenant role
2629

2730
## Development Workflow
2831

2932
```
30-
task build # produces bin/galactic
33+
task build # produces bin/galactic-cni and bin/galactic-router
3134
task test # runs test:unit then test:e2e
3235
task test:unit # unit tests with race detection
3336
task test:e2e # Kind cluster lifecycle test
3437
task lint # golangci-lint; lint-fix applies safe auto-fixes
35-
task run-agent # run agent (requires root / CAP_NET_ADMIN)
38+
task run-router # run galactic-router (requires root / CAP_NET_ADMIN)
3639
```
3740

3841
**Before every PR:** `task lint test`.
@@ -47,13 +50,14 @@ Summary:
4750

4851
## Deployments
4952

50-
- **`deploy/galactic-agent/`** — Kustomize manifests for the agent DaemonSet, RBAC, and ServiceAccount. Apply with `kubectl apply -k deploy/galactic-agent/`.
51-
- **`deploy/containerlab/`** — ContainerLab topology (`gvpc.clab.yaml`) for three Kind clusters (iad, sjc, infra) wired over an IPv6 SRv6 transit mesh. FRR runs as a hostNetwork DaemonSet on each worker for eBGP underlay; GoBGP handles L3VPN type-5 routes over iBGP to the infra route reflector. See `deploy/containerlab/README.md` and `deploy/containerlab/Taskfile.yaml` for bring-up commands.
53+
- **`deploy/galactic-router/`** — Kustomize manifests for the router DaemonSet, RBAC, and ServiceAccount. Apply with `kubectl apply -k deploy/galactic-router/`.
54+
- **`deploy/containerlab/`** — ContainerLab topology (`gvpc.clab.yaml`) for three Kind clusters (iad, sjc, infra) wired over an IPv6 SRv6 transit mesh. FRR runs as a hostNetwork DaemonSet on each worker for eBGP underlay; `galactic-router` (tenant role) handles EVPN path distribution over iBGP. See `deploy/containerlab/README.md` and `deploy/containerlab/Taskfile.yaml` for bring-up commands.
5255

5356
## New Developer Entry Points
5457

5558
1. Run `task build` to verify toolchain; run `task test` to confirm unit tests pass.
56-
2. Read `internal/cni/cni.go` (cmdAdd/cmdDel) to understand the container attach path.
57-
3. Read `internal/plumbing/intf/intf.go` to understand SRv6 endpoint encoding, interface naming, and base62↔hex conversion.
58-
4. Read `internal/plumbing/srv6/srv6.go` to understand kernel SRv6 ingress route management.
59-
5. Explore `internal/plumbing/` for shared kernel and network primitives (VRF, sysctl, interface naming, SRv6).
59+
2. Read `internal/cni/cni.go` (cmdAdd/cmdDel) to understand the container attach path and how `BGPAdvertisement` CRDs are created.
60+
3. Read `internal/reconcile/reconcile.go` to understand how Cosmos CRDs are translated into a `DesiredRouter`.
61+
4. Read `internal/runtime/gobgp/runtime.go` to understand how `DesiredRouter` is applied to GoBGP.
62+
5. Read `internal/plumbing/intf/intf.go` to understand SRv6 endpoint encoding, interface naming, and base62↔hex conversion.
63+
6. Explore `internal/plumbing/` for shared kernel and network primitives (VRF, sysctl, interface naming, SRv6).

ARCHITECTURE.md

Lines changed: 48 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,25 +2,26 @@
22

33
> Galactic is the SRv6 data plane for multi-cloud VPC networking, deployed as two
44
> binaries on each Kubernetes node: a CNI plugin that attaches containers to VPC
5-
> networks, and an agent that manages kernel SRv6 routes and distributes EVPN
6-
> (L2VPN/EVPN AFI/SAFI) paths via an embedded GoBGP server.
5+
> networks, and a router that reconciles Cosmos BGP CRDs and drives an embedded
6+
> GoBGP server to distribute EVPN (L2VPN/EVPN AFI/SAFI) paths between nodes.
77
8-
_Last updated: 2026-06-14_
8+
_Last updated: 2026-06-19_
99

1010
---
1111

1212
## Overview
1313

1414
Galactic implements VPC isolation and cross-cluster reachability using Linux SRv6.
1515
When a pod is attached to a VPC, the CNI plugin creates the required kernel state
16-
(VRF, veth pair, SRv6 ingress route) and injects EVPN paths into the
17-
node-local GoBGP daemon. GoBGP distributes those paths to a BGP route reflector,
18-
enabling pods on different nodes or clusters to reach each other via
19-
SRv6-encapsulated traffic.
16+
(VRF, veth pair, SRv6 ingress route) and writes a `BGPAdvertisement` CRD.
17+
`galactic-router` watches that CRD and injects the EVPN path into the node-local
18+
GoBGP server. GoBGP distributes the path to a BGP route reflector, enabling pods
19+
on different nodes or clusters to reach each other via SRv6-encapsulated traffic.
2020

2121
VPC and VPCAttachment CRDs are owned by a separate companion operator
2222
(`go.miloapis.com/cosmos`). Galactic receives pre-populated identifiers through the
23-
CNI config and acts on them without running its own CRD controllers.
23+
CNI config and acts on them. `galactic-router` reconciles BGP CRDs from the same
24+
cosmos API group directly — no gRPC sidecar, no provider CRD lifecycle.
2425

2526
### SRv6 SID encoding
2627

@@ -40,26 +41,32 @@ enabling automatic cross-node path import without explicit RT configuration.
4041
```
4142
galactic/
4243
├── cmd/
43-
│ ├── galactic-cni/ # CNI binary
44-
│ └── galactic-agent/ # Agent binary
44+
│ ├── galactic-cni/ # CNI binary
45+
│ └── galactic-router/ # Router binary (controller-runtime reconciler)
4546
├── internal/
46-
│ ├── agent/ # Agent run loop; wires GoBGP, health, metrics, bootstrap
47-
│ ├── bootstrap/ # BGPProvider CR lifecycle (create on start, delete on stop)
48-
│ ├── cni/ # CNI cmdAdd / cmdDel
49-
│ │ ├── route/ # Host-side static routes via netlink
50-
│ │ └── veth/ # veth pair management
51-
│ ├── gobgp/ # Embedded GoBGP server lifecycle
52-
│ ├── metrics/ # Prometheus metrics (galactic_agent_*)
53-
│ └── plumbing/ # Low-level kernel and network primitives
54-
│ ├── intf/ # Interface naming, base62↔hex encoding, SRv6 endpoint encode/decode
55-
│ ├── srv6/ # SRv6 ingress route add/del (END.DT46)
56-
│ ├── sysctl/ # Interface sysctl helpers
57-
│ └── vrf/ # Linux VRF create/delete/lookup
47+
│ ├── controller/ # controller-runtime reconcilers (BGPRouter, BGPPeer,
48+
│ │ # BGPAdvertisement, BGPPolicy, Secret, Node)
49+
│ ├── reconcile/ # CRD → DesiredRouter translation (node/role checks,
50+
│ │ # secret resolution, IPv6 next-hop from Node)
51+
│ ├── runtime/ # RouterRuntime interface + RuntimeManager
52+
│ │ ├── gobgp/ # GoBGP RouterRuntime (tenant role)
53+
│ │ └── frr/ # FRR RouterRuntime stub (fabric role, Phase 2)
54+
│ ├── model/ # DesiredRouter and family; re-exports cosmos enums
55+
│ ├── hash/ # SHA-256 change detection over DesiredRouter
56+
│ ├── metrics/ # Prometheus metrics (galactic_router_*)
57+
│ ├── cni/ # CNI cmdAdd / cmdDel
58+
│ │ ├── route/ # Host-side static routes via netlink
59+
│ │ └── veth/ # veth pair management
60+
│ └── plumbing/ # Low-level kernel and network primitives
61+
│ ├── intf/ # Interface naming, base62↔hex encoding, SRv6 endpoint encode/decode
62+
│ ├── srv6/ # SRv6 ingress route add/del (END.DT46)
63+
│ ├── sysctl/ # Interface sysctl helpers
64+
│ └── vrf/ # Linux VRF create/delete/lookup
5865
├── deploy/
59-
│ ├── galactic-agent/ # Kustomize: DaemonSet, RBAC, ServiceAccount
60-
│ └── containerlab/ # ContainerLab lab topology and scripts
66+
│ ├── galactic-router/ # Kustomize: DaemonSet, RBAC, ServiceAccount
67+
│ └── containerlab/ # ContainerLab lab topology and scripts
6168
└── containers/
62-
└── galactic/ # Production Dockerfile (builds galactic CNI binary)
69+
└── galactic/ # Production Dockerfile
6370
```
6471

6572
---
@@ -68,18 +75,21 @@ galactic/
6875

6976
See [docs/cni-sequence.md](docs/cni-sequence.md) for the full CNI ADD/DEL sequence diagram.
7077

71-
See [docs/agent-startup.md](docs/agent-startup.md) for the agent startup sequence diagram.
78+
See [docs/agent-startup.md](docs/agent-startup.md) for the router startup sequence diagram.
7279

7380
---
7481

7582
## Components
7683

7784
| Component | Binary | Role |
7885
|-----------|--------|------|
79-
| `internal/agent` | `galactic-agent` | Run loop; wires GoBGP, health, metrics, bootstrap |
80-
| `internal/bootstrap` | `galactic-agent` | BGPProvider CR lifecycle |
81-
| `internal/gobgp` | `galactic-agent` | Embedded GoBGP server |
82-
| `internal/metrics` | `galactic-agent` | Prometheus metrics |
86+
| `internal/controller` | `galactic-router` | controller-runtime reconcilers; field index registration; CRD status helpers |
87+
| `internal/reconcile` | `galactic-router` | CRD → DesiredRouter translation |
88+
| `internal/runtime/gobgp` | `galactic-router` | Embedded GoBGP server (tenant role) |
89+
| `internal/runtime/frr` | `galactic-router` | FRR stub (fabric role, Phase 2) |
90+
| `internal/model` | `galactic-router` | Internal BGP model types |
91+
| `internal/hash` | `galactic-router` | Change detection |
92+
| `internal/metrics` | `galactic-router` | Prometheus metrics |
8393
| `internal/cni` | `galactic-cni` | CNI cmdAdd / cmdDel |
8494
| `internal/plumbing/intf` | both | Interface naming, base62↔hex encoding, SRv6 endpoint encode/decode |
8595
| `internal/plumbing/srv6` | both | SRv6 ingress route add/del (END.DT46) |
@@ -92,12 +102,16 @@ See [docs/agent-startup.md](docs/agent-startup.md) for the agent startup sequenc
92102

93103
- **Identifiers in the SID.** VPC (48-bit) and VPCAttachment (16-bit) identifiers are packed into the low 64 bits of the SRv6 SID, making forwarding state fully self-describing without a lookup table.
94104
- **Base62 interface names.** Kernel interface names are Base62-encoded to stay within the 15-character limit (`vrfX-Y`, `galX-Y`). The hex form is used for BGP and SRv6; base62 for kernel interfaces.
95-
- **GoBGP embedded, not sidecar.** GoBGP runs in-process so the agent owns its lifecycle and can gate readiness on BGP availability. Peer and policy config is applied by the cosmos operator via `BGPProvider` / `BGPInstance` / `BGPPeer` CRDs. The provider advertises `L2VPN/EVPN` (AFI=25, SAFI=70) as its sole address family capability.
96-
- **CNI binary auto-detects mode.** The `galactic-cni` binary runs as both the CNI plugin (when `CNI_COMMAND` is set) and a CLI tool. This avoids shipping two separate binaries on the node.
105+
- **GoBGP embedded, lazy-started.** GoBGP runs in-process and starts only when the first `BGPRouter` is reconciled (`listenPort=-1`, outbound-only). ASN or RouterID changes trigger a full `Reconfigure` (fresh `BgpServer``StopBgp` is not called because it permanently terminates the v4 Serve loop).
106+
- **CRD-driven config, no sidecar gRPC.** `galactic-router` watches cosmos BGP CRDs directly via controller-runtime. The CNI writes a `BGPAdvertisement` CRD; the router reconciler picks it up. No in-node gRPC calls.
107+
- **Hash-based no-op suppression.** SHA-256 over the sorted `DesiredRouter` prevents redundant GoBGP Apply calls on every CRD event touch.
108+
- **RuntimeFactory pattern.** `ROUTER_ROLE=tenant` selects GoBGP; `ROUTER_ROLE=fabric` selects FRR (Phase 2 stub). The binary is selected at startup; no controller changes are needed for Phase 2.
109+
- **gRPC health on :5000.** Liveness and readiness probes use the gRPC health protocol (`google.golang.org/grpc/health`) on port 5000. No HTTP health endpoint.
97110

98111
---
99112

100113
## Known Constraints
101114

102-
- **GoBGP RIB is ephemeral.** All BGP state is in-process memory. On restart, sessions and paths must be re-established. The cosmos operator is responsible for re-applying config.
103-
- **No kernel-path unit tests.** `internal/cni`, `internal/plumbing/srv6`, and `internal/plumbing/vrf` require `CAP_NET_ADMIN` and a real kernel. `internal/plumbing/intf` is fully unit-testable (pure functions only). Coverage comes from the e2e suite (`task ci:e2etest`), which only runs on `main` and release tags.
115+
- **GoBGP RIB is ephemeral.** All BGP state is in-process memory. On restart, sessions and paths must be re-established from CRD state; controller-runtime's reconcile loop handles this automatically.
116+
- **EVPN Type 5 deferred.** `BGPAdvertisement` does not carry a Route Distinguisher field in the current cosmos API. `galactic-router` returns `ErrMissingRouteDistinguisher` for l2vpn/evpn advertisements and sets `Accepted=False` on the CRD.
117+
- **No kernel-path unit tests.** `internal/cni`, `internal/plumbing/srv6`, and `internal/plumbing/vrf` require `CAP_NET_ADMIN` and a real kernel. `internal/plumbing/intf` is fully unit-testable (pure functions only). Coverage comes from the e2e suite (`task test:e2e`).

CONVENTIONS.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,14 @@ This document defines the coding standards, naming rules, error handling pattern
1010

1111
- Module: `go.datum.net/galactic`
1212
- `cmd/galactic-cni/main.go` — CNI plugin entry point; calls `cni.RunPlugin()` directly
13-
- `cmd/galactic-agent/main.go` — agent entry point; parses flags and calls `agent.Run()`
14-
- `internal/plumbing/` — low-level kernel and network primitives shared between agent and CNI (`intf`, `srv6`, `sysctl`, `vrf`)
15-
- `internal/agent/` — agent entry point and gRPC server
13+
- `cmd/galactic-router/main.go` — router entry point; reads `NODE_NAME` and `ROUTER_ROLE` env vars, starts controller-runtime manager
14+
- `internal/plumbing/` — low-level kernel and network primitives shared between router and CNI (`intf`, `srv6`, `sysctl`, `vrf`)
15+
- `internal/controller/` — controller-runtime reconcilers (BGPRouter, BGPPeer, BGPAdvertisement, BGPPolicy, Secret, Node); also contains field index registration (`indexer.go`) and CRD status helpers (`status.go`)
16+
- `internal/reconcile/` — CRD → DesiredRouter translation
17+
- `internal/runtime/` — RouterRuntime interface; `gobgp/` (tenant) and `frr/` (fabric stub)
1618
- `internal/cni/` — CNI plugin (cmdAdd / cmdDel implementation)
17-
- `internal/cmd/version/` — ldflags variables (Version, GitCommit, etc.) set at build time
18-
- `internal/gobgp/` — embedded GoBGP server lifecycle
19-
- `internal/bootstrap/` — agent startup sequencing (BGPProvider resource management)
19+
- `internal/model/` — internal BGP model types
20+
- `internal/hash/` — SHA-256 change detection over DesiredRouter
2021
- `internal/metrics/` — Prometheus metrics registration
2122

2223
Place new code in `internal/` unless it must be imported by an external caller. Prefer creating a focused sub-package over adding to an existing large one.

Taskfile.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ tasks:
9696
-X go.datum.net/galactic/internal/metadata.BuildDate={{.BUILD_DATE}}
9797
cmds:
9898
- go build -ldflags "{{.LDFLAGS}}" -o bin/galactic-cni cmd/galactic-cni/main.go
99-
- go build -ldflags "{{.LDFLAGS}}" -o bin/galactic-agent cmd/galactic-agent/main.go
99+
- go build -ldflags "{{.LDFLAGS}}" -o bin/galactic-router cmd/galactic-router/main.go
100100

101101
docker-build:
102102
desc: Build container image

cmd/galactic-agent/main.go

Lines changed: 0 additions & 45 deletions
This file was deleted.

0 commit comments

Comments
 (0)