You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat: add RHEL/Rocky Linux 9 container OS support
Add OSType protobuf enum (UBUNTU_2404, ROCKY_9, RHEL_9) and --os-type
CLI flag to support Rocky Linux 9 (dev/test) and RHEL 9 (production)
as container OS alternatives to Ubuntu 24.04.
- New internal/ostype package: maps OSType to Incus images and OS family
- New internal/ospkg package: abstracts package management (apt vs dnf,
adduser vs useradd, sudo vs wheel group, ssh vs sshd service)
- Dual stack definitions: RHEL variants for all stacks in stacks.yaml
- Refactored container/manager.go to use OS-aware provisioning
- OS type stored as container label for subsequent operations
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: add backend heartbeat metrics and Grafana dashboard
Push containarium_backend_healthy metric (1=up, 0=down) per backend to
VictoriaMetrics every 30s. Add status-history panel to the Grafana
containarium-overview dashboard showing backend heartbeat timeline.
- New containarium.backend.healthy OTel gauge in metrics collector
- FetchPeerHealth() on PeerMetricsFetcherAdapter reports peer health
- Grafana dashboard: "Backend Heartbeat" row with status-history panel
- Enriched /v1/backends endpoint with live peer system info
- Dashboard auto-updates on daemon restart (updateGrafanaDashboard)
- PostgreSQL: wait up to 2min at startup for availability; add
Restart=on-failure systemd override inside core-postgres container
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: harden peer stability and SSH reliability
Peer node stability:
- ClamAV: wait for freshclam DB download before starting clamd; bump
security container memory from 3GB to 4GB to prevent OOM kills
- Conntrack: increase event channel buffers from 256 to 8192; rate-limit
"channel full" warning to once per 30s to stop log flooding
- Tunnel: increase yamux keepalive from 15s to 60s on both client and
server so tunnel survives CPU-heavy workloads (builds, GPU training)
- Peers: add --sentinel-url for auto-update; fix restart policies
SSH reliability:
- Fix ForwardCreateContainer to use camelCase field names (sshKeys, not
ssh_keys) matching gRPC-gateway protojson. This was silently dropping
SSH keys when creating containers on peers.
- Reject unknown JSON fields in gRPC-gateway (DiscardUnknown: false) so
field name mismatches fail loudly instead of silently
- Validate SSH public keys at API boundary and pre-write to reject
placeholder strings like "YOUR_KEY"
- Fix jump server account unlock: use 'usermod -p *' instead of
'passwd -d' which left accounts locked on Ubuntu 24.04
- Raise sshpiper failtoban threshold from 3 to 20 (ssh-agent tries
multiple keys per connection, each counted as a failure)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: hsinhoyeh <yhh92u@gmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: api/swagger/containarium.swagger.json
+20Lines changed: 20 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -3850,6 +3850,10 @@
3850
3850
"backendId": {
3851
3851
"type": "string",
3852
3852
"title": "Backend ID this container runs on (e.g., \"gcp-spot\", \"tunnel-fts-5900x-gpu\")"
3853
+
},
3854
+
"osType": {
3855
+
"$ref": "#/definitions/OSType",
3856
+
"title": "Operating system type of the container"
3853
3857
}
3854
3858
},
3855
3859
"title": "Container represents a complete container instance"
@@ -4025,6 +4029,10 @@
4025
4029
"backendId": {
4026
4030
"type": "string",
4027
4031
"title": "Target backend ID for creation (empty = primary backend)"
4032
+
},
4033
+
"osType": {
4034
+
"$ref": "#/definitions/OSType",
4035
+
"title": "Operating system type (takes precedence over image when set)"
4028
4036
}
4029
4037
},
4030
4038
"title": "CreateContainerRequest is the request to create a new container"
@@ -5222,6 +5230,18 @@
5222
5230
},
5223
5231
"title": "NetworkTopology represents the complete network visualization"
5224
5232
},
5233
+
"OSType": {
5234
+
"type": "string",
5235
+
"enum": [
5236
+
"OS_TYPE_UNSPECIFIED",
5237
+
"OS_TYPE_UBUNTU_2404",
5238
+
"OS_TYPE_ROCKY_9",
5239
+
"OS_TYPE_RHEL_9"
5240
+
],
5241
+
"default": "OS_TYPE_UNSPECIFIED",
5242
+
"description": "- OS_TYPE_UNSPECIFIED: Unspecified OS type (defaults to Ubuntu 24.04)\n - OS_TYPE_UBUNTU_2404: Ubuntu 24.04 LTS\n - OS_TYPE_ROCKY_9: Rocky Linux 9 (RHEL 9 rebuild, for dev/test)\n - OS_TYPE_RHEL_9: Red Hat Enterprise Linux 9 (for production, requires subscription)",
5243
+
"title": "OSType represents the operating system type for a container"
ctx, cancel:=context.WithTimeout(context.Background(), 15*time.Minute) // Container creation can take time (includes ultra-aggressive retry logic for google_guest_agent)
0 commit comments