resource_id derivation via gopsutil HostIDWithContext is unreliable across cloned VMs, shared containers, minimal images, and early boot

## Summary

The management-plane `resource_id` is derived on Linux from a chain ultimately rooted in `/etc/machine-id` (plus `product_uuid` and `boot_id`). `resource_id` is the primary identity used by NGINX One to correlate metrics, config, and telemetry with a host, so duplicates cause data to collide onto a single record and churning IDs orphan records on every restart.

This issue enumerates the ways the current implementation produces incorrect, colliding, or churning resource IDs.

## Where the code lives

- `internal/grpc/grpc.go:90` — `info.ResourceID(ctx)` is called when constructing the management-plane gRPC client.
- `pkg/host/info.go:140` — `ResourceID()`: if `IsContainer()` → `containerID()`, else → `hostID()`.
- `pkg/host/info.go:200` — `hostID()` calls `i.exec.HostID(ctx)` and hashes the result with `uuid.NewMD5(uuid.Nil, ...)`.
- `pkg/host/exec/exec.go:76` — `(*Exec).HostID()` delegates to `github.com/shirou/gopsutil/v4/host.HostIDWithContext`.
- `go.mod:46` — `github.com/shirou/gopsutil/v4 v4.26.3`.

gopsutil v4's Linux chain (`host/host_linux.go:33-63`) is:

1. `/sys/class/dmi/id/product_uuid` (SMBIOS)
2. `/etc/machine-id` (must be exactly 32 hex chars on the first line)
3. `/proc/sys/kernel/random/boot_id` — gopsutil's own comment: *"Not stable between reboot, but better than nothing"*
4. On total failure, returns `("", nil)` — empty string with **no error**.

## Problems

### 1. Cloned VMs and unprepared golden templates

systemd requires `/etc/machine-id` to be absent or empty in VM templates so first boot regenerates it ([systemd docs](https://www.freedesktop.org/software/systemd/man/machine-id.html)). This preparation step is skipped often enough to be a well-known class of bug: [Proxmox](https://forum.proxmox.com/threads/issues-cloning-a-new-vm-machine-id-stays-the-same.156227/), [VMware Photon](https://vmware.github.io/photon/assets/files/html/3.0/photon_admin/clearing-the-machine-id-of-a-cloned-instance-for-dhcp.html), [Uyuni #3478](https://github.com/uyuni-project/uyuni/issues/3478). When multiple VMs share a machine-id, they hash to the same UUID and report identical `resource_id`.

gopsutil tries `product_uuid` first, which mitigates this *only* for root-privileged agents on hardware/hypervisors that expose a unique SMBIOS UUID. In practice:

- `product_uuid` typically requires root. Unprivileged agents hit `ReadLines` → permission denied → fall through to `machine-id`.
- Some hypervisors (including VMware template clones and certain Proxmox configs) propagate the same SMBIOS UUID to clones anyway.

### 2. `uninitialized` sentinel during early boot produces unstable churn

During early boot, systemd writes the literal string `uninitialized\n` to `/etc/machine-id` and overmounts it until first-boot provisioning completes. gopsutil's 32-hex-char length check (`host_linux.go:49`) incidentally rejects the sentinel, but then falls through to `boot_id`, which **changes on every reboot**. An agent started in this window will later change its `resource_id` on the next boot, orphaning the prior record.

### 3. Missing `/etc/machine-id` on minimal images produces per-restart churn

On busybox / distroless / some Alpine configurations, `/etc/machine-id` is absent. `product_uuid` often requires root. The chain falls to `boot_id`. Every container restart ⇒ new `resource_id` ⇒ orphaned records in NGINX One. In default Docker configurations the host `/etc/machine-id` is also not mounted in ([denisbrodbeck/machineid#5](https://github.com/denisbrodbeck/machineid/issues/5)).

### 4. All-sources-fail returns empty string, which hashes to a fixed UUID for every host

If every source fails, `HostIDWithContext` returns `("", nil)`. `pkg/host/info.go:206` then computes `uuid.NewMD5(uuid.Nil, []byte(""))`, which is a **deterministic UUID produced by every host in this state** — a fleet-wide collision, reported with no error.

### 5. Error swallowed at the gRPC boundary

`internal/grpc/grpc.go:90-93` logs a warning on error from `ResourceID()` and proceeds with `resourceID = ""`. The empty string is then passed into `DialOptions(...)`. Agents that cannot derive an identity silently continue with an empty `resource_id` rather than refusing to start or surfacing the failure to the operator.

### 6. LXC containers sharing host machine-id

LXC containers frequently inherit the host's `/etc/machine-id` unless the administrator explicitly clears it ([Proxmox forum](https://forum.proxmox.com/threads/bug-machine-id-etc-machine-id-not-unique-in-lxc-containers.89708/)). `IsContainer()` (pkg/host/info.go:114) checks `/.dockerenv`, `/run/.containerenv`, `/var/run/secrets/kubernetes.io/serviceaccount`, cgroup references for `kubepods|docker|containerd|ecs|fargate`, and `ECS_CONTAINER_METADATA_URI_V4` — **LXC is not in any of these**. LXC instances therefore take the `hostID()` path and land on the shared machine-id.

### 7. Bind-mounted host `/etc/machine-id` inside containers

Red Hat has been shipping RHEL containers with the host's `/etc/machine-id` bind-mounted in since 2015 ([Bugzilla #1286787](https://bugzilla.redhat.com/show_bug.cgi?id=1286787)). systemd-nspawn and some LXC configurations copy the host's machine-id into the container filesystem. If `IsContainer()` returns false for such a container, every container on that host reports the host's resource_id.

When `IsContainer()` does return true but `containerIDFromMountInfo` fails, `ResourceID()` correctly returns an error rather than falling back to `hostID()` — so the agent does not leak the host ID into containers through this path specifically. However, per problem 5, the error is swallowed upstream anyway, leaving the agent with `resource_id=""`.

### 8. `containerIDFromMountInfo` regex coverage

`pkg/host/info.go:305` matches against five regex patterns. Unusual runtime configurations, some cgroup v2 paths, and custom CRI shims can produce mountinfo that matches none of them. On failure, `containerID()` errors, and the agent ends up with an empty `resource_id` per problem 5.

### 9. No `/var/lib/dbus/machine-id` fallback

gopsutil v4 does **not** consult `/var/lib/dbus/machine-id`, which on some systems is the only present machine-id file (e.g. non-systemd installations where the dbus variant was populated by a package but `/etc/machine-id` was not).

### 10. No operator override

No config knob, environment variable, or file-based override exists to force a specific `resource_id`. An operator who knows their fleet has duplicate machine-ids and cannot fix the source has no escape hatch.

## Impact

Any two agents whose chain resolves to the same bytes — whether via cloned VM machine-ids, shared SMBIOS UUIDs, LXC templates, bind-mounted host files, or all-sources-empty — will report identical `resource_id` to NGINX One with no warning logged. Any agent that lands on `boot_id` (early boot, missing machine-id on minimal images) will churn its `resource_id` on every reboot/restart, orphaning records. Any agent whose `ResourceID()` errors out will silently continue with an empty `resource_id` due to the warning-only handling in `internal/grpc/grpc.go`.

## References

- [systemd: machine-id(5)](https://www.freedesktop.org/software/systemd/man/machine-id.html)
- [gopsutil v4 host_linux.go](https://github.com/shirou/gopsutil/blob/v4.26.3/host/host_linux.go)
- [Red Hat Bugzilla #1286787 — docker should create /etc/machine-id](https://bugzilla.redhat.com/show_bug.cgi?id=1286787)
- [Proxmox: machine-id not unique in LXC containers](https://forum.proxmox.com/threads/bug-machine-id-etc-machine-id-not-unique-in-lxc-containers.89708/)
- [Proxmox: VM clone machine-id stays the same](https://forum.proxmox.com/threads/issues-cloning-a-new-vm-machine-id-stays-the-same.156227/)
- [Uyuni #3478 — duplicate machine_id from cloned VMs](https://github.com/uyuni-project/uyuni/issues/3478)
- [denisbrodbeck/machineid #5 — no machine-id in docker](https://github.com/denisbrodbeck/machineid/issues/5)
- [VMware Photon: clearing the machine-id of a cloned instance](https://vmware.github.io/photon/assets/files/html/3.0/photon_admin/clearing-the-machine-id-of-a-cloned-instance-for-dhcp.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resource_id derivation via gopsutil HostIDWithContext is unreliable across cloned VMs, shared containers, minimal images, and early boot #1633

Summary

Where the code lives

Problems

1. Cloned VMs and unprepared golden templates

2. `uninitialized` sentinel during early boot produces unstable churn

3. Missing `/etc/machine-id` on minimal images produces per-restart churn

4. All-sources-fail returns empty string, which hashes to a fixed UUID for every host

5. Error swallowed at the gRPC boundary

6. LXC containers sharing host machine-id

7. Bind-mounted host `/etc/machine-id` inside containers

8. `containerIDFromMountInfo` regex coverage

9. No `/var/lib/dbus/machine-id` fallback

10. No operator override

Impact

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

resource_id derivation via gopsutil HostIDWithContext is unreliable across cloned VMs, shared containers, minimal images, and early boot #1633

Description

Summary

Where the code lives

Problems

1. Cloned VMs and unprepared golden templates

2. uninitialized sentinel during early boot produces unstable churn

3. Missing /etc/machine-id on minimal images produces per-restart churn

4. All-sources-fail returns empty string, which hashes to a fixed UUID for every host

5. Error swallowed at the gRPC boundary

6. LXC containers sharing host machine-id

7. Bind-mounted host /etc/machine-id inside containers

8. containerIDFromMountInfo regex coverage

9. No /var/lib/dbus/machine-id fallback

10. No operator override

Impact

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `uninitialized` sentinel during early boot produces unstable churn

3. Missing `/etc/machine-id` on minimal images produces per-restart churn

7. Bind-mounted host `/etc/machine-id` inside containers

8. `containerIDFromMountInfo` regex coverage

9. No `/var/lib/dbus/machine-id` fallback