Skip to content

resource_id derivation via gopsutil HostIDWithContext is unreliable across cloned VMs, shared containers, minimal images, and early boot #1633

@dekobon

Description

@dekobon

Summary

The management-plane resource_id is derived on Linux from a chain ultimately rooted in /etc/machine-id (plus product_uuid and boot_id). resource_id is the primary identity used by NGINX One to correlate metrics, config, and telemetry with a host, so duplicates cause data to collide onto a single record and churning IDs orphan records on every restart.

This issue enumerates the ways the current implementation produces incorrect, colliding, or churning resource IDs.

Where the code lives

  • internal/grpc/grpc.go:90info.ResourceID(ctx) is called when constructing the management-plane gRPC client.
  • pkg/host/info.go:140ResourceID(): if IsContainer()containerID(), else → hostID().
  • pkg/host/info.go:200hostID() calls i.exec.HostID(ctx) and hashes the result with uuid.NewMD5(uuid.Nil, ...).
  • pkg/host/exec/exec.go:76(*Exec).HostID() delegates to github.com/shirou/gopsutil/v4/host.HostIDWithContext.
  • go.mod:46github.com/shirou/gopsutil/v4 v4.26.3.

gopsutil v4's Linux chain (host/host_linux.go:33-63) is:

  1. /sys/class/dmi/id/product_uuid (SMBIOS)
  2. /etc/machine-id (must be exactly 32 hex chars on the first line)
  3. /proc/sys/kernel/random/boot_id — gopsutil's own comment: "Not stable between reboot, but better than nothing"
  4. On total failure, returns ("", nil) — empty string with no error.

Problems

1. Cloned VMs and unprepared golden templates

systemd requires /etc/machine-id to be absent or empty in VM templates so first boot regenerates it (systemd docs). This preparation step is skipped often enough to be a well-known class of bug: Proxmox, VMware Photon, Uyuni #3478. When multiple VMs share a machine-id, they hash to the same UUID and report identical resource_id.

gopsutil tries product_uuid first, which mitigates this only for root-privileged agents on hardware/hypervisors that expose a unique SMBIOS UUID. In practice:

  • product_uuid typically requires root. Unprivileged agents hit ReadLines → permission denied → fall through to machine-id.
  • Some hypervisors (including VMware template clones and certain Proxmox configs) propagate the same SMBIOS UUID to clones anyway.

2. uninitialized sentinel during early boot produces unstable churn

During early boot, systemd writes the literal string uninitialized\n to /etc/machine-id and overmounts it until first-boot provisioning completes. gopsutil's 32-hex-char length check (host_linux.go:49) incidentally rejects the sentinel, but then falls through to boot_id, which changes on every reboot. An agent started in this window will later change its resource_id on the next boot, orphaning the prior record.

3. Missing /etc/machine-id on minimal images produces per-restart churn

On busybox / distroless / some Alpine configurations, /etc/machine-id is absent. product_uuid often requires root. The chain falls to boot_id. Every container restart ⇒ new resource_id ⇒ orphaned records in NGINX One. In default Docker configurations the host /etc/machine-id is also not mounted in (denisbrodbeck/machineid#5).

4. All-sources-fail returns empty string, which hashes to a fixed UUID for every host

If every source fails, HostIDWithContext returns ("", nil). pkg/host/info.go:206 then computes uuid.NewMD5(uuid.Nil, []byte("")), which is a deterministic UUID produced by every host in this state — a fleet-wide collision, reported with no error.

5. Error swallowed at the gRPC boundary

internal/grpc/grpc.go:90-93 logs a warning on error from ResourceID() and proceeds with resourceID = "". The empty string is then passed into DialOptions(...). Agents that cannot derive an identity silently continue with an empty resource_id rather than refusing to start or surfacing the failure to the operator.

6. LXC containers sharing host machine-id

LXC containers frequently inherit the host's /etc/machine-id unless the administrator explicitly clears it (Proxmox forum). IsContainer() (pkg/host/info.go:114) checks /.dockerenv, /run/.containerenv, /var/run/secrets/kubernetes.io/serviceaccount, cgroup references for kubepods|docker|containerd|ecs|fargate, and ECS_CONTAINER_METADATA_URI_V4LXC is not in any of these. LXC instances therefore take the hostID() path and land on the shared machine-id.

7. Bind-mounted host /etc/machine-id inside containers

Red Hat has been shipping RHEL containers with the host's /etc/machine-id bind-mounted in since 2015 (Bugzilla #1286787). systemd-nspawn and some LXC configurations copy the host's machine-id into the container filesystem. If IsContainer() returns false for such a container, every container on that host reports the host's resource_id.

When IsContainer() does return true but containerIDFromMountInfo fails, ResourceID() correctly returns an error rather than falling back to hostID() — so the agent does not leak the host ID into containers through this path specifically. However, per problem 5, the error is swallowed upstream anyway, leaving the agent with resource_id="".

8. containerIDFromMountInfo regex coverage

pkg/host/info.go:305 matches against five regex patterns. Unusual runtime configurations, some cgroup v2 paths, and custom CRI shims can produce mountinfo that matches none of them. On failure, containerID() errors, and the agent ends up with an empty resource_id per problem 5.

9. No /var/lib/dbus/machine-id fallback

gopsutil v4 does not consult /var/lib/dbus/machine-id, which on some systems is the only present machine-id file (e.g. non-systemd installations where the dbus variant was populated by a package but /etc/machine-id was not).

10. No operator override

No config knob, environment variable, or file-based override exists to force a specific resource_id. An operator who knows their fleet has duplicate machine-ids and cannot fix the source has no escape hatch.

Impact

Any two agents whose chain resolves to the same bytes — whether via cloned VM machine-ids, shared SMBIOS UUIDs, LXC templates, bind-mounted host files, or all-sources-empty — will report identical resource_id to NGINX One with no warning logged. Any agent that lands on boot_id (early boot, missing machine-id on minimal images) will churn its resource_id on every reboot/restart, orphaning records. Any agent whose ResourceID() errors out will silently continue with an empty resource_id due to the warning-only handling in internal/grpc/grpc.go.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions