Skip to content

ipvs: tolerate distinct VSGs resolving to the same kernel service tuple#2691

Open
pierrecdn wants to merge 2 commits into
acassen:masterfrom
criteo-forks:fix/ipvs-reload-wedge
Open

ipvs: tolerate distinct VSGs resolving to the same kernel service tuple#2691
pierrecdn wants to merge 2 commits into
acassen:masterfrom
criteo-forks:fix/ipvs-reload-wedge

Conversation

@pierrecdn
Copy link
Copy Markdown
Contributor

The docs (doc/man/man5/keepalived.conf.5.in:2280) forbid two VSes using the same VSG with the same protocol/family.
The same kernel constraint - one IPVS service per (proto, vip, port) or (proto, family, fwmark) - applies when distinct VSGs emit entries that collide on that tuple.

While that should not happen, in practice (provisioning streams, churn between renders on automation system) it sometimes does, and keepalived currently lets it silently desync from the kernel and never recovers.
Typically, every reload then logs IP_VS_SO_SET_ADDDEST(1159) ENOENT(2) / IP_VS_SO_SET_DELDEST ENOENT against a service the kernel no longer has but the surviving VSG still expects.

These patches makes keepalived hopefully deal with that case:

  1. ipvs: recover from missing parent service on ADDDEST/EDITDEST: when the kernel returns ENOENT on a dest add, recreate the parent service from the same srule and retry once. No behavior change is expected on healthy reloads.
  2. ipvs: do not trust per-vsge "reloaded" mark on LVS_CMD_ADD: drop the reloaded skip and is_vsge_alive shortcut on the ADD path; EEXIST is already idempotent in ipvs_talk(), so unconditionally re-issuing it prevents the wedge from forming.

Reproducer

Two VSGs (A, B) both containing e.g. 2001:db8::1 80 with protocol tcp, alive RSes.
Reload after dropping the entry from B only. Without the series: the kernel service for (tcp, 2001:db8::1, 80) disappears, syslog starts streaming ADDDEST/DELDEST ENOENT, and the wedge survives next reloads.

With patch 1 the next ADDDEST repairs it. With patch 2 it never forms.

When several Virtual Service Groups resolve to the same (proto, addr,
port) IPVS tuple (distinct VSG names with overlapping entries, common
in dynamically generated configs), the kernel only stores one service
for the tuple.

Each VSG has its own per-VSG entry alive counter, so a clear_diff_vsg()
that removes one VSG's entry can issue LVS_CMD_DEL on the shared
service while the surviving VSG was just marked vsge->reloaded=true and
will be skipped on the next LVS_CMD_ADD pass. From that point on every
healthcheck transition logs:

IPVS cmd IP_VS_SO_SET_ADDDEST(1159) error: No such file or directory(2)

because the kernel cannot find the parent service to attach the dst to.

This change makes ipvs_talk() self-heal: when ADDDEST returns ENOENT,
re-create the service from the same srule and retry the dest add once.

If the service recreation fails for another reason, the original errno
is restored and let the existing error path log it.

Signed-off-by: Pierre Cheynier <p.cheynier@criteo.com>
ipvs_group_cmd() used to skip LVS_CMD_ADD for any VSG entry with
vsg_entry->reloaded set on a reload, and to additionally skip when
!is_vsge_alive(). Both checks reflect keepalived's in-memory
state, potentially different than the actual kernel state.

When several VSGs (different vsgnames, possibly emitted by a
provisioning template that does not deduplicate across frontends)
resolve to the same IPVS tuple (proto, addr, tuple), the kernel only
keeps one service. Any clear_diff_vsg() pass that removes one VSG's
entry can issue LVS_CMD_DEL on the shared service while the
surviving VSG is just marked "reloaded=true" and so is skipped
on the next reloads. The kernel state then diverges from keepalived's
view and stays wedged forever: re-skips the ADD because reloaded=true,
every checker transition logs IP_VS_SO_SET_ADDDEST ENOENT and
IP_VS_SO_SET_DELDEST ENOENT against the missing parent service.

This commit always issue LVS_CMD_ADD: ipvs_talk() already turns
EEXIST on ADD into success, so the operation is idempotent and
self-healing.

To prevent log noise from the new EEXIST flood (one per reloaded
VSG entry per reload), suppress logging when ipvs_talk() turns EEXIST
or ENOENT into success (the "benign" case).

Signed-off-by: Pierre Cheynier <p.cheynier@criteo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant