Summary
Define explicit cross-backend coordination primitives for Dynamo discovery: atomic create-if-absent, generic compare-and-set (CAS) on revisioned keys, and bounded lease/slot acquisition for service leader election. The immediate goal is to make the current singleton-claim pattern explicit and extensible, especially for FileStore, while leaving room for future workflows such as allowing exactly N served indexers or leaders per component.
Motivation
PR #9302 fixes a concrete race for Approximate served indexers: topology validation currently does a distributed list-then-register sequence, so two processes can both observe that no indexer endpoint exists and then both start. Approximate indexers are not synchronized with each other, so two indexers in one Dynamo component can make routing materially wrong.
That PR only needs the narrow CAS-on-absence case: one canonical claim key transitions from absent to present exactly once. This is currently expressed through Bucket::insert(..., revision = 0), where etcd and NATS already model revision 0 as create-if-absent, and FileStore needs the same no-clobber behavior to make the claim real.
However, future coordination needs are broader than a single singleton claim. For example, automatic service leader election or bounded service ownership may want exactly three indexers/leaders in a component, renewal, release, and inspection of current holders. That should not be reinvented in each caller as ad-hoc discovery list checks. We should define the coordination contract once and make backend-specific guarantees explicit.
Proposal
Add an explicit coordination API to the runtime discovery/KV layer. Exact names are open, but the semantic pieces should be:
create_if_absent(key, value, ttl) -> Created | Exists: atomic absent-to-present transition.
compare_and_set(key, expected_revision, value, ttl) -> Created/Updated | Conflict: update only when the current revision matches the expected revision.
acquire_slot(group_key, slot_count, owner, ttl) -> Option<SlotLease>: acquire one of a bounded set of lease slots, e.g. one leader or up to three indexers per component.
renew(lease) and release(lease): keep a claim alive or explicitly give it up.
list_holders(group_key): inspect currently valid holders for debugging and orchestration.
The singleton served-indexer claim from PR #9302 is then the special case acquire_slot(group_key, 1, owner, ttl) or create_if_absent(claim_key, owner, ttl).
Backend Semantics
For etcd, implement this with transactions and leases. create_if_absent compares key version to 0. Generic CAS compares the stored version/mod revision with the caller's expected revision. Lease TTL and renewal map directly to etcd leases.
For Kubernetes discovery, implement this with a real Kubernetes coordination object before advertising support. Possible approaches include coordination.k8s.io/Lease objects, or a Dynamo-specific custom resource using Kubernetes create/update preconditions and resourceVersion. A backend must not report success for a claim it cannot enforce atomically.
For MemoryStore, use the existing mutex-protected map and store revision/lease metadata in memory. This gives deterministic unit-test behavior but does not imply distributed safety.
FileStore Design
FileStore should not rely on filename suffixes alone for generic CAS. Suffixes such as key.rev42 can identify immutable values, but they do not provide a serialization point for the logical key. Two writers could still both observe revision 42 and publish revision 43 unless there is one canonical object used for the compare.
Use one canonical per-key serialization point:
- Store a canonical file for each logical key, or a canonical
head file pointing to immutable value files.
- Store durable metadata with the current revision, owner, optional expiration/deadline, and value pointer or inline value.
- Use same-directory temp files for writes and atomic rename for publish, as FileStore already does for watcher-safe updates.
- Use an operation lock per logical key, such as an atomic
mkdir lock directory or exclusive-create lock file, to serialize generic CAS updates.
- For the narrow
create_if_absent case, use the canonical file's no-clobber create path directly when possible.
A generic FileStore CAS update should follow this shape:
- Acquire the per-key operation lock.
- Read the canonical head/value metadata.
- Treat expired entries as absent if the key has a TTL lease.
- Compare the current revision with the expected revision.
- If the compare fails, return
Conflict without modifying the value.
- If the compare passes, write the new metadata/value to a same-directory temp file.
- Flush the temp file and atomically rename it over the canonical file/head.
- Flush the parent directory where supported.
- Release the operation lock.
Crash behavior should be explicit:
- Orphan temp files are ignored by readers and cleaned opportunistically.
- A stale operation lock may be recovered only after a conservative lock timeout.
- A successfully renamed canonical file is the source of truth on restart.
- FileStore should document local-filesystem assumptions and any weaker guarantees on network filesystems.
For bounded slot acquisition, FileStore can avoid full arbitrary CAS in the common case by using canonical slot claim files:
group_key/slot-0
group_key/slot-1
group_key/slot-2
Each slot is acquired with atomic create-if-absent and kept alive with TTL/mtime or embedded lease metadata. This is simpler than general revisioned updates and maps directly to the leader-election use case.
Alternate Solutions
Keep only insert(..., revision = 0) create-if-absent semantics. This is enough for the current Approximate served-indexer singleton race, but it leaves future bounded leader election and revisioned updates undefined.
Require etcd or Kubernetes for all coordination and keep FileStore as best-effort local development only. This reduces FileStore complexity, but local tests and no-etcd workflows would not exercise the same coordination contracts as distributed deployments.
Implement leader election separately in each caller. This is likely to duplicate list-then-register races, backend-specific TTL handling, and cleanup logic across router/discovery code.
Requirements
- Backends must clearly state which primitives they support atomically.
- Unsupported backends must fail closed rather than returning advisory success.
- FileStore must preserve watcher safety: readers and watchers must not observe partial writes or temp files as real entries.
- FileStore CAS must be cross-process safe on supported local filesystems.
- TTL/lease behavior must be consistent enough for service failover and clean restart expectations.
- Tests should cover concurrent acquisition, conflict behavior, stale/expired claims, release, renewal, and crash leftovers where practical.
- The initial implementation may start with
create_if_absent and bounded slots before committing to a full generic CAS API.
References
Summary
Define explicit cross-backend coordination primitives for Dynamo discovery: atomic create-if-absent, generic compare-and-set (CAS) on revisioned keys, and bounded lease/slot acquisition for service leader election. The immediate goal is to make the current singleton-claim pattern explicit and extensible, especially for FileStore, while leaving room for future workflows such as allowing exactly N served indexers or leaders per component.
Motivation
PR #9302 fixes a concrete race for Approximate served indexers: topology validation currently does a distributed list-then-register sequence, so two processes can both observe that no indexer endpoint exists and then both start. Approximate indexers are not synchronized with each other, so two indexers in one Dynamo component can make routing materially wrong.
That PR only needs the narrow CAS-on-absence case: one canonical claim key transitions from absent to present exactly once. This is currently expressed through
Bucket::insert(..., revision = 0), where etcd and NATS already model revision 0 as create-if-absent, and FileStore needs the same no-clobber behavior to make the claim real.However, future coordination needs are broader than a single singleton claim. For example, automatic service leader election or bounded service ownership may want exactly three indexers/leaders in a component, renewal, release, and inspection of current holders. That should not be reinvented in each caller as ad-hoc discovery list checks. We should define the coordination contract once and make backend-specific guarantees explicit.
Proposal
Add an explicit coordination API to the runtime discovery/KV layer. Exact names are open, but the semantic pieces should be:
create_if_absent(key, value, ttl) -> Created | Exists: atomic absent-to-present transition.compare_and_set(key, expected_revision, value, ttl) -> Created/Updated | Conflict: update only when the current revision matches the expected revision.acquire_slot(group_key, slot_count, owner, ttl) -> Option<SlotLease>: acquire one of a bounded set of lease slots, e.g. one leader or up to three indexers per component.renew(lease)andrelease(lease): keep a claim alive or explicitly give it up.list_holders(group_key): inspect currently valid holders for debugging and orchestration.The singleton served-indexer claim from PR #9302 is then the special case
acquire_slot(group_key, 1, owner, ttl)orcreate_if_absent(claim_key, owner, ttl).Backend Semantics
For etcd, implement this with transactions and leases.
create_if_absentcompares key version to 0. Generic CAS compares the stored version/mod revision with the caller's expected revision. Lease TTL and renewal map directly to etcd leases.For Kubernetes discovery, implement this with a real Kubernetes coordination object before advertising support. Possible approaches include
coordination.k8s.io/Leaseobjects, or a Dynamo-specific custom resource using Kubernetes create/update preconditions andresourceVersion. A backend must not report success for a claim it cannot enforce atomically.For MemoryStore, use the existing mutex-protected map and store revision/lease metadata in memory. This gives deterministic unit-test behavior but does not imply distributed safety.
FileStore Design
FileStore should not rely on filename suffixes alone for generic CAS. Suffixes such as
key.rev42can identify immutable values, but they do not provide a serialization point for the logical key. Two writers could still both observe revision 42 and publish revision 43 unless there is one canonical object used for the compare.Use one canonical per-key serialization point:
headfile pointing to immutable value files.mkdirlock directory or exclusive-create lock file, to serialize generic CAS updates.create_if_absentcase, use the canonical file's no-clobber create path directly when possible.A generic FileStore CAS update should follow this shape:
Conflictwithout modifying the value.Crash behavior should be explicit:
For bounded slot acquisition, FileStore can avoid full arbitrary CAS in the common case by using canonical slot claim files:
group_key/slot-0group_key/slot-1group_key/slot-2Each slot is acquired with atomic create-if-absent and kept alive with TTL/mtime or embedded lease metadata. This is simpler than general revisioned updates and maps directly to the leader-election use case.
Alternate Solutions
Keep only
insert(..., revision = 0)create-if-absent semantics. This is enough for the current Approximate served-indexer singleton race, but it leaves future bounded leader election and revisioned updates undefined.Require etcd or Kubernetes for all coordination and keep FileStore as best-effort local development only. This reduces FileStore complexity, but local tests and no-etcd workflows would not exercise the same coordination contracts as distributed deployments.
Implement leader election separately in each caller. This is likely to duplicate list-then-register races, backend-specific TTL handling, and cleanup logic across router/discovery code.
Requirements
create_if_absentand bounded slots before committing to a full generic CAS API.References