Skip to content

✨ Read your own write client design#3473

Open
alvaroaleman wants to merge 1 commit intokubernetes-sigs:mainfrom
alvaroaleman:rowdesign
Open

✨ Read your own write client design#3473
alvaroaleman wants to merge 1 commit intokubernetes-sigs:mainfrom
alvaroaleman:rowdesign

Conversation

@alvaroaleman
Copy link
Copy Markdown
Member

Related to #3320
Minimal POC in #3472

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 8, 2026
called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
blocking the request
* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hinges on the assumption that we are guaranteed to get a delete event that contains the object with at least namespace, name and UID even if we end up getting a tombstome from the cache - I think that is the case, but would be good to get confirmation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.

* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was
observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s)
whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole having to observe the delete event business is quite finicky, it would be a lot easier if deletes would always provide an rv, but according to @liggitt that is problematic: kubernetes/kubernetes#134937 (comment)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.

@serathius
Copy link
Copy Markdown

/cc @michaelasp

@k8s-ci-robot k8s-ci-robot requested a review from michaelasp March 9, 2026 08:59
called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
blocking the request
* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.

* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was
observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s)
whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.

The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion
or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this
RV/deleted object within the get/list and block the request until it observed it or the requests context times out.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case where we trigger the delete but then the object gets recreated and we drop events forcing a relist? UID should be enough for this case right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is my understanding


## Goals

* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause certain high throughput cases to become slower, we should make sure that these cases are documented. The main case that comes to mind is some sort of controller that:

  1. Receives an event
  2. Does a list that does not need read your own write consistency
  3. Does some operation on the GVK of that list

Now we would have to block for every write on that list until the cache catches up. Agreed however that most controllers would benefit from having the cache up to date with its writes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fundamentally the tradeoff is performance vs consistency. I believe that the consistency is more important in most cases. For the cases where it is not, this whole feature can be disabled.

Copy link
Copy Markdown
Member

@sbueringer sbueringer Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add an option to Get/List calls that allows to disable consistent read (not sure if we should do that initially though, but I think it's something that we could add later as well).

EDIT: Saw that the option is mentioned below

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).

E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.

Concrete example from CAPI:

  • It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls
  • But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")

back to the client for any reason like the connection breaking. This is primarily for practical
purposes, if the write doesn't get a response indicating success or failure back, it is impossible to tell if
the write did or did not succeed
* Automatically dealing with configurations where the cache doesn't contain all objects that are written, for
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is tricky, if we don't have all the objects then it is not clear how we track the RV in the store. I'd like to look more at how the cache in controller runtime currently works with field/label selectors.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here is how it is configured: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.23.3/pkg/cache#Config

Fundemantelly it boils down to being able to configure field and/or labelSelector per GVK and or namespace and namespace(s) per gvk. The latter is implemented clientside, since the KAS does not allow watchichg multiple namespaces


## Goals

* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).

E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.

Concrete example from CAPI:

  • It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls
  • But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")

## Goals

* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were
committed by the server at the time the read happened
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to also describe what happens in the following case:

  • Write succeeds
  • RV doesn't show up in the cache (either not at all or it takes a long time (minutes))

E.g. we could return an error in the Get/List calls after some timeout. We could also fallback to a live read to the apiserver (I assume that's not what we want to do, IIRC some of the caches in the apiserver works that way)

(EDIT: I saw this case is mentioned in the implementation, but would be probably good to highlight it here as well)

example because its label or field selector doesn't match them. We will provide an option for this case and
may or may not try to detect this correctly in the future
* Support `DeleteAllOf` - The apiservers response to DeleteAllOf is insufficient to implement this correctly. DeleteAllOf
will error and instruct the user to use `List` and `Delete`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this, I assume implementing DeleteAllOff internally as a List + Delete would be a bit too much magic/surprising?

Comment on lines +45 to +46
The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be a worthwhile optimization to only block for the same GVK+namespace for List for namespaced List calls


The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion
or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "then unblocks the reads" seems wrong at this point? (the reads are unblocked "when it observed it" (below) right?)

Comment on lines +52 to +53
The implementation is gated behind a `ReadYourOwnWrite *bool` `client.Options.Cache` setting. It will initially be disabled
by default, the goal is to enable it by default once we are confident in the implementation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's introduce some metrics to make it possible to gather data on how this feature behaves (e.g. in something similar in CAPI I introduced a metric for the "wait duration". Number of wait timeouts would be probably also good.

Probably worth checking if k/k introduced metrics for related work


* Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once
called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add an additional request-level timeout per default in the consistency client. I would not be surprised if a huge majority of controller-runtime users are not setting an overall Reconciliation or request timeout.

IIRC we didn't set a ReconciliationTimeout per default when we introduced it to avoid breaking anyone.
A cached get/list call today is not something that typically times out. So I think a lot of folks didn't realize yet that they should set timeouts because they might never have hit issues without a timeout

passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
blocking the request
* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would it make sense to "reduce" the parameter we take in this func from a full client.Object to: gvk, key, uid? (similar for RemoveRequiredDeleteForObject)

### Changes to the Client

Add a new `readYourOwnWriteClient` wrapper to the client, which will wrap any new client that is constructed with
`options.Cache.ReadYourOwnWrite: new(true)`. This wrapper maintains a map of locks for gvk+key. It will wrap all
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is referring to client.Options.

Would be great if it would be possible to enable/disable this per GVK

Wondering if we should simplify the field / option below to something like "ConsistentRead" (we shouldn't do that if it would be misleading though)

case of `Get` or all current lock holders for GVK in the case of `List` release their lock or the requests context
expires.

Add a new `DisableReadYourOwnWriteConsistency` option that can be used for either `Get` or `List` and disables the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need an option to avoid blocking Reads on the Write methods?

(Specifically wondering what happens out of the box for a Write that never shows up in the cache because the object doesn't match the selectors)

will error and instruct the user to use `List` and `Delete`
* Fail writes that use optimistic locking clientside - This could be done in the future, but is initially out of scope

## Implementation
Copy link
Copy Markdown
Member

@sbueringer sbueringer Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to short-circuit some of the code if the write request didn't change the RV?

I think there are many cases where that can happen:

  • SSA no-op
  • Empty patches
  • Patches that apply changes that are changed back by conversion or mutating webhooks (so effectively they don't change anything)

None of these are ideal, but stuff like this happens

* Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once
called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
blocking the request
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume that for some reason the cache doesn't catch up at all. (I'm aware that today we don't have a way to handle that)

Would it be possible now to handle cases like that, if yes how? (not saying we have to do this automatically, just wondering if we expose something that makes it possible to handle cases like that)

I'm wondering if in cases like that it would be e.g. desirable that the entire controller binary restarts, similar to what happens on cache timeout during startup.

IIRC some controllers in k/k (in kcm?) basically stop reconciling now until the cache catches up. I think that's also what we would get with what is currently described in the proposal. I don't know if there is any additional handling if the cache never catches up.

or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this
RV/deleted object within the get/list and block the request until it observed it or the requests context times out.
It is important that we block before executing the write request and not after, because we can not know when exactly the server
commits it, only when it tells us having committed it.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(comment belongs slightly above, but collides with other comments)

Probably a stupid question, just want to double check. We can rely on events from a watch stream being ordered by RV, right? (otherwise we would potentially unblock to early)

Similarly I got this from an LLM and I'm honestly not sure if this is an issue or not, it sounds to me like it might be. I unfortunately don't have the time before PTO to figure it out (feel free to ignore if it doesn't make sense to you)

## Behavior during informer relist is unspecified

When an informer's watch connection breaks and it relists, the reflector replays
the full state. If a minimum RV was set but the relist's "initial list" RV is
lower than the target (because the list happened before the write was committed),
the `ConsistencyHandler` might not advance `observedRV` during the relist. The
subsequent synthetic Add events would then advance it, but there's a window
during relist where the cache state is being replaced and objects might
temporarily disappear.

The design should discuss whether relist can violate the consistency guarantee,
and whether the `ConsistencyHandler` needs to handle relist events specially
(e.g., the `isInInitialList` parameter in `OnAdd` that the POC ignores).

committed by the server at the time the read happened
* Reads do not get blocked on writes that started after they started, as otherwise they may end up getting
blocked indefinitely if many writes are happening.
* Writes themselves do not get blocked waiting for them to be observable by reads because:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything special about the following cases that should be called out?

  • subresource writes (e.g. status, scale)
  • create with GenerateName (there is no Name available when Create is called)

the write did or did not succeed
* Automatically dealing with configurations where the cache doesn't contain all objects that are written, for
example because its label or field selector doesn't match them. We will provide an option for this case and
may or may not try to detect this correctly in the future
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is even worth calling out, but Transform funcs that strip metadata (specifically kind/apiVersion/name/namespace/UID/RV (?)) are not supported

But maybe it's worth mentioning what we rely on

Controller-Runtimes default client writes to the api and reads from an informercache. As a result, performing a read
right after a write tends to not return that write since it hasn't made it back to the informer cache yet. This
leads to bugs and workarounds like replacing the read portion of the client to read from the API instead. The goal of
this proposal is to provide the same consistency in the cache-reading client a live client would with regards to
Copy link
Copy Markdown
Member

@sbueringer sbueringer Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was chatting a bit with @serathius at KubeCon and he mentioned that in core Kubernetes they intentionally didn't implement that kind of "strong consistency" (if I got it right they are not blocking all reads after writes, they do it more fine-granularly).

I don't have time right now to lookup the k/k implementations, but maybe we should:

  • ensure that we have metrics & data before we make a decision to turn this feature on per default
  • also enable cases with lower consistency guarantees, e.g. a controller might only want to wait until writes for a specific reconcile.Request have been observed and not also for other writes of other controller workers or other controllers (requires more research on what exactly was done in k/k, just wanted to surface it already in case that's useful)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, while controller-runtime can provide more options to user, I think the defaults should be aligned with k/k to avoid surprises by both operator users and authors. Eventual consistency of controllers is one of the reasons Kubernetes can scale so well. Operators should also take advantage of it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 k/k's method of reading its own writes is that it will only pause reconciliation if:

  1. An object the reconciler wrote has not been read back yet
  2. Only objects the requested object has written previously will stall reconciliation(this one is tricky to generalize, especially in cases that we still want to pause reconciliation on other resources)

In an example statefulset Foo may create pods -> foo1, foo2, foo3 and statefulset Bar creates bar1, bar2 and bar3

If foo2 is stuck and we have not seen the write, that does not prevent bar from processing its objects. I wonder if we want to only wait for writes inside of that request by default but expose helpers that can pause for the whole reconciler if necessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More context on the changes in k/k: kubernetes/website#54792 (Thx @michaelasp for writing the blog!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants