:sparkles: Read your own write client design by alvaroaleman · Pull Request #3473 · kubernetes-sigs/controller-runtime

alvaroaleman · 2026-03-08T11:39:46Z

Related to #3320
Minimal POC in #3472

/hold

k8s-ci-robot · 2026-03-08T11:39:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [alvaroaleman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alvaroaleman · 2026-03-08T11:42:39Z

designs/read_your_own_write_client.md

+  called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
+  passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
+  blocking the request
+* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key


This hinges on the assumption that we are guaranteed to get a delete event that contains the object with at least namespace, name and UID even if we end up getting a tombstome from the cache - I think that is the case, but would be good to get confirmation

I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.

alvaroaleman · 2026-03-09T08:50:58Z

designs/read_your_own_write_client.md

+* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
+  of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was
+  observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s)
+  whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing


The whole having to observe the delete event business is quite finicky, it would be a lot easier if deletes would always provide an rv, but according to @liggitt that is problematic: kubernetes/kubernetes#134937 (comment)

Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.

serathius · 2026-03-09T08:59:32Z

/cc @michaelasp

michaelasp · 2026-03-09T17:36:55Z

designs/read_your_own_write_client.md

+  called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
+  passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
+  blocking the request
+* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key


I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.

michaelasp · 2026-03-09T17:37:49Z

designs/read_your_own_write_client.md

+* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
+  of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was
+  observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s)
+  whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing


Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.

michaelasp · 2026-03-09T19:22:45Z

designs/read_your_own_write_client.md

+The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
+before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion
+or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this
+RV/deleted object within the get/list and block the request until it observed it or the requests context times out.


What happens in the case where we trigger the delete but then the object gets recreated and we drop events forcing a relist? UID should be enough for this case right?

Yeah, that is my understanding

michaelasp · 2026-03-09T20:41:28Z

designs/read_your_own_write_client.md

+
+## Goals
+
+* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were


This will cause certain high throughput cases to become slower, we should make sure that these cases are documented. The main case that comes to mind is some sort of controller that:

Receives an event

Does a list that does not need read your own write consistency

Does some operation on the GVK of that list

Now we would have to block for every write on that list until the cache catches up. Agreed however that most controllers would benefit from having the cache up to date with its writes.

yeah, fundamentally the tradeoff is performance vs consistency. I believe that the consistency is more important in most cases. For the cases where it is not, this whole feature can be disabled.

We could add an option to Get/List calls that allows to disable consistent read (not sure if we should do that initially though, but I think it's something that we could add later as well).

EDIT: Saw that the option is mentioned below

Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).

E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.

Concrete example from CAPI:

It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls

But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")

michaelasp · 2026-03-09T20:47:33Z

designs/read_your_own_write_client.md

+  back to the client for any reason like the connection breaking. This is primarily for practical
+  purposes, if the write doesn't get a response indicating success or failure back, it is impossible to tell if
+  the write did or did not succeed
+* Automatically dealing with configurations where the cache doesn't contain all objects that are written, for


Yeah this is tricky, if we don't have all the objects then it is not clear how we track the RV in the store. I'd like to look more at how the cache in controller runtime currently works with field/label selectors.

This here is how it is configured: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.23.3/pkg/cache#Config

Fundemantelly it boils down to being able to configure field and/or labelSelector per GVK and or namespace and namespace(s) per gvk. The latter is implemented clientside, since the KAS does not allow watchichg multiple namespaces

sbueringer · 2026-03-14T06:58:20Z

designs/read_your_own_write_client.md

+
+## Goals
+
+* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were


Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).

E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.

Concrete example from CAPI:

It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls

But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")

sbueringer · 2026-03-14T07:04:53Z

designs/read_your_own_write_client.md

+## Goals
+
+* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were
+  committed by the server at the time the read happened


It would be good to also describe what happens in the following case:

Write succeeds

RV doesn't show up in the cache (either not at all or it takes a long time (minutes))

E.g. we could return an error in the Get/List calls after some timeout. We could also fallback to a live read to the apiserver (I assume that's not what we want to do, IIRC some of the caches in the apiserver works that way)

(EDIT: I saw this case is mentioned in the implementation, but would be probably good to highlight it here as well)

sbueringer · 2026-03-14T07:12:20Z

designs/read_your_own_write_client.md

+  example because its label or field selector doesn't match them. We will provide an option for this case and
+  may or may not try to detect this correctly in the future
+* Support `DeleteAllOf` - The apiservers response to DeleteAllOf is insufficient to implement this correctly. DeleteAllOf
+  will error and instruct the user to use `List` and `Delete`


I'm fine with this, I assume implementing DeleteAllOff internally as a List + Delete would be a bit too much magic/surprising?

sbueringer · 2026-03-14T07:15:25Z

designs/read_your_own_write_client.md

+The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
+before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion


I think it could be a worthwhile optimization to only block for the same GVK+namespace for List for namespaced List calls

sbueringer · 2026-03-14T07:18:54Z

designs/read_your_own_write_client.md

+
+The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
+before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion
+or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this


nit: "then unblocks the reads" seems wrong at this point? (the reads are unblocked "when it observed it" (below) right?)

sbueringer · 2026-03-14T07:22:34Z

designs/read_your_own_write_client.md

+The implementation is gated behind a `ReadYourOwnWrite *bool` `client.Options.Cache` setting. It will initially be disabled
+by default, the goal is to enable it by default once we are confident in the implementation.


Let's introduce some metrics to make it possible to gather data on how this feature behaves (e.g. in something similar in CAPI I introduced a metric for the "wait duration". Number of wait timeouts would be probably also good.

Probably worth checking if k/k introduced metrics for related work

sbueringer · 2026-03-14T07:26:07Z

designs/read_your_own_write_client.md

+
+* Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once
+  called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
+  passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`


I wonder if we should add an additional request-level timeout per default in the consistency client. I would not be surprised if a huge majority of controller-runtime users are not setting an overall Reconciliation or request timeout.

IIRC we didn't set a ReconciliationTimeout per default when we introduced it to avoid breaking anyone.
A cached get/list call today is not something that typically times out. So I think a lot of folks didn't realize yet that they should set timeouts because they might never have hit issues without a timeout

sbueringer · 2026-03-14T07:31:34Z

designs/read_your_own_write_client.md

+  passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
+  blocking the request
+* Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key
+  of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was


nit: Would it make sense to "reduce" the parameter we take in this func from a full client.Object to: gvk, key, uid? (similar for RemoveRequiredDeleteForObject)

sbueringer · 2026-03-14T07:38:27Z

designs/read_your_own_write_client.md

+### Changes to the Client
+
+Add a new `readYourOwnWriteClient` wrapper to the client, which will wrap any new client that is constructed with
+`options.Cache.ReadYourOwnWrite: new(true)`. This wrapper maintains a map of locks for gvk+key. It will wrap all


I assume this is referring to client.Options.

Would be great if it would be possible to enable/disable this per GVK

Wondering if we should simplify the field / option below to something like "ConsistentRead" (we shouldn't do that if it would be misleading though)

sbueringer · 2026-03-14T07:41:33Z

designs/read_your_own_write_client.md

+case of `Get` or all current lock holders for GVK in the case of `List` release their lock or the requests context
+expires.
+
+Add a new `DisableReadYourOwnWriteConsistency` option that can be used for either `Get` or `List` and disables the


Do we also need an option to avoid blocking Reads on the Write methods?

(Specifically wondering what happens out of the box for a Write that never shows up in the cache because the object doesn't match the selectors)

sbueringer · 2026-03-14T11:07:09Z

designs/read_your_own_write_client.md

+  will error and instruct the user to use `List` and `Delete`
+* Fail writes that use optimistic locking clientside - This could be done in the future, but is initially out of scope
+
+## Implementation


Would it make sense to short-circuit some of the code if the write request didn't change the RV?

I think there are many cases where that can happen:

SSA no-op

Empty patches

Patches that apply changes that are changed back by conversion or mutating webhooks (so effectively they don't change anything)

None of these are ideal, but stuff like this happens

sbueringer · 2026-03-14T11:14:08Z

designs/read_your_own_write_client.md

+* Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once
+  called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the
+  passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey`
+  blocking the request


Let's assume that for some reason the cache doesn't catch up at all. (I'm aware that today we don't have a way to handle that)

Would it be possible now to handle cases like that, if yes how? (not saying we have to do this automatically, just wondering if we expose something that makes it possible to handle cases like that)

I'm wondering if in cases like that it would be e.g. desirable that the entire controller binary restarts, similar to what happens on cache timeout during startup.

IIRC some controllers in k/k (in kcm?) basically stop reconciling now until the cache catches up. I think that's also what we would get with what is currently described in the proposal. I don't know if there is any additional handling if the cache never catches up.

sbueringer · 2026-03-30T15:16:09Z

designs/read_your_own_write_client.md

+or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this
+RV/deleted object within the get/list and block the request until it observed it or the requests context times out.
+It is important that we block before executing the write request and not after, because we can not know when exactly the server
+commits it, only when it tells us having committed it.


(comment belongs slightly above, but collides with other comments)

Probably a stupid question, just want to double check. We can rely on events from a watch stream being ordered by RV, right? (otherwise we would potentially unblock to early)

Similarly I got this from an LLM and I'm honestly not sure if this is an issue or not, it sounds to me like it might be. I unfortunately don't have the time before PTO to figure it out (feel free to ignore if it doesn't make sense to you)

## Behavior during informer relist is unspecified When an informer's watch connection breaks and it relists, the reflector replays the full state. If a minimum RV was set but the relist's "initial list" RV is lower than the target (because the list happened before the write was committed), the `ConsistencyHandler` might not advance `observedRV` during the relist. The subsequent synthetic Add events would then advance it, but there's a window during relist where the cache state is being replaced and objects might temporarily disappear. The design should discuss whether relist can violate the consistency guarantee, and whether the `ConsistencyHandler` needs to handle relist events specially (e.g., the `isInInitialList` parameter in `OnAdd` that the POC ignores).

sbueringer · 2026-03-30T15:22:06Z

designs/read_your_own_write_client.md

+  committed by the server at the time the read happened
+* Reads do not get blocked on writes that started after they started, as otherwise they may end up getting
+  blocked indefinitely if many writes are happening.
+* Writes themselves do not get blocked waiting for them to be observable by reads because:


Is there anything special about the following cases that should be called out?

subresource writes (e.g. status, scale)

create with GenerateName (there is no Name available when Create is called)

sbueringer · 2026-03-30T15:25:55Z

designs/read_your_own_write_client.md

+  the write did or did not succeed
+* Automatically dealing with configurations where the cache doesn't contain all objects that are written, for
+  example because its label or field selector doesn't match them. We will provide an option for this case and
+  may or may not try to detect this correctly in the future


Not sure if this is even worth calling out, but Transform funcs that strip metadata (specifically kind/apiVersion/name/namespace/UID/RV (?)) are not supported

But maybe it's worth mentioning what we rely on

sbueringer · 2026-03-30T15:56:40Z

designs/read_your_own_write_client.md

+Controller-Runtimes default client writes to the api and reads from an informercache. As a result, performing a read
+right after a write tends to not return that write since it hasn't made it back to the informer cache yet. This
+leads to bugs and workarounds like replacing the read portion of the client to read from the API instead. The goal of
+this proposal is to provide the same consistency in the cache-reading client a live client would with regards to


I was chatting a bit with @serathius at KubeCon and he mentioned that in core Kubernetes they intentionally didn't implement that kind of "strong consistency" (if I got it right they are not blocking all reads after writes, they do it more fine-granularly).

I don't have time right now to lookup the k/k implementations, but maybe we should:

ensure that we have metrics & data before we make a decision to turn this feature on per default

also enable cases with lower consistency guarantees, e.g. a controller might only want to wait until writes for a specific reconcile.Request have been observed and not also for other writes of other controller workers or other controllers (requires more research on what exactly was done in k/k, just wanted to surface it already in case that's useful)

Right, while controller-runtime can provide more options to user, I think the defaults should be aligned with k/k to avoid surprises by both operator users and authors. Eventual consistency of controllers is one of the reasons Kubernetes can scale so well. Operators should also take advantage of it.

+1 k/k's method of reading its own writes is that it will only pause reconciliation if:

An object the reconciler wrote has not been read back yet

Only objects the requested object has written previously will stall reconciliation(this one is tricky to generalize, especially in cases that we still want to pause reconciliation on other resources)

In an example statefulset Foo may create pods -> foo1, foo2, foo3 and statefulset Bar creates bar1, bar2 and bar3

If foo2 is stuck and we have not seen the write, that does not prevent bar from processing its objects. I wonder if we want to only wait for writes inside of that request by default but expose helpers that can pause for the whole reconciler if necessary.

More context on the changes in k/k: kubernetes/website#54792 (Thx @michaelasp for writing the blog!)

✨ Read your own write client design

6738594

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 8, 2026

k8s-ci-robot requested review from joelanford and troy0820 March 8, 2026 11:39

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 8, 2026

alvaroaleman mentioned this pull request Mar 8, 2026

Idea: Provide read your own write for the cache-backed client #3320

Open

alvaroaleman commented Mar 8, 2026

View reviewed changes

alvaroaleman commented Mar 9, 2026

View reviewed changes

k8s-ci-robot requested a review from michaelasp March 9, 2026 08:59

michaelasp reviewed Mar 9, 2026

View reviewed changes

sbueringer reviewed Mar 14, 2026

View reviewed changes

sbueringer reviewed Mar 30, 2026

View reviewed changes


		## Goals

		* Any read from the client (`Get` or `List`) contains all writes performed through the same client that were

		The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List
		before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion

		The implementation is gated behind a `ReadYourOwnWrite *bool` `client.Options.Cache` setting. It will initially be disabled
		by default, the goal is to enable it by default once we are confident in the implementation.

Conversation

alvaroaleman commented Mar 8, 2026

Uh oh!

k8s-ci-robot commented Mar 8, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius commented Mar 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

sbueringer Mar 14, 2026 •

edited

Loading

sbueringer Mar 14, 2026 •

edited

Loading

sbueringer Mar 30, 2026 •

edited

Loading