✨ Read your own write client design#3473
✨ Read your own write client design#3473alvaroaleman wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the | ||
| passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey` | ||
| blocking the request | ||
| * Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key |
There was a problem hiding this comment.
This hinges on the assumption that we are guaranteed to get a delete event that contains the object with at least namespace, name and UID even if we end up getting a tombstome from the cache - I think that is the case, but would be good to get confirmation
There was a problem hiding this comment.
I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.
| * Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key | ||
| of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was | ||
| observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s) | ||
| whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing |
There was a problem hiding this comment.
The whole having to observe the delete event business is quite finicky, it would be a lot easier if deletes would always provide an rv, but according to @liggitt that is problematic: kubernetes/kubernetes#134937 (comment)
There was a problem hiding this comment.
Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.
|
/cc @michaelasp |
| called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the | ||
| passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey` | ||
| blocking the request | ||
| * Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key |
There was a problem hiding this comment.
I believe so as well, the RV is the main thing that cannot be guaranteed due to possibility of dropped events. We'd have to see what happens in the case that we drop watch events and we have to relist however, I wonder if anything strange occurs in the scenario that we delete -> recreate -> delete while our events get dropped. I don't think this would affect the internal impl though.
| * Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key | ||
| of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was | ||
| observed OR the requests context times our OR `RemoveRequiredDeleteForObject` is called for the same object. The object(s) | ||
| whose deletion are awaited are copied before waiting to avoid subsequent `AddRequiredDeleteForObject` calls to block existing |
There was a problem hiding this comment.
Yeah, I wish there was a better way to deal with delete events. Might be good to discuss if there's anything we can do upstream to make this behavior work more nicely.
| The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List | ||
| before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion | ||
| or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this | ||
| RV/deleted object within the get/list and block the request until it observed it or the requests context times out. |
There was a problem hiding this comment.
What happens in the case where we trigger the delete but then the object gets recreated and we drop events forcing a relist? UID should be enough for this case right?
There was a problem hiding this comment.
Yeah, that is my understanding
|
|
||
| ## Goals | ||
|
|
||
| * Any read from the client (`Get` or `List`) contains all writes performed through the same client that were |
There was a problem hiding this comment.
This will cause certain high throughput cases to become slower, we should make sure that these cases are documented. The main case that comes to mind is some sort of controller that:
- Receives an event
- Does a list that does not need read your own write consistency
- Does some operation on the GVK of that list
Now we would have to block for every write on that list until the cache catches up. Agreed however that most controllers would benefit from having the cache up to date with its writes.
There was a problem hiding this comment.
yeah, fundamentally the tradeoff is performance vs consistency. I believe that the consistency is more important in most cases. For the cases where it is not, this whole feature can be disabled.
There was a problem hiding this comment.
We could add an option to Get/List calls that allows to disable consistent read (not sure if we should do that initially though, but I think it's something that we could add later as well).
EDIT: Saw that the option is mentioned below
There was a problem hiding this comment.
Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).
E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.
Concrete example from CAPI:
- It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls
- But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")
| back to the client for any reason like the connection breaking. This is primarily for practical | ||
| purposes, if the write doesn't get a response indicating success or failure back, it is impossible to tell if | ||
| the write did or did not succeed | ||
| * Automatically dealing with configurations where the cache doesn't contain all objects that are written, for |
There was a problem hiding this comment.
Yeah this is tricky, if we don't have all the objects then it is not clear how we track the RV in the store. I'd like to look more at how the cache in controller runtime currently works with field/label selectors.
There was a problem hiding this comment.
This here is how it is configured: https://pkg.go.dev/sigs.k8s.io/controller-runtime@v0.23.3/pkg/cache#Config
Fundemantelly it boils down to being able to configure field and/or labelSelector per GVK and or namespace and namespace(s) per gvk. The latter is implemented clientside, since the KAS does not allow watchichg multiple namespaces
|
|
||
| ## Goals | ||
|
|
||
| * Any read from the client (`Get` or `List`) contains all writes performed through the same client that were |
There was a problem hiding this comment.
Somewhat related. I think sooner or later we should document some patterns on how to use this feature (including trade-offs involved).
E.g. I'm not sure how much sense it makes to share one consistency client across many (different) reconcilers in the same controller. If the client is shared all reconcilers are getting blocked, but because these reconcilers are running concurrently anyway there might be not much need/benefit for read-your-own-write across reconcilers.
Concrete example from CAPI:
- It would be helpful to use a consistency client for the MachineSet reconciler so subsequent Reconciles of the same MachineSet can observe the Machines that have been created/deleted in previous Reconcile calls
- But it doesn't necessarily make sense to use the same consistency client for the Cluster / MachineDeployment / Machine / ... reconcilers, because they have to deal with the fact that the MachineSet reconciler is running concurrently anyway (i.e. no need to introduce a "global mutex")
| ## Goals | ||
|
|
||
| * Any read from the client (`Get` or `List`) contains all writes performed through the same client that were | ||
| committed by the server at the time the read happened |
There was a problem hiding this comment.
It would be good to also describe what happens in the following case:
- Write succeeds
- RV doesn't show up in the cache (either not at all or it takes a long time (minutes))
E.g. we could return an error in the Get/List calls after some timeout. We could also fallback to a live read to the apiserver (I assume that's not what we want to do, IIRC some of the caches in the apiserver works that way)
(EDIT: I saw this case is mentioned in the implementation, but would be probably good to highlight it here as well)
| example because its label or field selector doesn't match them. We will provide an option for this case and | ||
| may or may not try to detect this correctly in the future | ||
| * Support `DeleteAllOf` - The apiservers response to DeleteAllOf is insufficient to implement this correctly. DeleteAllOf | ||
| will error and instruct the user to use `List` and `Delete` |
There was a problem hiding this comment.
I'm fine with this, I assume implementing DeleteAllOff internally as a List + Delete would be a bit too much magic/surprising?
| The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List | ||
| before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion |
There was a problem hiding this comment.
I think it could be a worthwhile optimization to only block for the same GVK+namespace for List for namespaced List calls
|
|
||
| The basic idea of the implementation is to make writes block concurrent reads to the same GVK+Key for Get and GVK for List | ||
| before the request is sent. If the request succeeds, the client provide the cache with either the returned resourceVersion | ||
| or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this |
There was a problem hiding this comment.
nit: "then unblocks the reads" seems wrong at this point? (the reads are unblocked "when it observed it" (below) right?)
| The implementation is gated behind a `ReadYourOwnWrite *bool` `client.Options.Cache` setting. It will initially be disabled | ||
| by default, the goal is to enable it by default once we are confident in the implementation. |
There was a problem hiding this comment.
Let's introduce some metrics to make it possible to gather data on how this feature behaves (e.g. in something similar in CAPI I introduced a metric for the "wait duration". Number of wait timeouts would be probably also good.
Probably worth checking if k/k introduced metrics for related work
|
|
||
| * Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once | ||
| called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the | ||
| passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey` |
There was a problem hiding this comment.
I wonder if we should add an additional request-level timeout per default in the consistency client. I would not be surprised if a huge majority of controller-runtime users are not setting an overall Reconciliation or request timeout.
IIRC we didn't set a ReconciliationTimeout per default when we introduced it to avoid breaking anyone.
A cached get/list call today is not something that typically times out. So I think a lot of folks didn't realize yet that they should set timeouts because they might never have hit issues without a timeout
| passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey` | ||
| blocking the request | ||
| * Add an internal-only `AddRequiredDeleteForObject(obj client.Object) error` method. Once set, all `Get` requests to the GVK+key | ||
| of object and all `List` requests to the GVK of object will be blocked until a delete event for GVK+UID of object was |
There was a problem hiding this comment.
nit: Would it make sense to "reduce" the parameter we take in this func from a full client.Object to: gvk, key, uid? (similar for RemoveRequiredDeleteForObject)
| ### Changes to the Client | ||
|
|
||
| Add a new `readYourOwnWriteClient` wrapper to the client, which will wrap any new client that is constructed with | ||
| `options.Cache.ReadYourOwnWrite: new(true)`. This wrapper maintains a map of locks for gvk+key. It will wrap all |
There was a problem hiding this comment.
I assume this is referring to client.Options.
Would be great if it would be possible to enable/disable this per GVK
Wondering if we should simplify the field / option below to something like "ConsistentRead" (we shouldn't do that if it would be misleading though)
| case of `Get` or all current lock holders for GVK in the case of `List` release their lock or the requests context | ||
| expires. | ||
|
|
||
| Add a new `DisableReadYourOwnWriteConsistency` option that can be used for either `Get` or `List` and disables the |
There was a problem hiding this comment.
Do we also need an option to avoid blocking Reads on the Write methods?
(Specifically wondering what happens out of the box for a Write that never shows up in the cache because the object doesn't match the selectors)
| will error and instruct the user to use `List` and `Delete` | ||
| * Fail writes that use optimistic locking clientside - This could be done in the future, but is initially out of scope | ||
|
|
||
| ## Implementation |
There was a problem hiding this comment.
Would it make sense to short-circuit some of the code if the write request didn't change the RV?
I think there are many cases where that can happen:
- SSA no-op
- Empty patches
- Patches that apply changes that are changed back by conversion or mutating webhooks (so effectively they don't change anything)
None of these are ideal, but stuff like this happens
| * Add an internal-only `SetMinimumRVForGVKAndKey(gvk schema.GroupVersionKind, key client.ObjectKey, rv int64)` method. Once | ||
| called, all `Get` requests to the GVK+key and all `List` requests to the GVK will be blocked until the cache observed the | ||
| passed rv or the request times out. The rv is copied before waiting to avoid subsequent calls to `SetMinimumRVForGVKAndKey` | ||
| blocking the request |
There was a problem hiding this comment.
Let's assume that for some reason the cache doesn't catch up at all. (I'm aware that today we don't have a way to handle that)
Would it be possible now to handle cases like that, if yes how? (not saying we have to do this automatically, just wondering if we expose something that makes it possible to handle cases like that)
I'm wondering if in cases like that it would be e.g. desirable that the entire controller binary restarts, similar to what happens on cache timeout during startup.
IIRC some controllers in k/k (in kcm?) basically stop reconciling now until the cache catches up. I think that's also what we would get with what is currently described in the proposal. I don't know if there is any additional handling if the cache never catches up.
| or the gvk+objectkey+uid of an object if it was deleted from storage, then unblock the reads. The cache will then copy this | ||
| RV/deleted object within the get/list and block the request until it observed it or the requests context times out. | ||
| It is important that we block before executing the write request and not after, because we can not know when exactly the server | ||
| commits it, only when it tells us having committed it. |
There was a problem hiding this comment.
(comment belongs slightly above, but collides with other comments)
Probably a stupid question, just want to double check. We can rely on events from a watch stream being ordered by RV, right? (otherwise we would potentially unblock to early)
Similarly I got this from an LLM and I'm honestly not sure if this is an issue or not, it sounds to me like it might be. I unfortunately don't have the time before PTO to figure it out (feel free to ignore if it doesn't make sense to you)
## Behavior during informer relist is unspecified
When an informer's watch connection breaks and it relists, the reflector replays
the full state. If a minimum RV was set but the relist's "initial list" RV is
lower than the target (because the list happened before the write was committed),
the `ConsistencyHandler` might not advance `observedRV` during the relist. The
subsequent synthetic Add events would then advance it, but there's a window
during relist where the cache state is being replaced and objects might
temporarily disappear.
The design should discuss whether relist can violate the consistency guarantee,
and whether the `ConsistencyHandler` needs to handle relist events specially
(e.g., the `isInInitialList` parameter in `OnAdd` that the POC ignores).
| committed by the server at the time the read happened | ||
| * Reads do not get blocked on writes that started after they started, as otherwise they may end up getting | ||
| blocked indefinitely if many writes are happening. | ||
| * Writes themselves do not get blocked waiting for them to be observable by reads because: |
There was a problem hiding this comment.
Is there anything special about the following cases that should be called out?
- subresource writes (e.g. status, scale)
- create with GenerateName (there is no Name available when Create is called)
| the write did or did not succeed | ||
| * Automatically dealing with configurations where the cache doesn't contain all objects that are written, for | ||
| example because its label or field selector doesn't match them. We will provide an option for this case and | ||
| may or may not try to detect this correctly in the future |
There was a problem hiding this comment.
Not sure if this is even worth calling out, but Transform funcs that strip metadata (specifically kind/apiVersion/name/namespace/UID/RV (?)) are not supported
But maybe it's worth mentioning what we rely on
| Controller-Runtimes default client writes to the api and reads from an informercache. As a result, performing a read | ||
| right after a write tends to not return that write since it hasn't made it back to the informer cache yet. This | ||
| leads to bugs and workarounds like replacing the read portion of the client to read from the API instead. The goal of | ||
| this proposal is to provide the same consistency in the cache-reading client a live client would with regards to |
There was a problem hiding this comment.
I was chatting a bit with @serathius at KubeCon and he mentioned that in core Kubernetes they intentionally didn't implement that kind of "strong consistency" (if I got it right they are not blocking all reads after writes, they do it more fine-granularly).
I don't have time right now to lookup the k/k implementations, but maybe we should:
- ensure that we have metrics & data before we make a decision to turn this feature on per default
- also enable cases with lower consistency guarantees, e.g. a controller might only want to wait until writes for a specific reconcile.Request have been observed and not also for other writes of other controller workers or other controllers (requires more research on what exactly was done in k/k, just wanted to surface it already in case that's useful)
There was a problem hiding this comment.
Right, while controller-runtime can provide more options to user, I think the defaults should be aligned with k/k to avoid surprises by both operator users and authors. Eventual consistency of controllers is one of the reasons Kubernetes can scale so well. Operators should also take advantage of it.
There was a problem hiding this comment.
+1 k/k's method of reading its own writes is that it will only pause reconciliation if:
- An object the reconciler wrote has not been read back yet
- Only objects the requested object has written previously will stall reconciliation(this one is tricky to generalize, especially in cases that we still want to pause reconciliation on other resources)
In an example statefulset Foo may create pods -> foo1, foo2, foo3 and statefulset Bar creates bar1, bar2 and bar3
If foo2 is stuck and we have not seen the write, that does not prevent bar from processing its objects. I wonder if we want to only wait for writes inside of that request by default but expose helpers that can pause for the whole reconciler if necessary.
There was a problem hiding this comment.
More context on the changes in k/k: kubernetes/website#54792 (Thx @michaelasp for writing the blog!)
Related to #3320
Minimal POC in #3472
/hold