test: fix space counter#1212
Conversation
WalkthroughExpanded migration test assertions in Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Migration Test
participant MUR as MasterUserRecord
participant Space as Space
participant SB as SpaceBinding
Note over Test,MUR: prepareDeactivatedUser / prepareBannedUser flow
Test->>MUR: verify MUR deletion
alt MUR deleted
Test->>Space: WaitUntilSpaceAndSpaceBindingsDeleted
Space-->>Test: confirmed deleted
Test->>SB: confirmed bindings deleted
else MUR still present
MUR-->>Test: still exists (retry/wait)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/retest flaky test |
|
/retest branch on host side not updated with master |
metlos
left a comment
There was a problem hiding this comment.
Great find! I think this will improve the situation but I'm not sure it is a 100% bulletproof solution, see the comments below.
I'm approving anyway, because this will make the situation better for sure..
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
Can this potentially race as well? E.g. if the space counter can react quickly enough before this line is hit, we could theoretically timeout here.
There was a problem hiding this comment.
+1
We could capture the time before deactivating the user and use that instead of time.Now()
There was a problem hiding this comment.
if there's anything happening in parallel to this test in the cluster, we could end up exiting too early though. So care needs to be taken here.
There was a problem hiding this comment.
I don't follow what you mean by the timeout. The ToolchainStatus is updated every second in e2e tests
which is exactly what we check in the
UntilToolchainStatusUpdatedAfter function: toolchain-e2e/testsupport/wait/host.go
Lines 1461 to 1468 in 6b150c2
When the ToolchainStatus is updated, then it also syncs with the counters, so we can be sure that the content of the ToolchainStatus is up-to-date to whatever happened before this line.
There was a problem hiding this comment.
I've just noticed that we already do the wait & check of the ToolchainStatus at the end of the setup runner here:
toolchain-e2e/test/migration/setup_runner.go
Lines 66 to 68 in 6b150c2
so I guess that we don't need to add the extra ones - only waiting until Space is being deleted should be sufficient.
There was a problem hiding this comment.
I agree that we don't need to add extra ones (it would be redundant)
There was a problem hiding this comment.
Should we capture the time before all the operations? This way, we would look for an update that happened during or after all the operations, rather than after the operations
There was a problem hiding this comment.
not needed, waiting until Space is gone and then for the update of the ToolchainStatus is completely sufficient
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
Same as above, I think this can potentially race. Is there a way of knowing e.g. the specific number of spaces that we should expect?
| require.NoError(t, err) | ||
|
|
||
| // let's wait until ToolchainStatus is updated with the latest numbers from the space counter | ||
| _, err = hostAwait.WaitForToolchainStatus(t, wait.UntilToolchainStatusUpdatedAfter(time.Now())) |
There was a problem hiding this comment.
+1
We could capture the time before deactivating the user and use that instead of time.Now()
MatousJobanek
left a comment
There was a problem hiding this comment.
awesome, thanks a lot 🚀
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alexeykazakov, MatousJobanek, metlos, mfrancisc, rsoaresd The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest merge conflict with the pairing |
|



Description
Lately, we are often hitting this flaky test:
The space count mismatch seems to happen between "setup migration" and "verify migration" steps. It can be happening for two reasons:
Premature Operator Shutdown
we simply kill the operators (at the end of the migration setup) too early, so it either doesn't properly decreased the counter or it doesn't store the decreased value to the ToolchainStatus
No Counter Reconciliation on Startup
we don't recount the Spaces at the start of the host operator in e2e tests
Slack thread
https://redhat-internal.slack.com/archives/CHK0J6HT6/p1759830691610259
Paired PR
codeready-toolchain/host-operator#1210
Issue ticket number and link
SANDBOX-1437
Summary by CodeRabbit