Commit 0dabfd1
authored
fix: surface precise admission errors and clear stale ResourceClaims (#573)
## The bug
Ref: #512
Users occasionally saw
> `Error from server (Forbidden): ... Insufficient quota resources
available. Review your quota usage and reach out to support if you need
additional resources.`
when creating resources such as DNS record sets, even though their
project quota showed only a handful of resources in use (for example
6/500). The message was misleading because the admission plugin returned
it for *every* failure inside the ResourceClaim pipeline, not just
actual denials.
## Root cause
The `ResourceQuotaEnforcementPlugin` has three distinct failure paths,
all of which were collapsed into the same 403 body:
1. **Actual denial** — `AllowanceBucketController` marks `Granted=False`
with reason `QuotaExceeded`.
2. **Watch timeout** — the admission watch manager times out after 30 s
because the bucket controller is still reconciling, or grants are still
propagating.
3. **Stale `ResourceClaim`** — claim names are deterministic, so a
leftover claim from a prior admission timeout blocks every retry with
`AlreadyExists`. `DeniedAutoClaimCleanupController` only reaped `Denied`
claims, not pending/timed-out ones, so these orphans could linger
indefinitely.
Case 3 produced the "insufficient quota" message even when the user had
plenty of quota available, which is exactly what the reporter hit.
## The fix (one PR, three coordinated pieces)
### 1. Typed claim-failure kinds with distinct user-facing messages
`internal/quota/admission/errors.go` (new) introduces `claimFailure`
with four kinds: `Denied`, `Timeout`, `Conflict`, `Internal`.
`createAndWaitForResourceClaim` now returns these and
`processResourceWithPolicy` maps each to a warm, actionable 403 body:
| Kind | Message |
|---|---|
| Denied | _You've reached your quota for this resource type (...).
Delete unused resources to free up capacity, or contact support to
request a higher limit._ |
| Timeout | _Your request took too long to be checked against your
quota. Please try again in a moment — if this keeps happening, contact
support._ |
| Conflict | _We're still cleaning up from a previous attempt to create
this resource (...). Please try again in a few seconds._ |
| Internal | _Something went wrong while checking your quota for this
request. Please try again — if this keeps happening, contact support._ |
### 2. Pre-create handling of existing claims
Before `Create`, `createResourceClaim` now does a `Get` at the
deterministic name and branches:
| Existing claim state | Action |
|---|---|
| `Granted=True` | Short-circuit admission to allow (capacity already
reserved for this resource) |
| `Granted=False` / `QuotaExceeded` | Return `Conflict` so the user gets
a retry message, not "insufficient quota" |
| Pending or no condition | Delete inline, then Create a fresh claim |
| `AlreadyExists` on Create | Also mapped to `Conflict` |
Only auto-created claims (label `quota.miloapis.com/auto-created=true`)
are touched, so hand-authored claims sharing a name are never disturbed.
### 3. Garbage collection for stale pending auto-created claims
`DeniedAutoClaimCleanupController` now also deletes auto-created claims
that remain in a non-final state for longer than
`DefaultStalePendingClaimAge` (5 min, well above the 30 s admission
watch timeout). Fresh pending claims are requeued for the remaining time
rather than deleted, so in-flight admissions are never disturbed.
Denied-claim behavior is unchanged.
### Observability
New counter
`milo_quota_admission_existing_claim_resolution_total{resolution=...}`
tracks how often each pre-create path fires (`granted`, `denied`,
`deleted_pending`, `create_conflict`), so the stale-claim behavior is no
longer invisible.
## Files touched
- `internal/quota/admission/errors.go` (new)
- `internal/quota/admission/plugin.go`
- `internal/quota/admission/plugin_test.go`
- `internal/quota/controllers/lifecycle/cleanup.go`
- `internal/quota/controllers/lifecycle/cleanup_test.go` (new)
## Test plan
- [x] \`go test ./internal/quota/...\` passes locally
- [x] \`TestResolveExistingClaim\` covers granted short-circuit, denied
conflict, pending delete+recreate, and clean-slate create
- [x] \`TestUserFacingClaimError\` asserts each claim-failure kind
produces its expected message
- [x] \`TestClaimWaitScenarios\` extended with a \`timeout\` case (plus
the mock now matches \`watchManager.evaluateClaimStatus\` semantics)
- [x] \`TestDeniedAutoClaimCleanupController\` covers denied-deleted,
stale-pending-deleted, fresh-pending-requeued,
no-condition-pending-deleted, granted-ignored, and manual-ignored
- [x] Manual verification: create a DNS record set, kill the admission
plugin mid-flight to leave a pending claim, then retry and confirm the
retry succeeds rather than returning "insufficient quota"
- [x] Check the new
\`milo_quota_admission_existing_claim_resolution_total\` counter on a
staging cluster
Made with [Cursor](https://cursor.com)5 files changed
Lines changed: 1011 additions & 54 deletions
File tree
- internal/quota
- admission
- controllers/lifecycle
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
0 commit comments