Skip to content

Commit 0dabfd1

Browse files
fix: surface precise admission errors and clear stale ResourceClaims (#573)
## The bug Ref: #512 Users occasionally saw > `Error from server (Forbidden): ... Insufficient quota resources available. Review your quota usage and reach out to support if you need additional resources.` when creating resources such as DNS record sets, even though their project quota showed only a handful of resources in use (for example 6/500). The message was misleading because the admission plugin returned it for *every* failure inside the ResourceClaim pipeline, not just actual denials. ## Root cause The `ResourceQuotaEnforcementPlugin` has three distinct failure paths, all of which were collapsed into the same 403 body: 1. **Actual denial** — `AllowanceBucketController` marks `Granted=False` with reason `QuotaExceeded`. 2. **Watch timeout** — the admission watch manager times out after 30 s because the bucket controller is still reconciling, or grants are still propagating. 3. **Stale `ResourceClaim`** — claim names are deterministic, so a leftover claim from a prior admission timeout blocks every retry with `AlreadyExists`. `DeniedAutoClaimCleanupController` only reaped `Denied` claims, not pending/timed-out ones, so these orphans could linger indefinitely. Case 3 produced the "insufficient quota" message even when the user had plenty of quota available, which is exactly what the reporter hit. ## The fix (one PR, three coordinated pieces) ### 1. Typed claim-failure kinds with distinct user-facing messages `internal/quota/admission/errors.go` (new) introduces `claimFailure` with four kinds: `Denied`, `Timeout`, `Conflict`, `Internal`. `createAndWaitForResourceClaim` now returns these and `processResourceWithPolicy` maps each to a warm, actionable 403 body: | Kind | Message | |---|---| | Denied | _You've reached your quota for this resource type (...). Delete unused resources to free up capacity, or contact support to request a higher limit._ | | Timeout | _Your request took too long to be checked against your quota. Please try again in a moment — if this keeps happening, contact support._ | | Conflict | _We're still cleaning up from a previous attempt to create this resource (...). Please try again in a few seconds._ | | Internal | _Something went wrong while checking your quota for this request. Please try again — if this keeps happening, contact support._ | ### 2. Pre-create handling of existing claims Before `Create`, `createResourceClaim` now does a `Get` at the deterministic name and branches: | Existing claim state | Action | |---|---| | `Granted=True` | Short-circuit admission to allow (capacity already reserved for this resource) | | `Granted=False` / `QuotaExceeded` | Return `Conflict` so the user gets a retry message, not "insufficient quota" | | Pending or no condition | Delete inline, then Create a fresh claim | | `AlreadyExists` on Create | Also mapped to `Conflict` | Only auto-created claims (label `quota.miloapis.com/auto-created=true`) are touched, so hand-authored claims sharing a name are never disturbed. ### 3. Garbage collection for stale pending auto-created claims `DeniedAutoClaimCleanupController` now also deletes auto-created claims that remain in a non-final state for longer than `DefaultStalePendingClaimAge` (5 min, well above the 30 s admission watch timeout). Fresh pending claims are requeued for the remaining time rather than deleted, so in-flight admissions are never disturbed. Denied-claim behavior is unchanged. ### Observability New counter `milo_quota_admission_existing_claim_resolution_total{resolution=...}` tracks how often each pre-create path fires (`granted`, `denied`, `deleted_pending`, `create_conflict`), so the stale-claim behavior is no longer invisible. ## Files touched - `internal/quota/admission/errors.go` (new) - `internal/quota/admission/plugin.go` - `internal/quota/admission/plugin_test.go` - `internal/quota/controllers/lifecycle/cleanup.go` - `internal/quota/controllers/lifecycle/cleanup_test.go` (new) ## Test plan - [x] \`go test ./internal/quota/...\` passes locally - [x] \`TestResolveExistingClaim\` covers granted short-circuit, denied conflict, pending delete+recreate, and clean-slate create - [x] \`TestUserFacingClaimError\` asserts each claim-failure kind produces its expected message - [x] \`TestClaimWaitScenarios\` extended with a \`timeout\` case (plus the mock now matches \`watchManager.evaluateClaimStatus\` semantics) - [x] \`TestDeniedAutoClaimCleanupController\` covers denied-deleted, stale-pending-deleted, fresh-pending-requeued, no-condition-pending-deleted, granted-ignored, and manual-ignored - [x] Manual verification: create a DNS record set, kill the admission plugin mid-flight to leave a pending claim, then retry and confirm the retry succeeds rather than returning "insufficient quota" - [x] Check the new \`milo_quota_admission_existing_claim_resolution_total\` counter on a staging cluster Made with [Cursor](https://cursor.com)
2 parents 34bbe04 + 276d223 commit 0dabfd1

5 files changed

Lines changed: 1011 additions & 54 deletions

File tree

internal/quota/admission/errors.go

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
package admission
2+
3+
import (
4+
"errors"
5+
"fmt"
6+
)
7+
8+
// errAlreadyGranted is a sentinel returned by createResourceClaim when an
9+
// existing ResourceClaim with the predetermined name already has a Granted
10+
// condition set to True. Its capacity was reserved on a prior admission pass
11+
// and still applies to this resource, so admission can allow the request
12+
// without registering a watch waiter or issuing a new Create. Callers
13+
// compare against this value with == (the sentinel is never wrapped).
14+
var errAlreadyGranted = errors.New("ResourceClaim already granted")
15+
16+
// claimFailureKind classifies why a ResourceClaim could not be resolved to a
17+
// Granted outcome during admission. It lets the admission plugin produce a
18+
// distinct, actionable user-facing message for each root cause rather than
19+
// collapsing every failure into "Insufficient quota".
20+
type claimFailureKind int
21+
22+
const (
23+
// claimFailureDenied indicates the AllowanceBucketController rejected the
24+
// claim because there was not enough available capacity. This is the only
25+
// case where "Insufficient quota" is accurate.
26+
claimFailureDenied claimFailureKind = iota
27+
28+
// claimFailureTimeout indicates the claim was created but did not reach a
29+
// final Granted condition within the admission plugin's watch timeout.
30+
// Typical causes: bucket controller still reconciling, grants still
31+
// propagating, or a slow control plane.
32+
claimFailureTimeout
33+
34+
// claimFailureConflict indicates a ResourceClaim with the deterministic
35+
// name already exists in a non-usable state (e.g., a previously denied
36+
// claim that has not yet been garbage collected). The caller should retry
37+
// after the GC controller deletes the stale claim.
38+
claimFailureConflict
39+
40+
// claimFailureInternal covers template render failures, watch manager
41+
// startup failures, dynamic client errors, and anything else that is not
42+
// attributable to quota capacity.
43+
claimFailureInternal
44+
)
45+
46+
// claimFailure is the structured error returned from the claim creation/wait
47+
// pipeline. It carries a classification so the admission plugin can surface a
48+
// tailored message to the caller while still preserving the underlying error
49+
// for logs and traces.
50+
type claimFailure struct {
51+
kind claimFailureKind
52+
reason string // machine-readable reason (e.g. condition reason)
53+
message string // user-facing message fragment (e.g. denial message)
54+
cause error // wrapped error for logging/tracing
55+
}
56+
57+
func (e *claimFailure) Error() string {
58+
switch {
59+
case e.message != "" && e.cause != nil:
60+
return fmt.Sprintf("%s: %v", e.message, e.cause)
61+
case e.message != "":
62+
return e.message
63+
case e.cause != nil:
64+
return e.cause.Error()
65+
default:
66+
return fmt.Sprintf("claim failure (kind=%d)", e.kind)
67+
}
68+
}
69+
70+
func (e *claimFailure) Unwrap() error { return e.cause }
71+
72+
// asClaimFailure unwraps a claimFailure from err if present. It returns the
73+
// claimFailure and true, or (nil, false) if err does not carry one.
74+
func asClaimFailure(err error) (*claimFailure, bool) {
75+
var f *claimFailure
76+
if errors.As(err, &f) {
77+
return f, true
78+
}
79+
return nil, false
80+
}
81+
82+
// newDeniedFailure constructs a claimFailure for an actual quota denial.
83+
func newDeniedFailure(reason, message string) *claimFailure {
84+
return &claimFailure{kind: claimFailureDenied, reason: reason, message: message}
85+
}
86+
87+
// newTimeoutFailure constructs a claimFailure for an admission watch timeout.
88+
func newTimeoutFailure(cause error) *claimFailure {
89+
return &claimFailure{kind: claimFailureTimeout, cause: cause}
90+
}
91+
92+
// newConflictFailure constructs a claimFailure for a pre-existing claim that
93+
// blocks retry (typically a denied claim awaiting GC).
94+
func newConflictFailure(message string, cause error) *claimFailure {
95+
return &claimFailure{kind: claimFailureConflict, message: message, cause: cause}
96+
}
97+
98+
// newInternalFailure constructs a claimFailure for infrastructure errors
99+
// (template render, watch manager, dynamic client, etc).
100+
func newInternalFailure(cause error) *claimFailure {
101+
return &claimFailure{kind: claimFailureInternal, cause: cause}
102+
}

0 commit comments

Comments
 (0)