Skip to content

Commit 9d1cdff

Browse files
fix(resources): delete-by-id, failed provision rows, RPC-deadline floor (Wave-2 A1)
Three related durability fixes from the cross-interface sweep: 1. DELETE /api/v1/resources/:id accepts BOTH the row id (as returned by GET /api/v1/resources) and the provision token. Resolution tries token first (historical contract preserved byte-for-byte), then falls back to id (resolveResourceByTokenOrID). The natural list -> delete-by-id flow 404'd 100% of the time before. Authorization is identical for both forms (team-scoped, 404-not-403). The deprovision RPC / cache key / IAM audit are now always addressed by resource.Token, never the raw path param. The env-policy middleware lookup (ResourceEnvByTokenOrIDForMiddleware) got the same fallback so the id-addressed form can't skip env_policy enforcement. openapi.go documents the dual-form param. 2. Provision rollback persists a status='failed' row instead of soft-deleting it, so a failed/timed-out provision is a pollable terminal state. models.MarkResourceFailed (atomic pending->failed, mirror of MarkResourceActive) replaces models.SoftDeleteResource at all 16 rollback sites (db/cache/nosql/queue/vector/storage x anon/auth/twin + finalizeProvision). Failed rows are visible in lists (only 'deleted' is hidden), deletable, quota-exempt (CountActiveResourcesByTeamAndType filters status='active'), and invisible to the reconciler ('pending') and TTL reaper (ReapableStatuses) - no live infra backs them. Migration 070 widens resources_status_check with 'failed'; 024/049/057 re-add the same canonical set (forward-consistency: the runner re-applies every migration on boot - 2026-06-10 incident class). SoftDeleteResource itself is removed (dead after this change). 3. Provision RPC deadlines can no longer undercut the provisioner's dedicated pod-ready wait. cacheProvisionTimeout was a 45s literal while the provisioner's k8s backends wait up to 3m for pod readiness (redisK8sReadyTO et al.) - a cold-pod Redis provision was killed mid-flight. New named constants dedicatedProvisionReadyWait (3m) + provisionRPCDeadlineMargin (30s) derive the deadline (3m30s); TestProvisionTimeouts_NeverUndercutDedicatedReadyWait enforces the floor for every tier and every provision deadline in the package. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
1 parent d77bf23 commit 9d1cdff

29 files changed

Lines changed: 603 additions & 124 deletions

docs/openapi.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -538,7 +538,17 @@ paths:
538538
security:
539539
- BearerAuth: []
540540
parameters:
541-
- $ref: "#/components/parameters/id"
541+
- name: id
542+
in: path
543+
required: true
544+
description: >-
545+
Accepts EITHER the resource's `id` (as returned by
546+
GET /api/v1/resources) OR its provision `token`. Resolution tries
547+
token first, then id; authorization (team ownership) is identical
548+
for both forms.
549+
schema:
550+
type: string
551+
format: uuid
542552
responses:
543553
"200":
544554
description: Resource deleted.

internal/db/migrations/024_resources_paused_status.sql

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,12 @@ ALTER TABLE resources
2929
ADD CONSTRAINT resources_status_check
3030
-- Forward-consistent full status set (incident 2026-06-10). The migration
3131
-- runner RE-APPLIES every migration on each boot; a NARROW constraint here
32-
-- (missing 'suspended' [added in 049] / 'pending' [added in 057]) crashes
33-
-- the boot the moment a row already holds one of those later-added — but
34-
-- valid — statuses. Re-adding the canonical set makes 024 safe to re-run
35-
-- regardless of data. (024/049/057 now all define the same set.)
36-
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'expired', 'deleted', 'reaped'));
32+
-- (missing 'suspended' [added in 049] / 'pending' [added in 057] /
33+
-- 'failed' [added in 070]) crashes the boot the moment a row already holds
34+
-- one of those later-added — but valid — statuses. Re-adding the canonical
35+
-- set makes 024 safe to re-run regardless of data. (024/049/057/070 now
36+
-- all define the same set.)
37+
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'failed', 'expired', 'deleted', 'reaped'));
3738

3839
ALTER TABLE resources ADD COLUMN IF NOT EXISTS paused_at TIMESTAMPTZ;
3940

internal/db/migrations/049_resources_suspended_status.sql

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,10 @@ ALTER TABLE resources DROP CONSTRAINT IF EXISTS resources_status_check;
3131
ALTER TABLE resources
3232
ADD CONSTRAINT resources_status_check
3333
-- Forward-consistent full status set (incident 2026-06-10): include 'pending'
34-
-- (added in 057) so re-applying 049 on boot can't crash on a valid pending
35-
-- row before 057 runs. 024/049/057 now all define the same canonical set.
36-
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'expired', 'deleted', 'reaped'));
34+
-- (added in 057) and 'failed' (added in 070) so re-applying 049 on boot
35+
-- can't crash on a valid later-added row before its own migration runs.
36+
-- 024/049/057/070 now all define the same canonical set.
37+
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'failed', 'expired', 'deleted', 'reaped'));
3738

3839
-- Partial index for the auto-unsuspend scan.
3940
-- EnforceStorageQuotaWorker scans WHERE status = 'suspended' on every run to

internal/db/migrations/057_resources_pending_status.sql

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,10 @@
4242
ALTER TABLE resources DROP CONSTRAINT IF EXISTS resources_status_check;
4343
ALTER TABLE resources
4444
ADD CONSTRAINT resources_status_check
45-
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'expired', 'deleted', 'reaped'));
45+
-- Forward-consistent full status set (incident 2026-06-10): include 'failed'
46+
-- (added in 070) so re-applying 057 on boot can't crash on a valid failed
47+
-- row before 070 runs. 024/049/057/070 now all define the same canonical set.
48+
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'failed', 'expired', 'deleted', 'reaped'));
4649

4750
-- idx_resources_pending_sweep (the partial index the reconciler scans) was
4851
-- already created by migration 030_resource_heartbeat.sql — it indexes
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
-- Migration: 070_resources_failed_status
2+
--
3+
-- Add 'failed' as a permitted value in the resources.status CHECK constraint.
4+
--
5+
-- Background (Wave-2 A1, cross-interface durability sweep 2026-06-11):
6+
-- When the synchronous backend provision RPC fails or times out, every
7+
-- provision handler used to ROLL BACK the just-created 'pending' row by
8+
-- soft-deleting it (status='deleted'). From the caller's point of view the
9+
-- resource simply VANISHED: a timed-out /db/new (or /cache/new, /nosql/new,
10+
-- /queue/new, /vector/new, /storage/new, twin provision) returned 503 and
11+
-- any follow-up GET on the token 404'd because the public read surface
12+
-- hides deleted rows. Agents polling for an in-flight provision had no
13+
-- terminal state to observe.
14+
--
15+
-- The handlers now mark the row status='failed' instead
16+
-- (models.MarkResourceFailed, pending→failed). A failed row is:
17+
-- * visible in GET /api/v1/resources (lists filter only status='deleted')
18+
-- * deletable via DELETE /api/v1/resources/:id (the status!='deleted'
19+
-- guard in SoftDeleteResourceIfActive admits it)
20+
-- * NOT counted toward plan quotas (quota counts filter status='active')
21+
-- * never swept by the provisioner_reconciler (it keys on 'pending') or
22+
-- the TTL reaper (ReapableStatuses() = active/paused/suspended) — the
23+
-- backend object either never existed or was already torn down by the
24+
-- rollback's best-effort cleanup, so there is nothing to deprovision.
25+
--
26+
-- Without this migration every MarkResourceFailed UPDATE would hit
27+
-- constraint-violation 23514 and the rollback path itself would error.
28+
--
29+
-- Status semantics (updated):
30+
-- pending — row inserted, backend provision RPC + URL persistence not yet
31+
-- complete; the transient mid-provision state. NOT usable.
32+
-- active — provisioned, accepting connections (or status-only for queue/storage/webhook)
33+
-- paused — user-initiated pause (Pro+ only); infra revoked; data preserved
34+
-- suspended — system-initiated suspend on storage quota breach; infra revoked
35+
-- failed — terminal: the backend provision RPC failed/timed out and the
36+
-- rollback kept the row as a pollable terminal state. No live
37+
-- backing infra. Visible in lists; deletable; quota-exempt.
38+
-- expired — TTL reached (anonymous resources); soft-deleted equivalent for anon
39+
-- deleted — user-deleted (permanent credentials removed)
40+
-- reaped — legacy: worker-reaped before 'deleted' was the canonical term
41+
--
42+
-- Idempotent: DROP IF EXISTS + re-ADD with the same syntax, so re-running on a
43+
-- schema that already applied this migration is harmless. Migrations 024/049/057
44+
-- were updated in the same change to define this same canonical set (the
45+
-- migration runner RE-APPLIES every migration on each boot — see the
46+
-- forward-consistency note in 024, incident 2026-06-10).
47+
48+
ALTER TABLE resources DROP CONSTRAINT IF EXISTS resources_status_check;
49+
ALTER TABLE resources
50+
ADD CONSTRAINT resources_status_check
51+
CHECK (status IN ('pending', 'active', 'paused', 'suspended', 'failed', 'expired', 'deleted', 'reaped'));

internal/db/postgres_migrations_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -511,7 +511,7 @@ func TestConnectRedis_PanicsOnUnreachable(t *testing.T) {
511511
// re-applying it mid-sequence can never reject a valid row. No DB needed —
512512
// reads the embedded SQL via the same seam the runner uses.
513513
func TestMigrations_ResourcesStatusCheck_ForwardConsistent(t *testing.T) {
514-
canonical := []string{"pending", "active", "paused", "suspended", "expired", "deleted", "reaped"}
514+
canonical := []string{"pending", "active", "paused", "suspended", "failed", "expired", "deleted", "reaped"}
515515
checked := 0
516516
for _, name := range MigrationFiles() {
517517
b, err := readMigrationFile(name)

internal/handlers/cache.go

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -221,8 +221,9 @@ func (h *CacheHandler) NewCache(c *fiber.Ctx) error {
221221
middleware.RecordProvisionFail("redis", middleware.ProvisionFailBackendUnavailable)
222222
slog.Error("cache.new.provision_failed",
223223
"error", err, "token", tokenStr, "request_id", requestID)
224-
// Soft-delete the resource record so limits aren't falsely consumed.
225-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
224+
// Mark the pending row 'failed' — a pollable terminal state. Failed
225+
// rows never count against quota (counts filter status='active').
226+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
226227
slog.Error("cache.new.soft_delete_failed", "error", delErr, "resource_id", resource.ID)
227228
}
228229
return respondProvisionFailed(c, err, "Failed to provision Redis namespace")
@@ -387,7 +388,7 @@ func (h *CacheHandler) newCacheAuthenticated(
387388
middleware.RecordProvisionFail("redis", middleware.ProvisionFailBackendUnavailable)
388389
slog.Error("cache.new.provision_failed_auth",
389390
"error", err, "token", tokenStr, "team_id", teamIDStr, "request_id", requestID)
390-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
391+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
391392
slog.Error("cache.new.soft_delete_failed_auth", "error", delErr, "resource_id", resource.ID)
392393
}
393394
return respondProvisionFailed(c, err, "Failed to provision Redis namespace")
@@ -563,7 +564,7 @@ func (h *CacheHandler) ProvisionForTwinCore(ctx context.Context, in ProvisionFor
563564
middleware.RecordProvisionFail(models.ResourceTypeRedis, middleware.ProvisionFailBackendUnavailable)
564565
slog.Error("twin.cache.provision_failed",
565566
"error", err, "token", tokenStr, "team_id", in.TeamID, "request_id", in.RequestID)
566-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
567+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
567568
slog.Error("twin.cache.soft_delete_failed",
568569
"error", delErr, "resource_id", resource.ID, "request_id", in.RequestID)
569570
}

internal/handlers/coverage_provisioner_grpc_test.go

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ type fakeProvisioner struct {
6767
failProvision bool
6868
deprovisionCalls int
6969

70-
lastReq *provisionerv1.ProvisionRequest
70+
lastReq *provisionerv1.ProvisionRequest
71+
lastDeprovisionReq *provisionerv1.DeprovisionRequest
7172
}
7273

7374
func (f *fakeProvisioner) ProvisionResource(_ context.Context, req *provisionerv1.ProvisionRequest) (*provisionerv1.ProvisionResponse, error) {
@@ -113,13 +114,21 @@ func (f *fakeProvisioner) ProvisionResource(_ context.Context, req *provisionerv
113114
}
114115
}
115116

116-
func (f *fakeProvisioner) DeprovisionResource(_ context.Context, _ *provisionerv1.DeprovisionRequest) (*provisionerv1.DeprovisionResponse, error) {
117+
func (f *fakeProvisioner) DeprovisionResource(_ context.Context, req *provisionerv1.DeprovisionRequest) (*provisionerv1.DeprovisionResponse, error) {
117118
f.mu.Lock()
118119
f.deprovisionCalls++
120+
f.lastDeprovisionReq = req
119121
f.mu.Unlock()
120122
return &provisionerv1.DeprovisionResponse{}, nil
121123
}
122124

125+
// lastDeprovision returns the most recent DeprovisionRequest (nil if none).
126+
func (f *fakeProvisioner) lastDeprovision() *provisionerv1.DeprovisionRequest {
127+
f.mu.Lock()
128+
defer f.mu.Unlock()
129+
return f.lastDeprovisionReq
130+
}
131+
123132
func (f *fakeProvisioner) deprovisionCount() int {
124133
f.mu.Lock()
125134
defer f.mu.Unlock()
@@ -231,6 +240,7 @@ func setupGRPCProvFixture(t *testing.T, fake *fakeProvisioner, badAESKey bool) g
231240

232241
middleware.SetRoleLookupDB(db)
233242
api := app.Group("/api/v1", middleware.RequireAuth(cfg), middleware.PopulateTeamRole())
243+
api.Get("/resources", resourceH.List)
234244
api.Get("/resources/:id", resourceH.Get)
235245
api.Delete("/resources/:id", resourceH.Delete)
236246
api.Post("/resources/:id/provision-twin", twinH.ProvisionTwin)

internal/handlers/db.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -260,7 +260,7 @@ func (h *DBHandler) NewDB(c *fiber.Ctx) error {
260260
middleware.RecordProvisionFail("postgres", middleware.ProvisionFailBackendUnavailable)
261261
slog.Error("db.new.provision_failed",
262262
"error", err, "token", tokenStr, "request_id", requestID)
263-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
263+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
264264
slog.Error("db.new.soft_delete_failed", "error", delErr, "resource_id", resource.ID)
265265
}
266266
return respondProvisionFailed(c, err, "Failed to provision Postgres database")
@@ -438,7 +438,7 @@ func (h *DBHandler) newDBAuthenticated(
438438
middleware.RecordProvisionFail("postgres", middleware.ProvisionFailBackendUnavailable)
439439
slog.Error("db.new.provision_failed_auth",
440440
"error", err, "token", tokenStr, "team_id", teamIDStr, "request_id", requestID)
441-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
441+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
442442
slog.Error("db.new.soft_delete_failed_auth", "error", delErr, "resource_id", resource.ID)
443443
}
444444
return respondProvisionFailed(c, err, "Failed to provision Postgres database")
@@ -647,7 +647,7 @@ func (h *DBHandler) ProvisionForTwinCore(ctx context.Context, in ProvisionForTwi
647647
middleware.RecordProvisionFail(models.ResourceTypePostgres, middleware.ProvisionFailBackendUnavailable)
648648
slog.Error("twin.db.provision_failed",
649649
"error", err, "token", tokenStr, "team_id", in.TeamID, "request_id", in.RequestID)
650-
if delErr := models.SoftDeleteResource(ctx, h.db, resource.ID); delErr != nil {
650+
if delErr := models.MarkResourceFailed(ctx, h.db, resource.ID); delErr != nil {
651651
slog.Error("twin.db.soft_delete_failed",
652652
"error", delErr, "resource_id", resource.ID, "request_id", in.RequestID)
653653
}

internal/handlers/env_policy_helpers.go

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -14,24 +14,27 @@ import (
1414

1515
"github.com/gofiber/fiber/v2"
1616
"github.com/google/uuid"
17-
"instant.dev/internal/models"
1817
)
1918

20-
// ResourceEnvByTokenForMiddleware reads the env stored on a resource row
21-
// addressed by the URL :id param (a public token UUID). Returns the env on
22-
// success or "" on any error — the env-policy middleware fails OPEN on
23-
// lookup error so a malformed/non-existent :id falls through to the
24-
// handler's own 400/404 instead of a confusing 403/env_policy_denied.
19+
// ResourceEnvByTokenOrIDForMiddleware reads the env stored on a resource row
20+
// addressed by the URL :id param — resolved first as a public token UUID,
21+
// then as the row's primary-key id (resolveResourceByTokenOrID), matching
22+
// the DELETE handler's own resolution so the env-policy gate covers BOTH
23+
// address forms (without the id fallback an id-addressed DELETE would skip
24+
// env-policy enforcement entirely). Returns the env on success or "" on any
25+
// error — the env-policy middleware fails OPEN on lookup error so a
26+
// malformed/non-existent :id falls through to the handler's own 400/404
27+
// instead of a confusing 403/env_policy_denied.
2528
//
2629
// Exported with the verbose suffix so its single intended caller (the
2730
// router wiring) is unambiguous; this is not a general-purpose helper.
28-
func ResourceEnvByTokenForMiddleware(c *fiber.Ctx, db *sql.DB) (string, error) {
31+
func ResourceEnvByTokenOrIDForMiddleware(c *fiber.Ctx, db *sql.DB) (string, error) {
2932
tokenStr := c.Params("id")
30-
token, err := uuid.Parse(tokenStr)
33+
pathUUID, err := uuid.Parse(tokenStr)
3134
if err != nil {
3235
return "", nil
3336
}
34-
r, err := models.GetResourceByToken(c.Context(), db, token)
37+
r, err := resolveResourceByTokenOrID(c.Context(), db, pathUUID)
3538
if err != nil {
3639
// Including ErrResourceNotFound — fail open so the handler returns
3740
// its own 404 (which contains a stable, agent-readable shape).

0 commit comments

Comments
 (0)