Skip to content

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620

Open
eddiebarth1 wants to merge 3 commits into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl
Open

PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620
eddiebarth1 wants to merge 3 commits into
mainfrom
plt-2119/stabilize-slo-tag-agnostic-psl

Conversation

@eddiebarth1
Copy link
Copy Markdown
Contributor

@eddiebarth1 eddiebarth1 commented Apr 30, 2026

Closes PLT-2119

Problem

Each deploy created a new PrometheusServiceLevel scoped to the revision tag ({app}-{target}-{tag}-servicelevels). Sloth generates burn-rate recording rules from PSLs, so every deploy reset the 1h and 6h burn-rate windows and produced duplicate alert series — one per revision.

What changed

  • One shared PSL per app/target ({app}-{target}-servicelevels) replaces per-tag PSLs. SLI queries aggregate across all revision tags; burn-rate windows are now continuous across deploys.
  • Rolling migration: each reconcile deletes the legacy per-tag PSL for the current revision. No coordinated rollout needed.
  • PSL cleanup on retirement: deleteServiceLevels wired into Deleting/Failing state transitions so the shared PSL is removed when an app/target is fully retired.
  • SLO auto-rollback restored: removing the tag label from the shared PSL broke IsRevisionTriggered for the CanaryWithSLIRules=false path (the default for all services). Fixed in prometheus/api.go — untagged Sloth alerts now correctly trigger rollback.

Canary impact

SyncCanaryRules is unchanged — per-revision PrometheusRule objects still carry tag=<revision> and drive ALERTS{canary="true",tag="<revision>"}. Canary auto-rollback is unaffected.

Test plan

  • go test ./... passes; TestIsRevisionTriggered covers the SLO rollback fallback path
  • Deploy to staging: confirm {app}-{target}-servicelevels PSL created, per-tag PSLs cleaned up
  • Verify burn-rate windows survive a deploy without resetting
  • Force a canary violation and confirm rollback fires

🤖 Generated with Claude Code

Eduardo Barth and others added 3 commits April 30, 2026 15:34
…erviceLevel

Each deploy previously created a new PrometheusServiceLevel scoped to the
current revision tag (e.g. {app}-{target}-{tag}-servicelevels). Sloth
generates burn-rate recording rules from these, so every deploy reset the
1h and 6h burn rate windows and produced a new alert series — causing
chronic SLO violations to fire N times (once per revision) rather than once.

This change:
- Adds SyncServiceLevels, a tag-agnostic plan that creates a single PSL per
  app/target ({app}-{target}-servicelevels). SLI queries aggregate across all
  tags, so burn rate windows are continuous across deploys.
- Removes SyncTaggedServiceLevels (dead code once incarnation is updated).
- incarnation.syncTaggedServiceLevels now: (1) upserts the shared PSL, and
  (2) calls DeleteTaggedServiceLevels to clean up any legacy per-tag PSL for
  the current revision, enabling a clean migration as revisions roll through
  their lifecycle.
- serviceLevelLabels() omits LabelTag/version labels from the shared PSL so
  DeleteTaggedServiceLevels cannot accidentally select and delete it.

Canary safety: CanaryWithSLIRules=true (the common case) directs Picchu to
query ALERTS{canary="true",tag="<revision>"}, which are produced by
SyncCanaryRules — a separate per-revision PrometheusRule that is unchanged.
This change is orthogonal to canary gate evaluation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 1 (critical) — Restore SLO auto-rollback
The shared tag-agnostic PSL produces Sloth alerts with no tag label.
TaggedAlerts was dropping every empty-tag sample via metricTag != "",
making IsRevisionTriggered always return false for SLO-enabled services
(CanaryWithSLIRules defaults to false, so 100% of services were affected).

Fix: drop the empty-tag guard in the canariesOnly=false branch
(!canariesOnly || metricTag != ""). Also extend IsRevisionTriggered to
check tags[""] in SLO mode, since empty-tag alerts from the shared PSL
are stored under the empty key and would not match the revision tag lookup.

Fix 2 — Delete shared PSL on app/target retirement
DeleteTaggedServiceLevels selects by {app, target, tag}; the shared PSL
(labeled {app, target} only) never matched. Added deleteServiceLevels on
Incarnation, wired it into the Deployment interface, and called it from
the Deleting and Failing state handlers after deleteTaggedServiceLevels.
DeleteServiceLevels plan already had the correct {app, target} selector.

Fix 3 — Rename cleanup
Renamed syncTaggedServiceLevels → syncServiceLevels across incarnation.go,
state.go, mock_deployment.go, and state_test.go. Removed dead tagged SLI
helpers (taggedSLISource*, serviceLevelTagged*) and the Tag field from
SLOConfig (Tag is still present as it is used by SyncCanaryRules).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds unit tests covering the tags[""] fallback in IsRevisionTriggered
introduced for the shared tag-agnostic PSL: untagged SLO alerts trigger
rollback in SLO mode, are ignored in canary mode, and tagged alerts take
priority when both fire.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant