PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620
Open
eddiebarth1 wants to merge 3 commits into
Open
PLT-2119: Replace per-tag PSLs with single tag-agnostic PrometheusServiceLevel#620eddiebarth1 wants to merge 3 commits into
eddiebarth1 wants to merge 3 commits into
Conversation
…erviceLevel
Each deploy previously created a new PrometheusServiceLevel scoped to the
current revision tag (e.g. {app}-{target}-{tag}-servicelevels). Sloth
generates burn-rate recording rules from these, so every deploy reset the
1h and 6h burn rate windows and produced a new alert series — causing
chronic SLO violations to fire N times (once per revision) rather than once.
This change:
- Adds SyncServiceLevels, a tag-agnostic plan that creates a single PSL per
app/target ({app}-{target}-servicelevels). SLI queries aggregate across all
tags, so burn rate windows are continuous across deploys.
- Removes SyncTaggedServiceLevels (dead code once incarnation is updated).
- incarnation.syncTaggedServiceLevels now: (1) upserts the shared PSL, and
(2) calls DeleteTaggedServiceLevels to clean up any legacy per-tag PSL for
the current revision, enabling a clean migration as revisions roll through
their lifecycle.
- serviceLevelLabels() omits LabelTag/version labels from the shared PSL so
DeleteTaggedServiceLevels cannot accidentally select and delete it.
Canary safety: CanaryWithSLIRules=true (the common case) directs Picchu to
query ALERTS{canary="true",tag="<revision>"}, which are produced by
SyncCanaryRules — a separate per-revision PrometheusRule that is unchanged.
This change is orthogonal to canary gate evaluation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fix 1 (critical) — Restore SLO auto-rollback
The shared tag-agnostic PSL produces Sloth alerts with no tag label.
TaggedAlerts was dropping every empty-tag sample via metricTag != "",
making IsRevisionTriggered always return false for SLO-enabled services
(CanaryWithSLIRules defaults to false, so 100% of services were affected).
Fix: drop the empty-tag guard in the canariesOnly=false branch
(!canariesOnly || metricTag != ""). Also extend IsRevisionTriggered to
check tags[""] in SLO mode, since empty-tag alerts from the shared PSL
are stored under the empty key and would not match the revision tag lookup.
Fix 2 — Delete shared PSL on app/target retirement
DeleteTaggedServiceLevels selects by {app, target, tag}; the shared PSL
(labeled {app, target} only) never matched. Added deleteServiceLevels on
Incarnation, wired it into the Deployment interface, and called it from
the Deleting and Failing state handlers after deleteTaggedServiceLevels.
DeleteServiceLevels plan already had the correct {app, target} selector.
Fix 3 — Rename cleanup
Renamed syncTaggedServiceLevels → syncServiceLevels across incarnation.go,
state.go, mock_deployment.go, and state_test.go. Removed dead tagged SLI
helpers (taggedSLISource*, serviceLevelTagged*) and the Tag field from
SLOConfig (Tag is still present as it is used by SyncCanaryRules).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds unit tests covering the tags[""] fallback in IsRevisionTriggered introduced for the shared tag-agnostic PSL: untagged SLO alerts trigger rollback in SLO mode, are ignored in canary mode, and tagged alerts take priority when both fire. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes PLT-2119
Problem
Each deploy created a new
PrometheusServiceLevelscoped to the revision tag ({app}-{target}-{tag}-servicelevels). Sloth generates burn-rate recording rules from PSLs, so every deploy reset the 1h and 6h burn-rate windows and produced duplicate alert series — one per revision.What changed
{app}-{target}-servicelevels) replaces per-tag PSLs. SLI queries aggregate across all revision tags; burn-rate windows are now continuous across deploys.deleteServiceLevelswired intoDeleting/Failingstate transitions so the shared PSL is removed when an app/target is fully retired.taglabel from the shared PSL brokeIsRevisionTriggeredfor theCanaryWithSLIRules=falsepath (the default for all services). Fixed inprometheus/api.go— untagged Sloth alerts now correctly trigger rollback.Canary impact
SyncCanaryRulesis unchanged — per-revisionPrometheusRuleobjects still carrytag=<revision>and driveALERTS{canary="true",tag="<revision>"}. Canary auto-rollback is unaffected.Test plan
go test ./...passes;TestIsRevisionTriggeredcovers the SLO rollback fallback path{app}-{target}-servicelevelsPSL created, per-tag PSLs cleaned up🤖 Generated with Claude Code