Skip to content

feat: add flagger_canary_phase metric with granular phase values#1927

Open
Softer wants to merge 1 commit into
fluxcd:mainfrom
Softer:feat/canary-phase-metric
Open

feat: add flagger_canary_phase metric with granular phase values#1927
Softer wants to merge 1 commit into
fluxcd:mainfrom
Softer:feat/canary-phase-metric

Conversation

@Softer

@Softer Softer commented Jun 3, 2026

Copy link
Copy Markdown

Motivation

flagger_canary_status collapses the 11 canary phases into only 3 values (0 running, 1 successful, 2 failed). On a Grafana state-timeline this makes it impossible to distinguish WaitingPromotion, Promoting, Finalising and Succeeded — they all map to 1.

Changing flagger_canary_status would break every existing dashboard and alert, so this PR adds a new metric instead and leaves the existing one untouched.

What this PR does

Adds a new gauge flagger_canary_phase (labels: name, namespace) that exposes each phase as a unique value via a deterministic phase-to-value map:

Value Phase Value Phase
0 Initializing 6 Finalising
1 Initialized 7 Succeeded
2 Waiting 8 Failed
3 Progressing 9 Terminating
4 WaitingPromotion 10 Terminated
5 Promoting
  • SetStatus now also sets the new gauge (via SetPhase), so every existing call site is covered without changing the scheduler.
  • flagger_canary_status is not modified.
  • The Terminating (9) phase is recorded from the finalizer (for revertOnDeletion: true canaries), and Terminated (10) from the informer delete handler for any deletion. A deleted canary therefore keeps emitting a filterable value, so queries can exclude removed canaries with flagger_canary_phase < 9.

This gives a non-breaking answer to the stale-metric problem in #1029: instead of deleting metrics on canary removal (flagged as a breaking change in #1856), the phase metric exposes a terminated sentinel that dashboards/alerts can
filter on. It is also relevant to #1819 where distinguishing WaitingPromotion matters.

flagger_canary_status collapses the 11 canary phases into 3 values
(0 running, 1 successful, 2 failed), so dashboards cannot tell
WaitingPromotion, Promoting, Finalising or Succeeded apart on a
Grafana state-timeline.

Add a new flagger_canary_phase gauge that exposes each phase as a
unique value (0=Initializing ... 10=Terminated) via a deterministic
phase-to-value map. SetStatus now also sets the new gauge, so every
existing call site is covered without touching the scheduler.
flagger_canary_status is left unchanged to avoid breaking existing
dashboards and alerts.

The Terminating (9) phase is recorded from the finalizer and the
Terminated (10) phase from the informer delete handler, so deleted
canaries keep emitting a filterable value (flagger_canary_phase < 9)
instead of leaving a stale series. This addresses the stale-metric
problem from fluxcd#1029 without deleting metrics, which was flagged as a
breaking change in fluxcd#1856.

Signed-off-by: Softer <sft.nik@gmail.com>

@aryan9600 aryan9600 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your contribution!

Comment thread pkg/metrics/recorder.go
// (which collapses all phases into running/successful/failed), this mapping
// keeps every phase distinct so they can be rendered on a Grafana state-timeline.
var canaryPhaseValues = map[flaggerv1.CanaryPhase]float64{
flaggerv1.CanaryPhaseInitializing: 0,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no canary will be ever set to Initializing because unfortunately setPhaseInitializing doesn't use SetStatus to update the status.

ref:

func (c *Controller) setPhaseInitializing(cd *flaggerv1.Canary) error {

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 30.04%. Comparing base (61582f7) to head (3e7cd4d).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
pkg/metrics/recorder.go 82.35% 2 Missing and 1 partial ⚠️
pkg/controller/controller.go 0.00% 1 Missing ⚠️
pkg/controller/finalizer.go 0.00% 1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1927      +/-   ##
==========================================
+ Coverage   30.00%   30.04%   +0.04%     
==========================================
  Files         288      288              
  Lines       18455    18474      +19     
==========================================
+ Hits         5537     5551      +14     
- Misses      12189    12193       +4     
- Partials      729      730       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants