cert-checker: fix logging & push metrics#8763
Conversation
|
@lenaunderwood22, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values. |
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
There was a problem hiding this comment.
Pull request overview
This PR updates cert-checker to emit structured audit logs (instead of printing a large multi-line JSON blob to stdout) and to publish run metrics (good/bad cert counts, latency, last-run timestamp) to a Prometheus Pushgateway discovered via DNS SRV (intended to replace an external wrapper script per issue #8753).
Changes:
- Replace stdout JSON report output with per-certificate audit error logs plus a final audit summary.
- Add Prometheus metrics for cert-checker results and push them to a Pushgateway (incl. SRV discovery + shared
cmd.PushMetricshelper). - Update tests/config fixtures and vendor the Prometheus
pushsubpackage.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
cmd/cert-checker/main.go |
Switch to audit logging, add metrics, add SRV-based Pushgateway discovery and push behavior. |
cmd/cert-checker/main_test.go |
Update tests to assert audit log emission rather than JSON report contents. |
cmd/shell.go |
Add PushMetrics helper using Prometheus push client. |
test/config/cert-checker.json |
Remove unexpiredOnly field from fixture config. |
test/config-next/cert-checker.json |
Remove unexpiredOnly field from fixture config. |
vendor/modules.txt |
Add vendored package path for prometheus/push. |
vendor/github.com/prometheus/client_golang/prometheus/push/push.go |
Vendor Prometheus Pushgateway client implementation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
aarongable
left a comment
There was a problem hiding this comment.
LGTM with some nits / test improvements.
| checkerBadCount prometheus.Gauge | ||
| } | ||
|
|
||
| func NewCertCheckerMetrics(stats prometheus.Registerer) *certCheckerMetrics { |
There was a problem hiding this comment.
This constructor is only called from inside this package, so it doesn't need to be Exported.
| func NewCertCheckerMetrics(stats prometheus.Registerer) *certCheckerMetrics { | |
| func newCertCheckerMetrics(stats prometheus.Registerer) *certCheckerMetrics { |
| checker.processCerts(context.TODO(), wg, config.CertChecker.BadResultsOnly) | ||
| checkerLatency.Observe(checker.clock.Since(s).Seconds()) | ||
| }() | ||
| checker.processCerts(context.TODO()) |
There was a problem hiding this comment.
| checker.processCerts(context.TODO()) | |
| checker.processCerts(context.Background()) |
| wg := new(sync.WaitGroup) | ||
| wg.Add(1) | ||
| checker.processCerts(context.Background(), wg, false) | ||
| checker.processCerts(context.Background()) |
There was a problem hiding this comment.
In unit tests, always use the context provided by the test harness:
| checker.processCerts(context.Background()) | |
| checker.processCerts(t.Context()) |
| } | ||
|
|
||
| func TestGetPushgatewayURL(t *testing.T) { | ||
| ctx := context.Background() |
There was a problem hiding this comment.
Same comment here. In fact, probably leave this line out entirely, and just use t.Context() in the three places below.
| t.Run("happy path", func(t *testing.T) { | ||
| gotURL, err := getPushgatewayURL(ctx, "consul.service.consul:53", | ||
| cmd.ServiceDomain{Service: "redisratelimits", Domain: "service.consul"}) | ||
| test.AssertNotError(t, err, "") |
There was a problem hiding this comment.
For a new test, we prefer to follow the Google Go style guide. That means using traditional error checking, and not using the assertion helper library.
| test.AssertNotError(t, err, "") | |
| if err != nil { | |
| t.Fatalf("getPushgatewayURL(consul.service.consul:53) = %s, but want success", err) | |
| } |
The same applies for the assertions below. Use t.Fatalf for places where test execution should stop, and t.Errorf for places where the test should fail but we should maybe also see the results of the next thing as well.
| t.Run("SRV not found", func(t *testing.T) { | ||
| _, err := getPushgatewayURL(ctx, "consul.service.consul:53", | ||
| cmd.ServiceDomain{Service: "doesnotexist", Domain: "service.consul"}) | ||
| test.AssertError(t, err, "") |
There was a problem hiding this comment.
In addition to the comment above about not using the test assertion helper, these tests should probably also ensure that they're getting the right error:
| test.AssertError(t, err, "") | |
| if err != nil { | |
| t.Fatalf("getPushgatewayURL(consul.service.consul:53) = %s, but want success", err) | |
| } | |
| if !strings.Contains(err.Error(), "SRV lookup of doesnotexist._tcp.service.consul failed") { | |
| t.Errorf("getPushgatewayURL(doesnotexist) = %q, but want 'SRV lookup failed'"), err) | |
| } |
or something like that
Fixes #8753