Skip to content

Commit 33fd544

Browse files
committed
feat: Enable Statefulset and Daemonset monitoring + use ServiceMonitor to labeldrop a few unused labels. This reduce cardinality
1 parent b870186 commit 33fd544

7 files changed

Lines changed: 290 additions & 18 deletions

File tree

CLAUDE.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Sentinel is a Kubernetes controller that tracks container images across workloads and exposes them as Prometheus metrics. It monitors Deployments, StatefulSets, and DaemonSets (CronJobs planned) and provides:
8+
9+
1. **Container image inventory** via `sentinel_container_image_info` metric
10+
2. **Image change tracking** via `sentinel_image_changes_total` counter
11+
3. **Dynamic label enrichment** - extract workload annotations/labels into Prometheus labels
12+
13+
## Build & Deploy Commands
14+
15+
```bash
16+
# Build
17+
make build # Build Go binary
18+
make docker # Build Docker image
19+
make deploy # Build + load to KIND + deploy to k8s (uses cluster "homelab")
20+
21+
# Run locally (requires kubeconfig)
22+
make run # Build + run with -v=2
23+
24+
# Test
25+
make test # Run all tests (currently no tests exist)
26+
27+
# Dependencies
28+
make deps # go mod tidy
29+
```
30+
31+
**Manual deployment:**
32+
```bash
33+
# Build and load into custom KIND cluster
34+
docker build -t sentinel:latest .
35+
kind load docker-image sentinel:latest --name <cluster-name>
36+
kubectl apply -f manifests/install/sentinel.yaml
37+
38+
# Deploy demo workloads
39+
kubectl apply -f manifests/develop/demo-app-1.yaml
40+
kubectl apply -f manifests/develop/demo-app-2.yaml
41+
42+
# Verify
43+
kubectl port-forward -n kube-system svc/sentinel-metrics 9090:9090
44+
curl -s localhost:9090/metrics | grep sentinel_
45+
```
46+
47+
## Architecture
48+
49+
### Control Flow
50+
51+
```
52+
main.go
53+
└─> cmd/sentinel/root.go (Cobra CLI)
54+
└─> cmd/sentinel/start.go (loads config via Viper)
55+
└─> pkg/sentinel/start.go
56+
├─> pkg/prometheus/sentinel_webserver.go (Init metrics + HTTP server)
57+
│ └─> pkg/prometheus/sentinel_exposed_metrics.go (BuildMetrics with dynamic labels)
58+
59+
├─> NamespaceWatcher() (watches namespaces with label selector)
60+
│ └─> sends []string of namespace names via channel
61+
62+
└─> AppDiscovery() (consumes namespace channel)
63+
└─> pkg/sentinel/app_discovery.go
64+
├─> Creates SharedInformerFactory per namespace
65+
├─> Watches Deployments, StatefulSets, DaemonSets
66+
└─> On events: handleWorkloadAdd/Update/Delete
67+
└─> setContainerMetric() (sets Prometheus metrics)
68+
```
69+
70+
### Key Concepts
71+
72+
**Namespace Watching:**
73+
- `NamespaceWatcher()` monitors namespaces matching `Config.NamespaceSelector` (default: `sentinel.io/controlled=enabled`)
74+
- Sends updated namespace list via channel whenever namespaces are labeled/unlabeled
75+
- `AppDiscovery()` consumes this channel and starts/stops informers per namespace
76+
77+
**Informer Lifecycle:**
78+
- Each watched namespace gets its own `SharedInformerFactory`
79+
- Informers watch Deployments and trigger event handlers
80+
- When namespace is unlabeled, informer is stopped via `close(stopCh)`
81+
82+
**Dynamic Prometheus Labels:**
83+
- Metrics are built at startup via `BuildMetrics(extraLabels)`
84+
- Base labels (workload_namespace, workload_type, etc.) + dynamic labels from `Config.ExtraLabels`
85+
- Prometheus requires all label names defined at registration time (can't add labels later)
86+
- **Label naming:** Uses `workload_namespace` instead of `namespace` to avoid collision with Prometheus ServiceMonitor auto-labels
87+
88+
**Image Change Detection:**
89+
- `handleWorkloadUpdate()` compares old vs new containers
90+
- If `container.Image` changed, increments `sentinel_image_changes_total{old_tag="...", new_tag="..."}`
91+
- Uses `parseImage()` helper to extract registry/repo/tag
92+
93+
## File Organization
94+
95+
```
96+
cmd/sentinel/ - CLI definition (Cobra)
97+
root.go - Root command
98+
start.go - "start" subcommand + Viper config loading
99+
100+
pkg/shared/ - Shared types
101+
sentinel_config.go - Config and ExtraLabel structs
102+
103+
pkg/prometheus/ - Metrics
104+
sentinel_exposed_metrics.go - Metric definitions + BuildMetrics()
105+
sentinel_webserver.go - HTTP server on :9090/metrics
106+
107+
pkg/sentinel/ - Controller logic
108+
start.go - Main controller entrypoint
109+
app_discovery.go - Per-namespace informers + event handlers
110+
helpers.go - Utilities (parseImage, extractExtraLabelValues, etc.)
111+
112+
manifests/
113+
install/ - Production deployment (ConfigMap, Deployment, RBAC)
114+
develop/ - Demo workloads for testing
115+
116+
dashboard/
117+
grafana.json - Pre-built Grafana dashboard
118+
```
119+
120+
## Configuration
121+
122+
Configuration is loaded via Viper with this precedence (highest to lowest):
123+
1. Environment variables (e.g., `METRICSPORT`, `VERBOSITY`)
124+
2. Config file at `/etc/sentinel/sentinel.yaml`
125+
3. Defaults in `cmd/sentinel/start.go`
126+
127+
**Example config:**
128+
```yaml
129+
namespaceSelector:
130+
"sentinel.io/controlled": "enabled"
131+
metricsPort: "9090"
132+
verbosity: 2
133+
134+
extraLabels:
135+
- type: "annotation"
136+
key: "sentinel.io/owner"
137+
timeseriesLabelName: "owner"
138+
- type: "label"
139+
key: "environment"
140+
timeseriesLabelName: "env"
141+
```
142+
143+
## Prometheus Metrics Behavior
144+
145+
### Info Metrics (sentinel_container_image_info)
146+
147+
- Always has value `1` (info pattern)
148+
- When image tag changes: old time series stops being reported, new time series starts
149+
- Prometheus caches old series briefly (5-15min) before expiring them
150+
- **Empty labels:** If annotation/label doesn't exist on workload, metric label is `""`
151+
152+
### Counter Metrics (sentinel_image_changes_total)
153+
154+
- Increments on every image tag change
155+
- **Important:** Counter is created on-demand when first change detected
156+
- Prometheus sees counter appear at value `1` (not `0`→`1`), so `increase()` over short windows may return `0`
157+
- Use `increase(sentinel_image_changes_total[1h])` or longer windows for reliable detection
158+
159+
## Workload Type Support
160+
161+
**Currently Supported:** Deployments, StatefulSets, DaemonSets
162+
**Planned:** CronJobs
163+
164+
### Implementation Pattern for Workload Types
165+
166+
All workload handlers use polymorphism via `metav1.Object` interface:
167+
- `handleWorkloadAdd(resourceType string, namespace string, workload metav1.Object, ...)`
168+
- `handleWorkloadUpdate(resourceType string, namespace string, newWorkload metav1.Object, ...)`
169+
- `handleWorkloadDelete(resourceType string, namespace string, name string, ...)`
170+
171+
This allows a single set of handlers to work with Deployment/StatefulSet/DaemonSet/etc.
172+
173+
**Key:** Use `.GetName()`, `.GetAnnotations()`, `.GetLabels()` methods (not direct field access like `.Name`)
174+
175+
## Important Implementation Details
176+
177+
### Image Parsing (helpers.go:parseImage)
178+
- Handles full registry URLs (ghcr.io, quay.io, etc.)
179+
- Defaults to `docker.io` if no registry in image string
180+
- Detects registry vs namespace by looking for `.` or `:` in first path component
181+
- Default tag is `latest` if not specified
182+
183+
### Change Detection (app_discovery.go:handleWorkloadUpdate)
184+
- Only processes updates where `newGen > oldGen` (spec changes, not status changes)
185+
- Skips spurious updates where `ResourceVersion` unchanged
186+
- Compares old vs new container images by building a map of `containerName -> image`
187+
- **Limitation:** If container is added/removed, no change event (only updates to existing containers)
188+
189+
### Metric Deletion
190+
- Currently NOT implemented (see TODO at app_discovery.go:~200)
191+
- Deleted workloads leave metrics in Prometheus until scrape timeout
192+
- To implement: would need to track active metrics and call `.Delete()` on GaugeVec
193+
194+
## Development Workflow
195+
196+
1. **Make code changes**
197+
2. **Build:** `make build` (or `go build -o sentinel`)
198+
3. **Test locally:** `make run` (requires kubeconfig pointing to cluster)
199+
4. **Deploy to KIND:** `make deploy` (builds Docker + loads to cluster "homelab")
200+
5. **Check logs:** `kubectl logs -n kube-system -l app=sentinel-controller -f`
201+
6. **Verify metrics:** Port-forward and curl `/metrics`
202+
203+
## Prometheus ServiceMonitor
204+
205+
`manifests/install/servicemonitor.yaml` configures Prometheus Operator scraping with `metricRelabelings`:
206+
207+
```yaml
208+
metricRelabelings:
209+
- action: labeldrop
210+
regex: pod # Drop Prometheus auto-labels
211+
- action: labeldrop
212+
regex: endpoint
213+
- action: labeldrop
214+
regex: instance
215+
- action: labeldrop
216+
regex: service
217+
- action: labeldrop
218+
regex: namespace # ServiceMonitor adds namespace="kube-system"
219+
```
220+
221+
**Why:** Prometheus ServiceMonitor automatically adds labels (`pod`, `endpoint`, `instance`, `service`, `namespace`) when scraping. We drop these to keep metrics clean since they're not meaningful for Sentinel's use case.
222+
223+
**Note:** Changes to `metricRelabelings` only affect NEW samples. Old time series with previous labels persist in Prometheus TSDB until retention expires.
224+
225+
## Grafana Dashboard
226+
227+
Pre-built dashboard at `dashboard/grafana.json` includes:
228+
- Overview stats (tracked containers, workloads, changes, `:latest` usage)
229+
- Image inventory table with color-coded tags
230+
- Registry distribution pie chart
231+
- Image changes log (table format works better than graphs for counter metrics)
232+
233+
Import into Grafana via UI → Dashboards → Import → Upload JSON file.

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ Info metric (Gauge, always `1`) providing a full inventory of container images r
8989

9090
| Label | Description | Example |
9191
|-------|-------------|---------|
92-
| `namespace` | Kubernetes namespace | `production` |
92+
| `workload_namespace` | Kubernetes namespace | `production` |
9393
| `workload_type` | Kind of workload | `Deployment` |
9494
| `workload_name` | Name of the workload | `api-server` |
9595
| `container_name` | Container within the workload | `nginx` |
@@ -103,7 +103,7 @@ Info metric (Gauge, always `1`) providing a full inventory of container images r
103103

104104
```prometheus
105105
sentinel_container_image_info{
106-
namespace="production",
106+
workload_namespace="production",
107107
workload_type="Deployment",
108108
workload_name="api-server",
109109
container_name="nginx",
@@ -125,7 +125,7 @@ Counter that increments every time a container's image tag changes, providing an
125125

126126
| Label | Description | Example |
127127
|-------|-------------|---------|
128-
| `namespace` | Kubernetes namespace | `production` |
128+
| `workload_namespace` | Kubernetes namespace | `production` |
129129
| `workload_type` | Kind of workload | `Deployment` |
130130
| `workload_name` | Name of the workload | `api-server` |
131131
| `container_name` | Container within the workload | `nginx` |
@@ -136,7 +136,7 @@ Counter that increments every time a container's image tag changes, providing an
136136

137137
```prometheus
138138
sentinel_image_changes_total{
139-
namespace="production",
139+
workload_namespace="production",
140140
workload_type="Deployment",
141141
workload_name="api-server",
142142
container_name="nginx",
@@ -152,7 +152,7 @@ sentinel_image_changes_total{
152152
sum(increase(sentinel_image_changes_total[24h]))
153153
154154
# Alert: too many image changes in production
155-
sentinel_image_changes_total{namespace="production"} > 5
155+
sentinel_image_changes_total{workload_namespace="production"} > 5
156156
157157
# Find containers still using :latest
158158
sentinel_container_image_info{image_tag="latest"}

dashboard/grafana.json

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
"editable": true,
2020
"fiscalYearStartMonth": 0,
2121
"graphTooltip": 1,
22-
"id": 32,
22+
"id": 0,
2323
"links": [],
2424
"panels": [
2525
{
@@ -446,7 +446,7 @@
446446
"targets": [
447447
{
448448
"editorMode": "code",
449-
"expr": "sentinel_container_image_info",
449+
"expr": "sentinel_container_image_info{workload_namespace=~\"$namespace\"}",
450450
"format": "table",
451451
"instant": true,
452452
"legendFormat": "__auto",
@@ -552,7 +552,7 @@
552552
"targets": [
553553
{
554554
"editorMode": "code",
555-
"expr": "count by (image_registry) (sentinel_container_image_info)",
555+
"expr": "count by (image_registry) (sentinel_container_image_info{workload_namespace=~\"$namespace\"})",
556556
"format": "time_series",
557557
"instant": true,
558558
"legendFormat": "{{image_registry}}",
@@ -706,7 +706,7 @@
706706
"uid": "prometheus"
707707
},
708708
"editorMode": "code",
709-
"expr": "sentinel_image_changes_total",
709+
"expr": "sentinel_image_changes_total{workload_namespace=~\"$namespace\"}",
710710
"format": "table",
711711
"instant": true,
712712
"legendFormat": "__auto",
@@ -761,15 +761,41 @@
761761
"containers"
762762
],
763763
"templating": {
764-
"list": []
764+
"list": [
765+
{
766+
"allowCustomValue": false,
767+
"current": {
768+
"text": [
769+
"All"
770+
],
771+
"value": [
772+
"$__all"
773+
]
774+
},
775+
"definition": "label_values(sentinel_container_image_info,workload_namespace)",
776+
"includeAll": true,
777+
"label": "namespace",
778+
"multi": true,
779+
"name": "namespace",
780+
"options": [],
781+
"query": {
782+
"qryType": 1,
783+
"query": "label_values(sentinel_container_image_info,workload_namespace)",
784+
"refId": "PrometheusVariableQueryEditor-VariableQuery"
785+
},
786+
"refresh": 1,
787+
"regex": "",
788+
"type": "query"
789+
}
790+
]
765791
},
766792
"time": {
767-
"from": "now-6h",
793+
"from": "now-15m",
768794
"to": "now"
769795
},
770796
"timepicker": {},
771797
"timezone": "browser",
772798
"title": "Sentinel - Container Image Tracking",
773799
"uid": "addk6s7",
774-
"version": 4
800+
"version": 7
775801
}

manifests/develop/demo-app-1.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ metadata:
1313
namespace: demo-app
1414
labels:
1515
app: demo-app-1
16+
annotations:
17+
"sentinel.io/owner": "Spiderman"
1618
spec:
1719
replicas: 1
1820
selector:

manifests/install/sentinel.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,8 @@ rules:
8787
- apiGroups: [""]
8888
resources: ["namespaces"]
8989
verbs: ["get", "watch", "list"]
90-
- apiGroups: ["apps"]
91-
resources: ["deployments"]
90+
- apiGroups: ["apps"]
91+
resources: ["deployments", "statefulsets", "daemonsets"]
9292
verbs: ["get", "list", "watch"]
9393

9494
---

manifests/install/servicemonitor.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,14 @@ spec:
1717
targetPort: 9090
1818
interval: 30s
1919
path: /metrics
20+
metricRelabelings:
21+
- action: labeldrop
22+
regex: pod
23+
- action: labeldrop
24+
regex: endpoint
25+
- action: labeldrop
26+
regex: instance
27+
- action: labeldrop
28+
regex: service
29+
- action: labeldrop
30+
regex: namespace

0 commit comments

Comments
 (0)