Skip to content

Commit db1de25

Browse files
authored
Merge pull request #3 from MatteoMori/statefulsets
Support for Statefulsets and Daemonsets
2 parents c271836 + 33fd544 commit db1de25

7 files changed

Lines changed: 314 additions & 20 deletions

File tree

CLAUDE.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Sentinel is a Kubernetes controller that tracks container images across workloads and exposes them as Prometheus metrics. It monitors Deployments, StatefulSets, and DaemonSets (CronJobs planned) and provides:
8+
9+
1. **Container image inventory** via `sentinel_container_image_info` metric
10+
2. **Image change tracking** via `sentinel_image_changes_total` counter
11+
3. **Dynamic label enrichment** - extract workload annotations/labels into Prometheus labels
12+
13+
## Build & Deploy Commands
14+
15+
```bash
16+
# Build
17+
make build # Build Go binary
18+
make docker # Build Docker image
19+
make deploy # Build + load to KIND + deploy to k8s (uses cluster "homelab")
20+
21+
# Run locally (requires kubeconfig)
22+
make run # Build + run with -v=2
23+
24+
# Test
25+
make test # Run all tests (currently no tests exist)
26+
27+
# Dependencies
28+
make deps # go mod tidy
29+
```
30+
31+
**Manual deployment:**
32+
```bash
33+
# Build and load into custom KIND cluster
34+
docker build -t sentinel:latest .
35+
kind load docker-image sentinel:latest --name <cluster-name>
36+
kubectl apply -f manifests/install/sentinel.yaml
37+
38+
# Deploy demo workloads
39+
kubectl apply -f manifests/develop/demo-app-1.yaml
40+
kubectl apply -f manifests/develop/demo-app-2.yaml
41+
42+
# Verify
43+
kubectl port-forward -n kube-system svc/sentinel-metrics 9090:9090
44+
curl -s localhost:9090/metrics | grep sentinel_
45+
```
46+
47+
## Architecture
48+
49+
### Control Flow
50+
51+
```
52+
main.go
53+
└─> cmd/sentinel/root.go (Cobra CLI)
54+
└─> cmd/sentinel/start.go (loads config via Viper)
55+
└─> pkg/sentinel/start.go
56+
├─> pkg/prometheus/sentinel_webserver.go (Init metrics + HTTP server)
57+
│ └─> pkg/prometheus/sentinel_exposed_metrics.go (BuildMetrics with dynamic labels)
58+
59+
├─> NamespaceWatcher() (watches namespaces with label selector)
60+
│ └─> sends []string of namespace names via channel
61+
62+
└─> AppDiscovery() (consumes namespace channel)
63+
└─> pkg/sentinel/app_discovery.go
64+
├─> Creates SharedInformerFactory per namespace
65+
├─> Watches Deployments, StatefulSets, DaemonSets
66+
└─> On events: handleWorkloadAdd/Update/Delete
67+
└─> setContainerMetric() (sets Prometheus metrics)
68+
```
69+
70+
### Key Concepts
71+
72+
**Namespace Watching:**
73+
- `NamespaceWatcher()` monitors namespaces matching `Config.NamespaceSelector` (default: `sentinel.io/controlled=enabled`)
74+
- Sends updated namespace list via channel whenever namespaces are labeled/unlabeled
75+
- `AppDiscovery()` consumes this channel and starts/stops informers per namespace
76+
77+
**Informer Lifecycle:**
78+
- Each watched namespace gets its own `SharedInformerFactory`
79+
- Informers watch Deployments and trigger event handlers
80+
- When namespace is unlabeled, informer is stopped via `close(stopCh)`
81+
82+
**Dynamic Prometheus Labels:**
83+
- Metrics are built at startup via `BuildMetrics(extraLabels)`
84+
- Base labels (workload_namespace, workload_type, etc.) + dynamic labels from `Config.ExtraLabels`
85+
- Prometheus requires all label names defined at registration time (can't add labels later)
86+
- **Label naming:** Uses `workload_namespace` instead of `namespace` to avoid collision with Prometheus ServiceMonitor auto-labels
87+
88+
**Image Change Detection:**
89+
- `handleWorkloadUpdate()` compares old vs new containers
90+
- If `container.Image` changed, increments `sentinel_image_changes_total{old_tag="...", new_tag="..."}`
91+
- Uses `parseImage()` helper to extract registry/repo/tag
92+
93+
## File Organization
94+
95+
```
96+
cmd/sentinel/ - CLI definition (Cobra)
97+
root.go - Root command
98+
start.go - "start" subcommand + Viper config loading
99+
100+
pkg/shared/ - Shared types
101+
sentinel_config.go - Config and ExtraLabel structs
102+
103+
pkg/prometheus/ - Metrics
104+
sentinel_exposed_metrics.go - Metric definitions + BuildMetrics()
105+
sentinel_webserver.go - HTTP server on :9090/metrics
106+
107+
pkg/sentinel/ - Controller logic
108+
start.go - Main controller entrypoint
109+
app_discovery.go - Per-namespace informers + event handlers
110+
helpers.go - Utilities (parseImage, extractExtraLabelValues, etc.)
111+
112+
manifests/
113+
install/ - Production deployment (ConfigMap, Deployment, RBAC)
114+
develop/ - Demo workloads for testing
115+
116+
dashboard/
117+
grafana.json - Pre-built Grafana dashboard
118+
```
119+
120+
## Configuration
121+
122+
Configuration is loaded via Viper with this precedence (highest to lowest):
123+
1. Environment variables (e.g., `METRICSPORT`, `VERBOSITY`)
124+
2. Config file at `/etc/sentinel/sentinel.yaml`
125+
3. Defaults in `cmd/sentinel/start.go`
126+
127+
**Example config:**
128+
```yaml
129+
namespaceSelector:
130+
"sentinel.io/controlled": "enabled"
131+
metricsPort: "9090"
132+
verbosity: 2
133+
134+
extraLabels:
135+
- type: "annotation"
136+
key: "sentinel.io/owner"
137+
timeseriesLabelName: "owner"
138+
- type: "label"
139+
key: "environment"
140+
timeseriesLabelName: "env"
141+
```
142+
143+
## Prometheus Metrics Behavior
144+
145+
### Info Metrics (sentinel_container_image_info)
146+
147+
- Always has value `1` (info pattern)
148+
- When image tag changes: old time series stops being reported, new time series starts
149+
- Prometheus caches old series briefly (5-15min) before expiring them
150+
- **Empty labels:** If annotation/label doesn't exist on workload, metric label is `""`
151+
152+
### Counter Metrics (sentinel_image_changes_total)
153+
154+
- Increments on every image tag change
155+
- **Important:** Counter is created on-demand when first change detected
156+
- Prometheus sees counter appear at value `1` (not `0`→`1`), so `increase()` over short windows may return `0`
157+
- Use `increase(sentinel_image_changes_total[1h])` or longer windows for reliable detection
158+
159+
## Workload Type Support
160+
161+
**Currently Supported:** Deployments, StatefulSets, DaemonSets
162+
**Planned:** CronJobs
163+
164+
### Implementation Pattern for Workload Types
165+
166+
All workload handlers use polymorphism via `metav1.Object` interface:
167+
- `handleWorkloadAdd(resourceType string, namespace string, workload metav1.Object, ...)`
168+
- `handleWorkloadUpdate(resourceType string, namespace string, newWorkload metav1.Object, ...)`
169+
- `handleWorkloadDelete(resourceType string, namespace string, name string, ...)`
170+
171+
This allows a single set of handlers to work with Deployment/StatefulSet/DaemonSet/etc.
172+
173+
**Key:** Use `.GetName()`, `.GetAnnotations()`, `.GetLabels()` methods (not direct field access like `.Name`)
174+
175+
## Important Implementation Details
176+
177+
### Image Parsing (helpers.go:parseImage)
178+
- Handles full registry URLs (ghcr.io, quay.io, etc.)
179+
- Defaults to `docker.io` if no registry in image string
180+
- Detects registry vs namespace by looking for `.` or `:` in first path component
181+
- Default tag is `latest` if not specified
182+
183+
### Change Detection (app_discovery.go:handleWorkloadUpdate)
184+
- Only processes updates where `newGen > oldGen` (spec changes, not status changes)
185+
- Skips spurious updates where `ResourceVersion` unchanged
186+
- Compares old vs new container images by building a map of `containerName -> image`
187+
- **Limitation:** If container is added/removed, no change event (only updates to existing containers)
188+
189+
### Metric Deletion
190+
- Currently NOT implemented (see TODO at app_discovery.go:~200)
191+
- Deleted workloads leave metrics in Prometheus until scrape timeout
192+
- To implement: would need to track active metrics and call `.Delete()` on GaugeVec
193+
194+
## Development Workflow
195+
196+
1. **Make code changes**
197+
2. **Build:** `make build` (or `go build -o sentinel`)
198+
3. **Test locally:** `make run` (requires kubeconfig pointing to cluster)
199+
4. **Deploy to KIND:** `make deploy` (builds Docker + loads to cluster "homelab")
200+
5. **Check logs:** `kubectl logs -n kube-system -l app=sentinel-controller -f`
201+
6. **Verify metrics:** Port-forward and curl `/metrics`
202+
203+
## Prometheus ServiceMonitor
204+
205+
`manifests/install/servicemonitor.yaml` configures Prometheus Operator scraping with `metricRelabelings`:
206+
207+
```yaml
208+
metricRelabelings:
209+
- action: labeldrop
210+
regex: pod # Drop Prometheus auto-labels
211+
- action: labeldrop
212+
regex: endpoint
213+
- action: labeldrop
214+
regex: instance
215+
- action: labeldrop
216+
regex: service
217+
- action: labeldrop
218+
regex: namespace # ServiceMonitor adds namespace="kube-system"
219+
```
220+
221+
**Why:** Prometheus ServiceMonitor automatically adds labels (`pod`, `endpoint`, `instance`, `service`, `namespace`) when scraping. We drop these to keep metrics clean since they're not meaningful for Sentinel's use case.
222+
223+
**Note:** Changes to `metricRelabelings` only affect NEW samples. Old time series with previous labels persist in Prometheus TSDB until retention expires.
224+
225+
## Grafana Dashboard
226+
227+
Pre-built dashboard at `dashboard/grafana.json` includes:
228+
- Overview stats (tracked containers, workloads, changes, `:latest` usage)
229+
- Image inventory table with color-coded tags
230+
- Registry distribution pie chart
231+
- Image changes log (table format works better than graphs for counter metrics)
232+
233+
Import into Grafana via UI → Dashboards → Import → Upload JSON file.

README.md

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,28 @@ Sentinel monitors Kubernetes workloads across labeled namespaces and tracks whic
1010

1111
<br>
1212

13+
- [Sentinel](#sentinel)
14+
- [Why Sentinel?](#why-sentinel)
15+
- [Quick Start](#quick-start)
16+
- [Prerequisites](#prerequisites)
17+
- [Installation](#installation)
18+
- [Metrics Exposed](#metrics-exposed)
19+
- [`sentinel_container_image_info`](#sentinel_container_image_info)
20+
- [`sentinel_image_changes_total`](#sentinel_image_changes_total)
21+
- [Dynamic Label Enrichment](#dynamic-label-enrichment)
22+
- [⚙️ Configuration](#️-configuration)
23+
- [1. Config file (`/etc/sentinel/sentinel.yaml`)](#1-config-file-etcsentinelsentinelyaml)
24+
- [2. Environment variables](#2-environment-variables)
25+
- [3. CLI flags](#3-cli-flags)
26+
- [Configuration Reference](#configuration-reference)
27+
- [📊 Grafana Dashboard](#-grafana-dashboard)
28+
- [Local Development](#local-development)
29+
- [Build and Run Locally](#build-and-run-locally)
30+
- [Test with KIND](#test-with-kind)
31+
- [🌟 Project Status](#-project-status)
32+
33+
<br>
34+
1335
## Why Sentinel?
1436

1537
Gain real-time visibility into your cluster's container image landscape. Perfect for:
@@ -67,7 +89,7 @@ Info metric (Gauge, always `1`) providing a full inventory of container images r
6789

6890
| Label | Description | Example |
6991
|-------|-------------|---------|
70-
| `namespace` | Kubernetes namespace | `production` |
92+
| `workload_namespace` | Kubernetes namespace | `production` |
7193
| `workload_type` | Kind of workload | `Deployment` |
7294
| `workload_name` | Name of the workload | `api-server` |
7395
| `container_name` | Container within the workload | `nginx` |
@@ -81,7 +103,7 @@ Info metric (Gauge, always `1`) providing a full inventory of container images r
81103

82104
```prometheus
83105
sentinel_container_image_info{
84-
namespace="production",
106+
workload_namespace="production",
85107
workload_type="Deployment",
86108
workload_name="api-server",
87109
container_name="nginx",
@@ -103,7 +125,7 @@ Counter that increments every time a container's image tag changes, providing an
103125

104126
| Label | Description | Example |
105127
|-------|-------------|---------|
106-
| `namespace` | Kubernetes namespace | `production` |
128+
| `workload_namespace` | Kubernetes namespace | `production` |
107129
| `workload_type` | Kind of workload | `Deployment` |
108130
| `workload_name` | Name of the workload | `api-server` |
109131
| `container_name` | Container within the workload | `nginx` |
@@ -114,7 +136,7 @@ Counter that increments every time a container's image tag changes, providing an
114136

115137
```prometheus
116138
sentinel_image_changes_total{
117-
namespace="production",
139+
workload_namespace="production",
118140
workload_type="Deployment",
119141
workload_name="api-server",
120142
container_name="nginx",
@@ -130,7 +152,7 @@ sentinel_image_changes_total{
130152
sum(increase(sentinel_image_changes_total[24h]))
131153
132154
# Alert: too many image changes in production
133-
sentinel_image_changes_total{namespace="production"} > 5
155+
sentinel_image_changes_total{workload_namespace="production"} > 5
134156
135157
# Find containers still using :latest
136158
sentinel_container_image_info{image_tag="latest"}
@@ -218,9 +240,9 @@ A pre-built Grafana dashboard is included in [`dashboard/grafana.json`](dashboar
218240
- **Registry distribution** – Donut chart showing image count by registry
219241
- **Change tracking log** – Table of all detected image changes with old → new tags
220242

243+
![example](./images/dashboard.png)
221244

222-
223-
---
245+
<br>
224246

225247
## Local Development
226248

0 commit comments

Comments
 (0)