|
| 1 | +# CLAUDE.md |
| 2 | + |
| 3 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +Sentinel is a Kubernetes controller that tracks container images across workloads and exposes them as Prometheus metrics. It monitors Deployments, StatefulSets, and DaemonSets (CronJobs planned) and provides: |
| 8 | + |
| 9 | +1. **Container image inventory** via `sentinel_container_image_info` metric |
| 10 | +2. **Image change tracking** via `sentinel_image_changes_total` counter |
| 11 | +3. **Dynamic label enrichment** - extract workload annotations/labels into Prometheus labels |
| 12 | + |
| 13 | +## Build & Deploy Commands |
| 14 | + |
| 15 | +```bash |
| 16 | +# Build |
| 17 | +make build # Build Go binary |
| 18 | +make docker # Build Docker image |
| 19 | +make deploy # Build + load to KIND + deploy to k8s (uses cluster "homelab") |
| 20 | + |
| 21 | +# Run locally (requires kubeconfig) |
| 22 | +make run # Build + run with -v=2 |
| 23 | + |
| 24 | +# Test |
| 25 | +make test # Run all tests (currently no tests exist) |
| 26 | + |
| 27 | +# Dependencies |
| 28 | +make deps # go mod tidy |
| 29 | +``` |
| 30 | + |
| 31 | +**Manual deployment:** |
| 32 | +```bash |
| 33 | +# Build and load into custom KIND cluster |
| 34 | +docker build -t sentinel:latest . |
| 35 | +kind load docker-image sentinel:latest --name <cluster-name> |
| 36 | +kubectl apply -f manifests/install/sentinel.yaml |
| 37 | + |
| 38 | +# Deploy demo workloads |
| 39 | +kubectl apply -f manifests/develop/demo-app-1.yaml |
| 40 | +kubectl apply -f manifests/develop/demo-app-2.yaml |
| 41 | + |
| 42 | +# Verify |
| 43 | +kubectl port-forward -n kube-system svc/sentinel-metrics 9090:9090 |
| 44 | +curl -s localhost:9090/metrics | grep sentinel_ |
| 45 | +``` |
| 46 | + |
| 47 | +## Architecture |
| 48 | + |
| 49 | +### Control Flow |
| 50 | + |
| 51 | +``` |
| 52 | +main.go |
| 53 | + └─> cmd/sentinel/root.go (Cobra CLI) |
| 54 | + └─> cmd/sentinel/start.go (loads config via Viper) |
| 55 | + └─> pkg/sentinel/start.go |
| 56 | + ├─> pkg/prometheus/sentinel_webserver.go (Init metrics + HTTP server) |
| 57 | + │ └─> pkg/prometheus/sentinel_exposed_metrics.go (BuildMetrics with dynamic labels) |
| 58 | + │ |
| 59 | + ├─> NamespaceWatcher() (watches namespaces with label selector) |
| 60 | + │ └─> sends []string of namespace names via channel |
| 61 | + │ |
| 62 | + └─> AppDiscovery() (consumes namespace channel) |
| 63 | + └─> pkg/sentinel/app_discovery.go |
| 64 | + ├─> Creates SharedInformerFactory per namespace |
| 65 | + ├─> Watches Deployments, StatefulSets, DaemonSets |
| 66 | + └─> On events: handleWorkloadAdd/Update/Delete |
| 67 | + └─> setContainerMetric() (sets Prometheus metrics) |
| 68 | +``` |
| 69 | + |
| 70 | +### Key Concepts |
| 71 | + |
| 72 | +**Namespace Watching:** |
| 73 | +- `NamespaceWatcher()` monitors namespaces matching `Config.NamespaceSelector` (default: `sentinel.io/controlled=enabled`) |
| 74 | +- Sends updated namespace list via channel whenever namespaces are labeled/unlabeled |
| 75 | +- `AppDiscovery()` consumes this channel and starts/stops informers per namespace |
| 76 | + |
| 77 | +**Informer Lifecycle:** |
| 78 | +- Each watched namespace gets its own `SharedInformerFactory` |
| 79 | +- Informers watch Deployments and trigger event handlers |
| 80 | +- When namespace is unlabeled, informer is stopped via `close(stopCh)` |
| 81 | + |
| 82 | +**Dynamic Prometheus Labels:** |
| 83 | +- Metrics are built at startup via `BuildMetrics(extraLabels)` |
| 84 | +- Base labels (workload_namespace, workload_type, etc.) + dynamic labels from `Config.ExtraLabels` |
| 85 | +- Prometheus requires all label names defined at registration time (can't add labels later) |
| 86 | +- **Label naming:** Uses `workload_namespace` instead of `namespace` to avoid collision with Prometheus ServiceMonitor auto-labels |
| 87 | + |
| 88 | +**Image Change Detection:** |
| 89 | +- `handleWorkloadUpdate()` compares old vs new containers |
| 90 | +- If `container.Image` changed, increments `sentinel_image_changes_total{old_tag="...", new_tag="..."}` |
| 91 | +- Uses `parseImage()` helper to extract registry/repo/tag |
| 92 | + |
| 93 | +## File Organization |
| 94 | + |
| 95 | +``` |
| 96 | +cmd/sentinel/ - CLI definition (Cobra) |
| 97 | + root.go - Root command |
| 98 | + start.go - "start" subcommand + Viper config loading |
| 99 | +
|
| 100 | +pkg/shared/ - Shared types |
| 101 | + sentinel_config.go - Config and ExtraLabel structs |
| 102 | +
|
| 103 | +pkg/prometheus/ - Metrics |
| 104 | + sentinel_exposed_metrics.go - Metric definitions + BuildMetrics() |
| 105 | + sentinel_webserver.go - HTTP server on :9090/metrics |
| 106 | +
|
| 107 | +pkg/sentinel/ - Controller logic |
| 108 | + start.go - Main controller entrypoint |
| 109 | + app_discovery.go - Per-namespace informers + event handlers |
| 110 | + helpers.go - Utilities (parseImage, extractExtraLabelValues, etc.) |
| 111 | +
|
| 112 | +manifests/ |
| 113 | + install/ - Production deployment (ConfigMap, Deployment, RBAC) |
| 114 | + develop/ - Demo workloads for testing |
| 115 | +
|
| 116 | +dashboard/ |
| 117 | + grafana.json - Pre-built Grafana dashboard |
| 118 | +``` |
| 119 | + |
| 120 | +## Configuration |
| 121 | + |
| 122 | +Configuration is loaded via Viper with this precedence (highest to lowest): |
| 123 | +1. Environment variables (e.g., `METRICSPORT`, `VERBOSITY`) |
| 124 | +2. Config file at `/etc/sentinel/sentinel.yaml` |
| 125 | +3. Defaults in `cmd/sentinel/start.go` |
| 126 | + |
| 127 | +**Example config:** |
| 128 | +```yaml |
| 129 | +namespaceSelector: |
| 130 | + "sentinel.io/controlled": "enabled" |
| 131 | +metricsPort: "9090" |
| 132 | +verbosity: 2 |
| 133 | + |
| 134 | +extraLabels: |
| 135 | + - type: "annotation" |
| 136 | + key: "sentinel.io/owner" |
| 137 | + timeseriesLabelName: "owner" |
| 138 | + - type: "label" |
| 139 | + key: "environment" |
| 140 | + timeseriesLabelName: "env" |
| 141 | +``` |
| 142 | +
|
| 143 | +## Prometheus Metrics Behavior |
| 144 | +
|
| 145 | +### Info Metrics (sentinel_container_image_info) |
| 146 | +
|
| 147 | +- Always has value `1` (info pattern) |
| 148 | +- When image tag changes: old time series stops being reported, new time series starts |
| 149 | +- Prometheus caches old series briefly (5-15min) before expiring them |
| 150 | +- **Empty labels:** If annotation/label doesn't exist on workload, metric label is `""` |
| 151 | + |
| 152 | +### Counter Metrics (sentinel_image_changes_total) |
| 153 | + |
| 154 | +- Increments on every image tag change |
| 155 | +- **Important:** Counter is created on-demand when first change detected |
| 156 | +- Prometheus sees counter appear at value `1` (not `0`→`1`), so `increase()` over short windows may return `0` |
| 157 | +- Use `increase(sentinel_image_changes_total[1h])` or longer windows for reliable detection |
| 158 | + |
| 159 | +## Workload Type Support |
| 160 | + |
| 161 | +**Currently Supported:** Deployments, StatefulSets, DaemonSets |
| 162 | +**Planned:** CronJobs |
| 163 | + |
| 164 | +### Implementation Pattern for Workload Types |
| 165 | + |
| 166 | +All workload handlers use polymorphism via `metav1.Object` interface: |
| 167 | +- `handleWorkloadAdd(resourceType string, namespace string, workload metav1.Object, ...)` |
| 168 | +- `handleWorkloadUpdate(resourceType string, namespace string, newWorkload metav1.Object, ...)` |
| 169 | +- `handleWorkloadDelete(resourceType string, namespace string, name string, ...)` |
| 170 | + |
| 171 | +This allows a single set of handlers to work with Deployment/StatefulSet/DaemonSet/etc. |
| 172 | + |
| 173 | +**Key:** Use `.GetName()`, `.GetAnnotations()`, `.GetLabels()` methods (not direct field access like `.Name`) |
| 174 | + |
| 175 | +## Important Implementation Details |
| 176 | + |
| 177 | +### Image Parsing (helpers.go:parseImage) |
| 178 | +- Handles full registry URLs (ghcr.io, quay.io, etc.) |
| 179 | +- Defaults to `docker.io` if no registry in image string |
| 180 | +- Detects registry vs namespace by looking for `.` or `:` in first path component |
| 181 | +- Default tag is `latest` if not specified |
| 182 | + |
| 183 | +### Change Detection (app_discovery.go:handleWorkloadUpdate) |
| 184 | +- Only processes updates where `newGen > oldGen` (spec changes, not status changes) |
| 185 | +- Skips spurious updates where `ResourceVersion` unchanged |
| 186 | +- Compares old vs new container images by building a map of `containerName -> image` |
| 187 | +- **Limitation:** If container is added/removed, no change event (only updates to existing containers) |
| 188 | + |
| 189 | +### Metric Deletion |
| 190 | +- Currently NOT implemented (see TODO at app_discovery.go:~200) |
| 191 | +- Deleted workloads leave metrics in Prometheus until scrape timeout |
| 192 | +- To implement: would need to track active metrics and call `.Delete()` on GaugeVec |
| 193 | + |
| 194 | +## Development Workflow |
| 195 | + |
| 196 | +1. **Make code changes** |
| 197 | +2. **Build:** `make build` (or `go build -o sentinel`) |
| 198 | +3. **Test locally:** `make run` (requires kubeconfig pointing to cluster) |
| 199 | +4. **Deploy to KIND:** `make deploy` (builds Docker + loads to cluster "homelab") |
| 200 | +5. **Check logs:** `kubectl logs -n kube-system -l app=sentinel-controller -f` |
| 201 | +6. **Verify metrics:** Port-forward and curl `/metrics` |
| 202 | + |
| 203 | +## Prometheus ServiceMonitor |
| 204 | + |
| 205 | +`manifests/install/servicemonitor.yaml` configures Prometheus Operator scraping with `metricRelabelings`: |
| 206 | + |
| 207 | +```yaml |
| 208 | +metricRelabelings: |
| 209 | + - action: labeldrop |
| 210 | + regex: pod # Drop Prometheus auto-labels |
| 211 | + - action: labeldrop |
| 212 | + regex: endpoint |
| 213 | + - action: labeldrop |
| 214 | + regex: instance |
| 215 | + - action: labeldrop |
| 216 | + regex: service |
| 217 | + - action: labeldrop |
| 218 | + regex: namespace # ServiceMonitor adds namespace="kube-system" |
| 219 | +``` |
| 220 | + |
| 221 | +**Why:** Prometheus ServiceMonitor automatically adds labels (`pod`, `endpoint`, `instance`, `service`, `namespace`) when scraping. We drop these to keep metrics clean since they're not meaningful for Sentinel's use case. |
| 222 | + |
| 223 | +**Note:** Changes to `metricRelabelings` only affect NEW samples. Old time series with previous labels persist in Prometheus TSDB until retention expires. |
| 224 | + |
| 225 | +## Grafana Dashboard |
| 226 | + |
| 227 | +Pre-built dashboard at `dashboard/grafana.json` includes: |
| 228 | +- Overview stats (tracked containers, workloads, changes, `:latest` usage) |
| 229 | +- Image inventory table with color-coded tags |
| 230 | +- Registry distribution pie chart |
| 231 | +- Image changes log (table format works better than graphs for counter metrics) |
| 232 | + |
| 233 | +Import into Grafana via UI → Dashboards → Import → Upload JSON file. |
0 commit comments