|
| 1 | +# Disk Health producer |
| 2 | + |
| 3 | +Reads SMART attributes and NVMe data from physical disks on Ceph OSD nodes. Runs as a DaemonSet in the `rook-ceph` namespace. |
| 4 | + |
| 5 | +## Prerequisites |
| 6 | + |
| 7 | +- Rook-Ceph cluster running |
| 8 | +- Nodes with physical disks at `/dev/` |
| 9 | +- Ceph OSD base path at `/var/lib/rook/rook-ceph/` (Rook default) |
| 10 | + |
| 11 | +The container image ships with `smartmontools` and `nvme-cli` built in. |
| 12 | + |
| 13 | +## Deployment |
| 14 | + |
| 15 | +### Step 1: ServiceAccount and RBAC |
| 16 | + |
| 17 | +```bash |
| 18 | +kubectl apply -f - <<'EOF' |
| 19 | +apiVersion: v1 |
| 20 | +kind: ServiceAccount |
| 21 | +metadata: |
| 22 | + name: ceph-disk-health-exporter |
| 23 | + namespace: rook-ceph |
| 24 | +--- |
| 25 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 26 | +kind: Role |
| 27 | +metadata: |
| 28 | + namespace: rook-ceph |
| 29 | + name: ceph-disk-health-exporter-role |
| 30 | +rules: |
| 31 | + - apiGroups: [""] |
| 32 | + resources: ["pods", "nodes"] |
| 33 | + verbs: ["get", "list"] |
| 34 | +--- |
| 35 | +apiVersion: rbac.authorization.k8s.io/v1 |
| 36 | +kind: RoleBinding |
| 37 | +metadata: |
| 38 | + namespace: rook-ceph |
| 39 | + name: ceph-disk-health-exporter-rolebinding |
| 40 | +roleRef: |
| 41 | + apiGroup: rbac.authorization.k8s.io |
| 42 | + kind: Role |
| 43 | + name: ceph-disk-health-exporter-role |
| 44 | +subjects: |
| 45 | + - kind: ServiceAccount |
| 46 | + name: ceph-disk-health-exporter |
| 47 | + namespace: rook-ceph |
| 48 | +EOF |
| 49 | +``` |
| 50 | + |
| 51 | +### Step 2: ConfigMap |
| 52 | + |
| 53 | +```bash |
| 54 | +kubectl apply -f - <<'EOF' |
| 55 | +apiVersion: v1 |
| 56 | +kind: ConfigMap |
| 57 | +metadata: |
| 58 | + name: disk-health-config |
| 59 | + namespace: rook-ceph |
| 60 | +data: |
| 61 | + PROMETHEUS_ENABLED: "true" |
| 62 | + PROMETHEUS_PORT: "8080" |
| 63 | + DISKS: "/dev/sda,/dev/sdb" |
| 64 | + INTERVAL: "60" |
| 65 | + CEPH_OSD_BASE_PATH: "/var/lib/rook/rook-ceph/" |
| 66 | + GROWN_DEFECTS_THRESHOLD: "10" |
| 67 | + PENDING_SECTORS_THRESHOLD: "3" |
| 68 | + REALLOCATED_SECTORS_THRESHOLD: "10" |
| 69 | + LIFETIME_USED_THRESHOLD: "80" |
| 70 | +EOF |
| 71 | +``` |
| 72 | + |
| 73 | +Set `DISKS` to match your node layout. Use `*` to monitor all disks. |
| 74 | + |
| 75 | +### Step 3: DaemonSet |
| 76 | + |
| 77 | +```bash |
| 78 | +kubectl apply -f - <<'EOF' |
| 79 | +apiVersion: apps/v1 |
| 80 | +kind: DaemonSet |
| 81 | +metadata: |
| 82 | + name: ceph-disk-health-exporter |
| 83 | + namespace: rook-ceph |
| 84 | + labels: |
| 85 | + app: ceph-disk-health-exporter |
| 86 | +spec: |
| 87 | + selector: |
| 88 | + matchLabels: |
| 89 | + app: ceph-disk-health-exporter |
| 90 | + template: |
| 91 | + metadata: |
| 92 | + labels: |
| 93 | + app: ceph-disk-health-exporter |
| 94 | + spec: |
| 95 | + serviceAccountName: ceph-disk-health-exporter |
| 96 | + containers: |
| 97 | + - name: disk-health-exporter |
| 98 | + image: ghcr.io/cobaltcore-dev/prysm:0.0.36 |
| 99 | + args: |
| 100 | + - local-producer |
| 101 | + - disk-health-metrics |
| 102 | + envFrom: |
| 103 | + - configMapRef: |
| 104 | + name: disk-health-config |
| 105 | + env: |
| 106 | + - name: NODE_NAME |
| 107 | + valueFrom: |
| 108 | + fieldRef: |
| 109 | + fieldPath: spec.nodeName |
| 110 | + - name: INSTANCE_ID |
| 111 | + valueFrom: |
| 112 | + fieldRef: |
| 113 | + fieldPath: metadata.name |
| 114 | + securityContext: |
| 115 | + privileged: true |
| 116 | + volumeMounts: |
| 117 | + - name: host-dev |
| 118 | + mountPath: /dev |
| 119 | + - name: host-proc |
| 120 | + mountPath: /host/proc |
| 121 | + readOnly: true |
| 122 | + - name: host-rook-ceph |
| 123 | + mountPath: /var/lib/rook/rook-ceph |
| 124 | + readOnly: true |
| 125 | + resources: |
| 126 | + requests: |
| 127 | + cpu: 100m |
| 128 | + memory: 128Mi |
| 129 | + limits: |
| 130 | + cpu: 500m |
| 131 | + memory: 256Mi |
| 132 | + ports: |
| 133 | + - containerPort: 8080 |
| 134 | + name: metrics |
| 135 | + volumes: |
| 136 | + - name: host-dev |
| 137 | + hostPath: |
| 138 | + path: /dev |
| 139 | + type: Directory |
| 140 | + - name: host-proc |
| 141 | + hostPath: |
| 142 | + path: /proc |
| 143 | + type: Directory |
| 144 | + - name: host-rook-ceph |
| 145 | + hostPath: |
| 146 | + path: /var/lib/rook/rook-ceph |
| 147 | + type: Directory |
| 148 | + tolerations: |
| 149 | + - key: "node-role.kubernetes.io/control-plane" |
| 150 | + effect: "NoSchedule" |
| 151 | + - key: "node-role.kubernetes.io/worker" |
| 152 | + operator: "Exists" |
| 153 | + nodeSelector: |
| 154 | + kubernetes.io/os: linux |
| 155 | +EOF |
| 156 | +``` |
| 157 | + |
| 158 | +`privileged: true` is required -- smartctl needs direct access to host `/dev/` devices. |
| 159 | + |
| 160 | +### Step 4: Service for Prometheus |
| 161 | + |
| 162 | +```bash |
| 163 | +kubectl apply -f - <<'EOF' |
| 164 | +apiVersion: v1 |
| 165 | +kind: Service |
| 166 | +metadata: |
| 167 | + name: disk-health-metrics |
| 168 | + namespace: rook-ceph |
| 169 | + labels: |
| 170 | + app: ceph-disk-health-exporter |
| 171 | +spec: |
| 172 | + clusterIP: None |
| 173 | + selector: |
| 174 | + app: ceph-disk-health-exporter |
| 175 | + ports: |
| 176 | + - name: metrics |
| 177 | + port: 8080 |
| 178 | + targetPort: 8080 |
| 179 | +EOF |
| 180 | +``` |
| 181 | + |
| 182 | +### Step 5: ServiceMonitor (Prometheus Operator) |
| 183 | + |
| 184 | +```yaml |
| 185 | +apiVersion: monitoring.coreos.com/v1 |
| 186 | +kind: ServiceMonitor |
| 187 | +metadata: |
| 188 | + name: disk-health-metrics |
| 189 | + namespace: rook-ceph |
| 190 | + labels: |
| 191 | + prometheus: kube-prometheus |
| 192 | +spec: |
| 193 | + selector: |
| 194 | + matchLabels: |
| 195 | + app: ceph-disk-health-exporter |
| 196 | + endpoints: |
| 197 | + - port: metrics |
| 198 | + interval: 60s |
| 199 | + path: /metrics |
| 200 | +``` |
| 201 | +
|
| 202 | +### Step 6: Verify |
| 203 | +
|
| 204 | +```bash |
| 205 | +# Check DaemonSet status |
| 206 | +kubectl -n rook-ceph get ds ceph-disk-health-exporter |
| 207 | + |
| 208 | +# Check logs |
| 209 | +kubectl -n rook-ceph logs -l app=ceph-disk-health-exporter --tail=20 |
| 210 | + |
| 211 | +# Test the metrics endpoint |
| 212 | +kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=ceph-disk-health-exporter -o name | head -1) -- wget -qO- http://localhost:8080/metrics | head -30 |
| 213 | +``` |
| 214 | + |
| 215 | +## Environment variables |
| 216 | + |
| 217 | +| Variable | Description | Default | |
| 218 | +|----------|-------------|---------| |
| 219 | +| `DISKS` | Comma-separated device list, or `*` for all | `/dev/sda,/dev/sdb` | |
| 220 | +| `INTERVAL` | Collection interval in seconds | `10` | |
| 221 | +| `PROMETHEUS_ENABLED` | Enable metrics endpoint | `false` | |
| 222 | +| `PROMETHEUS_PORT` | HTTP port for metrics | `8080` | |
| 223 | +| `NODE_NAME` | Node identifier (use fieldRef) | | |
| 224 | +| `INSTANCE_ID` | Instance identifier (use fieldRef) | | |
| 225 | +| `CEPH_OSD_BASE_PATH` | Rook-Ceph OSD directory | `/var/lib/rook/rook-ceph/` | |
| 226 | +| `GROWN_DEFECTS_THRESHOLD` | Alert threshold: grown defects | `10` | |
| 227 | +| `PENDING_SECTORS_THRESHOLD` | Alert threshold: pending sectors | `3` | |
| 228 | +| `REALLOCATED_SECTORS_THRESHOLD` | Alert threshold: reallocated sectors | `10` | |
| 229 | +| `LIFETIME_USED_THRESHOLD` | Alert threshold: SSD lifetime used (%) | `80` | |
| 230 | +| `ALL_ATTR` | Export all SMART attributes | `false` | |
| 231 | +| `NATS_URL` | NATS server URL (optional) | | |
| 232 | +| `NATS_SUBJECT` | NATS publish subject | `osd.disk.health` | |
| 233 | + |
| 234 | +## OSD mapping |
| 235 | + |
| 236 | +When `CEPH_OSD_BASE_PATH` is set, the producer maps physical devices to Ceph OSD IDs automatically. Every Prometheus metric gets an `osd_id` label. |
| 237 | + |
| 238 | +This works with both direct block devices and LVM logical volumes. |
| 239 | + |
| 240 | +## Metrics |
| 241 | + |
| 242 | +| Metric | Type | Description | |
| 243 | +|--------|------|-------------| |
| 244 | +| `smart_attributes` | Gauge | SMART attributes (labeled by `attribute`) | |
| 245 | +| `disk_temperature_celsius` | Gauge | Disk temperature | |
| 246 | +| `disk_reallocated_sectors` | Gauge | Reallocated sector count | |
| 247 | +| `disk_pending_sectors` | Gauge | Pending sector count | |
| 248 | +| `disk_power_on_hours_total` | Gauge | Cumulative power-on hours | |
| 249 | +| `ssd_life_used_percentage` | Gauge | SSD wear level | |
| 250 | +| `disk_error_counts_total` | Gauge | Error counts (labeled by `error_type`) | |
| 251 | +| `disk_capacity_gb` | Gauge | Disk capacity in GB | |
| 252 | +| `disk_info` | Gauge | Device metadata: vendor, model, serial, firmware, media_type | |
| 253 | + |
| 254 | +For NVMe devices, `smart_attributes` includes `critical_warning`, `available_spare`, `available_spare_threshold`, and vendor IDs in hex. |
| 255 | + |
| 256 | +Full list: [metrics reference](../pkg/producers/diskhealthmetrics/README.md). |
0 commit comments