Skip to content

Commit b91778b

Browse files
authored
Merge pull request #47 from cobaltcore-dev/docs/deployment-guides
Looks clean to be the first version of documentation. Docs/deployment guides
2 parents 66949d6 + 043e7f7 commit b91778b

5 files changed

Lines changed: 1087 additions & 8 deletions

File tree

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -107,14 +107,14 @@ Key Responsibilities:
107107
[Kernel Metrics](pkg/producers/kernelmetrics/README.md)
108108
[Resource Usage](pkg/producers/resourceusage/README.md)
109109

110-
## Usage
111-
112-
Prysm can be employed across a wide range of observability scenarios, from
113-
monitoring the health of Ceph storage clusters and RadosGW instances to
114-
ensuring the reliability of hardware components through SMART attribute
115-
analysis. Whether you need to integrate with Prometheus, send real-time alerts
116-
via NATS, or simply log and visualize system performance, Prysm offers the
117-
tools and flexibility to meet your needs.
110+
## Documentation
111+
112+
Step-by-step deployment guides for Kubernetes:
113+
114+
- [Getting Started](docs/getting-started.md) -- requirements, images, configuration, Prometheus setup
115+
- [RadosGW Usage Producer](docs/radosgw-usage.md) -- collect bucket/user/quota metrics from the RadosGW Admin API
116+
- [Disk Health Producer](docs/disk-health.md) -- monitor SMART/NVMe attributes on Ceph OSD nodes
117+
- [Ops Log Producer](docs/ops-log.md) -- parse RGW operation logs, expose metrics, publish CADF audit events
118118

119119
## Support, Feedback, Contributing
120120

docs/disk-health.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Disk Health producer
2+
3+
Reads SMART attributes and NVMe data from physical disks on Ceph OSD nodes. Runs as a DaemonSet in the `rook-ceph` namespace.
4+
5+
## Prerequisites
6+
7+
- Rook-Ceph cluster running
8+
- Nodes with physical disks at `/dev/`
9+
- Ceph OSD base path at `/var/lib/rook/rook-ceph/` (Rook default)
10+
11+
The container image ships with `smartmontools` and `nvme-cli` built in.
12+
13+
## Deployment
14+
15+
### Step 1: ServiceAccount and RBAC
16+
17+
```bash
18+
kubectl apply -f - <<'EOF'
19+
apiVersion: v1
20+
kind: ServiceAccount
21+
metadata:
22+
name: ceph-disk-health-exporter
23+
namespace: rook-ceph
24+
---
25+
apiVersion: rbac.authorization.k8s.io/v1
26+
kind: Role
27+
metadata:
28+
namespace: rook-ceph
29+
name: ceph-disk-health-exporter-role
30+
rules:
31+
- apiGroups: [""]
32+
resources: ["pods", "nodes"]
33+
verbs: ["get", "list"]
34+
---
35+
apiVersion: rbac.authorization.k8s.io/v1
36+
kind: RoleBinding
37+
metadata:
38+
namespace: rook-ceph
39+
name: ceph-disk-health-exporter-rolebinding
40+
roleRef:
41+
apiGroup: rbac.authorization.k8s.io
42+
kind: Role
43+
name: ceph-disk-health-exporter-role
44+
subjects:
45+
- kind: ServiceAccount
46+
name: ceph-disk-health-exporter
47+
namespace: rook-ceph
48+
EOF
49+
```
50+
51+
### Step 2: ConfigMap
52+
53+
```bash
54+
kubectl apply -f - <<'EOF'
55+
apiVersion: v1
56+
kind: ConfigMap
57+
metadata:
58+
name: disk-health-config
59+
namespace: rook-ceph
60+
data:
61+
PROMETHEUS_ENABLED: "true"
62+
PROMETHEUS_PORT: "8080"
63+
DISKS: "/dev/sda,/dev/sdb"
64+
INTERVAL: "60"
65+
CEPH_OSD_BASE_PATH: "/var/lib/rook/rook-ceph/"
66+
GROWN_DEFECTS_THRESHOLD: "10"
67+
PENDING_SECTORS_THRESHOLD: "3"
68+
REALLOCATED_SECTORS_THRESHOLD: "10"
69+
LIFETIME_USED_THRESHOLD: "80"
70+
EOF
71+
```
72+
73+
Set `DISKS` to match your node layout. Use `*` to monitor all disks.
74+
75+
### Step 3: DaemonSet
76+
77+
```bash
78+
kubectl apply -f - <<'EOF'
79+
apiVersion: apps/v1
80+
kind: DaemonSet
81+
metadata:
82+
name: ceph-disk-health-exporter
83+
namespace: rook-ceph
84+
labels:
85+
app: ceph-disk-health-exporter
86+
spec:
87+
selector:
88+
matchLabels:
89+
app: ceph-disk-health-exporter
90+
template:
91+
metadata:
92+
labels:
93+
app: ceph-disk-health-exporter
94+
spec:
95+
serviceAccountName: ceph-disk-health-exporter
96+
containers:
97+
- name: disk-health-exporter
98+
image: ghcr.io/cobaltcore-dev/prysm:0.0.36
99+
args:
100+
- local-producer
101+
- disk-health-metrics
102+
envFrom:
103+
- configMapRef:
104+
name: disk-health-config
105+
env:
106+
- name: NODE_NAME
107+
valueFrom:
108+
fieldRef:
109+
fieldPath: spec.nodeName
110+
- name: INSTANCE_ID
111+
valueFrom:
112+
fieldRef:
113+
fieldPath: metadata.name
114+
securityContext:
115+
privileged: true
116+
volumeMounts:
117+
- name: host-dev
118+
mountPath: /dev
119+
- name: host-proc
120+
mountPath: /host/proc
121+
readOnly: true
122+
- name: host-rook-ceph
123+
mountPath: /var/lib/rook/rook-ceph
124+
readOnly: true
125+
resources:
126+
requests:
127+
cpu: 100m
128+
memory: 128Mi
129+
limits:
130+
cpu: 500m
131+
memory: 256Mi
132+
ports:
133+
- containerPort: 8080
134+
name: metrics
135+
volumes:
136+
- name: host-dev
137+
hostPath:
138+
path: /dev
139+
type: Directory
140+
- name: host-proc
141+
hostPath:
142+
path: /proc
143+
type: Directory
144+
- name: host-rook-ceph
145+
hostPath:
146+
path: /var/lib/rook/rook-ceph
147+
type: Directory
148+
tolerations:
149+
- key: "node-role.kubernetes.io/control-plane"
150+
effect: "NoSchedule"
151+
- key: "node-role.kubernetes.io/worker"
152+
operator: "Exists"
153+
nodeSelector:
154+
kubernetes.io/os: linux
155+
EOF
156+
```
157+
158+
`privileged: true` is required -- smartctl needs direct access to host `/dev/` devices.
159+
160+
### Step 4: Service for Prometheus
161+
162+
```bash
163+
kubectl apply -f - <<'EOF'
164+
apiVersion: v1
165+
kind: Service
166+
metadata:
167+
name: disk-health-metrics
168+
namespace: rook-ceph
169+
labels:
170+
app: ceph-disk-health-exporter
171+
spec:
172+
clusterIP: None
173+
selector:
174+
app: ceph-disk-health-exporter
175+
ports:
176+
- name: metrics
177+
port: 8080
178+
targetPort: 8080
179+
EOF
180+
```
181+
182+
### Step 5: ServiceMonitor (Prometheus Operator)
183+
184+
```yaml
185+
apiVersion: monitoring.coreos.com/v1
186+
kind: ServiceMonitor
187+
metadata:
188+
name: disk-health-metrics
189+
namespace: rook-ceph
190+
labels:
191+
prometheus: kube-prometheus
192+
spec:
193+
selector:
194+
matchLabels:
195+
app: ceph-disk-health-exporter
196+
endpoints:
197+
- port: metrics
198+
interval: 60s
199+
path: /metrics
200+
```
201+
202+
### Step 6: Verify
203+
204+
```bash
205+
# Check DaemonSet status
206+
kubectl -n rook-ceph get ds ceph-disk-health-exporter
207+
208+
# Check logs
209+
kubectl -n rook-ceph logs -l app=ceph-disk-health-exporter --tail=20
210+
211+
# Test the metrics endpoint
212+
kubectl -n rook-ceph exec -it $(kubectl -n rook-ceph get pod -l app=ceph-disk-health-exporter -o name | head -1) -- wget -qO- http://localhost:8080/metrics | head -30
213+
```
214+
215+
## Environment variables
216+
217+
| Variable | Description | Default |
218+
|----------|-------------|---------|
219+
| `DISKS` | Comma-separated device list, or `*` for all | `/dev/sda,/dev/sdb` |
220+
| `INTERVAL` | Collection interval in seconds | `10` |
221+
| `PROMETHEUS_ENABLED` | Enable metrics endpoint | `false` |
222+
| `PROMETHEUS_PORT` | HTTP port for metrics | `8080` |
223+
| `NODE_NAME` | Node identifier (use fieldRef) | |
224+
| `INSTANCE_ID` | Instance identifier (use fieldRef) | |
225+
| `CEPH_OSD_BASE_PATH` | Rook-Ceph OSD directory | `/var/lib/rook/rook-ceph/` |
226+
| `GROWN_DEFECTS_THRESHOLD` | Alert threshold: grown defects | `10` |
227+
| `PENDING_SECTORS_THRESHOLD` | Alert threshold: pending sectors | `3` |
228+
| `REALLOCATED_SECTORS_THRESHOLD` | Alert threshold: reallocated sectors | `10` |
229+
| `LIFETIME_USED_THRESHOLD` | Alert threshold: SSD lifetime used (%) | `80` |
230+
| `ALL_ATTR` | Export all SMART attributes | `false` |
231+
| `NATS_URL` | NATS server URL (optional) | |
232+
| `NATS_SUBJECT` | NATS publish subject | `osd.disk.health` |
233+
234+
## OSD mapping
235+
236+
When `CEPH_OSD_BASE_PATH` is set, the producer maps physical devices to Ceph OSD IDs automatically. Every Prometheus metric gets an `osd_id` label.
237+
238+
This works with both direct block devices and LVM logical volumes.
239+
240+
## Metrics
241+
242+
| Metric | Type | Description |
243+
|--------|------|-------------|
244+
| `smart_attributes` | Gauge | SMART attributes (labeled by `attribute`) |
245+
| `disk_temperature_celsius` | Gauge | Disk temperature |
246+
| `disk_reallocated_sectors` | Gauge | Reallocated sector count |
247+
| `disk_pending_sectors` | Gauge | Pending sector count |
248+
| `disk_power_on_hours_total` | Gauge | Cumulative power-on hours |
249+
| `ssd_life_used_percentage` | Gauge | SSD wear level |
250+
| `disk_error_counts_total` | Gauge | Error counts (labeled by `error_type`) |
251+
| `disk_capacity_gb` | Gauge | Disk capacity in GB |
252+
| `disk_info` | Gauge | Device metadata: vendor, model, serial, firmware, media_type |
253+
254+
For NVMe devices, `smart_attributes` includes `critical_warning`, `available_spare`, `available_spare_threshold`, and vendor IDs in hex.
255+
256+
Full list: [metrics reference](../pkg/producers/diskhealthmetrics/README.md).

0 commit comments

Comments
 (0)