Skip to content

Commit 899e345

Browse files
committed
wip
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
1 parent c74eab8 commit 899e345

File tree

3 files changed

+2212
-0
lines changed

3 files changed

+2212
-0
lines changed

observability/README.md

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# Observability Stack for Java Operator SDK
2+
3+
This directory contains the setup scripts and Grafana dashboards for monitoring Java Operator SDK applications.
4+
5+
## Installation
6+
7+
Run the installation script to deploy the full observability stack (OpenTelemetry Collector, Prometheus, and Grafana):
8+
9+
```bash
10+
./install-observability.sh
11+
```
12+
13+
This will install:
14+
- **cert-manager** - Required for OpenTelemetry Operator
15+
- **OpenTelemetry Operator** - Manages OpenTelemetry Collector instances
16+
- **OpenTelemetry Collector** - Receives OTLP metrics and exports to Prometheus
17+
- **Prometheus** - Metrics storage and querying
18+
- **Grafana** - Metrics visualization
19+
20+
## Accessing Services
21+
22+
### Grafana
23+
```bash
24+
kubectl port-forward -n observability svc/kube-prometheus-stack-grafana 3000:80
25+
```
26+
Then open http://localhost:3000
27+
- Username: `admin`
28+
- Password: `admin`
29+
30+
### Prometheus
31+
```bash
32+
kubectl port-forward -n observability svc/kube-prometheus-stack-prometheus 9090:9090
33+
```
34+
Then open http://localhost:9090
35+
36+
## Grafana Dashboards
37+
38+
Two pre-configured dashboards are **automatically imported** during installation:
39+
40+
### 1. JVM Metrics Dashboard (`jvm-metrics-dashboard.json`)
41+
42+
Monitors Java Virtual Machine health and performance:
43+
44+
**Panels:**
45+
- **JVM Memory Used** - Heap and non-heap memory consumption by memory pool
46+
- **JVM Threads** - Live, daemon, and peak thread counts
47+
- **GC Pause Time Rate** - Garbage collection pause duration
48+
- **GC Pause Count Rate** - Frequency of garbage collection events
49+
- **CPU Usage** - System CPU utilization percentage
50+
- **Classes Loaded** - Number of classes currently loaded
51+
- **Process Uptime** - Application uptime in seconds
52+
- **CPU Count** - Available processor cores
53+
- **GC Memory Allocation Rate** - Memory allocation and promotion rates
54+
- **Heap Memory Max vs Committed** - Heap memory limits and commitments
55+
56+
**Key Metrics:**
57+
- `jvm.memory.used`, `jvm.memory.max`, `jvm.memory.committed`
58+
- `jvm.gc.pause`, `jvm.gc.memory.allocated`, `jvm.gc.memory.promoted`
59+
- `jvm.threads.live`, `jvm.threads.daemon`, `jvm.threads.peak`
60+
- `jvm.classes.loaded`, `jvm.classes.unloaded`
61+
- `system.cpu.usage`, `system.cpu.count`
62+
- `process.uptime`
63+
64+
### 2. Java Operator SDK Metrics Dashboard (`josdk-operator-metrics-dashboard.json`)
65+
66+
Monitors Kubernetes operator performance and health:
67+
68+
**Panels:**
69+
- **Reconciliation Rate (Started)** - Rate of reconciliation loops triggered
70+
- **Reconciliation Success vs Failure Rate** - Success/failure ratio over time
71+
- **Currently Executing Reconciliations** - Active reconciliation threads
72+
- **Reconciliation Queue Size** - Pending reconciliation work
73+
- **Total Reconciliations** - Cumulative count of reconciliations
74+
- **Error Rate** - Overall error rate across all reconciliations
75+
- **Reconciliation Execution Time** - P50, P95, P99 latency percentiles
76+
- **Event Reception Rate** - Kubernetes event processing rate
77+
- **Failures by Exception Type** - Breakdown of errors by exception class
78+
- **Controller Execution Success vs Failure** - Controller-level success metrics
79+
- **Delete Event Rate** - Resource deletion event frequency
80+
- **Reconciliation Retry Rate** - Retry attempts and patterns
81+
82+
**Key Metrics:**
83+
- `operator.sdk.reconciliations.started`, `.success`, `.failed`
84+
- `operator.sdk.reconciliations.executions` - Current execution count
85+
- `operator.sdk.reconciliations.queue.size` - Queue depth
86+
- `operator.sdk.controllers.execution.reconcile` - Execution timing histograms
87+
- `operator.sdk.events.received`, `.delete` - Event reception
88+
- Retry metrics and failure breakdowns
89+
90+
## Importing Dashboards into Grafana
91+
92+
### Automatic Import (Default)
93+
94+
The dashboards are **automatically imported** when you run `./install-observability.sh`. They will appear in Grafana within 30-60 seconds after installation. No manual steps required!
95+
96+
To verify the dashboards were imported:
97+
1. Access Grafana at http://localhost:3000
98+
2. Navigate to **Dashboards****Browse**
99+
3. Look for "JOSDK - JVM Metrics" and "JOSDK - Operator Metrics"
100+
101+
### Manual Import Methods
102+
103+
If you need to re-import or update the dashboards manually:
104+
105+
#### Method 1: Via Grafana UI
106+
107+
1. Access Grafana at http://localhost:3000
108+
2. Login with admin/admin
109+
3. Navigate to **Dashboards****Import**
110+
4. Click **Upload JSON file**
111+
5. Select `jvm-metrics-dashboard.json` or `josdk-operator-metrics-dashboard.json`
112+
6. Select **Prometheus** as the data source
113+
7. Click **Import**
114+
115+
#### Method 2: Via kubectl ConfigMap
116+
117+
```bash
118+
# Re-import JVM dashboard
119+
kubectl create configmap jvm-metrics-dashboard \
120+
--from-file=jvm-metrics-dashboard.json \
121+
-n observability \
122+
-o yaml --dry-run=client | \
123+
kubectl label --dry-run=client --local -f - grafana_dashboard=1 -o yaml | \
124+
kubectl apply -f -
125+
126+
# Re-import Operator dashboard
127+
kubectl create configmap josdk-operator-metrics-dashboard \
128+
--from-file=josdk-operator-metrics-dashboard.json \
129+
-n observability \
130+
-o yaml --dry-run=client | \
131+
kubectl label --dry-run=client --local -f - grafana_dashboard=1 -o yaml | \
132+
kubectl apply -f -
133+
```
134+
135+
The dashboards will be automatically discovered and loaded by Grafana within 30-60 seconds.
136+
137+
## Configuring Your Operator
138+
139+
To enable metrics export from your JOSDK operator, ensure your application:
140+
141+
1. **Has the required dependency** (already included in webpage sample):
142+
```xml
143+
<dependency>
144+
<groupId>io.micrometer</groupId>
145+
<artifactId>micrometer-registry-otlp</artifactId>
146+
</dependency>
147+
```
148+
149+
2. **Configures OTLP export** via `otlp-config.yaml`:
150+
```yaml
151+
otlp:
152+
url: "http://otel-collector-collector.observability.svc.cluster.local:4318/v1/metrics"
153+
step: 15s
154+
batchSize: 15000
155+
aggregationTemporality: "cumulative"
156+
```
157+
158+
3. **Registers JVM and JOSDK metrics** (see `WebPageOperator.java` for reference implementation)
159+
160+
## OTLP Endpoints
161+
162+
The OpenTelemetry Collector provides the following endpoints:
163+
164+
- **OTLP gRPC**: `otel-collector-collector.observability.svc.cluster.local:4317`
165+
- **OTLP HTTP**: `otel-collector-collector.observability.svc.cluster.local:4318`
166+
- **Prometheus Scrape**: `http://otel-collector-prometheus.observability.svc.cluster.local:8889/metrics`
167+
168+
## Troubleshooting
169+
170+
### Check OpenTelemetry Collector Logs
171+
```bash
172+
kubectl logs -n observability -l app.kubernetes.io/name=otel-collector -f
173+
```
174+
175+
### Check Prometheus Targets
176+
```bash
177+
kubectl port-forward -n observability svc/kube-prometheus-stack-prometheus 9090:9090
178+
```
179+
Open http://localhost:9090/targets and verify the OTLP collector target is UP.
180+
181+
### Verify Metrics in Prometheus
182+
Open Prometheus UI and search for metrics:
183+
- JVM metrics: `otel_jvm_*`
184+
- Operator metrics: `otel_operator_sdk_*`
185+
186+
### Check Grafana Data Source
187+
1. Navigate to **Configuration** → **Data Sources**
188+
2. Verify Prometheus data source is configured and working
189+
3. Click **Test** to verify connectivity
190+
191+
## Uninstalling
192+
193+
To remove the observability stack:
194+
195+
```bash
196+
kubectl delete configmap -n observability jvm-metrics-dashboard josdk-operator-metrics-dashboard
197+
kubectl delete -n observability OpenTelemetryCollector otel-collector
198+
helm uninstall -n observability kube-prometheus-stack
199+
helm uninstall -n observability opentelemetry-operator
200+
helm uninstall -n cert-manager cert-manager
201+
kubectl delete namespace observability cert-manager
202+
```
203+
204+
## Customizing Dashboards
205+
206+
The dashboard JSON files can be modified to:
207+
- Add new panels for custom metrics
208+
- Adjust time ranges and refresh intervals
209+
- Change visualization types
210+
- Add templating variables for filtering
211+
- Modify alert thresholds
212+
213+
After making changes, re-import the dashboard using one of the methods above.
214+
215+
## Example Queries
216+
217+
### JVM Metrics
218+
```promql
219+
# Heap memory usage percentage
220+
(otel_jvm_memory_used_bytes{area="heap"} / otel_jvm_memory_max_bytes{area="heap"}) * 100
221+
222+
# GC throughput (percentage of time NOT in GC)
223+
100 - (rate(otel_jvm_gc_pause_seconds_sum[5m]) * 100)
224+
225+
# Thread count trend
226+
otel_jvm_threads_live_threads
227+
```
228+
229+
### Operator Metrics
230+
```promql
231+
# Reconciliation success rate
232+
rate(otel_operator_sdk_reconciliations_success_total[5m]) / rate(otel_operator_sdk_reconciliations_started_total[5m])
233+
234+
# Average reconciliation time
235+
rate(otel_operator_sdk_controllers_execution_reconcile_seconds_sum[5m]) / rate(otel_operator_sdk_controllers_execution_reconcile_seconds_count[5m])
236+
237+
# Queue saturation
238+
otel_operator_sdk_reconciliations_queue_size / on() group_left() max(otel_operator_sdk_reconciliations_queue_size)
239+
```
240+
241+
## References
242+
243+
- [Java Operator SDK Documentation](https://javaoperatorsdk.io)
244+
- [Micrometer OTLP Documentation](https://micrometer.io/docs/registry/otlp)
245+
- [OpenTelemetry Collector](https://opentelemetry.io/docs/collector/)
246+
- [Grafana Dashboards](https://grafana.com/docs/grafana/latest/dashboards/)

0 commit comments

Comments
 (0)