Skip to content

Commit 80f4e97

Browse files
RobotGFclaudeCopilot
authored
[feat] Add metrics exporter and dashboard for TransferQueue (#83)
Expose Prometheus metrics from the controller and storage units so TransferQueue activity can be monitored end to end. Include a Grafana dashboard and tests so the observability workflow is easier to validate and adopt. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 539569d commit 80f4e97

12 files changed

Lines changed: 1704 additions & 22 deletions

File tree

docs/metrics.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
2+
# Prometheus Metrics & Grafana Dashboard
3+
4+
> Last updated: 05/08/2026
5+
6+
## Overview
7+
8+
TransferQueue provides built-in Prometheus metrics exporting for both the **Controller** and **SimpleStorageUnit** processes. When enabled, each process exposes an HTTP `/metrics` endpoint that can be scraped by Prometheus, and a pre-built Grafana dashboard is provided for visualization.
9+
10+
## Quick Start
11+
12+
### 1. Enable Metrics in Config
13+
14+
```yaml
15+
metrics:
16+
enabled: true
17+
port: 0 # 0 = auto-assign free port; set a fixed port for production
18+
```
19+
20+
Or pass via `init()`:
21+
22+
```python
23+
import transfer_queue as tq
24+
25+
tq.init({
26+
"metrics": {
27+
"enabled": True,
28+
"port": 9090,
29+
}
30+
})
31+
```
32+
33+
### 2. Discover the Endpoint
34+
35+
```python
36+
endpoint = tq.get_metrics_endpoint()
37+
print(f"http://{endpoint}/metrics")
38+
```
39+
40+
### 3. Import Grafana Dashboard
41+
42+
Import the pre-built dashboard JSON into your Grafana instance:
43+
44+
**[`scripts/grafana_dashboard.json`](../scripts/grafana_dashboard.json)**
45+
46+
Steps:
47+
1. Open Grafana → Dashboards → Import
48+
2. Upload the JSON file or paste its content
49+
3. Select your Prometheus datasource
50+
4. Done
51+
52+
## Configuration
53+
54+
| Config Key | Default | Description |
55+
|------------|---------|-------------|
56+
| `metrics.enabled` | `false` | Enable/disable the metrics exporter |
57+
| `metrics.port` | `0` | HTTP port for `/metrics` endpoint (0 = OS auto-assign) |
58+
59+
| Environment Variable | Default | Description |
60+
|---------------------|---------|-------------|
61+
| `TQ_METRICS_COLLECT_INTERVAL` | `10` | Background collection interval (seconds) |
62+
| `TQ_METRICS_STORAGE_TIMEOUT` | `5` | ZMQ timeout for storage unit queries (seconds) |
63+
64+
## Architecture
65+
66+
```
67+
┌─────────────────────────────────────────────────────────────────┐
68+
│ Controller Process │
69+
│ │
70+
│ TransferQueueController │
71+
│ │ │
72+
│ │── snapshot push (every 10s) ──▶ TQMetricsExporter │
73+
│ │ (role="controller") │
74+
│ │ │ │
75+
│ │ ├─ HTTP /metrics ◀── Prometheus
76+
│ │ │ │
77+
│ │ └─ ZMQ GET_METRICS │
78+
│ │ │ │
79+
└───────┼─────────────────────────────────────────┼───────────────┘
80+
│ │
81+
▼ ▼
82+
┌───────────────────┐ ┌───────────────────┐
83+
│ SimpleStorageUnit │ │ SimpleStorageUnit │
84+
│ │ │ │
85+
│ TQMetricsExporter │ │ TQMetricsExporter │
86+
│ (role="storage") │ │ (role="storage") │
87+
│ HTTP /metrics ◀─┼── Prometheus │ HTTP /metrics │
88+
└───────────────────┘ └───────────────────┘
89+
```
90+
91+
- **Controller** (`role="controller"`) pushes plain-dict snapshots to its exporter (no lock contention). Its exporter also queries storage units via ZMQ for capacity/utilization and per-operation request stats.
92+
- **Storage Units** (`role="storage"`) each run their own exporter with native Histogram/Counter metrics for request latency/throughput (PUT_DATA, GET_DATA, CLEAR_DATA).
93+
- **Two scrape paths**: If Prometheus scrapes only the controller endpoint, storage request metrics are available via ZMQ-collected gauges. If Prometheus scrapes each storage unit directly, native histogram data provides more precise quantiles.
94+
- Metrics are **role-prefixed**: controller uses `tq_controller_request_*`, storage uses `tq_storage_request_*` — no naming conflicts.
95+
96+
## Metrics Reference
97+
98+
### Controller Process Metrics
99+
100+
| Metric | Type | Labels | Description |
101+
|--------|------|--------|-------------|
102+
| `tq_controller_uptime_seconds` | Gauge | — | Controller process uptime |
103+
| `tq_controller_memory_rss_bytes` | Gauge | — | Controller RSS memory |
104+
105+
### Partition Metrics
106+
107+
| Metric | Type | Labels | Description |
108+
|--------|------|--------|-------------|
109+
| `tq_partitions_total` | Gauge | — | Number of active partitions |
110+
| `tq_partition_samples_total` | Gauge | `partition_id` | Samples per partition |
111+
| `tq_partition_production_progress` | Gauge | `partition_id`, `task_name` | Production progress (0.0–1.0) |
112+
| `tq_partition_consumption_progress` | Gauge | `partition_id`, `task_name` | Consumption progress (0.0–1.0) |
113+
114+
### Index Manager Metrics
115+
116+
| Metric | Type | Labels | Description |
117+
|--------|------|--------|-------------|
118+
| `tq_global_index_allocated_total` | Gauge | — | Total allocated global indexes |
119+
| `tq_global_index_reusable_total` | Gauge | — | Reusable global indexes |
120+
121+
### Request Metrics
122+
123+
| Metric | Type | Labels | Description |
124+
|--------|------|--------|-------------|
125+
| `tq_controller_request_total` | Counter | `op_type` | Total requests processed |
126+
| `tq_controller_request_duration_seconds` | Histogram | `op_type` | Request latency (buckets: 1ms–5s) |
127+
| `tq_controller_request_errors_total` | Counter | `op_type` | Total request errors |
128+
| `tq_controller_request_samples_total` | Counter | `op_type` | Total samples processed per operation (for batch-aware accounting) |
129+
130+
### Storage Unit Metrics (collected via ZMQ, exposed on controller)
131+
132+
| Metric | Type | Labels | Description |
133+
|--------|------|--------|-------------|
134+
| `tq_storage_capacity_total` | Gauge | `storage_unit_id` | Max storage capacity |
135+
| `tq_storage_active_keys_total` | Gauge | `storage_unit_id` | Active keys in storage |
136+
| `tq_storage_utilization_ratio` | Gauge | `storage_unit_id` | Utilization (active/capacity) |
137+
| `tq_storage_memory_rss_bytes` | Gauge | `storage_unit_id` | Storage process RSS memory |
138+
| `tq_storage_request_ops` | Gauge | `storage_unit_id`, `op_type` | Total requests processed by storage unit |
139+
| `tq_storage_request_latency_avg` | Gauge | `storage_unit_id`, `op_type` | Average request latency (seconds) |
140+
| `tq_storage_request_latency_p50` | Gauge | `storage_unit_id`, `op_type` | P50 request latency (seconds) |
141+
| `tq_storage_request_latency_p99` | Gauge | `storage_unit_id`, `op_type` | P99 request latency (seconds) |
142+
143+
### Storage Unit Native Metrics (exposed on each storage unit's own endpoint)
144+
145+
| Metric | Type | Labels | Description |
146+
|--------|------|--------|-------------|
147+
| `tq_storage_request_duration_seconds` | Histogram | `op_type` | Request latency (buckets: 1ms–5s) |
148+
| `tq_storage_request_total` | Counter | `op_type` | Total requests processed |
149+
| `tq_storage_request_errors_total` | Counter | `op_type` | Total request errors |
150+
| `tq_storage_request_samples_total` | Counter | `op_type` | Total samples processed per operation |
151+
152+
> **Note on naming**: The ZMQ-collected gauges on the controller avoid all Prometheus reserved suffixes (`_total`, `_bucket`, `_sum`, `_count`, `_info`, `_created`) and the reserved `le` label to prevent type metadata conflicts that break `label_values()` queries. P50/P99 are computed on the storage unit side and sent as pre-calculated values. The storage unit's own endpoint uses standard Counter/Histogram naming conventions.
153+
154+
## Grafana Dashboard
155+
156+
The dashboard ([`scripts/grafana_dashboard.json`](../scripts/grafana_dashboard.json)) includes:
157+
158+
### Panels
159+
160+
| Section | Panels |
161+
|---------|--------|
162+
| **Controller Overview** | Uptime, RSS Memory, Active Partitions, Indexes Allocated, Reusable Indexes |
163+
| **Request Throughput & Latency** | Controller Request Rate (ops/s), Controller Request Latency (repeats per quantile) |
164+
| **Partition Status** | Samples per Partition, Production Progress, Consumption Progress |
165+
| **Storage Units** | Utilization Bar Gauge, Active Keys, Capacity vs Active Keys, RSS Memory, Storage Request Rate, Storage Request Latency (repeats per quantile), Produced vs Cleared Samples/s, Active Keys Delta |
166+
167+
### Template Variables
168+
169+
| Variable | Type | Description |
170+
|----------|------|-------------|
171+
| `datasource` | Datasource | Prometheus datasource selector |
172+
| `task_name` | Query | Filter Production/Consumption Progress panels by task |
173+
| `op_type` | Custom | Filter request panels by operation (PUT_DATA, GET_DATA, CLEAR_DATA, etc.) |
174+
| `quantile` | Custom | Filter latency panels by quantile (p50, p99) |
175+
176+
### Thresholds
177+
178+
- **Storage Utilization**: Green < 70%, Yellow 70–90%, Red > 90%
179+
- **Controller RSS Memory**: Green < 2GB, Yellow 2–4GB, Red > 4GB
180+
181+
## Detecting Leaks: Produced vs Cleared
182+
183+
A common concern is whether consumed samples are being properly cleared from storage. The dashboard provides two panels for this:
184+
185+
### Produced vs Cleared Samples (per second)
186+
187+
Compares the **actual sample count** (not request count) between production and consumption:
188+
189+
- `rate(tq_controller_request_samples_total{op_type="NOTIFY_DATA_UPDATE"})` — samples produced/s
190+
- `rate(tq_controller_request_samples_total{op_type="CLEAR_META"})` — samples cleared/s
191+
192+
> **Why sample count, not request rate?** A single `CLEAR_META` request can batch-clear hundreds of samples. Comparing request rates would be misleading.
193+
194+
| Observation | Meaning |
195+
|-------------|---------|
196+
| Two lines track closely | Production/consumption balanced, no leak |
197+
| Produced consistently > Cleared | Samples accumulating — potential leak |
198+
| Cleared spikes after Produced plateau | Batch consumer pattern (normal) |
199+
200+
### Active Keys Delta
201+
202+
Shows `sum(tq_storage_active_keys_total)` over time:
203+
204+
| Observation | Meaning |
205+
|-------------|---------|
206+
| Stable or oscillating | Healthy steady-state |
207+
| Monotonically increasing | Leak — keys are never freed |
208+
| Approaching capacity | Imminent storage exhaustion |
209+
210+
### Quick Troubleshooting
211+
212+
1. **Active Keys rising?** → Check "Produced vs Cleared Samples" — is CLEAR keeping up?
213+
2. **CLEAR rate is zero?** → Consumer is not calling `clear_samples()` / `clear_partition()`
214+
3. **CLEAR rate > 0 but keys still rising?** → Check Consumption Progress — is the consumer actually finishing before clearing?
215+
216+
## Integration with `IntervalPerfMonitor`
217+
218+
When metrics are **disabled** (default), both the Controller and SimpleStorageUnit use `IntervalPerfMonitor` — a lightweight logger-based fallback that prints aggregated stats every 5 minutes.
219+
220+
When metrics are **enabled**, `TQMetricsExporter` replaces the perf monitor transparently (same `measure(op_type=...)` interface), providing Prometheus-native counters and histograms instead of log-based summaries.

requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,5 @@ hydra-core
55
numpy<2.0.0
66
msgspec
77
psutil
8-
omegaconf
8+
omegaconf
9+
prometheus_client>=0.20.0

0 commit comments

Comments
 (0)