|
| 1 | + |
| 2 | +# Prometheus Metrics & Grafana Dashboard |
| 3 | + |
| 4 | +> Last updated: 05/08/2026 |
| 5 | +
|
| 6 | +## Overview |
| 7 | + |
| 8 | +TransferQueue provides built-in Prometheus metrics exporting for both the **Controller** and **SimpleStorageUnit** processes. When enabled, each process exposes an HTTP `/metrics` endpoint that can be scraped by Prometheus, and a pre-built Grafana dashboard is provided for visualization. |
| 9 | + |
| 10 | +## Quick Start |
| 11 | + |
| 12 | +### 1. Enable Metrics in Config |
| 13 | + |
| 14 | +```yaml |
| 15 | +metrics: |
| 16 | + enabled: true |
| 17 | + port: 0 # 0 = auto-assign free port; set a fixed port for production |
| 18 | +``` |
| 19 | +
|
| 20 | +Or pass via `init()`: |
| 21 | + |
| 22 | +```python |
| 23 | +import transfer_queue as tq |
| 24 | +
|
| 25 | +tq.init({ |
| 26 | + "metrics": { |
| 27 | + "enabled": True, |
| 28 | + "port": 9090, |
| 29 | + } |
| 30 | +}) |
| 31 | +``` |
| 32 | + |
| 33 | +### 2. Discover the Endpoint |
| 34 | + |
| 35 | +```python |
| 36 | +endpoint = tq.get_metrics_endpoint() |
| 37 | +print(f"http://{endpoint}/metrics") |
| 38 | +``` |
| 39 | + |
| 40 | +### 3. Import Grafana Dashboard |
| 41 | + |
| 42 | +Import the pre-built dashboard JSON into your Grafana instance: |
| 43 | + |
| 44 | +**[`scripts/grafana_dashboard.json`](../scripts/grafana_dashboard.json)** |
| 45 | + |
| 46 | +Steps: |
| 47 | +1. Open Grafana → Dashboards → Import |
| 48 | +2. Upload the JSON file or paste its content |
| 49 | +3. Select your Prometheus datasource |
| 50 | +4. Done |
| 51 | + |
| 52 | +## Configuration |
| 53 | + |
| 54 | +| Config Key | Default | Description | |
| 55 | +|------------|---------|-------------| |
| 56 | +| `metrics.enabled` | `false` | Enable/disable the metrics exporter | |
| 57 | +| `metrics.port` | `0` | HTTP port for `/metrics` endpoint (0 = OS auto-assign) | |
| 58 | + |
| 59 | +| Environment Variable | Default | Description | |
| 60 | +|---------------------|---------|-------------| |
| 61 | +| `TQ_METRICS_COLLECT_INTERVAL` | `10` | Background collection interval (seconds) | |
| 62 | +| `TQ_METRICS_STORAGE_TIMEOUT` | `5` | ZMQ timeout for storage unit queries (seconds) | |
| 63 | + |
| 64 | +## Architecture |
| 65 | + |
| 66 | +``` |
| 67 | +┌─────────────────────────────────────────────────────────────────┐ |
| 68 | +│ Controller Process │ |
| 69 | +│ │ |
| 70 | +│ TransferQueueController │ |
| 71 | +│ │ │ |
| 72 | +│ │── snapshot push (every 10s) ──▶ TQMetricsExporter │ |
| 73 | +│ │ (role="controller") │ |
| 74 | +│ │ │ │ |
| 75 | +│ │ ├─ HTTP /metrics ◀── Prometheus |
| 76 | +│ │ │ │ |
| 77 | +│ │ └─ ZMQ GET_METRICS │ |
| 78 | +│ │ │ │ |
| 79 | +└───────┼─────────────────────────────────────────┼───────────────┘ |
| 80 | + │ │ |
| 81 | + ▼ ▼ |
| 82 | +┌───────────────────┐ ┌───────────────────┐ |
| 83 | +│ SimpleStorageUnit │ │ SimpleStorageUnit │ |
| 84 | +│ │ │ │ |
| 85 | +│ TQMetricsExporter │ │ TQMetricsExporter │ |
| 86 | +│ (role="storage") │ │ (role="storage") │ |
| 87 | +│ HTTP /metrics ◀─┼── Prometheus │ HTTP /metrics │ |
| 88 | +└───────────────────┘ └───────────────────┘ |
| 89 | +``` |
| 90 | +
|
| 91 | +- **Controller** (`role="controller"`) pushes plain-dict snapshots to its exporter (no lock contention). Its exporter also queries storage units via ZMQ for capacity/utilization and per-operation request stats. |
| 92 | +- **Storage Units** (`role="storage"`) each run their own exporter with native Histogram/Counter metrics for request latency/throughput (PUT_DATA, GET_DATA, CLEAR_DATA). |
| 93 | +- **Two scrape paths**: If Prometheus scrapes only the controller endpoint, storage request metrics are available via ZMQ-collected gauges. If Prometheus scrapes each storage unit directly, native histogram data provides more precise quantiles. |
| 94 | +- Metrics are **role-prefixed**: controller uses `tq_controller_request_*`, storage uses `tq_storage_request_*` — no naming conflicts. |
| 95 | +
|
| 96 | +## Metrics Reference |
| 97 | +
|
| 98 | +### Controller Process Metrics |
| 99 | +
|
| 100 | +| Metric | Type | Labels | Description | |
| 101 | +|--------|------|--------|-------------| |
| 102 | +| `tq_controller_uptime_seconds` | Gauge | — | Controller process uptime | |
| 103 | +| `tq_controller_memory_rss_bytes` | Gauge | — | Controller RSS memory | |
| 104 | +
|
| 105 | +### Partition Metrics |
| 106 | +
|
| 107 | +| Metric | Type | Labels | Description | |
| 108 | +|--------|------|--------|-------------| |
| 109 | +| `tq_partitions_total` | Gauge | — | Number of active partitions | |
| 110 | +| `tq_partition_samples_total` | Gauge | `partition_id` | Samples per partition | |
| 111 | +| `tq_partition_production_progress` | Gauge | `partition_id`, `task_name` | Production progress (0.0–1.0) | |
| 112 | +| `tq_partition_consumption_progress` | Gauge | `partition_id`, `task_name` | Consumption progress (0.0–1.0) | |
| 113 | +
|
| 114 | +### Index Manager Metrics |
| 115 | +
|
| 116 | +| Metric | Type | Labels | Description | |
| 117 | +|--------|------|--------|-------------| |
| 118 | +| `tq_global_index_allocated_total` | Gauge | — | Total allocated global indexes | |
| 119 | +| `tq_global_index_reusable_total` | Gauge | — | Reusable global indexes | |
| 120 | +
|
| 121 | +### Request Metrics |
| 122 | +
|
| 123 | +| Metric | Type | Labels | Description | |
| 124 | +|--------|------|--------|-------------| |
| 125 | +| `tq_controller_request_total` | Counter | `op_type` | Total requests processed | |
| 126 | +| `tq_controller_request_duration_seconds` | Histogram | `op_type` | Request latency (buckets: 1ms–5s) | |
| 127 | +| `tq_controller_request_errors_total` | Counter | `op_type` | Total request errors | |
| 128 | +| `tq_controller_request_samples_total` | Counter | `op_type` | Total samples processed per operation (for batch-aware accounting) | |
| 129 | +
|
| 130 | +### Storage Unit Metrics (collected via ZMQ, exposed on controller) |
| 131 | +
|
| 132 | +| Metric | Type | Labels | Description | |
| 133 | +|--------|------|--------|-------------| |
| 134 | +| `tq_storage_capacity_total` | Gauge | `storage_unit_id` | Max storage capacity | |
| 135 | +| `tq_storage_active_keys_total` | Gauge | `storage_unit_id` | Active keys in storage | |
| 136 | +| `tq_storage_utilization_ratio` | Gauge | `storage_unit_id` | Utilization (active/capacity) | |
| 137 | +| `tq_storage_memory_rss_bytes` | Gauge | `storage_unit_id` | Storage process RSS memory | |
| 138 | +| `tq_storage_request_ops` | Gauge | `storage_unit_id`, `op_type` | Total requests processed by storage unit | |
| 139 | +| `tq_storage_request_latency_avg` | Gauge | `storage_unit_id`, `op_type` | Average request latency (seconds) | |
| 140 | +| `tq_storage_request_latency_p50` | Gauge | `storage_unit_id`, `op_type` | P50 request latency (seconds) | |
| 141 | +| `tq_storage_request_latency_p99` | Gauge | `storage_unit_id`, `op_type` | P99 request latency (seconds) | |
| 142 | +
|
| 143 | +### Storage Unit Native Metrics (exposed on each storage unit's own endpoint) |
| 144 | +
|
| 145 | +| Metric | Type | Labels | Description | |
| 146 | +|--------|------|--------|-------------| |
| 147 | +| `tq_storage_request_duration_seconds` | Histogram | `op_type` | Request latency (buckets: 1ms–5s) | |
| 148 | +| `tq_storage_request_total` | Counter | `op_type` | Total requests processed | |
| 149 | +| `tq_storage_request_errors_total` | Counter | `op_type` | Total request errors | |
| 150 | +| `tq_storage_request_samples_total` | Counter | `op_type` | Total samples processed per operation | |
| 151 | +
|
| 152 | +> **Note on naming**: The ZMQ-collected gauges on the controller avoid all Prometheus reserved suffixes (`_total`, `_bucket`, `_sum`, `_count`, `_info`, `_created`) and the reserved `le` label to prevent type metadata conflicts that break `label_values()` queries. P50/P99 are computed on the storage unit side and sent as pre-calculated values. The storage unit's own endpoint uses standard Counter/Histogram naming conventions. |
| 153 | +
|
| 154 | +## Grafana Dashboard |
| 155 | +
|
| 156 | +The dashboard ([`scripts/grafana_dashboard.json`](../scripts/grafana_dashboard.json)) includes: |
| 157 | +
|
| 158 | +### Panels |
| 159 | +
|
| 160 | +| Section | Panels | |
| 161 | +|---------|--------| |
| 162 | +| **Controller Overview** | Uptime, RSS Memory, Active Partitions, Indexes Allocated, Reusable Indexes | |
| 163 | +| **Request Throughput & Latency** | Controller Request Rate (ops/s), Controller Request Latency (repeats per quantile) | |
| 164 | +| **Partition Status** | Samples per Partition, Production Progress, Consumption Progress | |
| 165 | +| **Storage Units** | Utilization Bar Gauge, Active Keys, Capacity vs Active Keys, RSS Memory, Storage Request Rate, Storage Request Latency (repeats per quantile), Produced vs Cleared Samples/s, Active Keys Delta | |
| 166 | +
|
| 167 | +### Template Variables |
| 168 | +
|
| 169 | +| Variable | Type | Description | |
| 170 | +|----------|------|-------------| |
| 171 | +| `datasource` | Datasource | Prometheus datasource selector | |
| 172 | +| `task_name` | Query | Filter Production/Consumption Progress panels by task | |
| 173 | +| `op_type` | Custom | Filter request panels by operation (PUT_DATA, GET_DATA, CLEAR_DATA, etc.) | |
| 174 | +| `quantile` | Custom | Filter latency panels by quantile (p50, p99) | |
| 175 | +
|
| 176 | +### Thresholds |
| 177 | +
|
| 178 | +- **Storage Utilization**: Green < 70%, Yellow 70–90%, Red > 90% |
| 179 | +- **Controller RSS Memory**: Green < 2GB, Yellow 2–4GB, Red > 4GB |
| 180 | +
|
| 181 | +## Detecting Leaks: Produced vs Cleared |
| 182 | +
|
| 183 | +A common concern is whether consumed samples are being properly cleared from storage. The dashboard provides two panels for this: |
| 184 | +
|
| 185 | +### Produced vs Cleared Samples (per second) |
| 186 | +
|
| 187 | +Compares the **actual sample count** (not request count) between production and consumption: |
| 188 | +
|
| 189 | +- `rate(tq_controller_request_samples_total{op_type="NOTIFY_DATA_UPDATE"})` — samples produced/s |
| 190 | +- `rate(tq_controller_request_samples_total{op_type="CLEAR_META"})` — samples cleared/s |
| 191 | +
|
| 192 | +> **Why sample count, not request rate?** A single `CLEAR_META` request can batch-clear hundreds of samples. Comparing request rates would be misleading. |
| 193 | +
|
| 194 | +| Observation | Meaning | |
| 195 | +|-------------|---------| |
| 196 | +| Two lines track closely | Production/consumption balanced, no leak | |
| 197 | +| Produced consistently > Cleared | Samples accumulating — potential leak | |
| 198 | +| Cleared spikes after Produced plateau | Batch consumer pattern (normal) | |
| 199 | +
|
| 200 | +### Active Keys Delta |
| 201 | +
|
| 202 | +Shows `sum(tq_storage_active_keys_total)` over time: |
| 203 | +
|
| 204 | +| Observation | Meaning | |
| 205 | +|-------------|---------| |
| 206 | +| Stable or oscillating | Healthy steady-state | |
| 207 | +| Monotonically increasing | Leak — keys are never freed | |
| 208 | +| Approaching capacity | Imminent storage exhaustion | |
| 209 | +
|
| 210 | +### Quick Troubleshooting |
| 211 | +
|
| 212 | +1. **Active Keys rising?** → Check "Produced vs Cleared Samples" — is CLEAR keeping up? |
| 213 | +2. **CLEAR rate is zero?** → Consumer is not calling `clear_samples()` / `clear_partition()` |
| 214 | +3. **CLEAR rate > 0 but keys still rising?** → Check Consumption Progress — is the consumer actually finishing before clearing? |
| 215 | +
|
| 216 | +## Integration with `IntervalPerfMonitor` |
| 217 | +
|
| 218 | +When metrics are **disabled** (default), both the Controller and SimpleStorageUnit use `IntervalPerfMonitor` — a lightweight logger-based fallback that prints aggregated stats every 5 minutes. |
| 219 | +
|
| 220 | +When metrics are **enabled**, `TQMetricsExporter` replaces the perf monitor transparently (same `measure(op_type=...)` interface), providing Prometheus-native counters and histograms instead of log-based summaries. |
0 commit comments