Skip to content

Commit 67ad5d4

Browse files
authored
Add Prometheus Metrics documentation (#93)
1 parent b0e5961 commit 67ad5d4

14 files changed

Lines changed: 255 additions & 7 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -133,3 +133,4 @@ dmypy.json
133133
.vscode
134134

135135
node_modules/
136+
.DS_Store

prometheus/README.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -115,15 +115,16 @@ services:
115115
- "./prometheus_config.yml:/etc/prometheus/prometheus.yml" # Assumes prometheus_config.yml exists in your CWD
116116
```
117117
118-
## Available Metrics
118+
## Metrics
119119
120-
The Prometheus extension exposes various LocalStack metrics through the `/_extension/metrics` endpoint, including:
121-
- Request counts by service
122-
- Request latencies
123-
- Resource utilization
124-
- Error rates
120+
The Prometheus extension exposes various LocalStack and system metrics through the `/_extension/metrics` endpoint.
125121

126-
For a complete list of available metrics, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running.
122+
For a complete list of available metrics, view the:
123+
- [LocalStack Metrics documentation](./docs/localstack_metrics.md)
124+
- [System Metrics documentation](./docs/system_metrics.md)
125+
- Otherwise, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running.
126+
127+
We've also included a [collection of PromQL queries](./docs/event_analysis.md) that are useful for analyzing LocalStack event source mappings performance.
127128

128129
## Licensing
129130

prometheus/docs/event_analysis.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# PromQL Queries for Event Processing Statistics
2+
3+
The following queries can be used to analyse performance of LocalStack's event processing capabilties.
4+
5+
## Average Propagation Delay from Event Source to Poller
6+
7+
The average amount of time a record has to wait before being processed during the last 5 minutes. A high propagation delay indicates that our event pollers are taking too long to ingest new events from an event source.
8+
9+
```
10+
rate(localstack_event_propagation_delay_seconds_sum[5m]) / rate(localstack_event_propagation_delay_seconds_count[5m])
11+
```
12+
13+
**Example**:
14+
![Average Propagation Delay](images/avg_propagation_delay.png)
15+
16+
## Batch Efficiency
17+
18+
A ratio showing how efficiently are our pollers retrieving records from an event source relative to how large their maximum batch size is. A higher number indicates that batch sizes could be increased.
19+
20+
```
21+
rate(localstack_batch_size_efficiency_ratio_sum[1m]) / rate(localstack_batch_size_efficiency_ratio_count[1m])
22+
```
23+
24+
Example:
25+
![Batch Efficiency Ratio](images/batch_efficiency_ratio.png)
26+
27+
## Records Per Poll
28+
29+
The average number of records being pulled in by an event poller per minute. When used in conjunction with batch efficiency, you can interpret the performance of your batching configuration.
30+
31+
```
32+
rate(localstack_records_per_poll_sum[1m]) / rate(localstack_records_per_poll_count[1m])
33+
```
34+
35+
Example:
36+
37+
![Records Per Poll](images/records_per_poll.png)
38+
39+
## In-Flight Events
40+
41+
Gauges how many events are currently being processed by a target at a given point in time. If event processing is taking long, this is a good way of measuring back-pressure on the system.
42+
43+
```
44+
localstack_in_flight_events
45+
```
46+
47+
Example:
48+
![In-Flight Events](images/in_flight_events.png)
49+
50+
## Event Processing Duration
51+
52+
The average duration per minute that targets are processing events for.
53+
54+
```
55+
rate(localstack_process_event_duration_seconds_sum[1m]) / rate(localstack_process_event_duration_seconds_count[1m])
56+
```
57+
58+
Example:
59+
60+
![Event Processing Duration](images/event_processing_duration.png)
61+
62+
## High Latency Event Processing
63+
64+
Retrieve the 95th percentile of processing times in a 5m interval grouped by LocalStack service and operation. Useful for analysing the tail-latency of event processing since this is likely where bottlenecks in performance start to show.
65+
66+
```
67+
histogram_quantile(0.95, sum by(service, operation, le) (rate(localstack_request_processing_duration_seconds_bucket[5m])))
68+
```
69+
70+
Example:
71+
![High Latency Event Processing](images/high_latency_event_processing.png)
72+
73+
## Empty Poll Responses
74+
75+
The approximate number of empty poll requests in a 5 minute interval.
76+
77+
```
78+
rate(localstack_poll_miss_total[5m]) * 60
79+
```
80+
81+
Example:
82+
![Empty Poll Responses](images/empty_poll_responses.png)
83+
84+
## Number of LocalStack requests Processed
85+
86+
The average number of request processed by the LocalStack gateway per minute. This is grouped by service type (i.e SQS) and operation type (i.e ReceiveMessage)
87+
88+
```
89+
sum by(service, operation) (rate(localstack_request_processing_duration_seconds_count[1m]) * 60)
90+
```
91+
92+
Example:
93+
![Requests Processed](images/requests_processed.png)
94+
95+
## In-Flight Requests Against LocalStack Gateway
96+
97+
Measures how many requests the Kinesis, SQS, DynamoDB, and Lambda services are currently processing in a given minute interval. Useful for seeing how hard a given service is currently being hit and the operation type.
98+
99+
```
100+
sum_over_time(localstack_in_flight_requests{service=~"dynamodb|kinesis|sqs|lambda"}[1m])
101+
```
102+
103+
Example:
104+
![In-Flight Requests](images/in_flight_requests.png)
79.9 KB
Loading
79.5 KB
Loading
95.1 KB
Loading
80.4 KB
Loading
114 KB
Loading
91.4 KB
Loading
140 KB
Loading

0 commit comments

Comments
 (0)