Skip to content

Commit 21aef38

Browse files
authored
CP-312094: Update rate limiting design document (#7000)
This update addresses some comments I've received on the design document. Most importantly, I add a new table for rate limiters which allows us to combine several callers under 1 rate limiter in a straightforward way.
2 parents 35055c0 + d1c6ce8 commit 21aef38

1 file changed

Lines changed: 144 additions & 41 deletions

File tree

doc/content/design/rate_limit.md

Lines changed: 144 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: XAPI Rate Limiting
2+
title: Rate Limiting
33
layout: default
44
design_doc: true
5-
revision: 1
5+
revision: 2
66
status: draft
77
---
88

@@ -11,19 +11,26 @@ status: draft
1111
- [Approach](#approach)
1212
- [Rate limiting](#rate-limiting)
1313
- [Client classification](#client-classification)
14+
- [Statistics](#statistics)
1415
- [API design](#api-design)
16+
- [Caller Datamodel](#caller-datamodel)
17+
- [API functions](#api-functions)
18+
- [Matching semantics](#matching-semantics)
19+
- [Caller lifecycle](#caller-lifecycle)
1520
- [XAPI integration](#xapi-integration)
1621
- [Library](#library)
1722
- [Token bucket](#token-bucket)
1823
- [Rate limit queue](#rate-limit-queue)
1924
- [Client table](#client-table)
25+
- [Operational Description](#operational-description)
26+
- [Pool Member / Multi-Host Considerations](#pool-member-multi-host-considerations)
2027
<!--toc:end-->
2128

2229
## Overview
2330

2431
We have had several customer incidents in the past that have been attributed to
25-
“overloading” xapi. This effectively means that a client is making requests at
26-
a rate that xapi cannot handle. This can result in very bad response times (“we
32+
“overloading” Xapi. This effectively means that a client is making requests at
33+
a rate that Xapi cannot handle. This can result in very bad response times (“we
2734
tried to shut down 20 VMs and this took 2 hours!”) and general system
2835
instability and unavailability.
2936

@@ -32,7 +39,7 @@ either misconfigured or make improper use of the API, hammering the pool and
3239
breaking use of the good guys. For example, a dodgy monitoring service may
3340
lock out the control software, or slow down VM lifecycle operations.
3441

35-
Part of the problem is that xapi and xenopsd are not very good at handling
42+
Part of the problem is that Xapi and xenopsd are not very good at handling
3643
load, in particular in a pool where the coordinator is often a bottleneck. A
3744
lot of work has already been done make the Toolstack cope better under load.
3845
This is important and a lot more can be done in this space.
@@ -44,7 +51,7 @@ improvements, as a complimentary approach.
4451

4552
Last year, thread prioritisation was tried. This could be revisited, but we
4653
also need an approach that allows us to pose hard constraints on clients. For
47-
example, as an admin, I want to configure xapi to give my control panel
54+
example, as an admin, I want to configure Xapi to give my control panel
4855
unlimited access, but explicitly limit how much Monitoring App X can do.
4956

5057
The proposal here is to do a simpler kind of per-client rate limiting at the
@@ -91,60 +98,156 @@ potentially made on multiple connections.
9198
### Client classification
9299
In order to let pool administrators know who they should be rate limiting, we
93100
will also introduce a **Caller** datamodel class which tracks all requests made
94-
to xapi.
101+
to Xapi.
95102

96103
Callers will be a high-level way of tracking clients. We allow callers to be
97104
identified by a number of different parameters: AD user, IP address,
98-
originator, user agent. When an unknown caller makes a request to xapi, we
105+
originator, user agent. When an unknown caller makes a request to Xapi, we
99106
record their data in a new row. The pool administrator will be able to merge
100107
related callers together and assign them labels.
101108

102109
The caller classification allows wildcards for any field, though we require
103110
that at least one field be specified. This lets us, for example, combine all
104-
accesses from the xapi python API by specifying the user-agent .
111+
accesses from the Xapi python API by specifying the user-agent .
105112

106-
In order to assist with rate limiting, we can store statistics about callers:
107-
- Last request timestamp
113+
A rate limiter can be associated with any number of callers, and the parameters
114+
of the rate limiter can either be derived from the usage patterns of the
115+
callers or selected from a number of preset profiles.
116+
117+
### Statistics
118+
In order to assist with rate limiting, we can store statistics about callers.
119+
We identify two kinds of statistics: volatile and stable. Volatile statistics
120+
change over time without any input, e.g. sliding windows. These will be stored
121+
in RRDs. By contrast, stable statistics only vary at most once per request, and
122+
so are safe to store in the main database.
123+
124+
**Volatile statistics:**
108125
- Tokens used over the last (5 minutes/hour/day).
109-
- Most common API requests.
126+
- Most common requests over the last (5 minutes/hour/day).
110127

111-
A rate limiter can then be attached to a particular caller, and the parameters
112-
of the rate limiter can either be derived from the usage patterns of the caller
113-
or selected from a number of preset profiles.
128+
**Stable statistics:**
129+
- Last request timestamp
114130

115131
## API design
116-
We propose two new datamodels: **Caller** and **Rate_limit**.
132+
133+
### Caller Datamodel
134+
We propose two new datamodel tables: **Caller**, which stores the data associated with each caller, and **Rate limit**, which identifies one or more callers with a rate limiter.
117135

118136
Caller:
119137
| Mutability | Name | Type | Description |
120138
| ---------: | ----------- | -------- | ---------------------------------------------------|
121-
| RW | name_label | String | User-assigned label for the caller |
122-
| RO | user_agent | String | User agent of throttled client; use "*" for "any" |
123-
| RO | host_ip | String | IP address of throttled client; use "*" for "any" |
124-
| RO | last_access | DateTime | Last time the caller made a request |
125-
| RO | burst_size | Float | Amount of tokens that can be consumed in one burst; -1 if no rate limit is applied |
126-
| RO | fill_rate | Float | Tokens added to the bucket per second; -1 if no rate limit is applied |
127-
128-
129-
Matching semantics for `Caller` fields:
130-
- `user_agent` and `host_ip` are treated as match patterns, not just stored values.
131-
- The star (`"*"`) is a wildcard for that field (equivalent to “any”).
132-
- A star can be appended to the end of a field to match any string that starts with the prefix. We only allow prefix matching.
133-
- An incoming call is assigned to a `Caller` only if **all non-wildcard fields in that
134-
caller record match** (logical AND across fields).
135-
- Examples:
136-
- `{user_agent = "*", host_ip = "10.1.2.3"}` matches any user-agent from `10.1.2.3`.
137-
- `{user_agent = "Python-xmlrpc/*", host_ip = "*"}` matches any Python
138-
XML-RPC client from any IP.
139-
- `{user_agent = "Python-xmlrpc/3.11", host_ip = "10.1.2.3"}` matches only calls where both fields match.
140-
141-
Rate limits: We provide calls to set and remove a rate limiter. When no rate limiter is set, both parameters are negative.
139+
| RO | uuid | String | Unique identifier for the caller |
140+
| RW | name_label | String | User-assigned label for the caller |
141+
| RW | name_description | String | User-assigned description for the caller |
142+
| RO | user_agent | String | user agent matching pattern |
143+
| RO | client_ip | String | Client IP address matching pattern |
144+
| RO | last_access | DateTime | Last time the caller made a request |
145+
| SRW | groups | Set String | Set of labels used for cumulative metrics |
146+
| RO | rate_limit | Ref Rate_limit | Associated rate limiter - can be null |
147+
148+
Callers identify the origin of incoming requests, and they serve a dual purpose
149+
of metrics gathering and providing a target for rate limiting. We allow users to query
150+
the metrics for an individual caller, or for all callers belonging to a given
151+
group. A new caller record will be added automatically whenever a request from
152+
an unknown origin is made to Xapi, and callers can also be added manually by
153+
users.
154+
155+
Rate limit:
156+
| Mutability | Name | Type | Description |
157+
|--------: | --------------|-----------|------------------|
158+
| RO | uuid | String | Unique identifier for the rate limiter |
159+
| RW | name_label | String | User-assigned label for the rate limiter |
160+
| RW | name_description | String | User-assigned description for the rate limiter |
161+
| SRO | callers | Set (Ref Caller) | Callers associated with this rate limiter |
162+
| RO | burst_size | Float | Amount of tokens that can be consumed in one burst |
163+
| RO | fill_rate | Float | Tokens added to the bucket per second |
164+
165+
A rate limit can be applied to a group of callers, which then have a collective
166+
rate limit applied. Each caller can have at most one rate limiter applied,
167+
which then becomes stored in its `rate_limit` field. We have two distinct
168+
notions of groups here: rate limits store groups of callers, but groups of
169+
callers are also represented in their `groups` field. We do this to allow for a
170+
decoupling of rate limiting and data reporting, and to simplify the underlying
171+
code by storing direct references to objects where possible.
172+
173+
### API functions
174+
We define the following API functions for the caller datamodel:
175+
- `Caller.create(name_label, name_description, user_agent, client_ip)`: Create a new caller.
176+
- `Caller.set_name_label(caller, name_label)`: Set name label on the caller
177+
- `Caller.destroy(caller)`: Destroy the caller
178+
- `Caller.add_group(caller, group)`: Add caller to group
179+
- `Caller.remove_group(caller, group)`: Remove caller from group
180+
- `Caller.query_usage(caller, time_period)`: Obtain usage statistics for an individual caller
181+
- `Caller.query_group_usage(group, time_period)`: Obtain usage statistics for a group of callers
182+
183+
And the following functions for the rate limiter datamodel:
184+
- `Rate_limit.create(name_label, callers, burst_size, fill_rate)`: Create a
185+
rate limiter with the supplied parameters.
186+
- `Rate_limit.add_caller(rate_limit, caller)`: Add a caller to the callers set
187+
- `Rate_limit.remove_caller(rate_limit, caller)`: Remove a caller from the callers set
188+
- `Rate_limit.set_burst_size(rate_limit, burst_size)`: Set the burst size for the
189+
rate limiter
190+
- `Rate_limit.set_fill_rate(rate_limit, burst_size)`: Set the fill rate for the
191+
rate limiter
192+
- `Rate_limit.destroy(rate_limit)`: Destroy the rate limiter - should also
193+
clear the `rate_limit` field from its associated callers
194+
195+
### Matching semantics
196+
When a request arrives, Xapi matches the request's metadata against the
197+
`Caller` table. Each field in a `Caller` record is a pattern matched against
198+
the corresponding field in the request using prefix matching: a pattern
199+
matches iff it is a prefix of the request's field value. Prefix patterns are
200+
specified by terminating with `*`. A pattern without `*` matches only exact
201+
equals. A record matches a request iff all fields match.
202+
203+
We treat logging and rate limiting differently:
204+
- **Logging**: All caller records that match with an incoming request track the
205+
request.
206+
- **Rate limiting**: Only the most specific match (which has rate limiting
207+
enabled) for any given request will trigger rate limiting and deduct tokens
208+
from its token bucket.
209+
210+
The **most specific match** is the matching `Caller` record with the longest
211+
total prefix length across all fields. For example, a record that matches
212+
`user_agent` with a full value is more specific than one that matches only a
213+
short prefix. We resolve ties through a lexicographic ordering amongst fields:
214+
IP address first, then user agent.
215+
216+
This results in the statistics being stored by callers tracking everything that
217+
matches, but only the most specific rate limiter to any given request will be
218+
triggered.
219+
220+
Note that this means overlapping records define independent rate limiters
221+
rather than a shared budget. For example, if `foo*` is limited to 10 req/s and
222+
`foo` is limited to 5 req/s, then a request whose user-agent is exactly `foo`
223+
will only deduct from the `foo` bucket (the more specific match), while
224+
requests matching `foo*` but not `foo` (e.g. `foobar`) deduct from the `foo*`
225+
bucket. The total sustained rate across both groups can therefore reach 15
226+
req/s. This is the expected behaviour of two separate rate limiters; sharing a
227+
single budget across overlapping patterns with different limits is not
228+
supported.
229+
230+
### Caller lifecycle
231+
- Users can proactively create callers with wildcards to identify groups of
232+
callers.
233+
- When a call comes in, a new Caller is automatically created if no existing
234+
**fully specified** record (no wildcards) matches. We want to store the details
235+
of all unique origins, even if they fall under a wildcard pattern.
236+
- Toolstack startup behavior: On toolstack startup, an in-memory data structure
237+
is created from the database fields which stores all the rate limiters and
238+
callers, as described in the implementation section. Token buckets are
239+
initialised full (i.e. each caller starts with a fresh burst budget) rather
240+
than being persisted across restarts. Volatile usage statistics surfaced via
241+
`Caller.query_usage` and `Caller.query_group_usage` are backed by RRDs, so
242+
historical data over the queried time period survives toolstack restarts.
243+
Stable statistics (e.g. `last_access`) are read from the database and are
244+
therefore also preserved.
142245

143246
## XAPI integration
144-
Calls into xapi are intercepted at two points:
247+
Calls into Xapi are intercepted at two points:
145248
- RPC calls are intercepted at dispatch, in the `do_dispatch` function within
146249
xapi/server_helpers.ml. At this point, we already have a session available if
147-
the caller is logged in, and we know which xapi call is being made.
250+
the caller is logged in, and we know which Xapi call is being made.
148251
- Other calls are intercepted by instrumenting the HTTP handlers as they are
149252
added to the HTTP server in the `add_handler` function within
150253
xapi/xapi_http.ml. Here we have less information, so the rate limiting is less
@@ -223,11 +326,11 @@ its function is executed by the rate limiter but may return a value.
223326

224327
### Client table
225328
We define a `Key` module to identify clients. This contains only user_agent and
226-
host_ip at present, but can be expanded to cover AD user and originator, for
329+
client_ip at present, but can be expanded to cover AD user and originator, for
227330
example.
228331
```ocaml
229332
module Key = struct
230-
type t = {user_agent: string; host_ip: string}
333+
type t = {user_agent: string; client_ip: string}
231334
```
232335

233336
This module implements a `match : t -> t -> bool` function, which treats empty

0 commit comments

Comments
 (0)