Skip to content

Commit ef22b3e

Browse files
authored
docs: fix metrics endpoint description in server README (ggml-org#22879)
* docs: fix metrics endpoint description in server README Required model query parameter for router mode described. Removed metrics: - llamacpp:kv_cache_usage_ratio - llamacpp:kv_cache_tokens Added metrics: - llamacpp:prompt_seconds_total - llamacpp:tokens_predicted_seconds_total - llamacpp:n_decode_total - llamacpp:n_busy_slots_per_decode * server: fix metrics type for n_busy_slots_per_decode metric
1 parent 68e7ea3 commit ef22b3e

2 files changed

Lines changed: 21 additions & 14 deletions

File tree

tools/server/README.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1043,16 +1043,23 @@ If query param `?fail_on_no_slot=1` is set, this endpoint will respond with stat
10431043

10441044
This endpoint is only accessible if `--metrics` is set.
10451045

1046-
Available metrics:
1047-
- `llamacpp:prompt_tokens_total`: Number of prompt tokens processed.
1048-
- `llamacpp:tokens_predicted_total`: Number of generation tokens processed.
1049-
- `llamacpp:prompt_tokens_seconds`: Average prompt throughput in tokens/s.
1050-
- `llamacpp:predicted_tokens_seconds`: Average generation throughput in tokens/s.
1051-
- `llamacpp:kv_cache_usage_ratio`: KV-cache usage. `1` means 100 percent usage.
1052-
- `llamacpp:kv_cache_tokens`: KV-cache tokens.
1053-
- `llamacpp:requests_processing`: Number of requests processing.
1054-
- `llamacpp:requests_deferred`: Number of requests deferred.
1055-
- `llamacpp:n_tokens_max`: High watermark of the context size observed.
1046+
In *router mode* the query param `?model={model_id}` has to be set. This endpoint will respond with status code 400 `model name is missing from the request` if not set.
1047+
1048+
#### Available metrics
1049+
1050+
| Metric | Type | Description |
1051+
| ------ | ---------------------- | ----------- |
1052+
| `llamacpp:prompt_tokens_total` | Counter | Number of prompt tokens processed. |
1053+
| `llamacpp:prompt_seconds_total` | Counter | Prompt process time in seconds. |
1054+
| `llamacpp:prompt_tokens_seconds` | Gauge | Average prompt throughput in tokens/s. |
1055+
| `llamacpp:tokens_predicted_total` | Counter | Number of generation tokens processed. |
1056+
| `llamacpp:tokens_predicted_seconds_total` | Counter | Predict process time in seconds. |
1057+
| `llamacpp:predicted_tokens_seconds` | Gauge | Average generation throughput in tokens/s. |
1058+
| `llamacpp:requests_processing` | Gauge | Number of requests processing. |
1059+
| `llamacpp:requests_deferred` | Gauge | Number of requests deferred. |
1060+
| `llamacpp:n_tokens_max` | Counter | High watermark of the context size observed. |
1061+
| `llamacpp:n_decode_total` | Counter | Total Number of llama_decode() calls. |
1062+
| `llamacpp:n_busy_slots_per_decode` | Gauge | Average number of busy slots per llama_decode() call. |
10561063

10571064
### POST `/slots/{id_slot}?action=save`: Save the prompt cache of the specified slot to a file.
10581065

tools/server/server-context.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3622,10 +3622,6 @@ void server_routes::init_routes() {
36223622
{"name", "n_tokens_max"},
36233623
{"help", "Largest observed n_tokens."},
36243624
{"value", res_task->n_tokens_max}
3625-
}, {
3626-
{"name", "n_busy_slots_per_decode"},
3627-
{"help", "Average number of busy slots per llama_decode() call"},
3628-
{"value", (float) res_task->n_busy_slots_total / std::max((float) res_task->n_decode_total, 1.f)}
36293625
}}},
36303626
{"gauge", {{
36313627
{"name", "prompt_tokens_seconds"},
@@ -3643,6 +3639,10 @@ void server_routes::init_routes() {
36433639
{"name", "requests_deferred"},
36443640
{"help", "Number of requests deferred."},
36453641
{"value", (uint64_t) res_task->n_tasks_deferred}
3642+
},{
3643+
{"name", "n_busy_slots_per_decode"},
3644+
{"help", "Average number of busy slots per llama_decode() call"},
3645+
{"value", (float) res_task->n_busy_slots_total / std::max((float) res_task->n_decode_total, 1.f)}
36463646
}}}
36473647
};
36483648

0 commit comments

Comments
 (0)