This page documents Lemonade's llama.cpp-specific compatibility surface.
| Method | Endpoint | Description | Modality |
|---|---|---|---|
POST |
/v1/reranking |
Reranking | query + documents -> relevance-scored documents |
GET |
/v1/slots |
Returns the current slots processing state | slots state |
POST |
/v1/slots/{id}?action=save |
Save the prompt cache of the specified slot to a file | prompt cache |
POST |
/v1/slots/{id}?action=restore |
Restore the prompt cache of the specified slot from a file | prompt cache |
POST |
/v1/slots/{id}?action=erase |
Erase the prompt cache of the specified slot | prompt cache |
POST |
/v1/tokenize |
Tokenize a given text | tokenization |
Reranking API for llama.cpp-compatible reranker models. You provide a query and a list of documents, and receive relevance scores for each document. Lemonade will load the requested model automatically if it is not already loaded.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/v1/rerankendpoint.
Note: This endpoint is only available for reranker-specific models using the
llamacpprecipe, such asbge-reranker-v2-m3-GGUF.
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/reranking" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}' -UseBasicParsing
```
=== "Bash"
```bash
curl -X POST http://localhost:13305/v1/reranking \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3-GGUF",
"query": "What is the capital of France?",
"documents": [
"Paris is the capital of France.",
"Berlin is the capital of Germany.",
"Madrid is the capital of Spain."
]
}'
```
{
"model": "bge-reranker-v2-m3-GGUF",
"object": "list",
"results": [
{
"index": 0,
"relevance_score": 8.60673713684082
},
{
"index": 1,
"relevance_score": -5.3886260986328125
},
{
"index": 2,
"relevance_score": -3.555561065673828
}
],
"usage": {
"prompt_tokens": 51,
"total_tokens": 51
}
}Field Descriptions:
model- Model identifier used for rerankingobject- Type of response object, always"list"results- Array of all input documents with relevance scoresindex- Original index of the document in the input arrayrelevance_score- Relevance score assigned by the model; higher means more relevant
usage- Token usage statisticsprompt_tokens- Number of tokens in the inputtotal_tokens- Total tokens processed
Note: Results are returned in input order. To rank documents by relevance, sort
resultsbyrelevance_scorein descending order on the client side.
Returns the current state of all processing slots in the llama.cpp server. Slots are parallel processing contexts that can handle multiple requests concurrently.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/slotsendpoint.
Note: This endpoint is only available when a llama.cpp model is loaded.
Note: This endpoint supports all four path prefixes:
/api/v0/slots,/api/v1/slots,/v0/slots, and/v1/slots.
This endpoint accepts no parameters.
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots" `
-Method GET -UseBasicParsing
```
=== "Bash"
```bash
curl http://localhost:13305/v1/slots
```
[
{
"id": 0,
"state": "idle",
"next_token": {
"has_next_token": false,
"n_remain": 0,
"n_decoded": 0
},
"task_id": -1,
"cache_tokens": 1024
},
{
"id": 1,
"state": "processing",
"next_token": {
"has_next_token": true,
"n_remain": 42,
"n_decoded": 15
},
"task_id": 123,
"cache_tokens": 512
}
]Field Descriptions:
id- Unique identifier for the slotstate- Current processing state ("idle", "processing", etc.)next_token- Information about token generation statehas_next_token- Whether more tokens are expectedn_remain- Number of tokens remaining to generaten_decoded- Number of tokens already decoded
task_id- Identifier of the current task being processed (-1 if idle)cache_tokens- Number of cached tokens in the slot's prompt cache
Save the prompt cache of a specific slot to a file. This allows you to persist the current context state for later restoration.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/slots/{id}?action=saveendpoint.
Note: The llama.cpp server must be started with the
--slot-save-pathargument for save operations to work. See Server Configuration for details on configuring backend arguments.Example configuration:
lemonade config set llamacpp.args="--slot-save-path /path/to/slot/saves"
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to save (path parameter). | |
filename |
Yes | The filename where the slot cache should be saved (JSON body). |
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=save" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== "PowerShell (/api/v1)"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=save" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== "Bash"
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=save" \
-H "Content-Type: application/json" \
-d '{"filename": "my_conversation_cache.bin"}'
```
{
"id_slot": 0,
"filename": "my_conversation_cache.bin",
"n_saved": 1024
}Field Descriptions:
id_slot- The slot ID that was savedfilename- The filename where the cache was savedn_saved- Number of tokens saved to the cache file
Restore the prompt cache of a specific slot from a previously saved file. This allows you to resume a conversation or context from where you left off.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/slots/{id}?action=restoreendpoint.
Note: The llama.cpp server must be started with the
--slot-save-pathargument for restore operations to work.
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to restore to (path parameter). | |
filename |
Yes | The filename from which to restore the slot cache (JSON body). |
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=restore" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== "PowerShell (/api/v1)"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=restore" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```
=== "Bash"
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=restore" \
-H "Content-Type: application/json" \
-d '{"filename": "my_conversation_cache.bin"}'
```
{
"id_slot": 0,
"filename": "my_conversation_cache.bin",
"n_restored": 1024
}Field Descriptions:
id_slot- The slot ID that was restoredfilename- The filename from which the cache was restoredn_restored- Number of tokens restored from the cache file
Erase (clear) the prompt cache of a specific slot. This removes all cached context from the slot, resetting it to an empty state.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/slots/{id}?action=eraseendpoint.
Note: This endpoint supports all four path prefixes:
/api/v0/slots/{id},/api/v1/slots/{id},/v0/slots/{id}, and/v1/slots/{id}.
| Parameter | Required | Description | Status |
|---|---|---|---|
id |
Yes | The slot ID to erase (path parameter). |
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/slots/0?action=erase" `
-Method POST -UseBasicParsing
```
=== "PowerShell (/api/v1)"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/slots/0?action=erase" `
-Method POST -UseBasicParsing
```
=== "Bash"
```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=erase"
```
{
"id_slot": 0
}Field Descriptions:
id_slot- The slot ID that was erased
Note: If the server returns an error, it may indicate that the slot was not found or that the operation failed.
Tokenize a given text. Does not count towards the current model's context window.
Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's
/tokenizeendpoint.
Note: This endpoint supports all four path prefixes:
/api/v0/tokenize,/api/v1/tokenize,/v0/tokenize, and/v1/tokenize.
Note: Actual response values may vary for the same string across different models if the models do not share the same tokenizer.
=== "PowerShell"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/v1/tokenize" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```
=== "PowerShell (/api/v1)"
```powershell
Invoke-WebRequest `
-Uri "http://localhost:13305/api/v1/tokenize" `
-Method POST `
-Headers @{ "Content-Type" = "application/json" } `
-Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```
=== "Bash"
```bash
curl -X POST "http://localhost:13305/v1/tokenize" \
-H "Content-Type: application/json" \
-d '{"content": "This is a string to tokenize"}'
```
{
"tokens": [1919,369,264,886,310,74995]
}If with_pieces is true:
{
"tokens": [
{"id": 123, "piece": "Hello"},
{"id": 456, "piece": " world"},
{"id": 789, "piece": "!"}
]
}Field Descriptions:
tokens- Array of token IDs