llama.cpp-Specific API

This page documents Lemonade's llama.cpp-specific compatibility surface.

Summary

Method	Endpoint	Description	Modality
`POST`	`/v1/reranking`	Reranking	query + documents -> relevance-scored documents
`GET`	`/v1/slots`	Returns the current slots processing state	slots state
`POST`	`/v1/slots/{id}?action=save`	Save the prompt cache of the specified slot to a file	prompt cache
`POST`	`/v1/slots/{id}?action=restore`	Restore the prompt cache of the specified slot from a file	prompt cache
`POST`	`/v1/slots/{id}?action=erase`	Erase the prompt cache of the specified slot	prompt cache
`POST`	`/v1/tokenize`	Tokenize a given text	tokenization

`POST /v1/reranking`

Reranking API for llama.cpp-compatible reranker models. You provide a query and a list of documents, and receive relevance scores for each document. Lemonade will load the requested model automatically if it is not already loaded.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /v1/rerank endpoint.

Note: This endpoint is only available for reranker-specific models using the llamacpp recipe, such as bge-reranker-v2-m3-GGUF.

Parameters

Parameter	Required	Description
`query`	Yes	The search query text.
`documents`	Yes	Array of document strings to score against the query.
`model`	Yes	The reranking model to use. If not already loaded, Lemonade loads it before forwarding the request.

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/reranking" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{
    "model": "bge-reranker-v2-m3-GGUF",
    "query": "What is the capital of France?",
    "documents": [
      "Paris is the capital of France.",
      "Berlin is the capital of Germany.",
      "Madrid is the capital of Spain."
    ]
  }' -UseBasicParsing
```

=== "Bash"

```bash
curl -X POST http://localhost:13305/v1/reranking \
  -H "Content-Type: application/json" \
  -d '{
        "model": "bge-reranker-v2-m3-GGUF",
        "query": "What is the capital of France?",
        "documents": [
          "Paris is the capital of France.",
          "Berlin is the capital of Germany.",
          "Madrid is the capital of Spain."
        ]
      }'
```

Response format

{
  "model": "bge-reranker-v2-m3-GGUF",
  "object": "list",
  "results": [
    {
      "index": 0,
      "relevance_score": 8.60673713684082
    },
    {
      "index": 1,
      "relevance_score": -5.3886260986328125
    },
    {
      "index": 2,
      "relevance_score": -3.555561065673828
    }
  ],
  "usage": {
    "prompt_tokens": 51,
    "total_tokens": 51
  }
}

Field Descriptions:

model - Model identifier used for reranking
object - Type of response object, always "list"
results - Array of all input documents with relevance scores
- index - Original index of the document in the input array
- relevance_score - Relevance score assigned by the model; higher means more relevant
usage - Token usage statistics
- prompt_tokens - Number of tokens in the input
- total_tokens - Total tokens processed

Note: Results are returned in input order. To rank documents by relevance, sort results by relevance_score in descending order on the client side.

`GET /v1/slots`

Returns the current state of all processing slots in the llama.cpp server. Slots are parallel processing contexts that can handle multiple requests concurrently.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /slots endpoint.

Note: This endpoint is only available when a llama.cpp model is loaded.

Note: This endpoint supports all four path prefixes: /api/v0/slots, /api/v1/slots, /v0/slots, and /v1/slots.

Parameters

This endpoint accepts no parameters.

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/slots" `
  -Method GET -UseBasicParsing
```

=== "Bash"

```bash
curl http://localhost:13305/v1/slots
```

Response format

[
  {
    "id": 0,
    "state": "idle",
    "next_token": {
      "has_next_token": false,
      "n_remain": 0,
      "n_decoded": 0
    },
    "task_id": -1,
    "cache_tokens": 1024
  },
  {
    "id": 1,
    "state": "processing",
    "next_token": {
      "has_next_token": true,
      "n_remain": 42,
      "n_decoded": 15
    },
    "task_id": 123,
    "cache_tokens": 512
  }
]

Field Descriptions:

id - Unique identifier for the slot
state - Current processing state ("idle", "processing", etc.)
next_token - Information about token generation state
- has_next_token - Whether more tokens are expected
- n_remain - Number of tokens remaining to generate
- n_decoded - Number of tokens already decoded
task_id - Identifier of the current task being processed (-1 if idle)
cache_tokens - Number of cached tokens in the slot's prompt cache

`POST /v1/slots/{id}?action=save`

Save the prompt cache of a specific slot to a file. This allows you to persist the current context state for later restoration.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /slots/{id}?action=save endpoint.

Note: The llama.cpp server must be started with the --slot-save-path argument for save operations to work. See Server Configuration for details on configuring backend arguments.

Example configuration:
lemonade config set llamacpp.args="--slot-save-path /path/to/slot/saves"

Note: This endpoint supports all four path prefixes: /api/v0/slots/{id}, /api/v1/slots/{id}, /v0/slots/{id}, and /v1/slots/{id}.

Parameters

Parameter	Required	Description	Status
`id`	Yes	The slot ID to save (path parameter).
`filename`	Yes	The filename where the slot cache should be saved (JSON body).

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/slots/0?action=save" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```

=== "PowerShell (/api/v1)"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/api/v1/slots/0?action=save" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```

=== "Bash"

```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=save" \
  -H "Content-Type: application/json" \
  -d '{"filename": "my_conversation_cache.bin"}'
```

Response format

{
  "id_slot": 0,
  "filename": "my_conversation_cache.bin",
  "n_saved": 1024
}

Field Descriptions:

id_slot - The slot ID that was saved
filename - The filename where the cache was saved
n_saved - Number of tokens saved to the cache file

`POST /v1/slots/{id}?action=restore`

Restore the prompt cache of a specific slot from a previously saved file. This allows you to resume a conversation or context from where you left off.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /slots/{id}?action=restore endpoint.

Note: The llama.cpp server must be started with the --slot-save-path argument for restore operations to work.

Note: This endpoint supports all four path prefixes: /api/v0/slots/{id}, /api/v1/slots/{id}, /v0/slots/{id}, and /v1/slots/{id}.

Parameters

Parameter	Required	Description	Status
`id`	Yes	The slot ID to restore to (path parameter).
`filename`	Yes	The filename from which to restore the slot cache (JSON body).

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/slots/0?action=restore" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```

=== "PowerShell (/api/v1)"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/api/v1/slots/0?action=restore" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"filename": "my_conversation_cache.bin"}' -UseBasicParsing
```

=== "Bash"

```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=restore" \
  -H "Content-Type: application/json" \
  -d '{"filename": "my_conversation_cache.bin"}'
```

Response format

{
  "id_slot": 0,
  "filename": "my_conversation_cache.bin",
  "n_restored": 1024
}

Field Descriptions:

id_slot - The slot ID that was restored
filename - The filename from which the cache was restored
n_restored - Number of tokens restored from the cache file

`POST /v1/slots/{id}?action=erase`

Erase (clear) the prompt cache of a specific slot. This removes all cached context from the slot, resetting it to an empty state.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /slots/{id}?action=erase endpoint.

Note: This endpoint supports all four path prefixes: /api/v0/slots/{id}, /api/v1/slots/{id}, /v0/slots/{id}, and /v1/slots/{id}.

Parameters

Parameter	Required	Description	Status
`id`	Yes	The slot ID to erase (path parameter).

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/slots/0?action=erase" `
  -Method POST -UseBasicParsing
```

=== "PowerShell (/api/v1)"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/api/v1/slots/0?action=erase" `
  -Method POST -UseBasicParsing
```

=== "Bash"

```bash
curl -X POST "http://localhost:13305/v1/slots/0?action=erase"
```

Response format

{
  "id_slot": 0
}

Field Descriptions:

id_slot - The slot ID that was erased

Note: If the server returns an error, it may indicate that the slot was not found or that the operation failed.

`POST /v1/tokenize`

Tokenize a given text. Does not count towards the current model's context window.

Note: This endpoint is part of Lemonade's llama.cpp compatibility layer. Internally, Lemonade forwards the request to llama.cpp's /tokenize endpoint.

Note: This endpoint supports all four path prefixes: /api/v0/tokenize, /api/v1/tokenize, /v0/tokenize, and /v1/tokenize.

Note: Actual response values may vary for the same string across different models if the models do not share the same tokenizer.

Parameters

Parameter	Required	Description
`content`	Yes	The text to tokenize.
`add_special`	No	Boolean indicating if special tokens, i.e. `BOS`, should be inserted. Default: `false`
`parse_special`	No	Boolean indicating if special tokens should be tokenized. When `false` special tokens are treated as plaintext. Default: `true`
`with_pieces`	No	Boolean indicating whether to return token pieces along with IDs. Default: `false`

Example request

=== "PowerShell"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/v1/tokenize" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```

=== "PowerShell (/api/v1)"

```powershell
Invoke-WebRequest `
  -Uri "http://localhost:13305/api/v1/tokenize" `
  -Method POST `
  -Headers @{ "Content-Type" = "application/json" } `
  -Body '{"content": "This is a string to tokenize"}' -UseBasicParsing
```

=== "Bash"

```bash
curl -X POST "http://localhost:13305/v1/tokenize" \
  -H "Content-Type: application/json" \
  -d '{"content": "This is a string to tokenize"}'
```

Response format

{
  "tokens": [1919,369,264,886,310,74995]
}

If with_pieces is true:

{
  "tokens": [
    {"id": 123, "piece": "Hello"},
    {"id": 456, "piece": " world"},
    {"id": 789, "piece": "!"}
  ]
}

Field Descriptions:

tokens - Array of token IDs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp-Specific API

Summary

`POST /v1/reranking`

Parameters

Example request

Response format

`GET /v1/slots`

Parameters

Example request

Response format

`POST /v1/slots/{id}?action=save`

Parameters

Example request

Response format

`POST /v1/slots/{id}?action=restore`

Parameters

Example request

Response format

`POST /v1/slots/{id}?action=erase`

Parameters

Example request

Response format

`POST /v1/tokenize`

Parameters

Example request

Response format

FilesExpand file tree

llamacpp.md

Latest commit

History

llamacpp.md

File metadata and controls

llama.cpp-Specific API

Summary

POST /v1/reranking

Parameters

Example request

Response format

GET /v1/slots

Parameters

Example request

Response format

POST /v1/slots/{id}?action=save

Parameters

Example request

Response format

POST /v1/slots/{id}?action=restore

Parameters

Example request

Response format

POST /v1/slots/{id}?action=erase

Parameters

Example request

Response format

POST /v1/tokenize

Parameters

Example request

Response format

`POST /v1/reranking`

`GET /v1/slots`

`POST /v1/slots/{id}?action=save`

`POST /v1/slots/{id}?action=restore`

`POST /v1/slots/{id}?action=erase`

`POST /v1/tokenize`