You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/config.json
+54Lines changed: 54 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -363,6 +363,46 @@
363
363
"title": "CORSConfiguration",
364
364
"type": "object"
365
365
},
366
+
"CompactionConfiguration": {
367
+
"additionalProperties": false,
368
+
"description": "Configuration for conversation history compaction.\n\nCompaction summarizes older conversation turns when their estimated\ntoken count approaches the context window limit, keeping the\nconversation usable instead of failing with HTTP 413. The\nconfiguration here controls when compaction triggers and how much\nrecent context is preserved verbatim.\n\nAttributes:\n enabled: Master switch. When False, compaction never triggers\n and other fields are inert.\n threshold_ratio: Trigger compaction when estimated input tokens\n exceed this fraction of the model's context window\n (clamped to 0.0..1.0).\n token_floor: Minimum estimated token count before compaction\n can trigger, regardless of threshold_ratio. Prevents\n triggering on very small context windows.\n buffer_turns: Initial number of recent turns to keep verbatim.\n The runtime applies a degrading guard — if these turns\n exceed the available budget, it reduces buffer_turns by\n one repeatedly until the budget fits, down to zero.\n buffer_max_ratio: Hard cap on the fraction of the context\n window the buffer zone may occupy, regardless of\n buffer_turns.",
369
+
"properties": {
370
+
"enabled": {
371
+
"default": false,
372
+
"description": "When true, older conversation turns are summarized when estimated tokens approach the context window limit.",
373
+
"title": "Enable compaction",
374
+
"type": "boolean"
375
+
},
376
+
"threshold_ratio": {
377
+
"default": 0.7,
378
+
"description": "Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0).",
379
+
"title": "Threshold ratio",
380
+
"type": "number"
381
+
},
382
+
"token_floor": {
383
+
"default": 4096,
384
+
"description": "Minimum token count before compaction can trigger. Prevents triggering on very small context windows.",
385
+
"minimum": 0,
386
+
"title": "Token floor",
387
+
"type": "integer"
388
+
},
389
+
"buffer_turns": {
390
+
"default": 4,
391
+
"description": "Number of recent turns to keep verbatim.",
392
+
"minimum": 0,
393
+
"title": "Buffer turns",
394
+
"type": "integer"
395
+
},
396
+
"buffer_max_ratio": {
397
+
"default": 0.3,
398
+
"description": "Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns.",
"description": "Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413.",
486
+
"title": "Conversation compaction configuration"
487
+
},
443
488
"byok_rag": {
444
489
"description": "BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file",
445
490
"items": {
@@ -718,6 +763,15 @@
718
763
"default": null,
719
764
"description": "Identification of default provider used when no other model is specified.",
720
765
"title": "Default provider"
766
+
},
767
+
"context_windows": {
768
+
"additionalProperties": {
769
+
"minimum": 0,
770
+
"type": "integer"
771
+
},
772
+
"description": "Map of fully-qualified model identifier (e.g., \"openai/gpt-4o-mini\") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger.",
Copy file name to clipboardExpand all lines: docs/config.md
+40Lines changed: 40 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -147,6 +147,44 @@ Useful resources:
147
147
| allow_headers | array | A list of HTTP request headers that should be supported for cross-origin requests. You can use ['*'] to allow all headers. The Accept, Accept-Language, Content-Language and Content-Type headers are always allowed for simple CORS requests. |
148
148
149
149
150
+
## CompactionConfiguration
151
+
152
+
153
+
Configuration for conversation history compaction.
154
+
155
+
Compaction summarizes older conversation turns when their estimated
156
+
token count approaches the context window limit, keeping the
157
+
conversation usable instead of failing with HTTP 413. The
158
+
configuration here controls when compaction triggers and how much
159
+
recent context is preserved verbatim.
160
+
161
+
Attributes:
162
+
enabled: Master switch. When False, compaction never triggers
163
+
and other fields are inert.
164
+
threshold_ratio: Trigger compaction when estimated input tokens
165
+
exceed this fraction of the model's context window
166
+
(clamped to 0.0..1.0).
167
+
token_floor: Minimum estimated token count before compaction
168
+
can trigger, regardless of threshold_ratio. Prevents
169
+
triggering on very small context windows.
170
+
buffer_turns: Initial number of recent turns to keep verbatim.
171
+
The runtime applies a degrading guard — if these turns
172
+
exceed the available budget, it reduces buffer_turns by
173
+
one repeatedly until the budget fits, down to zero.
174
+
buffer_max_ratio: Hard cap on the fraction of the context
175
+
window the buffer zone may occupy, regardless of
176
+
buffer_turns.
177
+
178
+
179
+
| Field | Type | Description |
180
+
|-------|------|-------------|
181
+
| enabled | boolean | When true, older conversation turns are summarized when estimated tokens approach the context window limit. |
182
+
| threshold_ratio | number | Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0). |
183
+
| token_floor | integer | Minimum token count before compaction can trigger. Prevents triggering on very small context windows. |
184
+
| buffer_turns | integer | Number of recent turns to keep verbatim. |
185
+
| buffer_max_ratio | number | Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns. |
186
+
187
+
150
188
## Configuration
151
189
152
190
@@ -166,6 +204,7 @@ Global service configuration.
166
204
| customization || It is possible to customize Lightspeed Core Stack via this section. System prompt can be customized and also different parts of the service can be replaced by custom Python modules. |
167
205
| inference || One LLM provider and one its model might be selected as default ones. When no provider+model pair is specified in REST API calls (query endpoints), the default provider and model are used. |
168
206
| conversation_cache |||
207
+
| compaction || Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413. |
169
208
| byok_rag | array | BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file |
170
209
| a2a_state || Configuration for A2A protocol persistent state storage. |
| default_model | string | Identification of default model used when no other model is specified. |
256
295
| default_provider | string | Identification of default provider used when no other model is specified. |
296
+
| context_windows | object | Map of fully-qualified model identifier (e.g., "openai/gpt-4o-mini") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger. |
0 commit comments