Skip to content

Commit 392413e

Browse files
authored
Merge pull request lightspeed-core#1745 from tisnik/lcore-1356-updated-config-doc
LCORE-1356: Updated configuration doc
2 parents b9ce55f + b44d507 commit 392413e

3 files changed

Lines changed: 183 additions & 8 deletions

File tree

docs/config.html

Lines changed: 89 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -498,6 +498,69 @@ <h2 id="corsconfiguration">CORSConfiguration</h2>
498498
</tr>
499499
</tbody>
500500
</table>
501+
<h2 id="compactionconfiguration">CompactionConfiguration</h2>
502+
<p>Configuration for conversation history compaction.</p>
503+
<p>Compaction summarizes older conversation turns when their estimated
504+
token count approaches the context window limit, keeping the
505+
conversation usable instead of failing with HTTP 413. The configuration
506+
here controls when compaction triggers and how much recent context is
507+
preserved verbatim.</p>
508+
<p>Attributes: enabled: Master switch. When False, compaction never
509+
triggers and other fields are inert. threshold_ratio: Trigger compaction
510+
when estimated input tokens exceed this fraction of the model&#x2019;s context
511+
window (clamped to 0.0..1.0). token_floor: Minimum estimated token count
512+
before compaction can trigger, regardless of threshold_ratio. Prevents
513+
triggering on very small context windows. buffer_turns: Initial number
514+
of recent turns to keep verbatim. The runtime applies a degrading guard
515+
&#x2014; if these turns exceed the available budget, it reduces buffer_turns by
516+
one repeatedly until the budget fits, down to zero. buffer_max_ratio:
517+
Hard cap on the fraction of the context window the buffer zone may
518+
occupy, regardless of buffer_turns.</p>
519+
<table>
520+
<colgroup>
521+
<col style="width: 26%"/>
522+
<col style="width: 23%"/>
523+
<col style="width: 50%"/>
524+
</colgroup>
525+
<thead>
526+
<tr class="header">
527+
<th>Field</th>
528+
<th>Type</th>
529+
<th>Description</th>
530+
</tr>
531+
</thead>
532+
<tbody>
533+
<tr class="odd">
534+
<td>enabled</td>
535+
<td>boolean</td>
536+
<td>When true, older conversation turns are summarized when estimated
537+
tokens approach the context window limit.</td>
538+
</tr>
539+
<tr class="even">
540+
<td>threshold_ratio</td>
541+
<td>number</td>
542+
<td>Trigger compaction when estimated tokens exceed this fraction of the
543+
model&#x2019;s context window (0.0-1.0).</td>
544+
</tr>
545+
<tr class="odd">
546+
<td>token_floor</td>
547+
<td>integer</td>
548+
<td>Minimum token count before compaction can trigger. Prevents
549+
triggering on very small context windows.</td>
550+
</tr>
551+
<tr class="even">
552+
<td>buffer_turns</td>
553+
<td>integer</td>
554+
<td>Number of recent turns to keep verbatim.</td>
555+
</tr>
556+
<tr class="odd">
557+
<td>buffer_max_ratio</td>
558+
<td>number</td>
559+
<td>Maximum fraction of context window the buffer zone can occupy,
560+
regardless of buffer_turns.</td>
561+
</tr>
562+
</tbody>
563+
</table>
501564
<h2 id="configuration">Configuration</h2>
502565
<p>Global service configuration.</p>
503566
<table>
@@ -588,49 +651,57 @@ <h2 id="configuration">Configuration</h2>
588651
<td/>
589652
</tr>
590653
<tr class="even">
654+
<td>compaction</td>
655+
<td/>
656+
<td>Controls when conversation history is summarized to keep the model&#x2019;s
657+
input below the context window limit. Disabled by default &#x2014; when
658+
disabled, requests that exceed the window continue to surface as HTTP
659+
413.</td>
660+
</tr>
661+
<tr class="odd">
591662
<td>byok_rag</td>
592663
<td>array</td>
593664
<td>BYOK RAG configuration. This configuration can be used to
594665
reconfigure Llama Stack through its run.yaml configuration file</td>
595666
</tr>
596-
<tr class="odd">
667+
<tr class="even">
597668
<td>a2a_state</td>
598669
<td/>
599670
<td>Configuration for A2A protocol persistent state storage.</td>
600671
</tr>
601-
<tr class="even">
672+
<tr class="odd">
602673
<td>quota_handlers</td>
603674
<td/>
604675
<td>Quota handlers configuration</td>
605676
</tr>
606-
<tr class="odd">
677+
<tr class="even">
607678
<td>azure_entra_id</td>
608679
<td/>
609680
<td/>
610681
</tr>
611-
<tr class="even">
682+
<tr class="odd">
612683
<td>rlsapi_v1</td>
613684
<td/>
614685
<td>Configuration for the rlsapi v1 /infer endpoint used by the RHEL
615686
Lightspeed Command Line Assistant (CLA).</td>
616687
</tr>
617-
<tr class="odd">
688+
<tr class="even">
618689
<td>splunk</td>
619690
<td/>
620691
<td>Splunk HEC configuration for sending telemetry events.</td>
621692
</tr>
622-
<tr class="even">
693+
<tr class="odd">
623694
<td>deployment_environment</td>
624695
<td>string</td>
625696
<td>Deployment environment name (e.g., &#x2018;development&#x2019;, &#x2018;staging&#x2019;,
626697
&#x2018;production&#x2019;). Used in telemetry events.</td>
627698
</tr>
628-
<tr class="odd">
699+
<tr class="even">
629700
<td>rag</td>
630701
<td/>
631702
<td>Configuration for all RAG strategies (inline and tool-based).</td>
632703
</tr>
633-
<tr class="even">
704+
<tr class="odd">
634705
<td>okp</td>
635706
<td/>
636707
<td>OKP provider settings. Only used when &#x2018;okp&#x2019; is listed in rag.inline
@@ -827,6 +898,16 @@ <h2 id="inferenceconfiguration">InferenceConfiguration</h2>
827898
<td>Identification of default provider used when no other model is
828899
specified.</td>
829900
</tr>
901+
<tr class="odd">
902+
<td>context_windows</td>
903+
<td>object</td>
904+
<td>Map of fully-qualified model identifier (e.g., &#x201C;openai/gpt-4o-mini&#x201D;)
905+
to context window size in tokens. Used by the conversation compaction
906+
trigger to decide when older turns must be summarized before the input
907+
exceeds the window. Models absent from this map have no registered
908+
window &#x2014; callers fall back to their own default or skip the token-based
909+
trigger.</td>
910+
</tr>
830911
</tbody>
831912
</table>
832913
<h2 id="jsonpathoperator">JsonPathOperator</h2>

docs/config.json

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -363,6 +363,46 @@
363363
"title": "CORSConfiguration",
364364
"type": "object"
365365
},
366+
"CompactionConfiguration": {
367+
"additionalProperties": false,
368+
"description": "Configuration for conversation history compaction.\n\nCompaction summarizes older conversation turns when their estimated\ntoken count approaches the context window limit, keeping the\nconversation usable instead of failing with HTTP 413. The\nconfiguration here controls when compaction triggers and how much\nrecent context is preserved verbatim.\n\nAttributes:\n enabled: Master switch. When False, compaction never triggers\n and other fields are inert.\n threshold_ratio: Trigger compaction when estimated input tokens\n exceed this fraction of the model's context window\n (clamped to 0.0..1.0).\n token_floor: Minimum estimated token count before compaction\n can trigger, regardless of threshold_ratio. Prevents\n triggering on very small context windows.\n buffer_turns: Initial number of recent turns to keep verbatim.\n The runtime applies a degrading guard — if these turns\n exceed the available budget, it reduces buffer_turns by\n one repeatedly until the budget fits, down to zero.\n buffer_max_ratio: Hard cap on the fraction of the context\n window the buffer zone may occupy, regardless of\n buffer_turns.",
369+
"properties": {
370+
"enabled": {
371+
"default": false,
372+
"description": "When true, older conversation turns are summarized when estimated tokens approach the context window limit.",
373+
"title": "Enable compaction",
374+
"type": "boolean"
375+
},
376+
"threshold_ratio": {
377+
"default": 0.7,
378+
"description": "Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0).",
379+
"title": "Threshold ratio",
380+
"type": "number"
381+
},
382+
"token_floor": {
383+
"default": 4096,
384+
"description": "Minimum token count before compaction can trigger. Prevents triggering on very small context windows.",
385+
"minimum": 0,
386+
"title": "Token floor",
387+
"type": "integer"
388+
},
389+
"buffer_turns": {
390+
"default": 4,
391+
"description": "Number of recent turns to keep verbatim.",
392+
"minimum": 0,
393+
"title": "Buffer turns",
394+
"type": "integer"
395+
},
396+
"buffer_max_ratio": {
397+
"default": 0.3,
398+
"description": "Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns.",
399+
"title": "Buffer max ratio",
400+
"type": "number"
401+
}
402+
},
403+
"title": "CompactionConfiguration",
404+
"type": "object"
405+
},
366406
"Configuration": {
367407
"additionalProperties": false,
368408
"description": "Global service configuration.",
@@ -440,6 +480,11 @@
440480
"$ref": "#/components/schemas/ConversationHistoryConfiguration",
441481
"title": "Conversation history configuration"
442482
},
483+
"compaction": {
484+
"$ref": "#/components/schemas/CompactionConfiguration",
485+
"description": "Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413.",
486+
"title": "Conversation compaction configuration"
487+
},
443488
"byok_rag": {
444489
"description": "BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file",
445490
"items": {
@@ -718,6 +763,15 @@
718763
"default": null,
719764
"description": "Identification of default provider used when no other model is specified.",
720765
"title": "Default provider"
766+
},
767+
"context_windows": {
768+
"additionalProperties": {
769+
"minimum": 0,
770+
"type": "integer"
771+
},
772+
"description": "Map of fully-qualified model identifier (e.g., \"openai/gpt-4o-mini\") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger.",
773+
"title": "Per-model context window sizes (tokens)",
774+
"type": "object"
721775
}
722776
},
723777
"title": "InferenceConfiguration",

docs/config.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,44 @@ Useful resources:
147147
| allow_headers | array | A list of HTTP request headers that should be supported for cross-origin requests. You can use ['*'] to allow all headers. The Accept, Accept-Language, Content-Language and Content-Type headers are always allowed for simple CORS requests. |
148148

149149

150+
## CompactionConfiguration
151+
152+
153+
Configuration for conversation history compaction.
154+
155+
Compaction summarizes older conversation turns when their estimated
156+
token count approaches the context window limit, keeping the
157+
conversation usable instead of failing with HTTP 413. The
158+
configuration here controls when compaction triggers and how much
159+
recent context is preserved verbatim.
160+
161+
Attributes:
162+
enabled: Master switch. When False, compaction never triggers
163+
and other fields are inert.
164+
threshold_ratio: Trigger compaction when estimated input tokens
165+
exceed this fraction of the model's context window
166+
(clamped to 0.0..1.0).
167+
token_floor: Minimum estimated token count before compaction
168+
can trigger, regardless of threshold_ratio. Prevents
169+
triggering on very small context windows.
170+
buffer_turns: Initial number of recent turns to keep verbatim.
171+
The runtime applies a degrading guard — if these turns
172+
exceed the available budget, it reduces buffer_turns by
173+
one repeatedly until the budget fits, down to zero.
174+
buffer_max_ratio: Hard cap on the fraction of the context
175+
window the buffer zone may occupy, regardless of
176+
buffer_turns.
177+
178+
179+
| Field | Type | Description |
180+
|-------|------|-------------|
181+
| enabled | boolean | When true, older conversation turns are summarized when estimated tokens approach the context window limit. |
182+
| threshold_ratio | number | Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0). |
183+
| token_floor | integer | Minimum token count before compaction can trigger. Prevents triggering on very small context windows. |
184+
| buffer_turns | integer | Number of recent turns to keep verbatim. |
185+
| buffer_max_ratio | number | Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns. |
186+
187+
150188
## Configuration
151189

152190

@@ -166,6 +204,7 @@ Global service configuration.
166204
| customization | | It is possible to customize Lightspeed Core Stack via this section. System prompt can be customized and also different parts of the service can be replaced by custom Python modules. |
167205
| inference | | One LLM provider and one its model might be selected as default ones. When no provider+model pair is specified in REST API calls (query endpoints), the default provider and model are used. |
168206
| conversation_cache | | |
207+
| compaction | | Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413. |
169208
| byok_rag | array | BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file |
170209
| a2a_state | | Configuration for A2A protocol persistent state storage. |
171210
| quota_handlers | | Quota handlers configuration |
@@ -254,6 +293,7 @@ Inference configuration.
254293
|-------|------|-------------|
255294
| default_model | string | Identification of default model used when no other model is specified. |
256295
| default_provider | string | Identification of default provider used when no other model is specified. |
296+
| context_windows | object | Map of fully-qualified model identifier (e.g., "openai/gpt-4o-mini") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger. |
257297

258298

259299
## JsonPathOperator

0 commit comments

Comments
 (0)