Merge pull request lightspeed-core#1745 from tisnik/lcore-1356-updated-config-doc

tisnik · web-flow · commit 392413e9adfd · 2026-05-17T11:45:46.000+02:00
LCORE-1356: Updated configuration doc
diff --git a/docs/config.html b/docs/config.html
@@ -498,6 +498,69 @@ <h2 id="corsconfiguration">CORSConfiguration</h2>
         </tr>
       </tbody>
     </table>
+    <h2 id="compactionconfiguration">CompactionConfiguration</h2>
+    <p>Configuration for conversation history compaction.</p>
+    <p>Compaction summarizes older conversation turns when their estimated
+token count approaches the context window limit, keeping the
+conversation usable instead of failing with HTTP 413. The configuration
+here controls when compaction triggers and how much recent context is
+preserved verbatim.</p>
+    <p>Attributes: enabled: Master switch. When False, compaction never
+triggers and other fields are inert. threshold_ratio: Trigger compaction
+when estimated input tokens exceed this fraction of the model&#x2019;s context
+window (clamped to 0.0..1.0). token_floor: Minimum estimated token count
+before compaction can trigger, regardless of threshold_ratio. Prevents
+triggering on very small context windows. buffer_turns: Initial number
+of recent turns to keep verbatim. The runtime applies a degrading guard
+&#x2014; if these turns exceed the available budget, it reduces buffer_turns by
+one repeatedly until the budget fits, down to zero. buffer_max_ratio:
+Hard cap on the fraction of the context window the buffer zone may
+occupy, regardless of buffer_turns.</p>
+    <table>
+      <colgroup>
+        <col style="width: 26%"/>
+        <col style="width: 23%"/>
+        <col style="width: 50%"/>
+      </colgroup>
+      <thead>
+        <tr class="header">
+          <th>Field</th>
+          <th>Type</th>
+          <th>Description</th>
+        </tr>
+      </thead>
+      <tbody>
+        <tr class="odd">
+          <td>enabled</td>
+          <td>boolean</td>
+          <td>When true, older conversation turns are summarized when estimated
+tokens approach the context window limit.</td>
+        </tr>
+        <tr class="even">
+          <td>threshold_ratio</td>
+          <td>number</td>
+          <td>Trigger compaction when estimated tokens exceed this fraction of the
+model&#x2019;s context window (0.0-1.0).</td>
+        </tr>
+        <tr class="odd">
+          <td>token_floor</td>
+          <td>integer</td>
+          <td>Minimum token count before compaction can trigger. Prevents
+triggering on very small context windows.</td>
+        </tr>
+        <tr class="even">
+          <td>buffer_turns</td>
+          <td>integer</td>
+          <td>Number of recent turns to keep verbatim.</td>
+        </tr>
+        <tr class="odd">
+          <td>buffer_max_ratio</td>
+          <td>number</td>
+          <td>Maximum fraction of context window the buffer zone can occupy,
+regardless of buffer_turns.</td>
+        </tr>
+      </tbody>
+    </table>
     <h2 id="configuration">Configuration</h2>
     <p>Global service configuration.</p>
     <table>
@@ -588,49 +651,57 @@ <h2 id="configuration">Configuration</h2>
           <td/>
         </tr>
         <tr class="even">
+          <td>compaction</td>
+          <td/>
+          <td>Controls when conversation history is summarized to keep the model&#x2019;s
+input below the context window limit. Disabled by default &#x2014; when
+disabled, requests that exceed the window continue to surface as HTTP
+413.</td>
+        </tr>
+        <tr class="odd">
           <td>byok_rag</td>
           <td>array</td>
           <td>BYOK RAG configuration. This configuration can be used to
 reconfigure Llama Stack through its run.yaml configuration file</td>
         </tr>
-        <tr class="odd">
+        <tr class="even">
           <td>a2a_state</td>
           <td/>
           <td>Configuration for A2A protocol persistent state storage.</td>
         </tr>
-        <tr class="even">
+        <tr class="odd">
           <td>quota_handlers</td>
           <td/>
           <td>Quota handlers configuration</td>
         </tr>
-        <tr class="odd">
+        <tr class="even">
           <td>azure_entra_id</td>
           <td/>
           <td/>
         </tr>
-        <tr class="even">
+        <tr class="odd">
           <td>rlsapi_v1</td>
           <td/>
           <td>Configuration for the rlsapi v1 /infer endpoint used by the RHEL
 Lightspeed Command Line Assistant (CLA).</td>
         </tr>
-        <tr class="odd">
+        <tr class="even">
           <td>splunk</td>
           <td/>
           <td>Splunk HEC configuration for sending telemetry events.</td>
         </tr>
-        <tr class="even">
+        <tr class="odd">
           <td>deployment_environment</td>
           <td>string</td>
           <td>Deployment environment name (e.g., &#x2018;development&#x2019;, &#x2018;staging&#x2019;,
 &#x2018;production&#x2019;). Used in telemetry events.</td>
         </tr>
-        <tr class="odd">
+        <tr class="even">
           <td>rag</td>
           <td/>
           <td>Configuration for all RAG strategies (inline and tool-based).</td>
         </tr>
-        <tr class="even">
+        <tr class="odd">
           <td>okp</td>
           <td/>
           <td>OKP provider settings. Only used when &#x2018;okp&#x2019; is listed in rag.inline
@@ -827,6 +898,16 @@ <h2 id="inferenceconfiguration">InferenceConfiguration</h2>
           <td>Identification of default provider used when no other model is
 specified.</td>
         </tr>
+        <tr class="odd">
+          <td>context_windows</td>
+          <td>object</td>
+          <td>Map of fully-qualified model identifier (e.g., &#x201C;openai/gpt-4o-mini&#x201D;)
+to context window size in tokens. Used by the conversation compaction
+trigger to decide when older turns must be summarized before the input
+exceeds the window. Models absent from this map have no registered
+window &#x2014; callers fall back to their own default or skip the token-based
+trigger.</td>
+        </tr>
       </tbody>
     </table>
     <h2 id="jsonpathoperator">JsonPathOperator</h2>
diff --git a/docs/config.json b/docs/config.json
@@ -363,6 +363,46 @@
         "title": "CORSConfiguration",
         "type": "object"
       },
+      "CompactionConfiguration": {
+        "additionalProperties": false,
+        "description": "Configuration for conversation history compaction.\n\nCompaction summarizes older conversation turns when their estimated\ntoken count approaches the context window limit, keeping the\nconversation usable instead of failing with HTTP 413. The\nconfiguration here controls when compaction triggers and how much\nrecent context is preserved verbatim.\n\nAttributes:\n    enabled: Master switch. When False, compaction never triggers\n        and other fields are inert.\n    threshold_ratio: Trigger compaction when estimated input tokens\n        exceed this fraction of the model's context window\n        (clamped to 0.0..1.0).\n    token_floor: Minimum estimated token count before compaction\n        can trigger, regardless of threshold_ratio. Prevents\n        triggering on very small context windows.\n    buffer_turns: Initial number of recent turns to keep verbatim.\n        The runtime applies a degrading guard — if these turns\n        exceed the available budget, it reduces buffer_turns by\n        one repeatedly until the budget fits, down to zero.\n    buffer_max_ratio: Hard cap on the fraction of the context\n        window the buffer zone may occupy, regardless of\n        buffer_turns.",
+        "properties": {
+          "enabled": {
+            "default": false,
+            "description": "When true, older conversation turns are summarized when estimated tokens approach the context window limit.",
+            "title": "Enable compaction",
+            "type": "boolean"
+          },
+          "threshold_ratio": {
+            "default": 0.7,
+            "description": "Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0).",
+            "title": "Threshold ratio",
+            "type": "number"
+          },
+          "token_floor": {
+            "default": 4096,
+            "description": "Minimum token count before compaction can trigger. Prevents triggering on very small context windows.",
+            "minimum": 0,
+            "title": "Token floor",
+            "type": "integer"
+          },
+          "buffer_turns": {
+            "default": 4,
+            "description": "Number of recent turns to keep verbatim.",
+            "minimum": 0,
+            "title": "Buffer turns",
+            "type": "integer"
+          },
+          "buffer_max_ratio": {
+            "default": 0.3,
+            "description": "Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns.",
+            "title": "Buffer max ratio",
+            "type": "number"
+          }
+        },
+        "title": "CompactionConfiguration",
+        "type": "object"
+      },
       "Configuration": {
         "additionalProperties": false,
         "description": "Global service configuration.",
@@ -440,6 +480,11 @@
             "$ref": "#/components/schemas/ConversationHistoryConfiguration",
             "title": "Conversation history configuration"
           },
+          "compaction": {
+            "$ref": "#/components/schemas/CompactionConfiguration",
+            "description": "Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413.",
+            "title": "Conversation compaction configuration"
+          },
           "byok_rag": {
             "description": "BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file",
             "items": {
@@ -718,6 +763,15 @@
             "default": null,
             "description": "Identification of default provider used when no other model is specified.",
             "title": "Default provider"
+          },
+          "context_windows": {
+            "additionalProperties": {
+              "minimum": 0,
+              "type": "integer"
+            },
+            "description": "Map of fully-qualified model identifier (e.g., \"openai/gpt-4o-mini\") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger.",
+            "title": "Per-model context window sizes (tokens)",
+            "type": "object"
           }
         },
         "title": "InferenceConfiguration",
diff --git a/docs/config.md b/docs/config.md
@@ -147,6 +147,44 @@ Useful resources:
 | allow_headers | array | A list of HTTP request headers that should be supported for cross-origin requests. You can use ['*'] to allow all headers. The Accept, Accept-Language, Content-Language and Content-Type headers are always allowed for simple CORS requests. |
 
 
+## CompactionConfiguration
+
+
+Configuration for conversation history compaction.
+
+Compaction summarizes older conversation turns when their estimated
+token count approaches the context window limit, keeping the
+conversation usable instead of failing with HTTP 413. The
+configuration here controls when compaction triggers and how much
+recent context is preserved verbatim.
+
+Attributes:
+    enabled: Master switch. When False, compaction never triggers
+        and other fields are inert.
+    threshold_ratio: Trigger compaction when estimated input tokens
+        exceed this fraction of the model's context window
+        (clamped to 0.0..1.0).
+    token_floor: Minimum estimated token count before compaction
+        can trigger, regardless of threshold_ratio. Prevents
+        triggering on very small context windows.
+    buffer_turns: Initial number of recent turns to keep verbatim.
+        The runtime applies a degrading guard — if these turns
+        exceed the available budget, it reduces buffer_turns by
+        one repeatedly until the budget fits, down to zero.
+    buffer_max_ratio: Hard cap on the fraction of the context
+        window the buffer zone may occupy, regardless of
+        buffer_turns.
+
+
+| Field | Type | Description |
+|-------|------|-------------|
+| enabled | boolean | When true, older conversation turns are summarized when estimated tokens approach the context window limit. |
+| threshold_ratio | number | Trigger compaction when estimated tokens exceed this fraction of the model's context window (0.0-1.0). |
+| token_floor | integer | Minimum token count before compaction can trigger. Prevents triggering on very small context windows. |
+| buffer_turns | integer | Number of recent turns to keep verbatim. |
+| buffer_max_ratio | number | Maximum fraction of context window the buffer zone can occupy, regardless of buffer_turns. |
+
+
 ## Configuration
 
 
@@ -166,6 +204,7 @@ Global service configuration.
 | customization |  | It is possible to customize Lightspeed Core Stack via this section. System prompt can be customized and also different parts of the service can be replaced by custom Python modules. |
 | inference |  | One LLM provider and one its model might be selected as default ones. When no provider+model pair is specified in REST API calls (query endpoints), the default provider and model are used. |
 | conversation_cache |  |  |
+| compaction |  | Controls when conversation history is summarized to keep the model's input below the context window limit. Disabled by default — when disabled, requests that exceed the window continue to surface as HTTP 413. |
 | byok_rag | array | BYOK RAG configuration. This configuration can be used to reconfigure Llama Stack through its run.yaml configuration file |
 | a2a_state |  | Configuration for A2A protocol persistent state storage. |
 | quota_handlers |  | Quota handlers configuration |
@@ -254,6 +293,7 @@ Inference configuration.
 |-------|------|-------------|
 | default_model | string | Identification of default model used when no other model is specified. |
 | default_provider | string | Identification of default provider used when no other model is specified. |
+| context_windows | object | Map of fully-qualified model identifier (e.g., "openai/gpt-4o-mini") to context window size in tokens. Used by the conversation compaction trigger to decide when older turns must be summarized before the input exceeds the window. Models absent from this map have no registered window — callers fall back to their own default or skip the token-based trigger. |
 
 
 ## JsonPathOperator