Commit d1c79ab
authored
LCORE-1572: Integrate conversation compaction into the query flow (#1796)
* LCORE-1572: add conversation compaction and wire it into /v1/query
Introduce runtime conversation compaction (Option A): once a conversation
approaches the model's context window, lightspeed-stack summarizes older
turns and owns the LLM context itself instead of letting Llama Stack reload
the full history.
- src/utils/conversation_compaction.py: apply_compaction() async generator
and apply_compaction_blocking() wrapper. Holds a per-conversation lock
(R11), estimates tokens (LCORE-1569), partitions and summarizes old turns
(LCORE-1570), writes the summary into the conversation as a marker item,
and rebuilds the request as explicit input (summaries + recent verbatim
turns + new query). Marker items track the boundary; the conversation_id
is preserved and the full history stays in Llama Stack items for audit.
- models/common/responses/responses_api_params.py: omit_conversation flag so
the conversation parameter is dropped from the request body in compacted
mode while remaining on the object for identity.
- configuration.py: AppConfig.compaction accessor.
- app/endpoints/query.py: apply compaction after preparing params; in
compacted mode store the completed turn against the original user query
(the conversation parameter is no longer sent, so Llama Stack does not
persist the turn automatically).
Background: the spec's original marker-keeps-conversation-parameter approach
was found unimplementable on llama-stack 0.6.0, which always reloads the full
conversation history when the conversation parameter is set. This restores
the spike's original explicit-input approach.
* LCORE-1572: unit tests for conversation compaction core and /v1/query
Cover marker detection and boundary selection, explicit-input assembly, the
trigger threshold, the disabled / no-context-window / existing-marker /
triggered paths of apply_compaction, the streaming CompactionStartedEvent
ordering, and compacted-turn storage.
* LCORE-1572: apply conversation compaction in the A2A endpoint
The A2A executor uses the same prepare_responses_params + Responses API
flow as /v1/query and persists conversation_id for multi-turn contexts, so
it accumulates context and must compact too.
- Run apply_compaction_blocking before responses.create (A2A is not a
browser SSE stream, so no progress event is emitted).
- In compacted mode, persist the completed turn from the response.completed
stream event, since the conversation parameter is no longer sent and Llama
Stack therefore does not store the turn automatically.
* LCORE-1572: apply conversation compaction in the streaming_query endpoint
Stream /v1/streaming_query through the compaction-aware path only when the
conversation actually compacts, so non-compacting requests are unaffected
(byte-for-byte the existing flow, including HTTP error handling).
- conversation_compaction: add needs_compaction_path(), a cheap pre-stream
predicate (no LLM, no lock) that is true only when the conversation already
has a summary marker or would trigger a new compaction.
- streaming_query: when the predicate is true, stream via the new
generate_response_with_compaction(), which emits the compaction progress
event before the summarization LLM call (R12) and creates the response
inside the stream, surfacing create-time errors as SSE error events.
generate_response gains emit_start/compacted parameters and, in compacted
mode, appends the completed turn to the conversation (the conversation
parameter is not sent, so Llama Stack does not store it automatically).
- a2a: silence too-many-lines after the earlier compaction wiring.
* LCORE-1572: tests for the streaming compaction gate
Cover needs_compaction_path: disabled, existing-marker, over-threshold, and
under-threshold — the gate that keeps non-compacting requests on the
unchanged streaming path.
* LCORE-1572: apply conversation compaction in the /v1/responses endpoint
/v1/responses is the OpenAI-compatible Responses API, so compaction is
silent: no custom SSE event is injected (preserving wire compatibility) and
create-time error handling is unchanged. Summarization runs before the
response is created, on both the streaming and non-streaming paths.
- responses_endpoint_handler: run apply_compaction_blocking before the
streaming/non-streaming split, gated to stateful single-conversation
requests (store=True, a conversation present, no previous_response_id).
- ResponsesContext: carry compacted_original_input so the finalization can
store the turn against the original user input.
- _append_previous_response_turn: generalized to also append the turn in
compacted mode (the conversation parameter is dropped, so Llama Stack does
not store the turn automatically) using the original input.
* LCORE-1572: tests for /v1/responses compacted-turn storage
Verify _append_previous_response_turn stores the turn against the original
input in compacted mode, and stores nothing when store is disabled.
* LCORE-1572: update spec doc to the as-built compaction design
Revise R10, R12, the architecture flow, the changed-request-flow section, and
the implementation guidance to match what was built: in compacted mode
lightspeed-stack builds explicit input and omits the Llama Stack conversation
parameter (which always reloads full history), preserving conversation_id and
the full item history. Record the redesign and the four affected endpoints
(query, streaming_query, A2A, /v1/responses) in a new Changelog section.
* LCORE-1572: fix needs_compaction_path docstring (pydocstyle D400)
* LCORE-1572: build compacted input as typed messages (silence Pydantic warning)
The explicit compacted input was assembled as plain dicts, which produced
PydanticSerializationUnexpectedValue warnings when ResponsesApiParams was
dumped (its input field is typed ResponseInput). Build the summary, recent
verbatim, and query items as typed OpenAIResponseMessage objects instead.
Verified end-to-end against a live stack: the serializer warning is gone and
compaction still triggers, preserves conversation identity, and recalls
earlier context correctly.
* LCORE-1572: raise instead of assert on the drained compaction result
apply_compaction_blocking asserted that the generator yielded a result. Under
python -O asserts are stripped, so the guard would vanish and a None result
could propagate to callers. Replace it with an explicit None check that raises
RuntimeError.
Clears a GitHub code-scanning (CodeQL) "use of assert" finding. The repository's
Bandit configuration skips B101, so this only surfaced via code scanning, not
the Bandit CI job.
* LCORE-1572: wire persisted recursive fold (R3) via the summary cache
Make the conversation summary cache the preferred source of truth for
compaction summaries and the home of the persisted recursive fold.
- apply_compaction / apply_compaction_blocking gain cache + user_id +
skip_user_id_check. Summaries are read from the cache (get_summaries) and each
new chunk is written to it (store_summary); the Llama Stack marker texts remain
an authoritative fallback when no persisting cache is configured (marker-only
mode, additive summaries, no fold).
- When the persisted summaries themselves exceed the threshold, they are folded
via recursively_resummarize and the fold is persisted with replace_summaries,
so it is computed once and reused rather than recomputed per request (R3).
- configured_conversation_cache() resolves the configured cache (or None) for
the endpoints.
- Wired into /v1/query, /v1/streaming_query, and /v1/responses. The A2A executor
stays marker-only: it has no resolved user_id for the (user_id, conversation_id)
cache key.
Adds 7 unit tests: cache-preferred reads, store-on-compaction, fold trigger and
persistence, no-fold-without-cache, marker fallback, and the cache resolver.
* LCORE-1572: address CodeRabbit review — list-form input tokens + clarity rename
- Count tokens for list-form ResponseInput (e.g. /v1/responses), not only the
string form, so compaction is not skipped on large item-list inputs that
could otherwise still hit HTTP 413. Adds _estimate_response_input_tokens and a
regression test.
- Rename CompactionResult.summarized to compacted: the flag means "served in
compacted / explicit-input mode" (set whenever the conversation has any
summary, reused or fresh), not "a summary was created this request". The old
name caused reviewer confusion about turn-persistence gating, which is correct
as written.
* LCORE-1572: persist compacted streaming turns with structure (CodeRabbit #4)
In compacted mode the streaming endpoint persisted the completed turn as
flattened strings via append_turn_to_conversation, dropping attachments and
non-text output items, and double-storing for shield-blocked requests. Persist
the structured turn instead:
- Capture the response's structured output items onto TurnSummary.output_items
(set at response.completed, and to the refusal item on a shield block).
- generate_response now takes original_input and persists via store_compacted_turn
with the original input plus structured output items, matching the /v1/query
and A2A paths.
- The shield-blocked branch no longer stores the turn when the conversation
parameter was omitted (compacted mode); generate_response stores it once with
the correct original input, avoiding the duplicate refusal turn.
Adds tests for the structured compacted persistence and the shield dedup
(compacted and non-compacted).
* LCORE-1572: do not initialize the conversation cache when compaction is disabled
configured_conversation_cache() is evaluated eagerly as a call argument in the
query endpoint, so it ran on every request and accessed
configuration.conversation_cache unconditionally — forcing the (SQLite) cache to
initialize even when compaction is disabled. On configurations whose cache file
could not be opened that raised and returned HTTP 500, which failed the e2e
suites (where compaction is off). Return None without touching the cache when
compaction is disabled; the cache is only used by compaction on this path.
Adds a regression test.
* LCORE-1572: address CodeRabbit round 2 (compacted-mode persistence edges)
Follow-ups to the streaming-persistence work, all for non-happy-path terminals in
compacted mode (conversation parameter omitted), so the persisted turn uses the
original user input + structured output rather than the explicit rewrite or
flattened strings:
- /v1/responses: shield-blocked turns persist against compacted_original_input,
not api_params.input (the explicit rewrite).
- streaming: interrupted (CancelledError) turns thread original_input through the
interrupt callback and persist structured items, fixing the wrong-input storage
and the cast(str, input) break on list inputs.
- streaming: capture output_items on response.failed / response.incomplete
terminals too, not only response.completed, so compacted persistence keeps
partial output.
- TurnSummary.output_items typed as list[OpenAIResponseOutput] instead of list[Any].
Also documents that disabling compaction mid-conversation on an already-compacted
conversation reverts it to full-history replay (unsupported transition); the
enabled flag stays a full off-switch (CodeRabbit E, declined by design).
Adds unit tests for the blocked /responses path, the interrupted compacted path,
and output_items capture on a failed terminal.
* LCORE-1572: document the disable-after-compaction limitation in the spec doc (CodeRabbit E)
* LCORE-1572: document as-built divergences in spec doc (cache source-of-truth, persisted fold)
The spec still described the earlier design (cache as a parallel/best-effort
layer, markers as the summary source). Update Summary storage, Additive
summarization, and Changed request flow to the as-built design, and add a
Changelog entry: the cache is the preferred source of truth for summaries (marker
texts as fallback + audit/boundary), the recursive fold is persisted via
replace_summaries (in-memory fold rejected), A2A is marker-only, and the
enabled flag stays a full off-switch.
* LCORE-1572: fix line-too-long (C0301) in interrupted-turn test docstring
* LCORE-1572: harden disabled-cache regression test to fail on eager cache access (CodeRabbit)
* LCORE-1572: ref-count per-conversation lock + extract apply_compaction helpers (review)
Addresses two inline review nits from tisnik on the LCORE-1572 PR.
Per-conversation lock cleanup (R11):
Replace the bare ``dict[str, asyncio.Lock]`` registry with a ref-counted
``_LockEntry`` and an ``@asynccontextmanager`` helper guarded by a registry
mutex. Entries are removed once the last waiter exits, so the registry no
longer grows unbounded with the set of conversation_ids ever seen by the
process. Adds tests for serialization, deletion-after-last-release,
entry-kept-while-waiters-queued, and cleanup-on-cancellation.
apply_compaction refactor:
Extract five helpers — ``_load_compaction_state``, ``_estimate_total_tokens``,
``_persist_new_summary_chunk``, ``_maybe_persist_fold``, ``_compacted_result``
— leaving the orchestrating generator linear and roughly one screen long.
The state-loading, token-estimation, persistence-side-effects, and result-
building concerns are now each named and individually testable.
* LCORE-1572: tighten typed-item handling in compaction helpers (review)
Addresses asimurka's review nit about the dual dict-or-model branches in
``_verbatim_input_message`` and the surrounding token-estimator helpers.
Llama Stack's ``client.conversations.items.list`` returns items as typed
Pydantic models (the ``ItemListResponse`` discriminated union). The dict
branches in ``is_message_item``, ``extract_message_text``,
``estimate_conversation_tokens``, ``format_conversation_for_summary`` and
``_verbatim_input_message`` were defensive code for a shape that never
arrives from production code paths — they only kept the dict-using test
fixtures alive.
Drop the dict branches and tighten the docstrings to state the typed-item
contract. Update the compaction test fixtures (``_msg``, ``_marker``) to
return ``OpenAIResponseMessage`` instances instead of dicts. Remove the
token-estimator and compaction tests that explicitly asserted dict-shape
acceptance; replace with single tests verifying that dicts are now ignored.
* LCORE-1572: soften R12 doc on silent /v1/responses compaction (review)
Addresses asimurka's review note: emitting a compaction event on the
``/v1/responses`` endpoint would itself be spec-compliant under the
OpenResponses extension-events convention, so framing silent compaction as a
forced choice for "wire compatibility" overstated the constraint. Reword R12
and the changelog entry to acknowledge the spec-compliant option and to frame
silent as the *initial* choice, kept to preserve drop-in compatibility with
clients written against the upstream OpenAI Responses API; emitting the event
on this endpoint is left open as a follow-up. Lightspeed's own clients can
already use ``/v1/streaming_query`` to receive the event.1 parent 2606037 commit d1c79ab
18 files changed
Lines changed: 2325 additions & 183 deletions
File tree
- docs/design/conversation-compaction
- src
- app/endpoints
- models/common
- responses
- utils
- tests/unit
- app/endpoints
- utils
Lines changed: 102 additions & 50 deletions
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
| 4 | + | |
3 | 5 | | |
4 | 6 | | |
5 | 7 | | |
| |||
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
35 | | - | |
| 37 | + | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| |||
45 | 47 | | |
46 | 48 | | |
47 | 49 | | |
| 50 | + | |
48 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
49 | 56 | | |
50 | 57 | | |
51 | 58 | | |
52 | 59 | | |
53 | 60 | | |
54 | | - | |
| 61 | + | |
55 | 62 | | |
56 | 63 | | |
57 | 64 | | |
| |||
336 | 343 | | |
337 | 344 | | |
338 | 345 | | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
339 | 359 | | |
340 | 360 | | |
341 | 361 | | |
| |||
392 | 412 | | |
393 | 413 | | |
394 | 414 | | |
395 | | - | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
396 | 418 | | |
397 | | - | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
398 | 425 | | |
399 | 426 | | |
400 | 427 | | |
| |||
414 | 441 | | |
415 | 442 | | |
416 | 443 | | |
417 | | - | |
| 444 | + | |
418 | 445 | | |
419 | 446 | | |
420 | 447 | | |
421 | 448 | | |
422 | 449 | | |
| 450 | + | |
| 451 | + | |
423 | 452 | | |
424 | 453 | | |
425 | 454 | | |
| |||
508 | 537 | | |
509 | 538 | | |
510 | 539 | | |
| 540 | + | |
| 541 | + | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
511 | 554 | | |
512 | 555 | | |
513 | 556 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
| 42 | + | |
42 | 43 | | |
43 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
44 | 50 | | |
45 | 51 | | |
46 | 52 | | |
| |||
196 | 202 | | |
197 | 203 | | |
198 | 204 | | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
199 | 219 | | |
200 | 220 | | |
201 | 221 | | |
| |||
207 | 227 | | |
208 | 228 | | |
209 | 229 | | |
210 | | - | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
211 | 235 | | |
212 | 236 | | |
213 | 237 | | |
| |||
282 | 306 | | |
283 | 307 | | |
284 | 308 | | |
| 309 | + | |
285 | 310 | | |
286 | 311 | | |
287 | 312 | | |
| |||
294 | 319 | | |
295 | 320 | | |
296 | 321 | | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
297 | 328 | | |
298 | 329 | | |
299 | 330 | | |
300 | 331 | | |
301 | 332 | | |
302 | 333 | | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
303 | 339 | | |
304 | 340 | | |
305 | 341 | | |
306 | 342 | | |
307 | | - | |
| 343 | + | |
308 | 344 | | |
309 | 345 | | |
310 | 346 | | |
| |||
331 | 367 | | |
332 | 368 | | |
333 | 369 | | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
334 | 380 | | |
335 | 381 | | |
336 | 382 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
| 65 | + | |
65 | 66 | | |
66 | 67 | | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
67 | 72 | | |
68 | 73 | | |
69 | 74 | | |
| |||
225 | 230 | | |
226 | 231 | | |
227 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
228 | 241 | | |
229 | 242 | | |
230 | 243 | | |
231 | | - | |
| 244 | + | |
232 | 245 | | |
233 | 246 | | |
234 | 247 | | |
| |||
238 | 251 | | |
239 | 252 | | |
240 | 253 | | |
241 | | - | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
242 | 262 | | |
243 | 263 | | |
244 | 264 | | |
245 | 265 | | |
246 | 266 | | |
247 | 267 | | |
248 | | - | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
249 | 278 | | |
250 | 279 | | |
251 | 280 | | |
| |||
337 | 366 | | |
338 | 367 | | |
339 | 368 | | |
340 | | - | |
| 369 | + | |
341 | 370 | | |
342 | 371 | | |
343 | 372 | | |
| |||
436 | 465 | | |
437 | 466 | | |
438 | 467 | | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
439 | 494 | | |
440 | 495 | | |
441 | 496 | | |
| |||
449 | 504 | | |
450 | 505 | | |
451 | 506 | | |
| 507 | + | |
452 | 508 | | |
453 | 509 | | |
454 | 510 | | |
| |||
0 commit comments