InternLM · lzhangzz · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026 · Jun 29, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
diff --git a/docs/en/inference/turbomind_config.md b/docs/en/inference/turbomind_config.md
@@ -105,6 +105,27 @@ Prefix caching feature is mainly applicable to scenarios where multiple requests
 
 Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length \< cache_block_seq_len), there will be no improvement in inference performance.
 
+### Partial-block boundary reuse
+
+Two mode knobs control whether partial-block prefix nodes are published at the prompt and generation boundaries: `cache_prompt` (`'all'` or `'auto'`, default `'auto'`) and `cache_generation` (`'all'`, `'auto'`, or `'none'`, default `'auto'`). Both require `enable_prefix_caching` and apply to any prefix-cached model: the published node carries the partial block's k/v, and on a recurrent/hybrid model (e.g. those with GatedDeltaNet layers) it additionally carries a recurrent-state checkpoint.
+
+Set `cache_prompt='all'` when the same prompt is processed repeatedly (multi-sample decoding, shared/system prompts) so a duplicate prompt skips prefill; the default `'auto'` does this only for image-bearing partial prompt blocks (reusing vision-encoded KV) and is inert for text-only prompts. Set `cache_generation='all'` when you need to resume from the exact generation end (e.g. multi-turn chat); `'auto'` (default) caches full generated blocks only; `'none'` caches no generated blocks. The partial-block node costs extra VRAM and copy bandwidth, so prefer `'auto'` unless the reuse pays off.
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+
+backend_config = TurbomindEngineConfig(
+    enable_prefix_caching=True,
+    cache_prompt='all',
+    cache_generation='all',
+)
+pipe = pipeline('your-model', backend_config=backend_config)
+```
+
+- `cache_prompt`: partial prompt-boundary publication mode, `'all'` or `'auto'` (default `'auto'`). `'all'` publishes a reusable prompt-boundary node at `B = prompt_len - cache_prompt_boundary_skip` whenever `B` is mid-block (and arms a recurrent-state checkpoint clamp when `B` is block-aligned), so a duplicate prompt skips prefill. `'auto'` does that only when the partial block holds image tokens (reusing vision-encoded KV) and is inert for text-only prompts. Costs one extra prefill forward for the producing request and (when `B` is mid-block) one partial cache block.
+- `cache_prompt_boundary_skip`: number of trailing prompt tokens treated as the volatile generation-prompt suffix (e.g. a chat template's `<think>\n`) and excluded from the reusable prompt-boundary node, moving it to `prompt_len - cache_prompt_boundary_skip`. Applies when `cache_prompt` is `'all'` or `'auto'`. Default 1 (exclude only the last token). Increase it for thinking models whose chat template appends a multi-token suffix that the next turn drops from history.
+- `cache_generation`: generated-block caching mode, `'all'`, `'auto'` (default), or `'none'`. `'all'` indexes full generated blocks and the terminal partial block, and adopts the terminal recurrent frontier checkpoint (exact multi-turn resume). `'auto'` indexes full generated blocks only. `'none'` indexes no generated blocks. Costs one partial cache block when `'all'` indexes a terminal partial block. Full-block checkpoints at block boundaries are always published regardless of this setting.
+
 ### kv quantization and inference switch
 
 - `quant_policy=4` means 4bit k/v quantization and inference

diff --git a/docs/zh_cn/inference/turbomind_config.md b/docs/zh_cn/inference/turbomind_config.md
@@ -107,6 +107,27 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da
 
 由于前缀缓存对 k/v 重复利用的最小粒度是block，如果相同prompt前缀不足一个block（前缀长度\<`cache_block_seq_len`），则推理性能不会有提升。
 
+### 非整块边界复用
+
+有两个模式开关控制是否在 prompt 与生成边界处发布非整块（partial-block）前缀节点：`cache_prompt`（取 `'all'` 或 `'auto'`，默认 `'auto'`）与 `cache_generation`（取 `'all'`、`'auto'` 或 `'none'`，默认 `'auto'`）。二者都需要开启 `enable_prefix_caching`，且适用于所有开启前缀缓存的模型：所发布的节点携带该非整块的 k/v；对于循环/混合模型（例如包含 GatedDeltaNet 层的模型），该节点还会额外携带循环状态 checkpoint。
+
+将 `cache_prompt='all'` 用于同一 prompt 会被反复处理的场景（多次采样解码、共享/系统 prompt），使重复 prompt 跳过 prefill；默认 `'auto'` 仅对包含图像 token 的非整块 prompt 块生效（复用视觉编码 KV），对纯文本 prompt 无效果。将 `cache_generation='all'` 用于需要从生成精确末端恢复的场景（例如多轮对话）；`'auto'`（默认）仅缓存完整生成块；`'none'` 不缓存任何生成块。非整块节点会带来额外的显存与拷贝带宽开销，除非复用收益明确，否则建议使用 `'auto'`。
+
+```python
+from lmdeploy import pipeline, TurbomindEngineConfig
+
+backend_config = TurbomindEngineConfig(
+    enable_prefix_caching=True,
+    cache_prompt='all',
+    cache_generation='all',
+)
+pipe = pipeline('your-model', backend_config=backend_config)
+```
+
+- `cache_prompt`：非整块 prompt 边界发布模式，取 `'all'` 或 `'auto'`（默认 `'auto'`）。`'all'` 在 `B = prompt_len - cache_prompt_boundary_skip` 且 `B` 落在块内部时发布可复用的 prompt 边界节点（当 `B` 对齐块边界时还会启用循环状态 checkpoint 钳位），使重复 prompt 跳过 prefill。`'auto'` 仅当该非整块包含图像 token 时生效（复用视觉编码 KV），对纯文本 prompt 无效果。代价是产生该节点的请求需要额外一次 prefill 前向计算，以及（当 `B` 落在块内部时）一个非整块缓存块（partial 块）。
+- `cache_prompt_boundary_skip`：将 prompt 末尾的若干 token 视为易变的生成前缀后缀（例如 chat 模板的 `<think>\n`），从可复用的 prompt 边界节点中排除，使节点移动到 `prompt_len - cache_prompt_boundary_skip`。当 `cache_prompt` 为 `'all'` 或 `'auto'` 时生效。默认 1（仅排除最后一个 token）。对于 chat 模板会追加多 token 后缀、且下一轮历史会丢弃该后缀的思考型模型，可调大该值。
+- `cache_generation`：生成块缓存模式，取 `'all'`、`'auto'`（默认）或 `'none'`。`'all'` 索引完整生成块及末端非整块，并采用末端循环 frontier checkpoint（精确多轮恢复）。`'auto'` 仅索引完整生成块。`'none'` 不索引任何生成块。当 `'all'` 索引末端非整块时，代价是一个非整块缓存块（partial 块）。无论该设置如何，块边界处的整块 checkpoint 始终会发布。
+
 ### kv 量化推理开关
 
 `quant_policy`是 kv 量化和推理开关。

diff --git a/lmdeploy/cli/chat.py b/lmdeploy/cli/chat.py
@@ -1,11 +1,14 @@
 # Copyright (c) OpenMMLab. All rights reserved.
-from contextlib import closing
+import os
+from urllib.parse import urlparse
 
 import fire
 
 from lmdeploy import GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig, pipeline
 from lmdeploy.archs import autoget_backend
 
+IMAGE_COMMAND = '/image'
+
 
 def input_prompt():
     """Input a prompt in the consolo interface."""
@@ -14,11 +17,82 @@ def input_prompt():
     return '\n'.join(iter(input, sentinel))
 
 
+def normalize_image_source(image_url: str):
+    """Resolve local relative image paths against the CLI working directory."""
+    if urlparse(image_url).scheme or os.path.isabs(image_url):
+        return image_url
+    return os.path.abspath(image_url)
+
+
+def parse_image_command(stripped_line: str):
+    """Parse one stripped /image command line into an OpenAI content block."""
+    if stripped_line == IMAGE_COMMAND:
+        raise ValueError('/image requires an image path or URL')
+    if not stripped_line.startswith(f'{IMAGE_COMMAND} '):
+        return None
+
+    image_url = stripped_line[len(IMAGE_COMMAND):].strip()
+    if not image_url:
+        raise ValueError('/image requires an image path or URL')
+    image_url = normalize_image_source(image_url)
+
+    return {
+        'kind': 'content',
+        'block': {
+            'type': 'image_url',
+            'image_url': {
+                'url': image_url
+            }
+        },
+    }
+
+
+def parse_prompt_line(line: str):
+    """Parse one raw prompt line into a text or content segment."""
+    stripped = line.strip()
+    for parse_command in (parse_image_command, ):
+        segment = parse_command(stripped)
+        if segment is not None:
+            return segment
+    return {'kind': 'text', 'text': line}
+
+
+def merge_text_segments(segments: list[dict]) -> list[dict]:
+    """Merge adjacent text segments into OpenAI text content blocks."""
+    content = []
+    pending_text = []
+
+    def append_pending_text():
+        if any(line.strip() for line in pending_text):
+            content.append({'type': 'text', 'text': '\n'.join(pending_text)})
+        pending_text.clear()
+
+    for segment in segments:
+        if segment['kind'] == 'text':
+            pending_text.append(segment['text'])
+            continue
+
+        append_pending_text()
+        content.append(segment['block'])
+
+    append_pending_text()
+    return content
+
+
+def parse_interactive_prompt(prompt: str):
+    """Parse a completed interactive chat prompt block.
+
+    Text-only prompts return the original string to preserve existing CLI behavior. Prompts with image commands return
+    ordered OpenAI multimodal content blocks.
+    """
+    segments = [parse_prompt_line(line) for line in prompt.splitlines()]
+    content = merge_text_segments(segments)
+    has_image = any(block['type'] == 'image_url' for block in content)
+    return content if has_image else prompt
+
+
 def build_pipe(model_path, backend, trust_remote_code=False, **kwargs):
     engine_config = None
-    if kwargs.get('enable_prefix_caching', False):
-        print('interactive chat cannot be used when prefix caching is enabled')
-        exit(-1)
     if backend == 'turbomind':
         engine_config = TurbomindEngineConfig()
         for key, value in kwargs.items():
@@ -74,35 +148,55 @@ def main(model_path, backend, trust_remote_code=False, **kwargs):
         # set auto backend mode
         backend = autoget_backend(model_path, trust_remote_code=trust_remote_code)
     quit = False
+    messages = []
     with build_pipe(model_path, backend, trust_remote_code=trust_remote_code, **kwargs) as pipe:
         gen_config = build_gen_config(**kwargs)
         adapter_name = get_adapter_name(**kwargs)
         while not quit:
-            with closing(pipe.session()) as sess:
-                while True:
-                    try:
-                        prompt = input_prompt()
-                    except KeyboardInterrupt:
-                        quit = True
-                        break
-                    if prompt == 'end':
-                        sess.close()
-                        break
-                    if prompt == 'exit':
-                        quit = True
-                        break
-                    if prompt.strip() == '':
-                        continue
-                    resps = pipe.chat(prompt,
-                                      session=sess,
+            try:
+                prompt = input_prompt()
+            except KeyboardInterrupt:
+                quit = True
+                continue
+            if prompt == 'end':
+                messages.clear()
+                continue
+            if prompt == 'exit':
+                quit = True
+                continue
+            if prompt.strip() == '':
+                continue
+
+            try:
+                content = parse_interactive_prompt(prompt)
+            except ValueError as exc:
+                print(f'Error: {exc}')
+                continue
+
+            messages.append({'role': 'user', 'content': content})
+            request_messages = [message.copy() for message in messages]
+            # This session is only a per-request cancellation handle; the
+            # conversation state lives in the Python transcript above.
+            request_session = pipe.session()
+            response_text = ''
+            resps = pipe.stream_infer(request_messages,
+                                      sessions=request_session,
                                       gen_config=gen_config,
                                       adapter_name=adapter_name,
-                                      stream_response=True)
-                    try:
-                        for resp in resps:
-                            print(resp.text, end='', flush=True)
-                    except KeyboardInterrupt:
-                        sess.abort()
+                                      stream_response=True,
+                                      sequence_start=True,
+                                      sequence_end=True)
+            try:
+                for resp in resps:
+                    print(resp.text, end='', flush=True)
+                    response_text += resp.text
+            except KeyboardInterrupt:
+                request_session.abort()
+                messages.pop()
+                print()
+                continue
+
+            messages.append({'role': 'assistant', 'content': response_text})
         else:
             print('exiting...')
 

diff --git a/lmdeploy/lib b/lmdeploy/lib
@@ -0,0 +1 @@
+/data/lmdeploy-memory/build/lib