Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 0 additions & 99 deletions CLAUDE.md

This file was deleted.

21 changes: 21 additions & 0 deletions docs/en/inference/turbomind_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,27 @@ Prefix caching feature is mainly applicable to scenarios where multiple requests

Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length \< cache_block_seq_len), there will be no improvement in inference performance.

### Partial-block boundary reuse

Two mode knobs control whether partial-block prefix nodes are published at the prompt and generation boundaries: `cache_prompt` (`'all'` or `'auto'`, default `'auto'`) and `cache_generation` (`'all'`, `'auto'`, or `'none'`, default `'auto'`). Both require `enable_prefix_caching` and apply to any prefix-cached model: the published node carries the partial block's k/v, and on a recurrent/hybrid model (e.g. those with GatedDeltaNet layers) it additionally carries a recurrent-state checkpoint.

Set `cache_prompt='all'` when the same prompt is processed repeatedly (multi-sample decoding, shared/system prompts) so a duplicate prompt skips prefill; the default `'auto'` does this only for image-bearing partial prompt blocks (reusing vision-encoded KV) and is inert for text-only prompts. Set `cache_generation='all'` when you need to resume from the exact generation end (e.g. multi-turn chat); `'auto'` (default) caches full generated blocks only; `'none'` caches no generated blocks. The partial-block node costs extra VRAM and copy bandwidth, so prefer `'auto'` unless the reuse pays off.

```python
from lmdeploy import pipeline, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(
enable_prefix_caching=True,
cache_prompt='all',
cache_generation='all',
)
pipe = pipeline('your-model', backend_config=backend_config)
```

- `cache_prompt`: partial prompt-boundary publication mode, `'all'` or `'auto'` (default `'auto'`). `'all'` publishes a reusable prompt-boundary node at `B = prompt_len - cache_prompt_boundary_skip` whenever `B` is mid-block (and arms a recurrent-state checkpoint clamp when `B` is block-aligned), so a duplicate prompt skips prefill. `'auto'` does that only when the partial block holds image tokens (reusing vision-encoded KV) and is inert for text-only prompts. Costs one extra prefill forward for the producing request and (when `B` is mid-block) one partial cache block.
- `cache_prompt_boundary_skip`: number of trailing prompt tokens treated as the volatile generation-prompt suffix (e.g. a chat template's `<think>\n`) and excluded from the reusable prompt-boundary node, moving it to `prompt_len - cache_prompt_boundary_skip`. Applies when `cache_prompt` is `'all'` or `'auto'`. Default 1 (exclude only the last token). Increase it for thinking models whose chat template appends a multi-token suffix that the next turn drops from history.
- `cache_generation`: generated-block caching mode, `'all'`, `'auto'` (default), or `'none'`. `'all'` indexes full generated blocks and the terminal partial block, and adopts the terminal recurrent frontier checkpoint (exact multi-turn resume). `'auto'` indexes full generated blocks only. `'none'` indexes no generated blocks. Costs one partial cache block when `'all'` indexes a terminal partial block. Full-block checkpoints at block boundaries are always published regardless of this setting.

### kv quantization and inference switch

- `quant_policy=4` means 4bit k/v quantization and inference
Expand Down
21 changes: 21 additions & 0 deletions docs/zh_cn/inference/turbomind_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,27 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da

由于前缀缓存对 k/v 重复利用的最小粒度是block,如果相同prompt前缀不足一个block(前缀长度\<`cache_block_seq_len`),则推理性能不会有提升。

### 非整块边界复用

有两个模式开关控制是否在 prompt 与生成边界处发布非整块(partial-block)前缀节点:`cache_prompt`(取 `'all'` 或 `'auto'`,默认 `'auto'`)与 `cache_generation`(取 `'all'`、`'auto'` 或 `'none'`,默认 `'auto'`)。二者都需要开启 `enable_prefix_caching`,且适用于所有开启前缀缓存的模型:所发布的节点携带该非整块的 k/v;对于循环/混合模型(例如包含 GatedDeltaNet 层的模型),该节点还会额外携带循环状态 checkpoint。

将 `cache_prompt='all'` 用于同一 prompt 会被反复处理的场景(多次采样解码、共享/系统 prompt),使重复 prompt 跳过 prefill;默认 `'auto'` 仅对包含图像 token 的非整块 prompt 块生效(复用视觉编码 KV),对纯文本 prompt 无效果。将 `cache_generation='all'` 用于需要从生成精确末端恢复的场景(例如多轮对话);`'auto'`(默认)仅缓存完整生成块;`'none'` 不缓存任何生成块。非整块节点会带来额外的显存与拷贝带宽开销,除非复用收益明确,否则建议使用 `'auto'`。

```python
from lmdeploy import pipeline, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(
enable_prefix_caching=True,
cache_prompt='all',
cache_generation='all',
)
pipe = pipeline('your-model', backend_config=backend_config)
```

- `cache_prompt`:非整块 prompt 边界发布模式,取 `'all'` 或 `'auto'`(默认 `'auto'`)。`'all'` 在 `B = prompt_len - cache_prompt_boundary_skip` 且 `B` 落在块内部时发布可复用的 prompt 边界节点(当 `B` 对齐块边界时还会启用循环状态 checkpoint 钳位),使重复 prompt 跳过 prefill。`'auto'` 仅当该非整块包含图像 token 时生效(复用视觉编码 KV),对纯文本 prompt 无效果。代价是产生该节点的请求需要额外一次 prefill 前向计算,以及(当 `B` 落在块内部时)一个非整块缓存块(partial 块)。
- `cache_prompt_boundary_skip`:将 prompt 末尾的若干 token 视为易变的生成前缀后缀(例如 chat 模板的 `<think>\n`),从可复用的 prompt 边界节点中排除,使节点移动到 `prompt_len - cache_prompt_boundary_skip`。当 `cache_prompt` 为 `'all'` 或 `'auto'` 时生效。默认 1(仅排除最后一个 token)。对于 chat 模板会追加多 token 后缀、且下一轮历史会丢弃该后缀的思考型模型,可调大该值。
- `cache_generation`:生成块缓存模式,取 `'all'`、`'auto'`(默认)或 `'none'`。`'all'` 索引完整生成块及末端非整块,并采用末端循环 frontier checkpoint(精确多轮恢复)。`'auto'` 仅索引完整生成块。`'none'` 不索引任何生成块。当 `'all'` 索引末端非整块时,代价是一个非整块缓存块(partial 块)。无论该设置如何,块边界处的整块 checkpoint 始终会发布。

### kv 量化推理开关

`quant_policy`是 kv 量化和推理开关。
Expand Down
148 changes: 121 additions & 27 deletions lmdeploy/cli/chat.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
# Copyright (c) OpenMMLab. All rights reserved.
from contextlib import closing
import os
from urllib.parse import urlparse

import fire

from lmdeploy import GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig, pipeline
from lmdeploy.archs import autoget_backend

IMAGE_COMMAND = '/image'


def input_prompt():
"""Input a prompt in the consolo interface."""
Expand All @@ -14,11 +17,82 @@ def input_prompt():
return '\n'.join(iter(input, sentinel))


def normalize_image_source(image_url: str):
"""Resolve local relative image paths against the CLI working directory."""
if urlparse(image_url).scheme or os.path.isabs(image_url):
return image_url
return os.path.abspath(image_url)


def parse_image_command(stripped_line: str):
"""Parse one stripped /image command line into an OpenAI content block."""
if stripped_line == IMAGE_COMMAND:
raise ValueError('/image requires an image path or URL')
if not stripped_line.startswith(f'{IMAGE_COMMAND} '):
return None

image_url = stripped_line[len(IMAGE_COMMAND):].strip()
if not image_url:
raise ValueError('/image requires an image path or URL')
image_url = normalize_image_source(image_url)

return {
'kind': 'content',
'block': {
'type': 'image_url',
'image_url': {
'url': image_url
}
},
}


def parse_prompt_line(line: str):
"""Parse one raw prompt line into a text or content segment."""
stripped = line.strip()
for parse_command in (parse_image_command, ):
segment = parse_command(stripped)
if segment is not None:
return segment
return {'kind': 'text', 'text': line}


def merge_text_segments(segments: list[dict]) -> list[dict]:
"""Merge adjacent text segments into OpenAI text content blocks."""
content = []
pending_text = []

def append_pending_text():
if any(line.strip() for line in pending_text):
content.append({'type': 'text', 'text': '\n'.join(pending_text)})
pending_text.clear()

for segment in segments:
if segment['kind'] == 'text':
pending_text.append(segment['text'])
continue

append_pending_text()
content.append(segment['block'])

append_pending_text()
return content


def parse_interactive_prompt(prompt: str):
"""Parse a completed interactive chat prompt block.

Text-only prompts return the original string to preserve existing CLI behavior. Prompts with image commands return
ordered OpenAI multimodal content blocks.
"""
segments = [parse_prompt_line(line) for line in prompt.splitlines()]
content = merge_text_segments(segments)
has_image = any(block['type'] == 'image_url' for block in content)
return content if has_image else prompt


def build_pipe(model_path, backend, trust_remote_code=False, **kwargs):
engine_config = None
if kwargs.get('enable_prefix_caching', False):
print('interactive chat cannot be used when prefix caching is enabled')
exit(-1)
if backend == 'turbomind':
engine_config = TurbomindEngineConfig()
for key, value in kwargs.items():
Expand Down Expand Up @@ -74,35 +148,55 @@ def main(model_path, backend, trust_remote_code=False, **kwargs):
# set auto backend mode
backend = autoget_backend(model_path, trust_remote_code=trust_remote_code)
quit = False
messages = []
with build_pipe(model_path, backend, trust_remote_code=trust_remote_code, **kwargs) as pipe:
gen_config = build_gen_config(**kwargs)
adapter_name = get_adapter_name(**kwargs)
while not quit:
with closing(pipe.session()) as sess:
while True:
try:
prompt = input_prompt()
except KeyboardInterrupt:
quit = True
break
if prompt == 'end':
sess.close()
break
if prompt == 'exit':
quit = True
break
if prompt.strip() == '':
continue
resps = pipe.chat(prompt,
session=sess,
try:
prompt = input_prompt()
except KeyboardInterrupt:
quit = True
continue
if prompt == 'end':
messages.clear()
continue
if prompt == 'exit':
quit = True
continue
if prompt.strip() == '':
continue

try:
content = parse_interactive_prompt(prompt)
except ValueError as exc:
print(f'Error: {exc}')
continue

messages.append({'role': 'user', 'content': content})
request_messages = [message.copy() for message in messages]
# This session is only a per-request cancellation handle; the
# conversation state lives in the Python transcript above.
request_session = pipe.session()
response_text = ''
resps = pipe.stream_infer(request_messages,
sessions=request_session,
gen_config=gen_config,
adapter_name=adapter_name,
stream_response=True)
try:
for resp in resps:
print(resp.text, end='', flush=True)
except KeyboardInterrupt:
sess.abort()
stream_response=True,
sequence_start=True,
sequence_end=True)
try:
for resp in resps:
print(resp.text, end='', flush=True)
response_text += resp.text
except KeyboardInterrupt:
request_session.abort()
messages.pop()
print()
continue

messages.append({'role': 'assistant', 'content': response_text})
else:
print('exiting...')

Expand Down
1 change: 1 addition & 0 deletions lmdeploy/lib
Loading
Loading