diff --git a/CLAUDE.md b/CLAUDE.md deleted file mode 100644 index 6a572ff93f..0000000000 --- a/CLAUDE.md +++ /dev/null @@ -1,99 +0,0 @@ -# CLAUDE.md - -This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. - -## Commands - -**Linting:** - -```bash -pre-commit run --all-files -``` - -Style: PEP8, max line length 120, double quotes, LF endings. C++ source under `src/` uses clang-format. - -**Tests:** - -```bash -pytest tests/test_lmdeploy # all unit tests -pytest tests/test_lmdeploy/test_model.py # specific file -pytest tests/test_lmdeploy/test_lite/ # quantization tests -pytest tests/test_lmdeploy/test_vl/ # vision-language tests -``` - -**Debug logging:** - -```bash -LMDEPLOY_LOG_LEVEL=DEBUG python ... -``` - -**Build (TurboMind C++ extension):** - -- Controlled via `setup.py` + CMake. Relevant env vars: `LMDEPLOY_TARGET_DEVICE` (default `cuda`), `DISABLE_TURBOMIND`, `CMAKE_BUILD_TYPE`, `CUDACXX`. -- Requirements split by device: `requirements/runtime_cuda.txt`, `runtime_ascend.txt`, etc. - -## Architecture - -### Two Backends, One Pipeline - -`lmdeploy/pipeline.py` is the main user-facing entry point (`pipeline()` in `api.py`). It instantiates either the **PyTorch engine** (`lmdeploy/pytorch/`) or the **TurboMind engine** (`lmdeploy/turbomind/`) based on config. - -### PyTorch Backend - -**Model patching** is the core mechanism: HuggingFace models are loaded normally, then their layers are dynamically replaced with optimized LMDeploy implementations. - -- `lmdeploy/pytorch/models/module_map.py` — registry mapping HF class names → LMDeploy replacement classes. Device-specific overrides in `DEVICE_SPECIAL_MODULE_MAP`. -- `lmdeploy/pytorch/models/patch.py` — applies the substitutions at runtime via `_get_rewrite_qualname()` / `_class_from_qualname()`. -- `lmdeploy/pytorch/models/` — 40+ per-model files (e.g., `llama.py`, `qwen.py`, `deepseek_v2.py`). Each reimplements attention, MLP, and embeddings using custom kernels. -- `lmdeploy/pytorch/nn/` — reusable optimized modules: `linear/` (AWQ, W8A8, blocked-FP8, LoRA variants), `attention.py`, `norm.py`, `rotary_embedding.py`, `moe/`. -- `lmdeploy/pytorch/kernels/` — Triton/CUDA kernels (e.g., `w8a8_triton_kernels.py`). -- `lmdeploy/pytorch/backends/` — kernel/operator dispatchers per quantization type (FP8, AWQ, CUDA). - -**Engine execution flow (key files):** - -- `engine.py` — main PyTorch engine. -- `paging/scheduler.py` — sequences → batches; prefill/decode, block eviction, prefix caching (`BlockTrie`). -- `engine/engine_loop.py` — async inference loop. -- (See `pytorch/engine/` and `pytorch/paging/` for full execution detail.) - -**Configuration dataclasses** (`lmdeploy/pytorch/config.py`): `ModelConfig`, `CacheConfig`, `SchedulerConfig`, `BackendConfig`, `DistConfig`, `MiscConfig`. - -### TurboMind Backend - -- Python wrapper: `lmdeploy/turbomind/turbomind.py` (~800 lines). Bridges into `lmdeploy/lib/_turbomind` (pybind11 extension built from `src/turbomind/`). -- Tensor interop via `torch.from_dlpack()` / `_tm.from_dlpack()`. -- Config and model conversion: `lmdeploy/turbomind/deploy/config.py`, `supported_models.py`. -- Parallel config helpers: `update_parallel_config()`, `complete_parallel_config()` in `messages.py`. - -### Lite / Quantization - -Entrypoints in `lmdeploy/lite/apis/`: `calibrate.py` (main), `auto_awq.py`, `gptq.py`, `smooth_quant.py`. - -**Flow:** load HF model → `CalibrationContext` collects activation statistics → scale computation (`lmdeploy/lite/quantization/`) → write quantized weights. - -- `lite/quantization/awq.py` — AWQ (NORM_FCS_MAP, FC_FCS_MAP define per-model layer structure). -- `lite/quantization/weight/quantizer.py` — weight quantizer. -- `lite/quantization/activation/observer.py` — activation statistics. -- `lite/modeling/` — model-specific GPTQ implementations (e.g., `internlm2_gptq.py`). -- `lite/utils/cal_qparams.py` — quantization parameter calculation utilities. - -Layer/norm/head mappings per model family are defined directly in `calibrate.py` and `awq.py`. - -### Vision-Language Models - -- `lmdeploy/vl/model/` — VLM preprocessing (InternVL, Qwen-VL, LLaVA, CogVLM, etc.). -- `lmdeploy/vl/media/` — image/video loaders and base classes. -- `lmdeploy/pytorch/multimodal/` — multimodal input handling for the PyTorch engine. -- Reference VLM implementation: `lmdeploy/vl/model/qwen3.py`. - -### Other Key Files - -- `lmdeploy/messages.py` — core types: `GenerationConfig`, `EngineConfig`, `TurbomindEngineConfig`, `SchedulerSequence`, `MessageStatus`. -- `lmdeploy/model.py` — chat templates; critical for correct conversation formatting. -- `lmdeploy/archs.py` — architecture registry mapping model arch names to runtime patches. -- `lmdeploy/tokenizer.py` — HuggingFace/SentencePiece tokenizer wrapper. -- `lmdeploy/serve/openai/` — OpenAI-compatible API server. - -## Adding a New PyTorch Model - -Use the `/support-new-model` skill for a complete step-by-step guide. diff --git a/docs/en/inference/turbomind_config.md b/docs/en/inference/turbomind_config.md index 5f549fb0c7..bbeb72cd6b 100644 --- a/docs/en/inference/turbomind_config.md +++ b/docs/en/inference/turbomind_config.md @@ -105,6 +105,27 @@ Prefix caching feature is mainly applicable to scenarios where multiple requests Since k/v block is the smallest granularity for reuse in prefix caching, if the identical prompt prefix is less than one block (prefix length \< cache_block_seq_len), there will be no improvement in inference performance. +### Partial-block boundary reuse + +Two mode knobs control whether partial-block prefix nodes are published at the prompt and generation boundaries: `cache_prompt` (`'all'` or `'auto'`, default `'auto'`) and `cache_generation` (`'all'`, `'auto'`, or `'none'`, default `'auto'`). Both require `enable_prefix_caching` and apply to any prefix-cached model: the published node carries the partial block's k/v, and on a recurrent/hybrid model (e.g. those with GatedDeltaNet layers) it additionally carries a recurrent-state checkpoint. + +Set `cache_prompt='all'` when the same prompt is processed repeatedly (multi-sample decoding, shared/system prompts) so a duplicate prompt skips prefill; the default `'auto'` does this only for image-bearing partial prompt blocks (reusing vision-encoded KV) and is inert for text-only prompts. Set `cache_generation='all'` when you need to resume from the exact generation end (e.g. multi-turn chat); `'auto'` (default) caches full generated blocks only; `'none'` caches no generated blocks. The partial-block node costs extra VRAM and copy bandwidth, so prefer `'auto'` unless the reuse pays off. + +```python +from lmdeploy import pipeline, TurbomindEngineConfig + +backend_config = TurbomindEngineConfig( + enable_prefix_caching=True, + cache_prompt='all', + cache_generation='all', +) +pipe = pipeline('your-model', backend_config=backend_config) +``` + +- `cache_prompt`: partial prompt-boundary publication mode, `'all'` or `'auto'` (default `'auto'`). `'all'` publishes a reusable prompt-boundary node at `B = prompt_len - cache_prompt_boundary_skip` whenever `B` is mid-block (and arms a recurrent-state checkpoint clamp when `B` is block-aligned), so a duplicate prompt skips prefill. `'auto'` does that only when the partial block holds image tokens (reusing vision-encoded KV) and is inert for text-only prompts. Costs one extra prefill forward for the producing request and (when `B` is mid-block) one partial cache block. +- `cache_prompt_boundary_skip`: number of trailing prompt tokens treated as the volatile generation-prompt suffix (e.g. a chat template's `\n`) and excluded from the reusable prompt-boundary node, moving it to `prompt_len - cache_prompt_boundary_skip`. Applies when `cache_prompt` is `'all'` or `'auto'`. Default 1 (exclude only the last token). Increase it for thinking models whose chat template appends a multi-token suffix that the next turn drops from history. +- `cache_generation`: generated-block caching mode, `'all'`, `'auto'` (default), or `'none'`. `'all'` indexes full generated blocks and the terminal partial block, and adopts the terminal recurrent frontier checkpoint (exact multi-turn resume). `'auto'` indexes full generated blocks only. `'none'` indexes no generated blocks. Costs one partial cache block when `'all'` indexes a terminal partial block. Full-block checkpoints at block boundaries are always published regardless of this setting. + ### kv quantization and inference switch - `quant_policy=4` means 4bit k/v quantization and inference diff --git a/docs/zh_cn/inference/turbomind_config.md b/docs/zh_cn/inference/turbomind_config.md index 1033ecaa6d..e66c57af40 100644 --- a/docs/zh_cn/inference/turbomind_config.md +++ b/docs/zh_cn/inference/turbomind_config.md @@ -107,6 +107,27 @@ cache_block_seq_len * num_layer * kv_head_num * size_per_head * 2 * sizeof(kv_da 由于前缀缓存对 k/v 重复利用的最小粒度是block,如果相同prompt前缀不足一个block(前缀长度\<`cache_block_seq_len`),则推理性能不会有提升。 +### 非整块边界复用 + +有两个模式开关控制是否在 prompt 与生成边界处发布非整块(partial-block)前缀节点:`cache_prompt`(取 `'all'` 或 `'auto'`,默认 `'auto'`)与 `cache_generation`(取 `'all'`、`'auto'` 或 `'none'`,默认 `'auto'`)。二者都需要开启 `enable_prefix_caching`,且适用于所有开启前缀缓存的模型:所发布的节点携带该非整块的 k/v;对于循环/混合模型(例如包含 GatedDeltaNet 层的模型),该节点还会额外携带循环状态 checkpoint。 + +将 `cache_prompt='all'` 用于同一 prompt 会被反复处理的场景(多次采样解码、共享/系统 prompt),使重复 prompt 跳过 prefill;默认 `'auto'` 仅对包含图像 token 的非整块 prompt 块生效(复用视觉编码 KV),对纯文本 prompt 无效果。将 `cache_generation='all'` 用于需要从生成精确末端恢复的场景(例如多轮对话);`'auto'`(默认)仅缓存完整生成块;`'none'` 不缓存任何生成块。非整块节点会带来额外的显存与拷贝带宽开销,除非复用收益明确,否则建议使用 `'auto'`。 + +```python +from lmdeploy import pipeline, TurbomindEngineConfig + +backend_config = TurbomindEngineConfig( + enable_prefix_caching=True, + cache_prompt='all', + cache_generation='all', +) +pipe = pipeline('your-model', backend_config=backend_config) +``` + +- `cache_prompt`:非整块 prompt 边界发布模式,取 `'all'` 或 `'auto'`(默认 `'auto'`)。`'all'` 在 `B = prompt_len - cache_prompt_boundary_skip` 且 `B` 落在块内部时发布可复用的 prompt 边界节点(当 `B` 对齐块边界时还会启用循环状态 checkpoint 钳位),使重复 prompt 跳过 prefill。`'auto'` 仅当该非整块包含图像 token 时生效(复用视觉编码 KV),对纯文本 prompt 无效果。代价是产生该节点的请求需要额外一次 prefill 前向计算,以及(当 `B` 落在块内部时)一个非整块缓存块(partial 块)。 +- `cache_prompt_boundary_skip`:将 prompt 末尾的若干 token 视为易变的生成前缀后缀(例如 chat 模板的 `\n`),从可复用的 prompt 边界节点中排除,使节点移动到 `prompt_len - cache_prompt_boundary_skip`。当 `cache_prompt` 为 `'all'` 或 `'auto'` 时生效。默认 1(仅排除最后一个 token)。对于 chat 模板会追加多 token 后缀、且下一轮历史会丢弃该后缀的思考型模型,可调大该值。 +- `cache_generation`:生成块缓存模式,取 `'all'`、`'auto'`(默认)或 `'none'`。`'all'` 索引完整生成块及末端非整块,并采用末端循环 frontier checkpoint(精确多轮恢复)。`'auto'` 仅索引完整生成块。`'none'` 不索引任何生成块。当 `'all'` 索引末端非整块时,代价是一个非整块缓存块(partial 块)。无论该设置如何,块边界处的整块 checkpoint 始终会发布。 + ### kv 量化推理开关 `quant_policy`是 kv 量化和推理开关。 diff --git a/lmdeploy/cli/chat.py b/lmdeploy/cli/chat.py index 099f0825cc..b2a906a97b 100644 --- a/lmdeploy/cli/chat.py +++ b/lmdeploy/cli/chat.py @@ -1,11 +1,14 @@ # Copyright (c) OpenMMLab. All rights reserved. -from contextlib import closing +import os +from urllib.parse import urlparse import fire from lmdeploy import GenerationConfig, PytorchEngineConfig, TurbomindEngineConfig, pipeline from lmdeploy.archs import autoget_backend +IMAGE_COMMAND = '/image' + def input_prompt(): """Input a prompt in the consolo interface.""" @@ -14,11 +17,82 @@ def input_prompt(): return '\n'.join(iter(input, sentinel)) +def normalize_image_source(image_url: str): + """Resolve local relative image paths against the CLI working directory.""" + if urlparse(image_url).scheme or os.path.isabs(image_url): + return image_url + return os.path.abspath(image_url) + + +def parse_image_command(stripped_line: str): + """Parse one stripped /image command line into an OpenAI content block.""" + if stripped_line == IMAGE_COMMAND: + raise ValueError('/image requires an image path or URL') + if not stripped_line.startswith(f'{IMAGE_COMMAND} '): + return None + + image_url = stripped_line[len(IMAGE_COMMAND):].strip() + if not image_url: + raise ValueError('/image requires an image path or URL') + image_url = normalize_image_source(image_url) + + return { + 'kind': 'content', + 'block': { + 'type': 'image_url', + 'image_url': { + 'url': image_url + } + }, + } + + +def parse_prompt_line(line: str): + """Parse one raw prompt line into a text or content segment.""" + stripped = line.strip() + for parse_command in (parse_image_command, ): + segment = parse_command(stripped) + if segment is not None: + return segment + return {'kind': 'text', 'text': line} + + +def merge_text_segments(segments: list[dict]) -> list[dict]: + """Merge adjacent text segments into OpenAI text content blocks.""" + content = [] + pending_text = [] + + def append_pending_text(): + if any(line.strip() for line in pending_text): + content.append({'type': 'text', 'text': '\n'.join(pending_text)}) + pending_text.clear() + + for segment in segments: + if segment['kind'] == 'text': + pending_text.append(segment['text']) + continue + + append_pending_text() + content.append(segment['block']) + + append_pending_text() + return content + + +def parse_interactive_prompt(prompt: str): + """Parse a completed interactive chat prompt block. + + Text-only prompts return the original string to preserve existing CLI behavior. Prompts with image commands return + ordered OpenAI multimodal content blocks. + """ + segments = [parse_prompt_line(line) for line in prompt.splitlines()] + content = merge_text_segments(segments) + has_image = any(block['type'] == 'image_url' for block in content) + return content if has_image else prompt + + def build_pipe(model_path, backend, trust_remote_code=False, **kwargs): engine_config = None - if kwargs.get('enable_prefix_caching', False): - print('interactive chat cannot be used when prefix caching is enabled') - exit(-1) if backend == 'turbomind': engine_config = TurbomindEngineConfig() for key, value in kwargs.items(): @@ -74,35 +148,55 @@ def main(model_path, backend, trust_remote_code=False, **kwargs): # set auto backend mode backend = autoget_backend(model_path, trust_remote_code=trust_remote_code) quit = False + messages = [] with build_pipe(model_path, backend, trust_remote_code=trust_remote_code, **kwargs) as pipe: gen_config = build_gen_config(**kwargs) adapter_name = get_adapter_name(**kwargs) while not quit: - with closing(pipe.session()) as sess: - while True: - try: - prompt = input_prompt() - except KeyboardInterrupt: - quit = True - break - if prompt == 'end': - sess.close() - break - if prompt == 'exit': - quit = True - break - if prompt.strip() == '': - continue - resps = pipe.chat(prompt, - session=sess, + try: + prompt = input_prompt() + except KeyboardInterrupt: + quit = True + continue + if prompt == 'end': + messages.clear() + continue + if prompt == 'exit': + quit = True + continue + if prompt.strip() == '': + continue + + try: + content = parse_interactive_prompt(prompt) + except ValueError as exc: + print(f'Error: {exc}') + continue + + messages.append({'role': 'user', 'content': content}) + request_messages = [message.copy() for message in messages] + # This session is only a per-request cancellation handle; the + # conversation state lives in the Python transcript above. + request_session = pipe.session() + response_text = '' + resps = pipe.stream_infer(request_messages, + sessions=request_session, gen_config=gen_config, adapter_name=adapter_name, - stream_response=True) - try: - for resp in resps: - print(resp.text, end='', flush=True) - except KeyboardInterrupt: - sess.abort() + stream_response=True, + sequence_start=True, + sequence_end=True) + try: + for resp in resps: + print(resp.text, end='', flush=True) + response_text += resp.text + except KeyboardInterrupt: + request_session.abort() + messages.pop() + print() + continue + + messages.append({'role': 'assistant', 'content': response_text}) else: print('exiting...') diff --git a/lmdeploy/lib b/lmdeploy/lib new file mode 120000 index 0000000000..7042cd2673 --- /dev/null +++ b/lmdeploy/lib @@ -0,0 +1 @@ +/data/lmdeploy-memory/build/lib \ No newline at end of file diff --git a/lmdeploy/messages.py b/lmdeploy/messages.py index 9486cc62b0..f850dc1a62 100644 --- a/lmdeploy/messages.py +++ b/lmdeploy/messages.py @@ -247,6 +247,29 @@ class TurbomindEngineConfig: a k/v block, default to 64 enable_prefix_caching: enable cache prompts for block reuse, default to False + cache_checkpoint_interval: minimum token gap between reusable + recurrent-state checkpoints (CacheRegistry checkpoint_min_interval). + Must be > 0. Default 4096. + cache_prompt: partial prompt-boundary publication mode, one of + 'all' | 'auto'. 'all' publishes the reusable partial fork_to node at + B = prompt_len - cache_prompt_boundary_skip whenever B is mid-block + (and arms a recurrent-state checkpoint clamp when B is block-aligned), + so a duplicate prompt skips prefill (costs one extra prefill forward + + a partial block). 'auto' (default) does that only when the partial + block holds image tokens (reusing vision-encoded KV) and is inert for + text-only prompts. Requires enable_prefix_caching. + cache_prompt_boundary_skip: number of trailing prompt tokens treated as + the volatile generation-prompt suffix (e.g. a chat template's + `\n`) and excluded from the reusable prompt-boundary node, so + the node ends at prompt_len - cache_prompt_boundary_skip. Default 1 + (exclude only the last token). Applies when cache_prompt is 'all' or + 'auto'. + cache_generation: generated-block caching mode, one of + 'all' | 'auto' | 'none'. 'all' indexes full generated blocks and the + terminal partial block, and adopts the terminal recurrent frontier + checkpoint (exact multi-turn resume, costs a partial block). 'auto' + (default) indexes full generated blocks only. 'none' indexes no + generated blocks at all. Requires enable_prefix_caching. quant_policy: default to 0. For TurboMind, when k/v is quantized into int4 or int8, set it to 4 or 8, respectively rope_scaling_factor: scaling factor used for dynamic ntk, @@ -297,6 +320,10 @@ class TurbomindEngineConfig: cache_chunk_size: int = -1 cache_block_seq_len: int = 64 enable_prefix_caching: bool = False + cache_checkpoint_interval: int = 4096 + cache_prompt: str = 'auto' + cache_prompt_boundary_skip: int = 1 + cache_generation: str = 'auto' quant_policy: int = 0 rope_scaling_factor: float = 0.0 use_logn_attn: bool = False @@ -330,6 +357,10 @@ def __post_init__(self): assert self.max_prefill_token_num >= 0, \ 'invalid max_prefill_token_num' assert self.num_tokens_per_iter >= 0, 'invalid num_tokens_per_iter' + assert self.cache_prompt in ('all', 'auto'), 'invalid cache_prompt' + assert self.cache_generation in ('all', 'auto', 'none'), 'invalid cache_generation' + assert self.cache_checkpoint_interval > 0, 'invalid cache_checkpoint_interval' + assert self.cache_prompt_boundary_skip >= 1, 'invalid cache_prompt_boundary_skip' assert self.async_ in (0, 1), 'async_ must be 0 (disabled) or 1 (enabled)' @@ -528,8 +559,10 @@ class ResponseType(enum.Enum): INPUT_LENGTH_ERROR = enum.auto() INTERNAL_ENGINE_ERROR = enum.auto() CANCEL = enum.auto() - PREFIX_CACHE_CONFLICT_INTERACTIVE_MODE = enum.auto() + PREFIX_CACHE_CONFLICT = enum.auto() NO_QUEUE = enum.auto() + NOT_SUPPORTED = enum.auto() + OUT_OF_MEMORY = enum.auto() @dataclass diff --git a/lmdeploy/metrics/metrics_processor.py b/lmdeploy/metrics/metrics_processor.py index 3973c6b82d..83dfa232f5 100644 --- a/lmdeploy/metrics/metrics_processor.py +++ b/lmdeploy/metrics/metrics_processor.py @@ -83,6 +83,10 @@ async def _run_metrics_handler(self): async def update_schedule_stats(self, schedule_metrics: ScheduleMetrics): """Update schedule stats.""" + if schedule_metrics is None: + # Backend scheduler metrics unavailable (e.g. TurboMind metrics revival is + # deferred); skip the schedule-stat update rather than crash the logger loop. + return self.scheduler_stats.update_from_schedule_metrics(schedule_metrics) # record schedule stats for stat_logger in self.stat_loggers: diff --git a/lmdeploy/serve/core/vl_async_engine.py b/lmdeploy/serve/core/vl_async_engine.py index d246a20f75..6709321785 100644 --- a/lmdeploy/serve/core/vl_async_engine.py +++ b/lmdeploy/serve/core/vl_async_engine.py @@ -31,11 +31,15 @@ def __init__(self, backend_config=backend_config, trust_remote_code=trust_remote_code) if backend_config and backend_config.enable_prefix_caching: - supports_prefix_caching = backend == 'pytorch' and getattr(self.vl_encoder, '_uses_new_preprocess', False) - if not supports_prefix_caching: + native_tm_vision = (backend == 'turbomind' + and getattr(self.vl_encoder.model, '_turbomind_native_vision', False)) + pytorch_new_preprocess = (backend == 'pytorch' + and getattr(self.vl_encoder, '_uses_new_preprocess', False)) + if not (native_tm_vision or pytorch_new_preprocess): backend_config.enable_prefix_caching = False logger.warning('Prefix caching is disabled for this VL model path. ' - 'Only PyTorch new-preprocess multimodal inputs are supported.') + 'Supported: TurboMind native-vision models and PyTorch new-preprocess ' + 'multimodal inputs.') super().__init__(model_path, backend=backend, backend_config=backend_config, diff --git a/lmdeploy/turbomind/models/qwen3_5.py b/lmdeploy/turbomind/models/qwen3_5.py index 0d13f372ae..8f54e5d48f 100644 --- a/lmdeploy/turbomind/models/qwen3_5.py +++ b/lmdeploy/turbomind/models/qwen3_5.py @@ -19,8 +19,10 @@ """ from __future__ import annotations +import hashlib import math import re +import struct from typing import TYPE_CHECKING, Any import _turbomind as _tm @@ -328,6 +330,49 @@ def _split_packed_vision_qkv(qkv): return tuple(x.contiguous() for x in qkv.chunk(3, dim=-1)) +def _image_fingerprint(input_mm: dict) -> bytes: + """SHA-256 over the Qwen3.5 ViT-forward inputs plus the mRoPE scalar. + + Post-preprocess (phase A): every input is already on the item dict. Two requests hash equal iff their ViT embeddings + and cached LM KV for the image span are identical -- i.e. reuse is correct. + """ + modality = input_mm['modality'] + is_video = modality in (Modality.VIDEO, Modality.VIDEO.value) + pv = input_mm['pixel_values_videos'] if is_video else input_mm['pixel_values'] + gthw = input_mm['video_grid_thw'] if is_video else input_mm['image_grid_thw'] + if isinstance(gthw, torch.Tensor): + values = gthw.flatten().tolist() + else: + values = list(gthw) + t, h, w = int(values[0]), int(values[1]), int(values[2]) + spg = input_mm.get('second_per_grid') # video only; float | None + + h_obj = hashlib.sha256() + h_obj.update(struct.pack(' same bytes; the dtype is constant per engine instance. + h_obj.update(pv.contiguous().cpu().view(torch.uint8).numpy().tobytes()) + return h_obj.digest() # 32 bytes; never all-zero + + +def _resolve_fingerprint(input_mm: dict) -> bytes: + """Use a pre-placed fingerprint if present (future pre-preprocess + generator, or a test forcing empty/dormant); otherwise derive it from the + ViT inputs. + + `is not None` (not `or`) so an explicit b'' stays empty rather than falling + through to compute -- the empty-fingerprint sentinel must be preserved + (empty never compares equal -> image-span reuse stays dormant). + """ + fp = input_mm.get('fingerprint') + return fp if fp is not None else _image_fingerprint(input_mm) + + class Qwen3_5VisionModel(TextModel): """Vision sub-tree for Qwen3.5 VLM, rooted at ModelRoot.vision_model. @@ -401,6 +446,7 @@ def to_turbomind_multimodal(self, multimodal: list[dict[str, Any]]): raise ValueError(f'Qwen3.5 TurboMind does not support modality {modality!r}') token_begin, token_end = self._offset_pair(input_mm['offset']) + fingerprint = _resolve_fingerprint(input_mm) items.append( _tm.multimodal.Qwen3_5VitItem( modality=tm_modality, @@ -408,6 +454,7 @@ def to_turbomind_multimodal(self, multimodal: list[dict[str, Any]]): token_begin=token_begin, token_end=token_end, grid_thw=grid_thw, + fingerprint=fingerprint, )) return _tm.multimodal.Qwen3_5VitInput(items) diff --git a/lmdeploy/turbomind/turbomind.py b/lmdeploy/turbomind/turbomind.py index 2df03cf340..7b52ff216f 100644 --- a/lmdeploy/turbomind/turbomind.py +++ b/lmdeploy/turbomind/turbomind.py @@ -238,6 +238,10 @@ def _from_hf(self, model_path: str, engine_config: TurbomindEngineConfig, ec.cache_max_block_count = engine_config.cache_max_entry_count ec.cache_chunk_size = engine_config.cache_chunk_size ec.enable_prefix_caching = engine_config.enable_prefix_caching + ec.cache_checkpoint_interval = engine_config.cache_checkpoint_interval + ec.cache_prompt = engine_config.cache_prompt + ec.cache_prompt_boundary_skip = engine_config.cache_prompt_boundary_skip + ec.cache_generation = engine_config.cache_generation ec.enable_metrics = engine_config.enable_metrics ec.num_tokens_per_iter = engine_config.num_tokens_per_iter ec.max_prefill_iters = engine_config.max_prefill_iters @@ -391,6 +395,11 @@ def create_instance(self, cuda_stream_id=0): def get_schedule_metrics(self): # TODO: support dp tm_metrics = self.model_comm.get_schedule_metrics(0) + if tm_metrics is None: + # ScheduleMetrics is not yet wired onto the new scheduler (metrics revival is + # deferred). Report no metrics so consumers (health probe / metrics logger) + # degrade gracefully instead of dereferencing a missing metrics object. + return None return ScheduleMetrics(active_seqs=tm_metrics.active_seqs, waiting_seqs=tm_metrics.waiting_seqs, total_blocks=tm_metrics.total_blocks, @@ -568,14 +577,13 @@ def __init__(self, tm_model: 'TurboMind', cuda_stream_id: int = 0): 0: ResponseType.SUCCESS, 1: ResponseType.SESSION_NOT_EXIST, 2: ResponseType.SESSION_REPEAT, - 3: ResponseType.SESSION_REPEAT, - 4: ResponseType.INTERNAL_ENGINE_ERROR, 5: ResponseType.INTERNAL_ENGINE_ERROR, 6: ResponseType.INPUT_LENGTH_ERROR, 7: ResponseType.FINISH, 8: ResponseType.CANCEL, - 9: ResponseType.PREFIX_CACHE_CONFLICT_INTERACTIVE_MODE, + 9: ResponseType.PREFIX_CACHE_CONFLICT, 10: ResponseType.NO_QUEUE, + 11: ResponseType.OUT_OF_MEMORY, -1: ResponseType.INTERNAL_ENGINE_ERROR, } @@ -671,15 +679,9 @@ def prepare_inputs(self, async def async_cancel(self, session_id: int = None): self.model_inst.cancel() - def async_end_cb(self, fut: asyncio.Future, status: int): - """Executing on engine's signaling thread.""" - logger.info(f'[async_end_cb] session ended, status = {status}') - fut.get_loop().call_soon_threadsafe(fut.set_result, status) - async def async_end(self, session_id): - fut = asyncio.get_running_loop().create_future() - self.model_inst.end(partial(self.async_end_cb, fut), session_id) - await fut + """TurboMind is stateless; there is no engine-side session to end.""" + return def async_signal_cb(self, s: StreamingSemaphore): """Executing on engine's signaling thread.""" @@ -706,15 +708,21 @@ async def async_stream_infer(self, input_embeddings (list[numpy.ndarray]): embeddings features input_embedding_ranges (list[tuple[int,int]]): the begin/end offsets of input_embeddings to input_ids - sequence_start (bool): indicator for starting a sequence - sequence_end (bool): indicator for ending a sequence + sequence_start (bool): must be True; TurboMind is stateless-only + sequence_end (bool): must be True; TurboMind is stateless-only step (int): the offset of the k/v cache - stop (bool): indicator for cancelling the session gen_config (GenerationConfig): generation config stream_output (bool): indicator for stream output kwargs (dict): kwargs for backward compatibility """ logger.info(f'[async_stream_infer] session {session_id} start') + if not (sequence_start and sequence_end): + logger.error(f'[async_stream_infer] session {session_id}: TurboMind supports only ' + f'stateless requests; stateful/interactive inference ' + f'(sequence_start={sequence_start}, sequence_end={sequence_end}) is not ' + f'supported - use prefix caching instead') + yield EngineOutput(ResponseType.NOT_SUPPORTED, []) + return gen_cfg = self._get_generation_config(gen_config) inputs, input_len = self.prepare_inputs(input_ids=input_ids, @@ -758,7 +766,7 @@ async def async_stream_infer(self, f'disable guided decoding: {e}') gen_config.response_format = None - session = _tm.SessionParam(id=session_id, step=step, start=sequence_start, end=sequence_end) + session = _tm.SessionParam(id=session_id, step=step) inputs = _np_dict_to_tm_dict(inputs) mm_inputs = self.tm_model.mm_input_converter(multimodal) diff --git a/scripts/test_turbomind_model.py b/scripts/test_turbomind_model.py index abca29290f..ca05e99ee4 100644 --- a/scripts/test_turbomind_model.py +++ b/scripts/test_turbomind_model.py @@ -1,7 +1,8 @@ #!/usr/bin/env python3 """Smoke-test one TurboMind model for agent/subagent harnesses. -Code flow: parse argv → configure HF/GPU and run one inference → print sections. +Code flow: parse argv → resolve prompts → configure HF/GPU and run batch inference +→ print sections. Stdout is plain text in short sections, for example: @@ -9,19 +10,31 @@ model: tp: gpus: - TM_DEBUG_LEVEL: DEBUG (only if --debug was passed) + max_new_tokens: 256 + async: 1 + session_len: 16384 + max_batch_size: 8 + enable_prefix_caching: 0 + cache_checkpoint_interval: 4096 + cache_prompt: 'auto' + cache_generation: 'auto' + prompt_count: 1 + prompt_source: default + CUDA_LAUNCH_BLOCKING: 1 (only if --debug was passed) --- timing --- pipeline load: s inference: s --- tokens --- - input: - generated: + [0] source_index: 0 + [0] prompt: Write a short paragraph about the importance of reading books. + [0] input: + [0] generated: - --- response begin --- + --- response 0 begin --- - --- response end --- + --- response 0 end --- Exit code: 0 only if no uncaught exception (pipeline load + inference complete). On failure the full traceback is printed to stderr. @@ -30,144 +43,582 @@ Usage (from repo root): python scripts/test_turbomind_model.py \\ - [--debug] + --model-id ID \\ + --cache-dir PATH \\ + --tp N \\ + --gpus DEVICES \\ + [--prompt TEXT ...] \\ + [--prompt-file PATH] \\ + [--prompt-ids N [N ...]] \\ + [--max-new-tokens N] \\ + [--async {0,1}] \\ + [--session-len N] \\ + [--max-batch-size N] \\ + [--enable-prefix-caching] \\ + [--max-prefill-token-num N] \\ + [--cache-checkpoint-interval N] \\ + [--cache-prompt {all,auto}] \\ + [--cache-generation {all,auto,none}] \\ + [--debug] + +Optional prompts: repeat --prompt for multiple strings, or --prompt-file for a JSON +array of strings. --prompt and --prompt-file are mutually exclusive. When omitted, a +built-in default prompt is used. --prompt-ids selects/reorders/duplicates by 0-based +index; when omitted, all prompts run in order. + +Example --prompt-ids (0-based; repeats run the same prompt again): + + # prompts.json: ["First prompt.", "Second prompt."] + # Run second, then first twice (three responses total) + python scripts/test_turbomind_model.py \\ + --model-id ID --cache-dir PATH --tp 1 --gpus 0 \\ + --prompt-file prompts.json \\ + --prompt-ids 1 0 0 + + # Same with CLI prompts: run "B", then "A" twice + python scripts/test_turbomind_model.py \\ + --model-id ID --cache-dir PATH --tp 1 --gpus 0 \\ + --prompt "A" --prompt "B" \\ + --prompt-ids 1 0 0 + +Example with prefix caching and repeated prompts: -Optional --debug sets TM_DEBUG_LEVEL=DEBUG before loading TurboMind so asynchronous -CUDA errors surface after kernel launch (see TurboMind CUDA helpers). + python scripts/test_turbomind_model.py \\ + --model-id ID --cache-dir PATH --tp 2 --gpus 0,1 \\ + --enable-prefix-caching \\ + --prompt "A" --prompt "B" \\ + --prompt-ids 0 0 1 + +Optional engine params (defaults shown): --max-new-tokens 256, --async 1, +--session-len 16384, --max-batch-size 8. Prefix caching: pass +--enable-prefix-caching (default off). + +Optional --debug sets CUDA_LAUNCH_BLOCKING=1 before loading TurboMind so CUDA +kernels run synchronously and errors surface at the launch site. Example gpus: "0" for tp=1, "0,1" for tp=2. + +Python (repo root on PYTHONPATH): + + from scripts.test_turbomind_model import run_smoke_test + + result = run_smoke_test( + model_id="/path/to/model", + cache_dir="/nvme2/huggingface_hub/hub", + tp=2, + gpus="0,1", + prompts=["Hello", "World"], + prompt_ids=[0, 1, 0], + enable_prefix_caching=True, + max_new_tokens=512, + emit_report=False, + ) + assert len(result.responses) == 3 + assert result.responses[0].text.strip() + +SmokeResult.responses is a list of PromptResult (index, source_index, prompt_preview, +text, input_token_len, generate_token_len). Use result.responses[0].text for the +first decoded output. """ from __future__ import annotations +import argparse +import json import os import sys import time import traceback +from pathlib import Path from typing import NamedTuple import huggingface_hub.constants as hf_constants +DEFAULT_MAX_NEW_TOKENS = 256 +DEFAULT_ASYNC = 1 +DEFAULT_SESSION_LEN = 16384 +DEFAULT_MAX_BATCH_SIZE = 8 +DEFAULT_MAX_PREFILL_TOKEN_NUM = 1024 +DEFAULT_CACHE_CHECKPOINT_INTERVAL = 4096 +DEFAULT_CACHE_PROMPT = 'auto' +DEFAULT_CACHE_GENERATION = 'auto' +DEFAULT_PROMPT = 'Write a short paragraph about the importance of reading books.' +PROMPT_PREVIEW_LEN = 64 -class SmokeResult(NamedTuple): - create_s: float - infer_s: float + +class PromptConfigError(ValueError): + """Invalid prompt CLI/API configuration (maps to exit 2).""" + + +class PromptResult(NamedTuple): + index: int + source_index: int + prompt_preview: str text: str input_token_len: int generate_token_len: int +class ResolvedPrompts(NamedTuple): + prompts: list[str] + source_indices: list[int] + source: str # 'default' | 'cli' | 'file' + + +class SmokeResult(NamedTuple): + create_s: float + infer_s: float + responses: list[PromptResult] + + def _set_hf_cache(path: str) -> None: hf_constants.HF_HUB_CACHE = path hf_constants.HF_HUB_OFFLINE = 1 -def parse_args(argv: list[str]) -> tuple[str, str, int, str, bool]: - prog = os.path.basename(argv[0]) if argv else 'test_turbomind_model.py' - rest = [a for a in argv[1:] if a != '--debug'] - debug = len(rest) != len(argv) - 1 +def _positive_int(value: str) -> int: + n = int(value) + if n < 1: + raise argparse.ArgumentTypeError(f'{value!r} must be >= 1') + return n - if len(rest) != 4: - print( - f'usage: {prog} [--debug] ', - file=sys.stderr, - ) - sys.exit(2) - model_path, cache_dir, tp_s, gpus = rest +def _validate_engine_params( + *, + max_new_tokens: int, + async_: int, + session_len: int, + max_batch_size: int, + max_prefill_token_num: int, + cache_checkpoint_interval: int, +) -> None: + if max_new_tokens < 1: + raise ValueError(f'max_new_tokens must be >= 1, got {max_new_tokens}') + if async_ not in (0, 1): + raise ValueError(f'async_ must be 0 or 1, got {async_}') + if session_len < 1: + raise ValueError(f'session_len must be >= 1, got {session_len}') + if max_batch_size < 1: + raise ValueError(f'max_batch_size must be >= 1, got {max_batch_size}') + if max_prefill_token_num < 1: + raise ValueError(f'max_prefill_token_num must be >= 1, got {max_prefill_token_num}') + if cache_checkpoint_interval < 1: + raise ValueError( + f'cache_checkpoint_interval must be >= 1, got {cache_checkpoint_interval}') + + +def _format_prompt_preview(text: str) -> str: + escaped = text.replace('\n', '\\n') + if len(escaped) > PROMPT_PREVIEW_LEN: + return escaped[:PROMPT_PREVIEW_LEN] + '...' + return escaped + + +def _load_prompt_file(path: str) -> list[str]: + prompt_path = Path(path) + if not prompt_path.is_file(): + raise PromptConfigError(f'prompt file not found: {path}') try: - tp = int(tp_s) - except ValueError: - print(f'invalid tp: {tp_s!r}', file=sys.stderr) - sys.exit(2) - return model_path, cache_dir, tp, gpus, debug + data = json.loads(prompt_path.read_text(encoding='utf-8')) + except json.JSONDecodeError as exc: + raise PromptConfigError(f'invalid JSON in prompt file {path}: {exc}') from exc + if not isinstance(data, list): + raise PromptConfigError(f'prompt file must contain a JSON array of strings: {path}') + if not data: + raise PromptConfigError(f'prompt file is empty: {path}') + for i, item in enumerate(data): + if not isinstance(item, str): + raise PromptConfigError(f'prompt file entry {i} is not a string: {path}') + if not item: + raise PromptConfigError(f'prompt file entry {i} is empty: {path}') + return data + + +def resolve_prompts( + *, + prompts: list[str] | None = None, + prompt_file: str | None = None, + prompt_ids: list[int] | None = None, +) -> ResolvedPrompts: + if prompts is not None and prompt_file is not None: + raise PromptConfigError('cannot use both prompts and prompt_file') + + if prompt_file is not None: + base = _load_prompt_file(prompt_file) + source = 'file' + elif prompts is not None: + if not prompts: + raise PromptConfigError('prompts list is empty') + for i, prompt in enumerate(prompts): + if not prompt: + raise PromptConfigError(f'prompt {i} is empty') + base = list(prompts) + source = 'cli' + else: + base = [DEFAULT_PROMPT] + source = 'default' + + if prompt_ids is None: + ids = list(range(len(base))) + else: + ids = list(prompt_ids) + + for idx in ids: + if idx < 0 or idx >= len(base): + raise PromptConfigError( + f'prompt_ids index {idx} out of range for {len(base)} prompts') + + resolved = [base[i] for i in ids] + if not resolved: + raise PromptConfigError('prompt_ids produced an empty prompt list') + return ResolvedPrompts(resolved, ids, source) + + +def build_arg_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + description='Smoke-test one TurboMind model for agent/subagent harnesses.', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog="""\ +stdout sections: --- setup ---, --- timing ---, --- tokens ---, --- response N begin/end --- +Optional prompts: --prompt, --prompt-file, --prompt-ids +Optional engine params: + --max-new-tokens + --async + --session-len + --max-batch-size + --enable-prefix-caching + --max-prefill-token-num + --cache-checkpoint-interval + --cache-prompt + --cache-generation +Exit 0: load + inference complete. Exit 1: exception (traceback on stderr). Exit 2: usage error. +""", + ) + parser.add_argument('--model-id', required=True, help='HuggingFace model id or local path') + parser.add_argument('--cache-dir', required=True, help='HF hub cache directory (HF_HUB_CACHE)') + parser.add_argument('--tp', required=True, type=int, help='Tensor parallel size') + parser.add_argument('--gpus', required=True, help='CUDA_VISIBLE_DEVICES value, e.g. "0" or "0,1"') + parser.add_argument( + '--max-new-tokens', + type=_positive_int, + default=DEFAULT_MAX_NEW_TOKENS, + help=f'GenerationConfig.max_new_tokens (default: {DEFAULT_MAX_NEW_TOKENS})', + ) + parser.add_argument( + '--async', + type=int, + default=DEFAULT_ASYNC, + choices=[0, 1], + dest='async_', + help='Enable async execution (default: 1). Set 0 to disable, 1 to enable.', + ) + parser.add_argument( + '--session-len', + type=_positive_int, + default=DEFAULT_SESSION_LEN, + help=f'TurbomindEngineConfig.session_len (default: {DEFAULT_SESSION_LEN})', + ) + parser.add_argument( + '--max-batch-size', + type=_positive_int, + default=DEFAULT_MAX_BATCH_SIZE, + help=f'TurbomindEngineConfig.max_batch_size (default: {DEFAULT_MAX_BATCH_SIZE})', + ) + parser.add_argument( + '--enable-prefix-caching', + action='store_true', + help='Enable TurboMind prefix caching (TurbomindEngineConfig.enable_prefix_caching)', + ) + parser.add_argument( + '--max-prefill-token-num', + type=_positive_int, + default=DEFAULT_MAX_PREFILL_TOKEN_NUM, + help=f'TurbomindEngineConfig.max_prefill_token_num (default: {DEFAULT_MAX_PREFILL_TOKEN_NUM})', + ) + parser.add_argument( + '--cache-checkpoint-interval', + type=_positive_int, + default=DEFAULT_CACHE_CHECKPOINT_INTERVAL, + help=('TurbomindEngineConfig.cache_checkpoint_interval ' + f'(default: {DEFAULT_CACHE_CHECKPOINT_INTERVAL})'), + ) + parser.add_argument( + '--cache-prompt', + choices=['all', 'auto'], + default=DEFAULT_CACHE_PROMPT, + help=('Prompt-boundary caching mode ' + f'(TurbomindEngineConfig.cache_prompt; default: {DEFAULT_CACHE_PROMPT!r})'), + ) + parser.add_argument( + '--cache-generation', + choices=['all', 'auto', 'none'], + default=DEFAULT_CACHE_GENERATION, + help=('Generation caching mode ' + f'(TurbomindEngineConfig.cache_generation; default: {DEFAULT_CACHE_GENERATION!r})'), + ) + parser.add_argument( + '--prompt', + action='append', + default=None, + metavar='TEXT', + help='Prompt text (repeatable). Mutually exclusive with --prompt-file.', + ) + parser.add_argument( + '--prompt-file', + default=None, + metavar='PATH', + help='JSON file containing an array of prompt strings. Mutually exclusive with --prompt.', + ) + parser.add_argument( + '--prompt-ids', + nargs='+', + type=int, + default=None, + metavar='N', + help='0-based indices into the prompt list; repeats allowed. Default: all in order.', + ) + parser.add_argument( + '--debug', + action='store_true', + help='Set CUDA_LAUNCH_BLOCKING=1 before TurboMind load', + ) + return parser + + +def parse_args(argv: list[str]) -> argparse.Namespace: + parser = build_arg_parser() + return parser.parse_args(argv[1:]) def run_smoke_infer( - model_path: str, + model_id: str, cache_dir: str, tp: int, gpus: str, + resolved: ResolvedPrompts, *, + max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS, + async_: int = DEFAULT_ASYNC, + session_len: int = DEFAULT_SESSION_LEN, + max_batch_size: int = DEFAULT_MAX_BATCH_SIZE, + enable_prefix_caching: bool = False, + max_prefill_token_num: int = DEFAULT_MAX_PREFILL_TOKEN_NUM, + cache_checkpoint_interval: int = DEFAULT_CACHE_CHECKPOINT_INTERVAL, + cache_prompt: str = DEFAULT_CACHE_PROMPT, + cache_generation: str = DEFAULT_CACHE_GENERATION, debug: bool = False, ) -> SmokeResult: + _validate_engine_params( + max_new_tokens=max_new_tokens, + async_=async_, + session_len=session_len, + max_batch_size=max_batch_size, + max_prefill_token_num=max_prefill_token_num, + cache_checkpoint_interval=cache_checkpoint_interval, + ) _set_hf_cache(cache_dir) os.environ['CUDA_VISIBLE_DEVICES'] = gpus if debug: - os.environ['TM_DEBUG_LEVEL'] = 'DEBUG' + os.environ['CUDA_LAUNCH_BLOCKING'] = '1' from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline engine_config = TurbomindEngineConfig( - async_=1, - max_batch_size=4, - session_len=4096, + async_=async_, + max_batch_size=max_batch_size, + session_len=session_len, cache_max_entry_count=0.5, - max_prefill_token_num=1024, + max_prefill_token_num=max_prefill_token_num, tp=tp, dp=1, enable_metrics=False, communicator='nccl', + enable_prefix_caching=enable_prefix_caching, + cache_checkpoint_interval=cache_checkpoint_interval, + cache_prompt=cache_prompt, + cache_generation=cache_generation, ) - gen_config = GenerationConfig(max_new_tokens=128, do_sample=False) - prompt = 'Write a short paragraph about the importance of reading books.' + gen_config = GenerationConfig(max_new_tokens=max_new_tokens, do_sample=False) t0 = time.perf_counter() - with pipeline(model_path, backend_config=engine_config, log_level='WARNING', + with pipeline(model_id, backend_config=engine_config, log_level='WARNING', trust_remote_code=True) as pipe: create_s = time.perf_counter() - t0 t1 = time.perf_counter() - out = pipe([prompt], gen_config=gen_config, do_preprocess=True) + out = pipe(resolved.prompts, gen_config=gen_config, do_preprocess=True) infer_s = time.perf_counter() - t1 - res = out[0] - text = res.text if hasattr(res, 'text') else str(res) - input_token_len = getattr(res, 'input_token_len', -1) - generate_token_len = getattr(res, 'generate_token_len', -1) - return SmokeResult(create_s, infer_s, text, input_token_len, generate_token_len) + if not isinstance(out, list): + out = [out] + if len(out) != len(resolved.prompts): + raise RuntimeError( + f'pipeline returned {len(out)} responses for {len(resolved.prompts)} prompts') + + responses: list[PromptResult] = [] + for res, source_index, prompt_text in zip(out, resolved.source_indices, resolved.prompts): + text = res.text if hasattr(res, 'text') else str(res) + batch_index = getattr(res, 'index', len(responses)) + responses.append(PromptResult( + index=batch_index, + source_index=source_index, + prompt_preview=_format_prompt_preview(prompt_text), + text=text, + input_token_len=getattr(res, 'input_token_len', -1), + generate_token_len=getattr(res, 'generate_token_len', -1), + )) + return SmokeResult(create_s, infer_s, responses) def print_report( - model_path: str, + model_id: str, tp: int, gpus: str, + resolved: ResolvedPrompts, result: SmokeResult, *, + max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS, + async_: int = DEFAULT_ASYNC, + session_len: int = DEFAULT_SESSION_LEN, + max_batch_size: int = DEFAULT_MAX_BATCH_SIZE, + enable_prefix_caching: bool = False, + max_prefill_token_num: int = DEFAULT_MAX_PREFILL_TOKEN_NUM, + cache_checkpoint_interval: int = DEFAULT_CACHE_CHECKPOINT_INTERVAL, + cache_prompt: str = DEFAULT_CACHE_PROMPT, + cache_generation: str = DEFAULT_CACHE_GENERATION, debug: bool = False, ) -> None: - if not result.text.strip(): - print('warning: empty response text', file=sys.stderr) - print('--- setup ---') - print(f'model: {model_path}') + print(f'model: {model_id}') print(f'tp: {tp}') print(f'gpus: {gpus}') + print(f'max_new_tokens: {max_new_tokens}') + print(f'async: {async_}') + print(f'session_len: {session_len}') + print(f'max_batch_size: {max_batch_size}') + print(f'enable_prefix_caching: {1 if enable_prefix_caching else 0}') + print(f'cache_checkpoint_interval: {cache_checkpoint_interval}') + print(f'cache_prompt: {cache_prompt!r}') + print(f'cache_generation: {cache_generation!r}') + print(f'max_prefill_token_num: {max_prefill_token_num}') + print(f'prompt_count: {len(resolved.prompts)}') + print(f'prompt_source: {resolved.source}') if debug: - print('TM_DEBUG_LEVEL: DEBUG') + print('CUDA_LAUNCH_BLOCKING: 1') print() print('--- timing ---') print(f'pipeline load: {result.create_s:.2f} s') print(f'inference: {result.infer_s:.2f} s') print() print('--- tokens ---') - print(f'input: {result.input_token_len}') - print(f'generated: {result.generate_token_len}') + for item in result.responses: + print(f'[{item.index}] source_index: {item.source_index}') + print(f'[{item.index}] prompt: {item.prompt_preview}') + print(f'[{item.index}] input: {item.input_token_len}') + print(f'[{item.index}] generated: {item.generate_token_len}') print() - print('--- response begin ---') - print(result.text, end='') - if result.text and not result.text.endswith('\n'): - print() - print('--- response end ---') + for item in result.responses: + if not item.text.strip(): + print(f'warning: empty response text at index {item.index}', file=sys.stderr) + print(f'--- response {item.index} begin ---') + print(item.text, end='') + if item.text and not item.text.endswith('\n'): + print() + print(f'--- response {item.index} end ---') + + +def run_smoke_test( + *, + model_id: str, + cache_dir: str, + tp: int, + gpus: str, + prompts: list[str] | None = None, + prompt_file: str | None = None, + prompt_ids: list[int] | None = None, + max_new_tokens: int = DEFAULT_MAX_NEW_TOKENS, + async_: int = DEFAULT_ASYNC, + session_len: int = DEFAULT_SESSION_LEN, + max_batch_size: int = DEFAULT_MAX_BATCH_SIZE, + enable_prefix_caching: bool = False, + max_prefill_token_num: int = DEFAULT_MAX_PREFILL_TOKEN_NUM, + cache_checkpoint_interval: int = DEFAULT_CACHE_CHECKPOINT_INTERVAL, + cache_prompt: str = DEFAULT_CACHE_PROMPT, + cache_generation: str = DEFAULT_CACHE_GENERATION, + debug: bool = False, + emit_report: bool = True, +) -> SmokeResult: + resolved = resolve_prompts( + prompts=prompts, + prompt_file=prompt_file, + prompt_ids=prompt_ids, + ) + result = run_smoke_infer( + model_id, + cache_dir, + tp, + gpus, + resolved, + max_new_tokens=max_new_tokens, + async_=async_, + session_len=session_len, + max_batch_size=max_batch_size, + enable_prefix_caching=enable_prefix_caching, + max_prefill_token_num=max_prefill_token_num, + cache_checkpoint_interval=cache_checkpoint_interval, + cache_prompt=cache_prompt, + cache_generation=cache_generation, + debug=debug, + ) + if emit_report: + print_report( + model_id, + tp, + gpus, + resolved, + result, + max_new_tokens=max_new_tokens, + async_=async_, + session_len=session_len, + max_batch_size=max_batch_size, + enable_prefix_caching=enable_prefix_caching, + max_prefill_token_num=max_prefill_token_num, + cache_checkpoint_interval=cache_checkpoint_interval, + cache_prompt=cache_prompt, + cache_generation=cache_generation, + debug=debug, + ) + return result def main() -> None: - model_path, cache_dir, tp, gpus, debug = parse_args(sys.argv) - result = run_smoke_infer(model_path, cache_dir, tp, gpus, debug=debug) - print_report(model_path, tp, gpus, result, debug=debug) + args = parse_args(sys.argv) + run_smoke_test( + model_id=args.model_id, + cache_dir=args.cache_dir, + tp=args.tp, + gpus=args.gpus, + prompts=args.prompt, + prompt_file=args.prompt_file, + prompt_ids=args.prompt_ids, + max_new_tokens=args.max_new_tokens, + async_=args.async_, + session_len=args.session_len, + max_batch_size=args.max_batch_size, + enable_prefix_caching=args.enable_prefix_caching, + max_prefill_token_num=args.max_prefill_token_num, + cache_checkpoint_interval=args.cache_checkpoint_interval, + cache_prompt=args.cache_prompt, + cache_generation=args.cache_generation, + debug=args.debug, + emit_report=True, + ) if __name__ == '__main__': try: main() + except PromptConfigError as exc: + print(exc, file=sys.stderr) + sys.exit(2) except Exception: traceback.print_exc() sys.exit(1) diff --git a/scripts/test_vlm.py b/scripts/test_vlm.py deleted file mode 100644 index 693daa2067..0000000000 --- a/scripts/test_vlm.py +++ /dev/null @@ -1,72 +0,0 @@ -#!/usr/bin/env python3 -"""Smoke-test InternVL3.5 VLM with an image.""" -import os -import sys -import time - -import huggingface_hub.constants as hf_constants - - -def main(): - model_path = sys.argv[1] if len(sys.argv) > 1 else 'OpenGVLab/InternVL3_5-8B' - cache_dir = sys.argv[2] if len(sys.argv) > 2 else '/nvme2/huggingface_hub/hub' - image_path = sys.argv[3] if len(sys.argv) > 3 else '/data/lmdeploy-modeling/resources/batch_memory.png' - gpus = sys.argv[4] if len(sys.argv) > 4 else '0' - - hf_constants.HF_HUB_CACHE = cache_dir - hf_constants.HF_HUB_OFFLINE = 1 - os.environ['CUDA_VISIBLE_DEVICES'] = gpus - - from lmdeploy import GenerationConfig, TurbomindEngineConfig, pipeline - from lmdeploy.vl import load_image - - engine_config = TurbomindEngineConfig( - async_=1, - max_batch_size=4, - session_len=8192, - cache_max_entry_count=0.5, - max_prefill_token_num=1024, - tp=1, - dp=1, - enable_metrics=False, - communicator='nccl', - ) - gen_config = GenerationConfig(max_new_tokens=256, do_sample=False) - - image = load_image(image_path) - prompt = 'Describe this image in detail. What do you see?' - - print('--- setup ---') - print(f'model: {model_path}') - print(f'image: {image_path}') - print(f'gpus: {gpus}') - print() - - t0 = time.perf_counter() - with pipeline(model_path, backend_config=engine_config, log_level='WARNING') as pipe: - load_s = time.perf_counter() - t0 - print('--- timing ---') - print(f'pipeline load: {load_s:.2f} s') - - t1 = time.perf_counter() - out = pipe([(prompt, image)], gen_config=gen_config, do_preprocess=True) - infer_s = time.perf_counter() - t1 - print(f'inference: {infer_s:.2f} s') - print() - - res = out[0] - text = res.text if hasattr(res, 'text') else str(res) - input_tokens = getattr(res, 'input_token_len', -1) - gen_tokens = getattr(res, 'generate_token_len', -1) - - print('--- tokens ---') - print(f'input: {input_tokens}') - print(f'generated: {gen_tokens}') - print() - print('--- response begin ---') - print(text) - print('--- response end ---') - - -if __name__ == '__main__': - main() diff --git a/src/turbomind/CMakeLists.txt b/src/turbomind/CMakeLists.txt index f66ef0a2df..8988b9ac37 100644 --- a/src/turbomind/CMakeLists.txt +++ b/src/turbomind/CMakeLists.txt @@ -14,6 +14,7 @@ add_subdirectory(utils) add_subdirectory(core) +add_subdirectory(memory) add_subdirectory(kernels) add_subdirectory(comm) add_subdirectory(generation) @@ -32,6 +33,7 @@ target_link_libraries(turbomind PUBLIC device_comm host_comm core + memory memory_utils nvtx_utils CUDA::cublasLt diff --git a/src/turbomind/engine/CMakeLists.txt b/src/turbomind/engine/CMakeLists.txt index c76f2b92b3..ec6ac903e9 100644 --- a/src/turbomind/engine/CMakeLists.txt +++ b/src/turbomind/engine/CMakeLists.txt @@ -3,9 +3,12 @@ cmake_minimum_required(VERSION 3.25) add_library(engine STATIC + block.cc + cache_registry.cc gateway.cc request.cc request_queue.cc + scheduler.cc model_request.cc model_executor.cc engine.cc @@ -13,3 +16,8 @@ add_library(engine STATIC target_link_libraries(engine PRIVATE xgrammar core) set_property(TARGET engine PROPERTY POSITION_INDEPENDENT_CODE ON) set_property(TARGET engine PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON) + +if (BUILD_TEST) + add_executable(test_prefix_trie test_prefix_trie.cc) + target_link_libraries(test_prefix_trie PRIVATE engine memory core Catch2::Catch2WithMain) +endif () diff --git a/src/turbomind/engine/README.md b/src/turbomind/engine/README.md new file mode 100644 index 0000000000..4f427c084d --- /dev/null +++ b/src/turbomind/engine/README.md @@ -0,0 +1,472 @@ +# TurboMind Engine Async Execution Model + +## addressing + +Address a top-level section by its heading, such as `concepts`, `principles`, `ownership`, `invariants`, `contracts`, or `checklist`. + +Address a leaf by `
.`, using the top-level section plus the `###` heading text. Examples: `concepts.phase`, `ownership.cache`, `invariants.cleanup`, `contracts.cache-prepare`, `checklist.cache-memory`. + +Leaf headings are local to their section and intentionally short. Do not introduce a front index, global prefix, root id, or sentence-length id. + +## scope + +### status + +This document is the normative developer contract for the TurboMind C++ engine execution model. It covers the host-side engine loop, model executor handoff, scheduler transaction, request lifecycle, cache metadata, prefix ownership, and module-level `BatchOp` contracts under `src/turbomind/engine` and the TurboMind model modules that participate in `BatchOp`. + +This document does not describe the PyTorch engine. It is not a refactor proposal and does not prescribe a new scheduler. It records the concepts, ownership rules, invariants, and contracts that current and future TurboMind changes must preserve unless this document is updated in the same change. + +When code and this document disagree, treat the disagreement as a design bug. Either fix the code to satisfy the contract or update this document with the new contract and the reason for the change. + +## concepts + +### request + +`Request` is the API-facing unit of work. It owns request identity, a history/KV offset (`step`), generation configuration, input and output tensor references, cancellation state, callbacks, and the externally visible request state. + +### sequence + +`Sequence` is the engine-local mutable execution state for one accepted request on one local rank. It is created from a `Request` during admission and is the object passed through scheduler and model-module contracts. It stores token progress, scheduling decisions, logical block handles, cache-category request state, generation rows, lifecycle flags, and transient per-pass fields. + +### multimodal-spans + +`Sequence::multimodal_spans` is the engine-visible `(token span, fingerprint)` projection of multimodal inputs; `multimodal_inputs` (pixels) stays opaque. + +`cache_prompt_boundary_skip` is the engine knob for the trailing volatile-suffix length; `Sequence::prompt_boundary_pos` is its per-sequence resolved boundary `B = prompt_len - cache_prompt_boundary_skip`. +`cache_prompt` / `cache_generation` are the two `CacheMode` publication knobs; `cache_checkpoint_interval` is the recurrent-checkpoint spacing (`CacheRegistry::checkpoint_min_interval`, > 0). + +### batch-data + +`BatchData` is a reusable phase-local carrier between the engine thread and the model executor thread. It contains the phase id, current and previous batch sizes, the active-batch permutation, token-count metadata, and CUDA events used to order host setup and device execution. + +### phase + +Phase is one slot in the async pipeline. With one phase, the engine behaves synchronously: a submitted batch is updated before the next batch is prepared. With multiple phases, host scheduling and setup may run ahead of model execution by reusing different `BatchData` slots. + +### scheduler-transaction + +Scheduler transaction is one scheduling pass over eligible `Sequence` objects. For each request the scheduler plans (`Resume` for inactive, `Continue` for active): it sizes logical blocks, reserves cache ids, computes `resume_len`, and emits restore copy plans. `Scheduler::Schedule()` then commits: it decides which requests become active, assigns `history_len` and `input_len`, commits cache allocation and eviction through the memory replay, selects and attaches checkpoint publication slots, emits publication copy plans, and records producer marks. + +### logical-block + +Logical block is scheduler-owned metadata for a fixed token interval. It records offset, capacity, current size, cache-object slots, prefix-index identity, an intrusive strong refcount (request and fork references held through `BlockHandle`s; the cache-allocation reference taken via `Retain`/`Drop` keyed on the slot's `CacheBlock::owner` identity), and node-level producer ownership. + +### cache-object + +Cache object is an object-typed allocation handle tracked by `CacheBlockPool` and backed by `ObjectAllocator`. The scheduler owns cache object lifetime, allocation metadata, validity, and release; modules own the meaning and contents of their registered byte ranges within the object. A cache object may be composite: one handle whose bytes are several independent sub-allocations (parts), resolved to multiple `(address, bytes)` segments. + +### module + +Module is any TurboMind model component that participates in `LanguageModel::Run(BatchOp, phase, env)`, such as input processing, attention, GDN, generation, or output processing. Modules may validate and prepare their own state, but they must obey the `BatchOp` contracts in this document. + +### signal + +Signal is a callback scheduled from the engine into the `Gateway` signal thread. Signals update externally visible request state and invoke user callbacks outside the engine scheduling thread. + +### gateway + +Gateway accepts external requests into per-queue `RequestQueue` objects and owns the signal thread used for callbacks. Queue operations may happen concurrently with the engine loop, but accepted requests enter engine-owned mutable state only when the engine thread pops them from the gateway. + +### engine-thread + +Engine thread runs `Engine::Impl::InternalThreadEntry()`. It owns request admission, validation, cancellation observation, scheduling, host-side setup, completed-batch update, lifecycle retirement, and notification submission. All scheduler state is mutated on this thread. + +### model-executor-thread + +Model executor thread runs `ModelExecutor::Impl::InternalThreadEntry()`. It owns the CUDA execution context for `BatchOp::kPrepare`, `BatchOp::kForward`, and `BatchOp::kUnprep`. It consumes ready `BatchData` objects from the outbound queue, waits for the setup event, runs device work, records the done event, and returns the batch through the inbound queue. + +### data-path + +The request data path is: + +```text +Request + -> Sequence + -> BatchData and module-owned per-phase buffers + -> device/module state + -> BatchData and module-owned per-phase buffers + -> Sequence + -> Request outputs and signals +``` + +The engine and executor exchange `BatchData` slots through queues. Each slot has a stable phase id. The phase id selects module-owned per-phase buffers, while the batch slot itself carries the current active membership and CUDA ordering events. + +## principles + +### engine-state + +The engine thread is the owner of request scheduling state. It admits requests, mutates `Sequence` lifecycle fields, runs scheduler transactions, calls host-side module-level `BatchOp` handlers, submits batches, processes completed batches, and releases request-owned state. + +### scheduler-boundary + +The scheduler is the transaction boundary for shared execution resources. Request-level planning (`Accept`, `Resume`, `Continue`) may match or create logical blocks, reserve cache ids, compute `resume_len`, and emit copy intent, but allocation, eviction, active admission, `history_len`, `input_len`, publication slot attachment, and producer marking are committed by `Scheduler::Schedule()`. + +### cache-semantics + +Modules own registered byte-range semantics. The scheduler may know that a cache id has an object id, an allocation handle, and a timestamp; it must not know whether bytes in that object contain KV blocks, GDN recurrent state, publication checkpoints, or future module state. + +### resume-proof + +Generic cache validity is a lifetime fact, not a resume proof. A valid allocation can keep a prefix node alive, but only `Scheduler::Resume()` may decide whether cached state lets a request skip tokens, and only from content proven produced: `is_valid` set by publication for indexed nodes, `filled_len` for private blocks. + +### device-content + +Device content operations happen on the model executor thread. Module-specific content work (clearing or post-processing a module's own byte range, preparing pointers, reading model outputs) belongs to the relevant `BatchOp` handler. Whole-object cache copies planned by the scheduler as `(src, dst)` cache-id pairs are resolved to addresses during engine-thread setup and performed by the executor: restore copies before `BatchOp::kPrepare`, publication copies after `BatchOp::kUnprep`. The scheduler never knows what the copied bytes mean; modules never know why a copy happened. Resolving a composite handle yields one or more segments, so a scheduler-planned whole-object copy fans out to one device copy per part (same `(src_id, dst_id)` cache-id plan; only the engine-thread resolution multiplies). + +### delayed-cleanup + +Async execution requires delayed cleanup. A request that has finished or been canceled must be excluded from future scheduling immediately, but its request-owned resources cannot be released until every submitted batch that references it has completed and decremented `inflight`. + +### callbacks + +Externally visible callbacks do not run on the engine scheduling path. The engine records callback work as signals and the gateway signal thread invokes them with the appropriate external context. + +### boundary-policy + +Partial-block boundary publication is decided entirely at Accept-time in `SetupForks` (prompt) and at finalization in `PublishGeneration` (generation), from two `CacheMode` knobs — `EngineConfig::cache_prompt` (`all`|`auto`) and `EngineConfig::cache_generation` (`all`|`auto`|`none`) — parsed once into `Scheduler::prompt_cache_mode_` / `generation_cache_mode_`. There is no runtime veto object. `cache_prompt=all` publishes the partial prompt `fork_to` node whenever `B` is mid-block and arms the block-aligned checkpoint clamp otherwise; `cache_prompt=auto` publishes the partial node only when its token range `[j*bs, B)` overlaps a multimodal span (`Scheduler::HasMultimodalOverlap`) and never arms the block-aligned clamp. `cache_generation=all` indexes the terminal partial generated block and adopts the terminal recurrent frontier checkpoint; `auto` indexes full generated blocks only; `none` indexes no generated blocks. The decision is a pure function of cross-rank-identical sequence attributes (prompt geometry, `cache_prompt_boundary_skip`, `multimodal_spans`), so it is consistent across ranks. `Sequence::prompt_boundary_node` now means the boundary will be published (no deferred re-check). + +## ownership + +### gateway + +`Gateway` owns request queues and signal delivery, routing each incoming request to a queue round-robin. It does not own engine-local execution state. After a request is accepted, externally visible completion and streaming updates are delivered by signals scheduled back through the gateway. + +### request + +`Request` is shared API state. It is referenced by the gateway, engine, callbacks, and request-local engine state. The engine may update `Request::cancel_flag`, `Request::ec`, and external state through `UpdateState()`, but the execution details are kept in `Sequence`. + +### sequence + +`Sequence` is owned by `Engine::Impl::State::rc`. It remains owned by the engine until retirement cleanup resets the owning slot. Modules may store module-specific handles in `Sequence`, but they do not own the `Sequence` object. + +### batch-data + +`BatchData` slots are owned by the engine/executor queues. A submitted slot temporarily owns the active membership snapshot encoded by `bs0`, `bsz`, and `perm`, plus CUDA events that order setup and execution. It does not own `Sequence` objects. + +### scheduler + +`Scheduler` is owned by `Engine::Impl`. It owns the `CacheRegistry` (registration is closed before construction), `LogicalBlockPool`, `PrefixTrie`, and `CacheBlockPool`, and it holds a reference to the engine-owned `ObjectAllocator`. `LogicalBlockPool` is a prefix-agnostic node factory and recycle policy; `PrefixTrie` owns prefix indexing (`Find`/`Search`/`Insert`/`Erase`). The scheduler wires them with `LogicalBlockPool::set_recycle_hook`, so when a node's refcount reaches zero the pool fires the hook to erase it from the trie index before the node is destroyed. Scheduler pools persist across scheduling passes. The scheduler parses `EngineConfig::cache_prompt` / `cache_generation` into `CacheMode` values used for partial-block boundary publish decisions (`concepts.boundary-policy`). + +### object-allocator + +`ObjectAllocator` owns the backing cache memory region and allocation validity. `CacheBlockPool` stores object ids, allocation handles, timestamps, and a per-slot weak `owner` back-reference to the logical block the slot belongs to (see `cache-metadata`). Logical blocks point to cache ids, not raw memory. + +### module-cache + +Modules register anonymous byte requirements with prefix or checkpoint cache categories during construction and keep only byte offsets or base part ids (per registration channel). Each category registers one composite `ObjectAllocator` object id after all modules have registered. A category exposes two registration channels: an accumulation channel (grows part 0, returns a within-part byte offset) and a composite channel (appends parts 1..N, returns the base part id). Slab classes in `ObjectAllocator` are deduped by aligned size, and two same-aligned-size simple categories would share an object id (out of scope: prefix is the only simple category). Modules own the content semantics of their registered byte ranges. The `CacheRegistry` is a registration table only; cache id reservation, validity checks, resume selection, and release all live in the scheduler. + +### generation-row + +Generation rows are request-owned logical resources managed by the `Generation` module. A row is allocated lazily when a request first generates and is returned only by `BatchOp::kDel` during request cleanup. + +### prefix + +Prefix production is guarded by `LogicalBlock::producer`, the id of the request currently writing a block's token range. It is set for committed requests by `Scheduler::Schedule()` and cleared by the same pass's publication step for the produced range. A request must not be admitted to write a range whose blocks carry a foreign producer mark. Logical block lifetime is governed by a single intrusive refcount (`LogicalBlock::refs`). Requests and fork edges hold strong references through RAII `BlockHandle`s. When a slot's `CacheBlock::owner` is set, a valid allocation on that slot also holds a strong reference in `LogicalBlock::refs`; `CacheBlock::owner` is only a weak identity back-reference to the block a slot belongs to, and the strong allocation reference is taken and dropped explicitly via `LogicalBlockPool::Retain`/`Drop` keyed on that `owner` (request-owned slots leave `owner == nullptr` and take no allocation ref). The allocation reference is taken when the memory replay commits an allocation (`ReplayMemory`), when a finished request adopts its frontier as a terminal checkpoint (`PublishGeneration`), and when checkpoint publication attaches the publish slot to its target block (`CommitResults`); it is dropped on eviction (`ReplayMemory`), when a private block's allocations are released (`Release`), and when the scheduler drains live allocations at teardown (`~Scheduler`). `LogicalBlockPool::Drop` is the sole decrement funnel — `~BlockHandle` calls it for handle references and explicit allocation drops call it directly — and the last drop triggers `Recycle`. + +### callbacks + +Callbacks are owned outside the engine scheduling path. The engine creates signal closures, and the gateway signal thread invokes them. + +## invariants + +### seq-len + +`seq_len` is the number of known tokens in `Sequence::token_ids` after the last completed update. During generation, `Update()` appends the sampled token and advances `seq_len`. + +### resume-len + +`resume_len` is the prefix length that can be skipped for the next scheduler transaction. `Scheduler::Resume()` computes it from the async executable upper bound, contiguous valid prefix coverage, and, when checkpoint bytes are registered, the exact restorable checkpoint or frontier position. + +### readonly-block-num + +`readonly_block_num` is the per-pass count of leading `Sequence::block_ids` reused read-only: fully-valid whole blocks (rounded down to a whole block) whose KV the forward reads for context but must not re-write. `Scheduler::Resume()` counts them; `Continue()` sets it to 0 (decode writes only the new token). It gates only the KV cache *stores* — they are skipped for positions `< readonly_block_num * block_size`; reads, the set of processed tokens, recurrent recomputation, and producer marking are unaffected. + +### history-len + +`history_len` is the committed resume point for the active forward. `Scheduler::Schedule()` sets `history_len = resume_len` only for admitted requests. Module setup and output selection use `history_len` as the start of already-available state for the submitted batch. + +### input-len + +`input_len` is the number of tokens admitted for the active forward. It is set by `Scheduler::Schedule()` after resource admission and allocation planning. Inactive requests must have `input_len == 0` and `history_len == 0`. + +### filled-len + +`filled_len` is the contiguous prefix context currently established for the request — the position a subsequent resume or decode builds on — not limited to KV this request's own forward produced. It is reconciled in two places. (1) `Engine::Update()` reconciles it from a completed forward: a generating request excludes the newly sampled token, so `filled_len` is `sequence_length - 1`; a non-generating prefill chunk uses `sequence_length`. (2) `Scheduler::CommitResults()` reconciles a resuming request to `filled_len = resume_len`, recording the prefix it reused read-only (prefix cache) or restored from a checkpoint; the `[resume_len, end)` span the in-flight resume forward rebuilds is carried by `inflight_input_len` until that forward completes. The resume-commit write never races `Update()` because a resuming request is inactive (not part of the in-flight batch). + +### inflight-input-len + +`inflight_input_len` is submitted prefix growth that has not yet been reflected into `filled_len`. In async mode, after update of a completed batch, an active request that was submitted into the next batch records `inflight_input_len = input_len`. This equals `input_len` even for a prefix-skipping resume because `CommitResults()` reconciles `filled_len` to `resume_len`, so the growth the forward produces (`end - filled_len`) is exactly `input_len`. + +### inflight-new-tokens + +`inflight_new_tokens` is submitted sequence-length growth that has not yet been reflected into `seq_len`. In async mode, after update of a completed batch, an active generating request records `inflight_new_tokens = 1`; otherwise it records `inflight_new_tokens = 0`. + +### executable-context + +The executable context length for a scheduling pass is `seq_len + inflight_new_tokens - inflight_input_len`. Prefix matching and resume initialization must not assume that the full `seq_len + inflight_new_tokens` context is already safely reusable; the engine still needs to execute at least one token to produce logits for generation. + +### generating + +`generating` means the submitted forward reaches the current context boundary and can produce a next token. The engine sets it from `resume_len + inflight_input_len + input_len == seq_len + inflight_new_tokens`. + +### autoregres + +`autoregres` means the submitted forward is an already-active one-token decode that can take its input token from the model's autoregressive output path instead of copying prompt tokens from host memory. + +### is-active + +`is_active` has two time-dependent meanings that must not be collapsed. Before scheduler commit, it describes whether the request was active in the previous scheduling state and is used for resource accounting. After scheduler commit, it describes whether the request is active in the current scheduling state. + +### retiring + +`retiring` means the request has finished or been canceled and must never be scheduled again. It does not mean resources can be released. + +### inflight + +`inflight` is the number of submitted batches that still reference the request. It is incremented during `Setup()` for each active request in the submitted batch and decremented during `Update()` for the completed batch membership. A retiring request is releasable only when `inflight == 0`. + +### done + +`done` records request completion/cancellation for output and update logic. It is not the physical cleanup condition; cleanup is governed by `retiring && inflight == 0`. + +### cleanup + +The cleanup invariant is: + +```cpp +if (request.retiring && request.inflight == 0) { + Run(BatchOp::kDel, -1, env); + scheduler.Release(request); + remove_sequence(); +} +``` + +### protection-set + +The eviction-protection set a request stamps (`involved_cache_ids`) is exactly what it needs to run the forward — its prefix blocks and single frontier (when checkpoints are registered) — and is its **required** allocation set. Published block checkpoints are resume-time optimizations, not run-time state, and are deliberately excluded so they stay evictable: a high-priority sequence runs whenever memory fits its prefix blocks + one frontier and may reclaim its own prior checkpoints. The single checkpoint or fork source actually restored in a pass is protected for that pass via stamping its `restore_copies` source. + +## contracts + +### scheduler-start + +A scheduler transaction starts with a list of eligible, non-retiring `Sequence` objects. The engine resets transient scheduling fields, and asks the scheduler to plan each request (`Resume` for inactive, `Continue` for active) before commit. + +### prefix-prepare + +When prefix caching is enabled and the request is trie-eligible, `Scheduler::Accept()` matches the prompt against the prefix trie at admission: full blocks are matched or created and indexed, the first miss may bind a partial-match source (`fork_from`), and a prompt-boundary publish node (`fork_to`) may be created. `fork_from` is always bound (any prior request may have published a prompt or generation partial node). A partial `fork_to` node is created when `B` falls inside a block and `cache_prompt` admits it: `all` always, `auto` only when the node's token range overlaps a multimodal span. The partial node carries the partial block's KV for every prefix-cached model; a recurrent model additionally publishes a recurrent-state checkpoint onto the same node (the checkpoint payload attaches only when checkpoint cache ids exist). Accept must not allocate backing memory or select `resume_len`. The reusable prompt boundary ends at `B = prompt_len - cache_prompt_boundary_skip` (the configured count of trailing volatile generation-prompt tokens, default 1, so the default excludes only the last prompt token; `B` is capped by the `seq_len-1` resume cap). A partial `fork_to` node is published only when `B` falls inside a block (`B % block_size != 0`); when `B` is block-aligned the whole-block prefix already tiles `[0, B)` and only the boundary clamp/checkpoint applies. Over-excluding (a larger skip) is safe: segment tokens are exact-compared, so a too-long suffix only shortens reuse and never causes a false hit. Accept sets `Sequence::prompt_boundary_node` when `SetupForks` decides the boundary will be published (node insert succeeded, or the block-aligned boundary case) (`concepts.boundary-policy`); KV and checkpoint publication follow at scheduler commit (`contracts.checkpoint-publish`). Every indexing site folds each image's fingerprint into the cumulative key at the block where the image starts (from `Sequence::multimodal_spans`) and stores it on that `LogicalBlock`: `Accept`'s block creation, the partial-block `Search` when a partial prompt-boundary node may be published, and `PublishGeneration` when it later indexes the prompt-tail block that block creation left private (an image start can only fall in that block; generated positions never carry one). The folding is therefore uniform across lookup and indexing, so a published prompt-tail node has the same identity a future request's `Accept` rebuilds. + +### cache-prepare + +Request-level planning (`Resume` for inactive requests, `Continue` for active ones) runs inside the scheduling pass before admission. It may create missing logical blocks, reserve missing category cache ids, compute `resume_len`, and emit restore copy intent as cache-id pairs. It must not allocate or deallocate backing object memory, run module callbacks, copy, clear, restore, publish, mark a request active, set `history_len`, or set `input_len`. + +`Continue` maintains the request's `involved_cache_ids` incrementally rather than rebuilding it: a request active last pass committed, so none of its involved cache ids were evicted and its whole required set was allocated; only blocks appended by `EnsureBlocks` since the last plan are new (and, being freshly created, unallocated). `Resume` cannot — shared prefix nodes it references can be evicted by other requests between its passes — so it rebuilds `involved_cache_ids` from a full scan each pass. + +### scheduler-commit + +`Scheduler::Schedule()` is the commit step. It sorts candidate requests by `Request::unique_id`, stamps each request's `involved_cache_ids` and the sources of its `restore_copies`, tests composed resources, clamps each forward's end to a boundary candidate (a block boundary, or exactly B = prompt_len - cache_prompt_boundary_skip when `prompt_boundary_node` is set (the publish decision is finalized in `SetupForks`; the clamp fires on the pass that can reach `B`)), checks producer conflicts, selects checkpoint publication targets, and plans cache allocation and eviction with a `ScratchAllocator`. Admission and replay run in two phases (see `contracts.scheduler-admission`): `ReplayMemory` is applied once for the required tier and again for the optional tier, and each call applies only its phase's committed replay to the real allocator and then clears the replay buffer. After replay it attaches committed publication slots, emits publication copy plans, updates frontier metadata, and publishes produced ranges. + +For each committed request, the scheduler sets: + +```cpp +r.history_len = r.resume_len; +r.input_len = admitted; // clamped to a boundary candidate: a block boundary, or B = prompt_len - cache_prompt_boundary_skip when prompt_boundary_node is set +r.is_active = true; +``` + +### scheduler-inactive + +For each uncommitted request, the scheduler must leave it inactive for the current pass: + +```cpp +r.is_active = false; +r.input_len = 0; +r.history_len = 0; +r.publish_target = nullptr; +r.alloc_cache_ids.clear(); +r.restore_copies.clear(); +r.publish_copies.clear(); +``` + +### scheduler-admission + +Admission is two-phase. The **required** tier (prefix blocks + frontier) evicts up to the request's `cutoff[i]` and, on failure, defers the request and stops the pass — priority enforcement, gated by `max_evict_ts`. The **optional** tier (checkpoint publication, fork-to population) runs only after every required forward is placed, on a `ScratchAllocator` (a copy of the committed slab capacity, `MemoryState`; a committed handle is itself the `Allocation` pointer, read for its slot lists during eviction, and the handle store is never copied — `ObjectAllocator` is move-only), and reclaims only **inactive** slots (`timestamp < pass_floor`, where `pass_floor` is the pass-start timestamp) of any category via the allocator's evict/allocate path. An optional allocation that does not fit is dropped; it never evicts active state and never defers a forward. + +### allocation + +Allocation planning must be atomic at the transaction boundary. If a request cannot allocate all required cache objects, the scheduler must not partially mutate the real allocator for that failed suffix. Evictions and allocations are applied only for the committed prefix of the planning replay. + +### eviction + +Eviction is timestamp based and object-type agnostic. Evicting an allocation releases the allocation reference it held on its logical block when the slot's `owner` is set; evicting a prefix-category allocation also clears the block's `is_valid`. A block whose reference count reaches zero is recycled by the pool, which fires a recycle hook that removes it from the `PrefixTrie` index, and its fork-edge handles release as the node is destroyed. + +### prefix-conflict + +The scheduler may skip a request whose produced range carries a foreign producer mark and continue considering later requests. Producer conflict handling is block-level and must not reset or release the skipped request's logical blocks. + +### scheduler-output + +The scheduler's output is a set of current active requests plus updated scheduler metadata. The engine owns batch partitioning, permutation construction, setup submission, update processing, and retirement after the scheduler transaction. + +### batchop + +`BatchOp` is the module-level operation protocol used by `LanguageModel::Run()`. Each operation has a narrow contract. A module may ignore operations that do not apply to it. There is no module-level scheduling operation; scheduler cache preparation owns host-side cache reservation and resume selection. + +### batchop-add + +`BatchOp::kAdd` runs on the engine thread when new `Sequence` objects are admitted. It initializes module-specific request fields and validates request-local inputs. It may set `Sequence::status` to reject a request. It must not require scheduler logical blocks or cache object allocations. + +### batchop-setup + +`BatchOp::kSetup` runs on the engine thread after scheduler commit and before batch submission. It consumes committed active requests and scheduler metadata. It prepares host and device metadata buffers, copies non-cache input metadata, may resolve committed cache allocation handles to raw addresses, and may update request-owned module handles that describe the submitted work. It must treat the scheduler decision as fixed. + +### object-address + +Resolving an `ObjectAllocator` allocation handle to an address is metadata preparation, not backing-memory access. The address may be copied as a pointer value. The engine thread must not dereference that address or issue copies, clears, restores, publishes, kernels, or other operations whose source or destination is the cache object backing memory. A handle resolves to one or more `(address, bytes)` segments (one for a simple object, N+1 for a composite); the same engine-thread restriction applies to every segment. A handle is a typed `const Allocation*` that dereferences directly to a stable `Allocation` holding the per-part `bases`; identity and staleness are owned by a single always-on mechanism — a monotonic `Allocation::key` that a consumer snapshots and later compares (there is no compile-time backend split). Each slab slot stores its owning handle (`slot_owner_`), the reverse link a future compaction pass uses to rewrite the one `Allocation` that owns a relocated slot. + +### batchop-prepare + +`BatchOp::kPrepare` runs on the model executor thread after the setup event is visible on the executor stream and after scheduler-planned restore copies have been enqueued. It prepares device-side state for forward execution. It may use raw cache object addresses prepared by setup and perform module-owned byte-range content operations, such as clearing state for requests whose forward starts at position 0 (`history_len + inflight_input_len == 0`) or post-processing restored content. + +### batchop-forward + +`BatchOp::kForward` runs on the model executor thread. It executes model computation for the submitted batch, mutates module device state for the active requests, writes sampled output ids when generation is active, and updates device-side finished and sequence-length state. KV cache writes are bounded below by `readonly_block_num * block_size`; positions in read-only leading blocks are read but not re-written. + +### batchop-unprep + +`BatchOp::kUnprep` runs on the model executor thread after forward execution and before scheduler-planned publication copies are enqueued. It exports device-side results needed by the engine update path into per-phase module buffers and is the module's last chance to finalize frontier contents before publication snapshots them. It must not invoke external request callbacks. + +### batchop-fetch + +`BatchOp::kFetch` runs on the engine thread after the completed batch's done event is visible on the engine stream. It schedules copies from per-phase module buffers to host-visible buffers and publishes fetched tensors into `env` for `kUpdate`. + +### batchop-update + +`BatchOp::kUpdate` runs on the engine thread after fetch copies have completed and the engine stream has synchronized. It updates request-local host state from fetched results and module-owned host buffers. It may update generation sampling state and other CPU-side bookkeeping. It must not release request-owned resources. + +### batchop-del + +`BatchOp::kDel` runs on the engine thread during retirement cleanup before `Scheduler::Release()`. It releases module-owned request resources such as generation rows. It must tolerate partially initialized request state and must not depend on the request being active in a current batch. + +### executor-only + +Only `kPrepare`, `kForward`, and `kUnprep` are executed by the model executor thread. Cache object backing memory is accessed only by these module-level operations and by the executor-run, scheduler-planned whole-object copies that bracket them. + +### cache-metadata + +Cache metadata is generic. `CacheBlockPool` records cache ids, object ids, allocation handles, timestamps, and a weak `owner` back-reference to the logical block a slot belongs to (a valid allocation holds one strong ref on its owner). `LogicalBlock` records which cache ids are attached to a token interval. Neither type defines what an object's bytes mean. A slot caches the resolved `Allocation` handle (giving the per-part `bases` and the part count) plus an `alloc_key` snapshot for ABA-safe stale detection; the cached `allocation` being non-null is the validity flag. The pool still does not know a segment is a layer. + +### cache-content + +Cache contents are module-specific within registered byte ranges. `UnifiedAttentionLayer` owns KV byte-range semantics. `GatedDeltaNetLayer` owns recurrent and convolution state byte-range semantics. Future modules that register category bytes must define their own resumability and content-update rules. + +### cache-reuse + +A valid cache allocation is sufficient to keep an indexed prefix node alive, but it is not sufficient to make the node reusable for a request. Reuse requires the block to have been published (`is_valid`) and is revalidated by `Scheduler::Resume()` on every pass. + +### resume-selection + +`resume_len` is selected by `Scheduler::Resume()`. Without checkpoint bytes it is the contiguous prefix-valid token end, capped by the async executable upper bound. With checkpoint bytes it is the latest position among the request frontier, published block checkpoints, and fork sources that is covered by valid prefix content; restore intent is expressed as copy plans into the frontier. Generic cache validity alone never raises `resume_len` past content that was not proven produced (`is_valid` for indexed nodes, `filled_len` for private blocks). `resume_len` (what every stateful module skips) is distinct from `readonly_block_num` (the KV-store boundary): full validity of leading whole blocks marks them read-only for KV stores even when checkpoint coarseness keeps `resume_len` lower, so the re-processed window `[resume_len, readonly_block_num * block_size)` rebuilds recurrent state without re-writing already-valid KV. When a published prompt-boundary node exists (`prompt_boundary_node`), a duplicate (or history-extending) prompt resumes via fork-extension at the producer's prompt-boundary node end `B` (the producer's `prompt_boundary_pos = prompt_len - cache_prompt_boundary_skip` on the source node) by restoring the node's KV, plus its recurrent-state checkpoint when the model is recurrent; full-block prompt and generation checkpoints remain always-on regardless of the knobs. + +### category-registration + +Modules register anonymous byte requirements with the prefix or checkpoint category during construction and keep only byte offsets or base part ids (per registration channel). Each category registers one composite `ObjectAllocator` object id after all modules have registered. A category exposes two registration channels: an accumulation channel (grows part 0, returns a within-part byte offset) and a composite channel (appends parts 1..N, returns the base part id). Slab classes in `ObjectAllocator` are deduped by aligned size, and two same-aligned-size simple categories would share an object id (out of scope: prefix is the only simple category). Modules own the content semantics of their registered byte ranges. The `CacheRegistry` only maps categories to object ids and byte offsets; cache id reservation, validity, resume selection, and release are scheduler policy. When a module sizes its composite parts to equal another category's aligned object size (e.g. `GatedDeltaNetLayer` block-sizing recurrent parts to the prefix object), they share one slab class and become interchangeable at slot granularity under the already category-agnostic eviction sweep; page-granular reclamation (`slab.h`, `kMaxEmptySlabs == 0`) is the pre-existing baseline that also applies when sizes differ. + +### unified-attention + +`UnifiedAttentionLayer` registers its KV byte requirement with the prefix category during construction and stores the returned byte offset. During setup it resolves committed prefix cache ids from logical blocks and prepares KV pointer metadata. Reserving logical-block cache ids and validating contiguous prefix coverage is scheduler planning, not module work. It skips KV cache stores for positions in read-only leading blocks (`< readonly_block_num * block_size`) and supplies those positions from the already-valid blocks during reads. + +### gated-deltanet + +`GatedDeltaNetLayer` registers its recurrent/convolution state byte requirement with the checkpoint category during construction and stores the relevant offsets (per-layer conv element offsets within part 0, computed by the module; the base part id rec_base for recurrent parts). During setup it resolves the committed frontier cache part bases for each request and records which requests start their forward at position 0 (`history_len + inflight_input_len == 0`; in-flight tokens advance the frontier before this batch runs). During `kPrepare` it clears its registered parts (conv part 0 and each recurrent block part, including any rounding padding) for those requests. It does not know whether checkpoints are restored, published, or shared; those are scheduler-planned, executor-run whole-object copies. The recurrent state is a rounded-up 2D `(L_b layers × H_b v_heads)` block grid: one uniform composite part (`block_bytes_`) per block, conv unchanged. `GatedDeltaNetLayer` resolves a per-(layer-group, batch, head-group) recurrent base (composite part `rec_base + (L/L_b)*ng + (h/H_b)`, shared by all `L_b` layers of the block-row) plus a per-layer in-block element offset `linear_state_offset == (L%L_b)*H_b*cell_elems`, and one accumulated conv base (part 0) with the per-layer conv element offset, instead of one recurrent base per layer. The recurrent kernel indexes head-groups: `state_ptrs[b*ng + h/H_b] + linear_state_offset + (h%H_b)*state_size`. With `TM_GDN_BLOCK_CONFIG` unset (`L_b=1, H_b=num_v_heads, ng=1`) this reduces exactly to one base per layer at offset 0. Consumers that reuse a prompt-boundary checkpoint resume at `B` with a restored checkpoint (not position 0), so the "clear at start" path (`history_len + inflight_input_len == 0`) is unaffected. + +### checkpoint-publish + +Checkpoint publication is planned and committed entirely by the scheduler. Planning reserves a request-owned publication cache id. Commit knows the forward end only after admitted `input_len`, and skips nodes that already hold a valid checkpoint. Publication planning is routed mutually exclusively by the pass's forward end — a prompt-boundary group (a `fork_to` node's KV copy only when `B` is mid-block, plus the boundary checkpoint published either onto that partial `fork_to` node or onto the block-aligned boundary block, planned only when `prompt_boundary_node` is set and the forward landed at `B`, so a not-yet-reached pass allocates neither the KV block nor the checkpoint slot) and a full-block group. The full-block group is coverage-driven: it publishes iff a full block ends exactly at the forward end, subject to the configured minimum interval, with no knowledge of prompt-boundary mode. The prompt-boundary checkpoint bypasses the minimum interval. (This drops the prior behavior of suppressing a full-block checkpoint just below an upcoming prompt boundary; that checkpoint is now kept, since full-block publication depends only on coverage.) It attaches the allocated cache id to the target's checkpoint slot and emits a frontier-to-slot publication copy that the executor runs after `kUnprep`. + +### prefix-identity + +Prefix identity is token identity, per-image content identity, plus parent identity. Index lookup must use cumulative `PrefixKey`, exact parent identity, exact segment-token comparison, and exact comparison of the block's start-fingerprints (`LogicalBlock::image_fps`). A fingerprint is the image's opaque 256-bit content identity; an empty fingerprint never compares equal to anything, including another empty fingerprint. Blocks interior to an image carry no fingerprint of their own — their identity is carried by the cumulative key and the parent chain, since the image's first block exact-compares the fingerprint. Hash equality alone is never identity. + +### prefix-ownership + +Producer marking is a per-pass exclusion mechanism. `Scheduler::Schedule()` sets `LogicalBlock::producer` on the committed produced range and the same pass's publication step clears it. A request must not be admitted to write blocks carrying a foreign producer mark. There is no cross-pass ownership state to clean up on cancel. + +### prefix-publish + +Publication of produced ranges happens at scheduler commit, after the memory replay. Indexed nodes become `is_valid` only when the committed forward end fully covers them; private blocks become `is_valid` with their content extent tracked by `filled_len`. Device-side content arrives in submission order, so a consumer batch always executes after the producer batch that committed before it. + +### cancel-release + +Canceling or releasing a request drops the request's references; the order among blocks is immaterial. Private (un-indexed) blocks have their allocations deallocated immediately (dropping the allocation ref while the request ref still pins the block), then clearing `Sequence::block_ids` drops the request refs and recycles any now-unreferenced block; indexed nodes keep valid allocations alive (each allocation holds an allocation ref in `LogicalBlock::refs` via `Retain`/`Drop` on the slot's `owner`) and remain discoverable. Incomplete indexed nodes are left `is_valid == false`, so no consumer can resume from their content; they are reclaimed by eviction. + +### checkpoint-adoption + +Terminal checkpoint frontier adoption happens inside `Scheduler::PublishGeneration()` for normally finished, non-canceled, trie-eligible requests when the frontier allocation is valid, and is gated by `cache_generation == all`. Adoption does not test `frontier_pos`: at finalization the live recurrent buffer is guaranteed to correspond to `filled_len` because the finishing pass stored its state there and the GDN recurrence kernel bypasses its state write-back whenever the device finished mask is set, so any async over-shoot pass leaves the buffer untouched. `frontier_pos` is resume-fast-path bookkeeping committed speculatively as the scheduled forward end (`CommitResults()`), so under async lookahead it over-counts past `filled_len`; testing it would spuriously block a safe adoption. Adoption transfers the frontier cache id into the checkpoint slot of the newly indexed terminal block and transfers the allocation's reference to that block. On adoption, redundant full-block checkpoints within `checkpoint_min_interval` below `filled_len` that sit on blocks being indexed this pass (`pos > prompt_len`, still private — no consumer reference) are dropped, mirroring eviction. If the only in-window checkpoint sits on an already-shared block (`pos <= prompt_len`), adoption is skipped instead, preserving the interval without touching shared state. The terminal partial generated block is itself indexed into the prefix trie whenever `cache_generation == all`, independent of model type (full generated blocks always index): it carries the partial block's KV for every prefix-cached model, and the frontier-checkpoint adoption above applies only when the frontier id is valid (recurrent models). So the generation-boundary partial node exists only when fork matching can reach it. + +### cache-eviction + +Eviction may remove cache objects without module-specific knowledge. After eviction, a prefix node remains indexed only while its reference count is positive (requests, fork edges, or remaining valid allocations). Checkpoint and prefix resumability are revalidated by `Resume()` on every pass from current allocation validity. Published checkpoints are not held in any request's eviction-protection set (`involved_cache_ids`), so they age and are reclaimed before live working-set blocks under pressure. Eviction frees a cache *slot* (the allocation), not the `LogicalBlock`: a block referenced by a living sequence or a fork edge survives even with all of its allocations evicted, and is recycled only when its last reference drops. + +## checklist + +Before changing TurboMind async execution, scheduler, cache management, or module-level `BatchOp` behavior, verify the change preserves these rules: + +### state-owner + +Does exactly one component own each state mutation? + +### cache-prepare + +Do `Accept`/`Resume`/`Continue` only match or create logical blocks, reserve cache ids, compute `resume_len`, and emit copy intent, without backing allocation, active admission, or device content mutation? Does `Continue` maintain `involved_cache_ids` incrementally (appending only tail blocks from `EnsureBlocks`) while `Resume` rebuilds it from a full scan each pass? + +### scheduler-commit + +Does `Scheduler::Schedule()` remain the only active-admission, allocation, eviction, `history_len`, `input_len`, and publication-attach commit point? + +### cache-semantics + +Are cache object byte ranges interpreted only by the module that registered the byte range? + +### cache-validity + +Is generic cache validity used only for lifetime, not to raise `resume_len`? + +### cache-memory + +Are cache object backing-memory reads and writes limited to executor-thread `BatchOp` handlers and executor-run, scheduler-planned whole-object copies, with KV writes further limited to `[readonly_block_num * block_size, end)` (read-only leading blocks are reads only)? For composite objects, are whole-object copies issued as one device copy per part? + +### delayed-release + +Can a finishing or canceled request be excluded from scheduling before its resources are physically released? + +### cleanup + +Is every request-owned resource released only after `retiring && inflight == 0`? + +### async-progress + +Does async state account for submitted-but-not-yet-reflected work through `inflight_input_len`, `inflight_new_tokens`, and `inflight`? + +### forward-progress + +On a scheduling pass that admits nothing (empty active batch) with no in-flight work remaining (`inflight == 0` for every request), does the engine fail the highest-priority eligible request (smallest `unique_id`) with `kOutOfMemory` — rather than resubmitting empty batches indefinitely — so a request too large for the cache always receives a terminal status? + +### callbacks + +Are external callbacks delivered through gateway signals rather than directly on the engine scheduling path? + +### prefix-ownership + +If producer marking is touched, is it set only at scheduler commit and cleared by the same pass's publication step? + +### module-cache + +If a module registers cache bytes, does it register with exactly one category, store only its byte offset, define setup pointer resolution, and keep backing-memory reads and writes in executor-thread `BatchOp` handlers? + +### boundary-policy + +Are partial-block boundary publishes decided at Accept/finalization from `cache_prompt` / `cache_generation` (`CacheMode`), with no runtime veto object? Is `cache_prompt=auto` gated on multimodal overlap of the partial node's range? Is the decision a pure function of cross-rank-identical attributes? Is full-block publication kept mode-free (coverage-driven)? Is the recurrent-checkpoint spacing knob `cache_checkpoint_interval` (> 0, no block_seq_len fallback)? + +### contract-sync + +If this document no longer matches the intended behavior, is the contract updated in the same change as the code? diff --git a/src/turbomind/engine/batch.h b/src/turbomind/engine/batch.h index 6059e43458..1f7de3a42b 100644 --- a/src/turbomind/engine/batch.h +++ b/src/turbomind/engine/batch.h @@ -1,7 +1,9 @@ #pragma once +#include #include +#include #include "src/turbomind/core/core.h" #include "src/turbomind/engine/request.h" @@ -10,30 +12,39 @@ namespace turbomind { enum class BatchOp { - kAdd, // Se -> Rc H - kSetup, // Rc -> (B -> D) H2D - kPrepare, // (D -> St) D - kForward, // St -> St D - kUnprep, // (St -> D) D - kFetch, // (D -> B) D2H - kUpdate, // B -> Rc H - kDel, // Rc -> Se H + kAdd, // Request -> Seq H Sched + kSetup, // Seq -> (B -> D) H2D Sched + kPrepare, // (D -> St) D Exec + kForward, // St -> St D Exec + kUnprep, // (St -> D) D Exec + kFetch, // (D -> B) D2H Sched + kUpdate, // B -> Seq H Sched + kDel, // Seq -> Request H Sched }; -// Se -> Rc -> (B -> D) -> St -> (D -> B) -> Rc -> Se +// Request -> Seq -> (B -> D) -> St -> (D -> B) -> Seq -> Request /* -Se -> Rc (add: rc) - Rc -> B - (B -> D) (setup: rc, d, copy) +Request -> Sequence (add) + Sequence -> B + (B -> D) (setup: Sequence, d, copy) (D -> St) St -> St (forward) (St -> D) (D -> B) - B -> Rc (sync) -Rc -> Se (del: rc) + B -> Sequence (sync) +Sequence -> Request (del) */ +// A scheduler-planned cache copy resolved to device pointers on the engine +// thread. Executed by the model executor as a whole-object device copy. +struct ResolvedCopy { + void* src; + void* dst; + size_t bytes; +}; + +/// TODO: The strcuture itself should not be passed as a part of ENV struct BatchData { explicit BatchData(int phase): self{this}, phase{phase} @@ -52,12 +63,13 @@ struct BatchData { const int phase; - int bs0 = 0; - int bsz = 0; + int bs0 = 0; // prev batch size + int bsz = 0; // curr batch size Buffer_ perm; - std::vector> rc; + std::vector restore_copies; // run before BatchOp::kPrepare + std::vector publish_copies; // run after BatchOp::kUnprep std::vector local_token_num; int global_token_num = 0; diff --git a/src/turbomind/engine/block.cc b/src/turbomind/engine/block.cc new file mode 100644 index 0000000000..da1b6ed319 --- /dev/null +++ b/src/turbomind/engine/block.cc @@ -0,0 +1,114 @@ +#include "src/turbomind/engine/block.h" + +#include +#include +#include + +namespace turbomind { + +void CacheBlockPool::Invalidate(int index) +{ + TM_CHECK_GT(index, 0); + TM_CHECK_LT(index, static_cast(blocks_.size())); + TM_CHECK_GE(blocks_[index].object_id, 0); + blocks_[index] = {}; + free_list_.push_back(index); +} + +int CacheBlockPool::Create(int object_id, LogicalBlock* owner) +{ + TM_CHECK_GE(object_id, 0); + if (TM_UNLIKELY(free_list_.empty())) { + free_list_.push_back(static_cast(blocks_.size())); + blocks_.emplace_back(); + } + + const int idx = free_list_.back(); + free_list_.pop_back(); + blocks_[idx] = {}; + blocks_[idx].object_id = object_id; + blocks_[idx].owner = owner; + return idx; +} + +void CacheBlockPool::Deallocate(ObjectAllocator& alloc, int cache_id) +{ + auto& c = blocks_[cache_id]; + TM_CHECK(c.valid()); + alloc.Deallocate(c.object_id, c.allocation); + c.allocation = {}; + c.alloc_key = 0; + c.timestamp = 0; +} + +std::vector CacheBlockPool::SortedIndices() const +{ + std::vector idxs; + idxs.reserve(blocks_.size()); + for (int i = 1; i < static_cast(blocks_.size()); ++i) { + if (blocks_[i].valid()) { + idxs.push_back(i); + } + } + std::sort(idxs.begin(), idxs.end(), [this](int i, int j) { return blocks_[i].timestamp < blocks_[j].timestamp; }); + return idxs; +} + +uint64_t CacheBlockPool::Stamp(const std::vector& cache_ids) +{ + const auto ret = next_timestamp_; + for (auto it = cache_ids.rbegin(); it != cache_ids.rend(); ++it) { + if (*it) { + Stamp(*it); + } + } + return ret; +} + +uint64_t CacheBlockPool::Stamp(int cache_id) +{ + TM_CHECK_GT(cache_id, 0); + TM_CHECK_LT(cache_id, static_cast(blocks_.size())); + TM_CHECK_GE(blocks_[cache_id].object_id, 0); + blocks_[cache_id].timestamp = next_timestamp_++; + return blocks_[cache_id].timestamp; +} + +LogicalBlockPool::~LogicalBlockPool() +{ + if (live_ != 0) { + TM_LOG_ERROR("leaked {} logical blocks", live_); + } +} + +BlockHandle LogicalBlockPool::Create(int logical_index) +{ + TM_CHECK_GT(block_size_, 0); + TM_CHECK_GE(logical_index, 0); + + LogicalBlock* p = alloc_.allocate(1); + std::allocator_traits::construct(alloc_, p); + p->mgr = this; + p->offset = logical_index * block_size_; + p->capacity = block_size_; + ++live_; + return BlockHandle{p}; // refs 0 -> 1 +} + +void LogicalBlockPool::Recycle(LogicalBlock* p) +{ + if (on_recycle_) { + on_recycle_(*p); // PrefixTrie::Erase (pool stays prefix-agnostic) + } + if (const int c = p->prefix_id) { + cache_.Invalidate(c); // allocation already gone (see class comment) + } + if (const int c = p->checkpoint_id) { + cache_.Invalidate(c); + } + std::allocator_traits::destroy(alloc_, p); // ~LogicalBlock drops fork edges, frees tokens + alloc_.deallocate(p, 1); // back to the pmr pool + --live_; +} + +} // namespace turbomind diff --git a/src/turbomind/engine/block.h b/src/turbomind/engine/block.h new file mode 100644 index 0000000000..7b4de81402 --- /dev/null +++ b/src/turbomind/engine/block.h @@ -0,0 +1,248 @@ +#pragma once + +#include +#include +#include +#include +#include + +#include "src/turbomind/core/check.h" +#include "src/turbomind/engine/fingerprint.h" +#include "src/turbomind/engine/prefix_key.h" +#include "src/turbomind/memory/common.h" +#include "src/turbomind/memory/object.h" + +namespace turbomind { + +struct LogicalBlock; +class LogicalBlockPool; + +// Intrusive, non-atomic strong handle to a logical block (engine-thread only). +// Holds one ref for its lifetime; copy retains, destruction drops. +class BlockHandle { + LogicalBlock* p_{}; + +public: + BlockHandle() = default; + explicit BlockHandle(LogicalBlock* p); + BlockHandle(const BlockHandle& o); + BlockHandle(BlockHandle&& o) noexcept: p_{std::exchange(o.p_, nullptr)} {} + BlockHandle& operator=(BlockHandle o) noexcept + { + std::swap(p_, o.p_); + return *this; + } + ~BlockHandle(); + + LogicalBlock& operator*() const noexcept; // defined after LogicalBlock is complete + LogicalBlock* operator->() const noexcept + { + return p_; + } + LogicalBlock* get() const noexcept + { + return p_; + } + explicit operator bool() const noexcept + { + return p_ != nullptr; + } + friend bool operator==(const BlockHandle& a, const BlockHandle& b) noexcept + { + return a.p_ == b.p_; + } +}; + +struct CacheBlock { + uint64_t timestamp{}; // eviction priority; zero means highest + int object_id{-1}; // ObjectAllocator registration id + object_alloc_t allocation{}; // {const Allocation*}; .a == nullptr => no live allocation + uint64_t alloc_key{}; // snapshot of allocation->key at replay (ABA stale check) + + // Slot -> owning logical block (weak identity). Set at Create; persists + // across evict/realloc. nullptr = request-owned (frontier/publish). + LogicalBlock* owner{}; + + // Base of part `p`; `part` indexes the resolved Allocation. + char* base(int part) const + { + return allocation->base(part); + } + int part_count() const + { + return allocation->part_count(); + } + bool valid() const noexcept + { + return allocation.a != nullptr; + } +}; + +class CacheBlockPool { +public: + CacheBlockPool() + { + blocks_.emplace_back(); + } + + void Invalidate(int index); + + int Create(int object_id, LogicalBlock* owner = nullptr); + + // Deallocates the slot's backing object and clears it back to "no + // allocation" state (the owner identity persists). Pre-condition: the slot + // has a live allocation (allocation set). + void Deallocate(ObjectAllocator& alloc, int cache_id); + + // Eviction candidates: exactly the currently allocated blocks. The cached + // allocation handle is the validity flag; the timestamp only orders the candidates. + std::vector SortedIndices() const; + + uint64_t Stamp(const std::vector& cache_ids); + uint64_t Stamp(int cache_id); + + CacheBlock& operator[](int index) noexcept + { + return blocks_[index]; + } + const CacheBlock& operator[](int index) const noexcept + { + return blocks_[index]; + } + + size_t size() const noexcept + { + return blocks_.size() - free_list_.size(); + } + +private: + uint64_t next_timestamp_{1}; + + std::vector blocks_; + std::vector free_list_; +}; + +struct LogicalBlock { + // Position and content extent within the sequence + int offset{-1}; + int capacity{0}; + int size{0}; // filled tokens of an indexed node; 0 for private blocks + + // Intrusive strong refcount (requests + fork edges + valid allocations) + int refs{0}; + LogicalBlockPool* mgr{}; // set at Create; used by handle / Retain / Drop + + // Cache slots, one per category + int prefix_id{0}; + int checkpoint_id{0}; + + // Prefix trie node state (mutated only via the trie methods) + const LogicalBlock* parent{}; // nullptr = root; non-owning identity + PrefixKey key{}; // empty => not (yet) a prefix node + std::vector tokens; + std::vector image_fps; // start-fingerprints of images beginning in this block (usually empty) + bool indexed{false}; // present in the trie index + + // Fork edges (strong, RAII) + BlockHandle fork_from; // partial-match source (read side) + BlockHandle fork_to; // prompt-boundary publish target (write side) + + bool is_valid{false}; // content proven produced; cleared on prefix evict + uint64_t producer{0}; // request currently writing this range; 0 = none +}; + +// Owns logical block lifetime via an intrusive refcount. Nodes are allocated +// discretely from a pooled memory resource, so a LogicalBlock* is a stable +// identity. When refs reaches 0 the node is recycled: a recycle hook removes +// it from the PrefixTrie index, every attached cache slot's allocation is +// already invalid (a valid allocation holds a ref via CacheBlock::owner), so +// Invalidate only returns slot metadata. +class LogicalBlockPool { + using NodeAlloc = std::pmr::polymorphic_allocator; + +public: + LogicalBlockPool(CacheBlockPool& cache, int block_size = 0): block_size_{block_size}, cache_{cache} {} + + ~LogicalBlockPool(); + + void ResetBlockSize(int block_size) + { + TM_CHECK_GT(block_size, 0); + TM_CHECK_EQ(live_, 0); + block_size_ = block_size; + } + + int block_size() const noexcept + { + return block_size_; + } + + void set_recycle_hook(std::function h) + { + on_recycle_ = std::move(h); + } + + BlockHandle Create(int logical_index); + + void Retain(LogicalBlock* p) noexcept + { + if (p) { + ++p->refs; + } + } + + void Drop(LogicalBlock* p) // sole decrement funnel + { + if (p) { + TM_CHECK_GT(p->refs, 0); + if (--p->refs == 0) { + Recycle(p); + } + } + } + + size_t size() const noexcept + { + return live_; + } + +private: + void Recycle(LogicalBlock* p); // sole place that frees a node + + int block_size_{}; + int live_{}; + + CacheBlockPool& cache_; + + std::pmr::unsynchronized_pool_resource res_; + NodeAlloc alloc_{&res_}; + std::function on_recycle_; +}; + +inline BlockHandle::BlockHandle(LogicalBlock* p): p_{p} +{ + if (p_) { + p_->mgr->Retain(p_); + } +} + +inline BlockHandle::BlockHandle(const BlockHandle& o): p_{o.p_} +{ + if (p_) { + p_->mgr->Retain(p_); + } +} + +inline BlockHandle::~BlockHandle() +{ + if (p_) { + p_->mgr->Drop(p_); + } +} + +inline LogicalBlock& BlockHandle::operator*() const noexcept +{ + return *p_; +} + +} // namespace turbomind diff --git a/src/turbomind/engine/cache_mode.h b/src/turbomind/engine/cache_mode.h new file mode 100644 index 0000000000..1ffc6d1a4f --- /dev/null +++ b/src/turbomind/engine/cache_mode.h @@ -0,0 +1,52 @@ +// Copyright (c) OpenMMLab. All rights reserved. + +#pragma once + +#include + +#include "src/turbomind/core/logger.h" // TM_LOG_FATAL + +namespace turbomind { + +// Per-side prefix-cache publication mode. cache_prompt uses {kAuto, kAll} +// (no kNone); cache_generation uses all three. +enum class CacheMode +{ + kNone, + kAuto, + kAll +}; + +// String -> CacheMode. cache_prompt never receives "none" (rejected by the +// Python TurbomindEngineConfig.__post_init__ assert); the shared parser still +// accepts it for cache_generation. TM_LOG_FATAL is [[noreturn]] (std::abort), +// so no trailing return is needed. +inline CacheMode ParseCacheMode(const std::string& s) +{ + if (s == "none") { + return CacheMode::kNone; + } + if (s == "auto") { + return CacheMode::kAuto; + } + if (s == "all") { + return CacheMode::kAll; + } + TM_LOG_FATAL("invalid cache mode: {}", s); +} + +// Pure prompt-boundary publish decision, given the mode, whether the geometric +// plan wants a partial fork_to node (else block-aligned checkpoint clamp), and +// whether that partial node's token range holds image tokens. 'all' publishes a +// partial node whenever the plan is partial and arms the clamp when +// block-aligned; 'auto' publishes only an image-bearing partial node and never +// arms the block-aligned clamp. +inline bool DecidePromptBoundaryPublish(CacheMode prompt_mode, bool plan_partial, bool has_image_in_node) +{ + if (plan_partial) { + return prompt_mode == CacheMode::kAll || (prompt_mode == CacheMode::kAuto && has_image_in_node); + } + return prompt_mode == CacheMode::kAll; +} + +} // namespace turbomind diff --git a/src/turbomind/engine/cache_registry.cc b/src/turbomind/engine/cache_registry.cc new file mode 100644 index 0000000000..904fa4b436 --- /dev/null +++ b/src/turbomind/engine/cache_registry.cc @@ -0,0 +1,19 @@ +#include "src/turbomind/engine/cache_registry.h" + +namespace turbomind { + +void CacheCategory::RegisterObjectId(ObjectAllocator& allocator) +{ + TM_CHECK_LT(object_id_, 0); + if (used()) { + object_id_ = allocator.Register(effective_parts()); + } +} + +void CacheRegistry::RegisterObjectIds(ObjectAllocator& allocator) +{ + prefix_.RegisterObjectId(allocator); + checkpoint_.RegisterObjectId(allocator); +} + +} // namespace turbomind diff --git a/src/turbomind/engine/cache_registry.h b/src/turbomind/engine/cache_registry.h new file mode 100644 index 0000000000..0c3f2f7537 --- /dev/null +++ b/src/turbomind/engine/cache_registry.h @@ -0,0 +1,156 @@ +#pragma once + +#include +#include +#include +#include +#include + +#include "src/turbomind/core/check.h" +#include "src/turbomind/memory/object.h" + +namespace turbomind { + +// Per-category composite object layout and ObjectAllocator registration. +// Modules claim [offset, offset + bytes) during model construction and keep +// only the returned byte offset. All cache policy lives in the scheduler. +class CacheCategory { +public: + // Accumulation channel: claims [offset, offset + bytes) in part 0 and returns + // the within-part byte offset. Modules keep only the returned offset. + size_t Register(size_t bytes, size_t alignment) + { + TM_CHECK_LT(object_id_, 0); // registration closes once the object id exists + TM_CHECK_GT(bytes, 0); + TM_CHECK_GT(alignment, 0); + + auto& acc = parts_[0]; // {bytes, alignment, count=1} + constexpr size_t kMaxSize = std::numeric_limits::max(); + TM_CHECK_LE(alignment - 1, kMaxSize - acc[0]); + + const size_t offset = (acc[0] + alignment - 1) / alignment * alignment; + TM_CHECK_LE(bytes, kMaxSize - offset); + + acc[0] = offset + bytes; + acc[1] = std::max(acc[1], alignment); + return offset; + } + + // Composite channel: appends sum(count) parts of the given aligned sizes and + // returns the base part id (>= 1). Part ids count expanded parts. + int Register(const std::vector>& parts) + { + TM_CHECK_LT(object_id_, 0); + const int base = next_part_id_; + for (const auto& p : parts) { + TM_CHECK_GT(p[0], 0); + TM_CHECK_GT(p[1], 0); + TM_CHECK_GT(p[2], 0); + parts_.push_back(p); + next_part_id_ += static_cast(p[2]); + } + return base; + } + + void RegisterObjectId(ObjectAllocator& allocator); + + int object_id() const + { + TM_CHECK_GE(object_id_, 0); + return object_id_; + } + + int object_id_or_negative() const noexcept + { + return object_id_; + } + + // Bytes accumulated into part 0 (accumulation channel). For the prefix + // category this is the simple object's byte size; 0 before any Register. + size_t accumulation_bytes() const noexcept + { + return parts_[0][0]; + } + + bool used() const noexcept + { + return parts_[0][0] != 0 || next_part_id_ > 1; // accumulation non-empty or any composite part + } + + // Parts list for ObjectAllocator::Register: accumulation part 0 first (when + // non-empty), then composite parts in registration order. Part ids start at + // 1 for composites, so part 0 must be present whenever composites exist or + // positional indices would shift and base(part_id) would be wrong. + std::vector> effective_parts() const + { + const bool has_acc = parts_[0][0] != 0; + const bool has_composite = next_part_id_ > 1; + TM_CHECK(!has_composite || has_acc) << "composite category requires a non-empty accumulation part 0"; + + std::vector> out; + if (has_acc) { + out.push_back(parts_[0]); + } + for (size_t k = 1; k < parts_.size(); ++k) { + out.push_back(parts_[k]); + } + return out; + } + +private: + // parts_[0] is the accumulation part (id 0): {bytes, alignment, count=1}. + // parts_[1..] are composite member specs in registration order. + std::vector> parts_{{0, 1, 1}}; + int next_part_id_{1}; // next composite part id (counts expanded parts) + int object_id_{-1}; +}; + +class CacheRegistry { +public: + CacheCategory& prefix() noexcept + { + return prefix_; + } + + const CacheCategory& prefix() const noexcept + { + return prefix_; + } + + CacheCategory& checkpoint() noexcept + { + return checkpoint_; + } + + const CacheCategory& checkpoint() const noexcept + { + return checkpoint_; + } + + // True once a module registered checkpoint bytes and the object id exists. + bool has_checkpoint() const noexcept + { + return checkpoint_.object_id_or_negative() >= 0; + } + + int checkpoint_min_interval() const noexcept + { + return checkpoint_min_interval_; + } + + void set_checkpoint_min_interval(int interval) + { + TM_CHECK_GT(interval, 0); + checkpoint_min_interval_ = interval; + } + + void RegisterObjectIds(ObjectAllocator& allocator); + +private: + CacheCategory prefix_; + CacheCategory checkpoint_; + + int checkpoint_min_interval_{1}; +}; + +} // namespace turbomind diff --git a/src/turbomind/engine/engine.cc b/src/turbomind/engine/engine.cc index de0eed861d..6e5e631b46 100644 --- a/src/turbomind/engine/engine.cc +++ b/src/turbomind/engine/engine.cc @@ -4,10 +4,12 @@ #include #include #include +#include #include #include "nvtx3/nvToolsExt.h" +#include "src/turbomind/comm/env.h" #include "src/turbomind/comm/host_comm.h" #include "src/turbomind/core/allocator.h" #include "src/turbomind/core/check.h" @@ -15,19 +17,21 @@ #include "src/turbomind/engine/engine.h" #include "src/turbomind/engine/model_executor.h" #include "src/turbomind/engine/request.h" +#include "src/turbomind/engine/scheduler.h" #include "src/turbomind/core/copy.h" #include "src/turbomind/core/logger.h" #include "src/turbomind/core/scope.h" -#include "src/turbomind/models/decoder_layer_weight.h" -#include "src/turbomind/models/delta_net_weight.h" #include "src/turbomind/models/language_model.h" -#include "src/turbomind/models/llama/SequenceManager.h" +#include "src/turbomind/models/llama/context_token_resource.h" #include "src/turbomind/models/llama/llama_params.h" -#include "src/turbomind/models/model_weight.h" +#include "src/turbomind/models/vision_model.h" #include "src/turbomind/utils/cuda_utils.h" #include "src/turbomind/utils/metrics.h" +#include "src/turbomind/memory/object.h" +#include "src/turbomind/memory/stats.h" + // #include "dbg.h" namespace turbomind { @@ -37,18 +41,15 @@ using std::unique_ptr; using std::vector; struct RequestData { - vector> infer; // incoming inference request - vector> kill; // incoming kill request - - vector cancel; // canceled indices in current batch - bool abort; + vector> infer; // incoming inference request + vector cancel; // canceled indices in current batch + bool abort; }; template void serdes(Archive& ar, RequestData& r) { ar& r.infer; - ar& r.kill; ar& r.cancel; ar& r.abort; } @@ -58,23 +59,22 @@ struct Engine::Impl { using Requests = vector>; using Signal = std::function; + struct State; + Impl(EngineParam param, + ObjectAllocator alloc, + CacheRegistry cache_registry, LanguageModel model, std::unique_ptr vision_model, - const ModelWeight& weights, Context& ctx, Gateway& gateway, int device_id, int queue_id, int phases); - void CreateSequenceManager(); - void InternalThreadEntry(); - void Validate(Requests& infer_rs, Requests& kill_rs); - - void Kill(const Requests& rs, vector& signals); + void Validate(Requests& infer_reqs); vector GetCanceled(); @@ -82,19 +82,26 @@ struct Engine::Impl { void Accept(const Requests& rs, vector& signals); - void Interrupt(RequestCache& c); + void Interrupt(Sequence& c); + + void Retire(State& s); // Allocation of memory / compute resources void Schedule(); - // intiailize RC from `Sequence` + // Forward-progress guard: fail the head-of-line request on genuine cache OOM + void FailStalledHeadOfLine(std::vector& signals); + + // Initialize batch data from engine-local sequence state void Setup(BatchData& d); - // Sync vars from batch output to RC + // Sync vars from batch output to engine-local sequence state void Update(BatchData& d, std::vector& signals); void Run(BatchOp op, int phase, Ref env) { + // Vision sub-graph runs first so its env outputs (image embeddings, + // mrope tensors) are visible to the language model in the same pass. if (vision_model_) { vision_model_->Run(op, phase, env); } @@ -109,6 +116,8 @@ struct Engine::Impl { void UpdateScheduleMetrics(); + void MaybeLogCacheStats(); + ~Impl(); const EngineParam param_; @@ -129,27 +138,29 @@ struct Engine::Impl { int& is_warm_up_; - unique_ptr seq_mgr_; + ObjectAllocator object_allocator_; + Scheduler scheduler_; Queue> inbound_; Queue> outbound_; LanguageModel model_; - std::unique_ptr vision_model_; // null for text-only models - const ModelWeight& weights_; + std::unique_ptr vision_model_; // null for text-only checkpoints ModelExecutor executor_; std::thread internal_thread_; - int session_len_trunc_; + // int session_len_trunc_; shared_ptr metrics_; - std::atomic scheduler_tick_{}; + int cache_log_interval_ = GetEnv(); // read once (GetEnv caches statically) + uint64_t schedule_counter_ = 0; struct State { - vector> rc; - vector perm; + vector> rc; + + vector perm; // current -> previous int bs0 = 0; int active = 0; @@ -167,27 +178,36 @@ struct Engine::Impl { struct Data { }; vector data_; - - // staging buffers - Buffer_ block_ptrs_buf_; - Buffer_ block_ptrs_offsets_buf_; }; Engine::Impl::~Impl() { TM_LOG_INFO("{}", __PRETTY_FUNCTION__); + if (cache_log_interval_ && tp_rank_ == 0) { + TM_LOG_WARN("dp{} cache stats:\n{}", dp_rank_, FormatMemoryStats(object_allocator_.Stats())); + } inbound_.close(); outbound_.close(); if (internal_thread_.joinable()) { internal_thread_.join(); } executor_ = {}; + + for (auto& state : states_) { + for (auto& cache : state.rc) { + if (cache) { + scheduler_.Release(*cache); + cache.reset(); + } + } + } } Engine::Impl::Impl(EngineParam param, + ObjectAllocator alloc, + CacheRegistry cache_registry, LanguageModel model, std::unique_ptr vision_model, - const ModelWeight& weights, Context& ctx, Gateway& gateway, int device_id, @@ -204,9 +224,17 @@ Engine::Impl::Impl(EngineParam param, queue_id_{queue_id}, async_{phases > 1}, is_warm_up_{*ctx.is_warm_up}, + object_allocator_{std::move(alloc)}, + scheduler_{object_allocator_, + std::move(cache_registry), + param_.cache_block_seq_len, + param_.enable_prefix_caching, + param_.cache_prompt, + param_.cache_prompt_boundary_skip, + param_.cache_generation, + is_warm_up_}, model_{std::move(model)}, - vision_model_{std::move(vision_model)}, - weights_{weights} + vision_model_{std::move(vision_model)} { states_.emplace_back(); @@ -216,141 +244,50 @@ Engine::Impl::Impl(EngineParam param, executor_ = ModelExecutor{model_, vision_model_.get(), ctx, device_id_, outbound_, inbound_}; - CreateSequenceManager(); // initializes `session_len_trunc_` - - const ssize_t max_batch_block_num = param.max_batch_size * cdiv(session_len_trunc_, param_.cache_block_seq_len); - block_ptrs_buf_ = {max_batch_block_num, kCPUpinned}; - block_ptrs_offsets_buf_ = {param.max_batch_size + 1, kCPUpinned}; -} - -void Engine::Impl::CreateSequenceManager() -{ - const auto cache_block_seq_len = param_.cache_block_seq_len; - - // Derive DeltaNet fields if linear attention exists - bool has_linear_attention = false; - int linear_key_head_dim = 0, linear_value_head_dim = 0; - int linear_conv_kernel_dim = 0, linear_num_key_heads = 0, linear_num_value_heads = 0; - for (int i = 0; i < weights_.num_layer; ++i) { - if (auto* dn = weights_.layer(i)->linear_attn.get()) { - has_linear_attention = true; - linear_key_head_dim = dn->key_head_dim; - linear_value_head_dim = dn->value_head_dim; - linear_conv_kernel_dim = dn->d_conv; - linear_num_key_heads = dn->num_k_heads * param_.attn_tp_size; - linear_num_value_heads = dn->num_v_heads * param_.attn_tp_size; - break; - } - } - - if (has_linear_attention && param_.enable_prefix_caching) { - TM_LOG_FATAL("Prefix caching is unsupported when linear attention is present"); + if (cache_log_interval_ && tp_rank_ == 0) { + TM_LOG_WARN("dp{} cache stats:\n{}", dp_rank_, FormatMemoryStats(object_allocator_.Stats())); } - - const auto get_free_size = [&] { - size_t free{}, total{}; - TM_CUDA_CHECK(cudaMemGetInfo(&free, &total)); - return AllReduce(tp_group_, free, comm::RedOp::kMin); - }; - - seq_mgr_ = std::make_unique(weights_.head_dim, - weights_.kv_head_num / param_.attn_tp_size, - weights_.num_layer, - weights_.layer_types, - param_.quant_policy, - weights_.data_type, - weights_.data_type, // runtime_dtype = data_type - linear_key_head_dim, - linear_value_head_dim, - linear_conv_kernel_dim, - linear_num_key_heads, - linear_num_value_heads, - cache_block_seq_len, - param_.attn_tp_size, - param_.max_batch_size, - param_.cache_max_block_count, - param_.cache_chunk_size, - param_.enable_prefix_caching, - tp_rank_, - param_.attn_cp_size, - core::Context::alloc(kDEVICE), - get_free_size); - - const auto max_cached_tokens = seq_mgr_->max_block_count() * (size_t)cache_block_seq_len * param_.attn_cp_size; - session_len_trunc_ = std::min(max_cached_tokens, (size_t)param_.session_len); - TM_LOG_INFO("max cached tokens: {}", max_cached_tokens); - if (session_len_trunc_ != param_.session_len) { - TM_LOG_WARN("`session_len` truncated to {} due to limited KV cache memory", session_len_trunc_); - } - UpdateScheduleMetrics(); } -void Engine::Impl::Validate(Requests& infer_reqs, Requests& kill_reqs) +void Engine::Impl::Validate(Requests& infer_reqs) { std::pmr::monotonic_buffer_resource mbr; std::pmr::unordered_map occur(&mbr); - bool has_linear_attention = false; - for (auto t : weights_.layer_types) { - if (t == 1) { - has_linear_attention = true; - break; + for (const auto& s : states_) { + for (int i = 0; i < s.size(); ++i) { + if (s.rc[i]) { + ++occur[s.rc[i]->req->id]; + } } } + for (const auto& r : infer_reqs) { + ++occur[r->id]; + } - auto count = [&occur](const auto& reqs) { - for (const auto& r : reqs) { - ++occur[r->id]; + for (const auto& r : infer_reqs) { + if (occur[r->id] > 1) { + TM_LOG_ERROR("Skip conflicting infer request for ID {}", r->id); + r->ec = Request::kConflict; } - }; - - auto validate = [&](auto& reqs, const char* type, bool is_infer) { - for (const auto& r : reqs) { - if (occur[r->id] > 1) { - TM_LOG_ERROR("Skip conflicting {} request for ID {}", type, r->id); - r->ec = Request::kConflict; + if (!r->ec && param_.enable_prefix_caching) { + if (r->step != 0) { + TM_LOG_ERROR("Skip inconsistent infer request for ID {} step {}: " + "prefix caching is incompatible with a nonzero step", + r->id, + r->step); + r->ec = Request::kInconsistency; } - if (!r->ec && is_infer && has_linear_attention && !r->session.end_flag) { - TM_LOG_ERROR("Skip inconsistent {} request for ID {}. Linear attention only supports stateless " - "requests", - type, + else if (r->gen_cfg.output_logits == GenerationConfig::kAll + || r->gen_cfg.output_last_hidden_state == GenerationConfig::kAll || r->gen_cfg.return_ppl) { + TM_LOG_ERROR("Skip inconsistent infer request for ID {}: prefix caching cannot " + "output logits/last_hidden_states for all tokens or ppl", r->id); r->ec = Request::kInconsistency; } - if (param_.enable_prefix_caching) { - if (r->session.step != 0) { - // Prefix caching is incompatible with interactive mode - TM_LOG_ERROR("Skip inconsistent {} request for ID {} step {}", type, r->id, r->session.step); - r->ec = Request::kInconsistency; - } - else if (r->gen_cfg.output_logits == GenerationConfig::kAll - || r->gen_cfg.output_last_hidden_state == GenerationConfig::kAll || r->gen_cfg.return_ppl) { - // Prefix caching is incompatible with outputting all tokens' logits or last_hidden_state - TM_LOG_ERROR("Skip inconsistent {} request for ID {}. It cannot output logits or " - "last_hidden_states for all tokens or ppl", - type, - r->id); - r->ec = Request::kInconsistency; - } - } - } - }; - - for (const auto& s : states_) { - for (int i = 0; i < s.size(); ++i) { - if (s.rc[i]) { - ++occur[s.rc[i]->req->id]; - } } } - count(kill_reqs); - count(infer_reqs); - - validate(kill_reqs, "kill", false); - validate(infer_reqs, "infer", true); - - // New requests that never get a chance to start for (auto& r : infer_reqs) { if (r && r->cancel_flag.load(std::memory_order_acquire) == -1) { r->ec = Request::kCancel; @@ -372,44 +309,26 @@ vector Engine::Impl::GetCanceled() return idxs; } -void Engine::Impl::Kill(const Requests& kills, vector& signals) +void Engine::Impl::Interrupt(Sequence& c) { - for (auto& r : kills) { - if (r) { - int ec = r->ec; - if (!ec) { - if (!seq_mgr_->Erase(r->id)) { - ec = Request::kInvalid; - } - } - signals.push_back([=] { r->end_cb ? r->end_cb(ec) : void(); }); - } - } + Sequence* p = &c; + Buffer_ rs{&p, 1, kCPU}; + Run(BatchOp::kDel, -1, TensorMap{{"requests", rs}}); + + scheduler_.Release(c); } -void Engine::Impl::Interrupt(RequestCache& c) +void Engine::Impl::Retire(State& s) { - auto& s = *TM_CHECK_NOTNULL(c.seq); - if (c.req->session.end_flag) { - if (!is_warm_up_ && s.status != Sequence::kCached) { // At least `Locked` status is required for caching - seq_mgr_->CacheGeneration(s); - } - TM_CHECK(seq_mgr_->Erase(c.req->id)); - } - else { - if (s.recurrent_states && c.seq_len != s.cache_len) { - TM_LOG_WARN( - "[Engine][Interrupt] Invalidating cache for ID {} due to linear-state/cache mismatch ({} vs {})", - s.id, - c.seq_len, - s.cache_len); - seq_mgr_->InvalidateStatesAndCache(s); - } - else { - seq_mgr_->UpdateAndSetUnlock(s); + for (auto& p : s.rc) { + if (!p || !p->retiring || p->inflight != 0) { + continue; } + + Interrupt(*p); + p.reset(); + ++s.finish; } - c.seq = nullptr; } void Engine::Impl::Cancel(vector& indices, vector& signals) @@ -417,13 +336,14 @@ void Engine::Impl::Cancel(vector& indices, vector& signals) auto& s = states_.at(0); for (const auto& i : indices) { auto& c = TM_CHECK_NOTNULL(s.rc[i]); - c->done = true; - Interrupt(*c); - signals.push_back([r = std::move(c->req), l = c->seq_len] { // - UpdateState(*r, Request::kCancel, l); - }); - c.reset(); - s.finish += 1; + if (c->retiring) { + continue; + } + + c->is_canceled = true; + c->retiring = true; + c->done = true; + signals.push_back([r = c->req, l = c->seq_len] { UpdateState(*r, Request::kCancel, l); }); } } @@ -431,7 +351,7 @@ void Engine::Impl::Accept(const Requests& rs, vector& signals) { auto& s = states_.at(0); - vector> incoming; + vector> incoming; incoming.reserve(rs.size()); for (const auto& r : rs) { @@ -441,89 +361,32 @@ void Engine::Impl::Accept(const Requests& rs, vector& signals) continue; } - const int input_len = r->inputs.at("input_ids").shape(0); - - if (input_len > session_len_trunc_) { - signals.push_back([r] { UpdateState(*r, Request::kTooLong, 0); }); - continue; - } - - auto ptr = r->session.start_flag ? seq_mgr_->Create(r->id) : seq_mgr_->Get(r->id); - if (!ptr) { - signals.push_back([r] { UpdateState(*r, Request::kInvalid, 0); }); - continue; - } - - const int step = [&] { - int s = r->session.step; - if (s < 0) { - s = ptr->tokens.size(); - } - else if (s > ptr->tokens.size()) { - if (tp_rank_ == 0) { - TM_LOG_WARN("Skipping invalid step ({}) setting for ID {}", s, ptr->id); - } - s = ptr->tokens.size(); - } - return s; - }(); + const auto& input_ids = r->inputs.at("input_ids"); + const int input_len = input_ids.shape(0); - if (step + input_len > session_len_trunc_) { + if (input_len > param_.session_len) { signals.push_back([r] { UpdateState(*r, Request::kTooLong, 0); }); continue; } - if (step && param_.enable_prefix_caching) { - // step not supported in prefix-caching mode - signals.push_back([r] { UpdateState(*r, Request::kInconsistency, 0); }); - continue; - } - - auto& seq = *ptr; - seq_mgr_->AcquireLinearStateSlot(seq); + /// TODO: force step after prefix matching - if (seq.recurrent_states) { - if (step != seq.cache_len) { - signals.push_back([r] { UpdateState(*r, Request::kInvalid, 0); }); - continue; - } - } - - auto c = std::make_unique(r, seq); - - if (step < seq.tokens.size()) { - seq.tokens.resize(step); - seq.cache_len = std::min(seq.cache_len, step); - } - - c->step0 = step; - - // const int* input_ids = r->inputs.at("input_ids").data(); - auto& input_ids = r->inputs.at("input_ids"); + auto c = std::make_unique(r); int* token_ids = c->token_ids = r->output_ids.data(); - /// TODO: move this somewhere else - token_ids = std::copy_n(seq.tokens.data(), seq.tokens.size(), token_ids); token_ids = std::copy_n(input_ids.data(), input_len, token_ids); c->prompt_len = c->seq_len = token_ids - c->token_ids; // all known tokens - // Only prefix cache needs prompt data - if (param_.enable_prefix_caching && input_len && r->session.start_flag) { - seq.prompt.insert(seq.prompt.end(), input_ids.data(), input_ids.data() + input_len); - } - - // dbg(seq.cache_len, seq.tokens.size(), input_len, c->seq_len); - int max_seq_len = c->prompt_len + c->gen_cfg.max_new_tokens; - if (max_seq_len > session_len_trunc_) { - max_seq_len = session_len_trunc_; + if (max_seq_len > param_.session_len) { + max_seq_len = param_.session_len; if (tp_rank_ == 0) { const int trunc_output_len = max_seq_len - c->prompt_len; // clang-format off TM_LOG_WARN("ID {}: total sequence length ({} + {}) exceeds `session_len` ({}), `max_new_tokens` is truncated to {}", - seq.id, c->prompt_len, c->gen_cfg.max_new_tokens, session_len_trunc_, trunc_output_len); + r->id, c->prompt_len, c->gen_cfg.max_new_tokens, param_.session_len, trunc_output_len); // clang-format on } } @@ -532,7 +395,7 @@ void Engine::Impl::Accept(const Requests& rs, vector& signals) incoming.push_back(std::move(c)); } - Buffer_ buf(incoming.size(), kCPU); + Buffer_ buf(incoming.size(), kCPU); for (int i = 0; i < incoming.size(); ++i) { buf[i] = incoming[i].get(); } @@ -542,6 +405,7 @@ void Engine::Impl::Accept(const Requests& rs, vector& signals) for (auto& x : incoming) { if (x->status == 0) { + scheduler_.Accept(*x); s.rc.push_back(std::move(x)); } else { @@ -558,75 +422,82 @@ void Engine::Impl::Schedule() TM_FUNCTION_SCOPE(); auto& s = states_.at(0); - vector sequences; - vector status; - vector context_length; - vector alpha; - vector priorities; - vector cache; - vector inv; + vector eligible; + + vector was_active; + vector context_length; + vector orignal_idxs; + vector inflight_input_len; for (int i = 0; i < s.size(); ++i) { - // skip invalid positions - if (const auto& c = s.rc[i]) { - cache.push_back(c.get()); - sequences.push_back(c->seq); - status.push_back(c->seq->status); - priorities.push_back(c->req->unique_id); - context_length.push_back(c->seq_len + c->beta /* plus draft tokens */); - alpha.push_back(c->alpha); - TM_CHECK(c->seq->status == Sequence::kActive || c->alpha == 0) << c->seq->status << " " << c->alpha; - inv.push_back(i); - c->input_len = c->history_len = 0; - // dbg(c->request->id, c->seq_len, c->sequence.cache_len, c->alpha, c->beta, c->is_decoding, - // c->is_generate); + auto& p = s.rc[i]; + if (!p) { + continue; + } + auto& c = *p; + if (!c.retiring) { + eligible.push_back(&c); + was_active.push_back(c.is_active); + context_length.push_back(c.seq_len + c.inflight_new_tokens /* plus draft tokens */); + inflight_input_len.push_back(c.inflight_input_len); + orignal_idxs.push_back(i); + c.input_len = c.history_len = 0; } } - // dbg("Schedule"); + ScheduleResources resources; + resources.Add(param_.max_forward_token_num); + resources.Add(param_.max_context_token_num); - seq_mgr_->Materialize( - sequences, context_length, alpha, priorities, param_.max_forward_token_num, param_.max_context_token_num); + scheduler_.Schedule(eligible, resources); - vector idxs(sequences.size()); + vector idxs(eligible.size()); std::iota(idxs.begin(), idxs.end(), 0); - subrange active{idxs.begin(), std::stable_partition(idxs.begin(), idxs.end(), [&](int i) { - return sequences[i]->status == Sequence::kActive; // IS active - })}; + subrange active{idxs.begin(), + std::stable_partition(idxs.begin(), idxs.end(), [&](int i) { return eligible[i]->is_active; })}; - TM_CHECK(sequences.empty() || !active.empty()) << "No enough blocks"; + // An empty active batch (cache OOM / resource starvation) is handled by + // FailStalledHeadOfLine, called after Schedule() returns, where request + // lifecycle and signal emission live (see README forward-progress). if (is_warm_up_) { // Avoid extra iteration for warm up request in async mode (force inactivate) - active = {active.begin(), std::stable_partition(active.begin(), active.end(), [&](int i) { // - return alpha[i] == 0; + active = {active.begin(), std::stable_partition(active.begin(), active.end(), [&](int i) { + return inflight_input_len[i] == 0; })}; } subrange inactive{active.end(), idxs.end()}; - subrange existing{active.begin(), std::stable_partition(active.begin(), active.end(), [&](int i) { - return status[i] == Sequence::kActive; // WAS active in active - })}; + for (auto i : active) { + eligible[i]->is_active = true; + } + for (auto i : inactive) { + eligible[i]->is_active = false; + eligible[i]->input_len = 0; + eligible[i]->history_len = 0; + } + + subrange existing{active.begin(), + std::stable_partition(active.begin(), active.end(), [&](int i) { return was_active[i]; })}; subrange swap_in{existing.end(), active.end()}; - subrange swap_out{inactive.begin(), std::stable_partition(inactive.begin(), inactive.end(), [&](int i) { - return status[i] == Sequence::kActive; // WAS active in inactive - })}; + subrange swap_out{inactive.begin(), + std::stable_partition(inactive.begin(), inactive.end(), [&](int i) { return was_active[i]; })}; // |<-- existing -->|<-- swap-in -->|<- swap-out ->| // |<----------- active ----------->|<------- inactive ----->| for (auto i : swap_in) { - cache[i]->autoregres = {}; - cache[i]->generating = {}; + eligible[i]->autoregres = {}; + eligible[i]->generating = {}; } if (param_.enable_metrics) { for (auto i : swap_in) { - if (auto& m = cache[i]->req->metrics; TM_LIKELY(m)) { + if (auto& m = eligible[i]->req->metrics; TM_LIKELY(m)) { int64_t expected = 0; m->scheduled_time.compare_exchange_strong( expected, RequestMetrics::timestamp(), std::memory_order_relaxed); @@ -635,86 +506,140 @@ void Engine::Impl::Schedule() } for (auto i : existing) { - if (cache[i]->generating) { - cache[i]->autoregres = true; - } + auto& c = *eligible[i]; + c.autoregres = c.generating && c.input_len == 1; } for (auto i : active) { - auto& s = *sequences[i]; - auto& c = *cache[i]; - if (s.cache_len + c.alpha + s.input_length == c.seq_len + c.beta) { - c.generating = true; - } + auto& c = *eligible[i]; + c.generating = c.resume_len + c.inflight_input_len + c.input_len == c.seq_len + c.inflight_new_tokens; } // move partially prefilled sequences to the back - subrange partial{std::stable_partition(active.begin(), active.end(), [&](int i) { return cache[i]->generating; }), - active.end()}; - TM_CHECK_LE(partial.size(), 1); + subrange partial{ + std::stable_partition(active.begin(), active.end(), [&](int i) { return eligible[i]->generating; }), + active.end()}; // dbg(inv); - vector> rc(idxs.size()); - vector perm(idxs.size()); + vector> rc; + vector perm; + rc.reserve(s.size()); + perm.reserve(s.size()); for (int i = 0; i < idxs.size(); ++i) { - perm[i] = inv[idxs[i]]; // inverse map to original indices - rc[i] = std::move(s.rc[perm[i]]); // warp the request cache + perm.push_back(orignal_idxs[idxs[i]]); // inverse map to original indices (curr -> prev) + rc.push_back(std::move(s.rc[perm[i]])); // permute the engine-local sequence state } + // Put done sequences to the back, logical blocks need to be updated. + for (int i = 0; i < s.size(); ++i) { + if (auto& p = s.rc[i]) { + perm.push_back(i); + rc.push_back(std::move(p)); + } + } + s.rc.swap(rc); s.perm.swap(perm); - for (auto& c : s.rc) { - /// ! input_length not updated for inactive seqs - c->input_len = c->seq->input_length; - c->history_len = c->seq->cache_len; - // dbg(c->request->id, - // c->seq_len, - // c->history_len, - // c->input_len, - // c->alpha, - // c->beta, - // c->is_decoding, - // c->is_generate); - } - - s.bs0 = std::exchange(s.active, active.size()); + s.bs0 = std::exchange(s.active, active.size()); + if (cache_log_interval_ && schedule_counter_ % cache_log_interval_ == 0) { + TM_LOG_WARN("dp{} total: {}, eligible: {}, active: {}", dp_rank_, s.size(), eligible.size(), s.bs0); + } s.swapout = swap_out.size(); s.finish = 0; } -void Engine::Impl::Setup(BatchData& d) +void Engine::Impl::FailStalledHeadOfLine(std::vector& signals) { - TM_FUNCTION_SCOPE(); - auto& st = states_.at(0); + auto& s = states_.at(0); - d.rc.resize(st.active); - std::copy_n(st.rc.begin(), st.active, d.rc.begin()); + if (s.active != 0 || is_warm_up_) { + return; // work was admitted, or warm-up legitimately forces empty active + } - block_ptrs_offsets_buf_[0] = 0; - auto block_ptrs = block_ptrs_buf_.data(); - for (int i = 0; i < st.active; ++i) { - const auto& s = *st.rc[i]->seq; - block_ptrs_offsets_buf_[i + 1] = block_ptrs_offsets_buf_[i] + s.blocks.size(); - block_ptrs = std::transform(s.blocks.cbegin(), s.blocks.cend(), block_ptrs, [&](int block_id) { - return seq_mgr_->GetBlockPtr(block_id); - }); + // Nothing was admitted this pass. If no in-flight work remains, no memory + // will ever be released, so the highest-priority eligible request cannot + // make progress even with maximum eviction. Fail it with kOutOfMemory: it + // retires, releases its held cache, and the next request becomes + // head-of-line (see README forward-progress). + Sequence* victim = nullptr; + for (auto& p : s.rc) { + if (!p) { + continue; + } + if (p->inflight > 0) { + return; // in-flight batch will release memory when it completes (transient drain) + } + if (!p->retiring && (!victim || p->req->unique_id < victim->req->unique_id)) { + victim = p.get(); // smallest unique_id == highest priority == root of the OOM + } } - d.bs0 = st.bs0; - d.bsz = st.active; + if (!victim) { + return; + } - d.perm = {d.bsz, kCPU}; - std::copy_n(st.perm.data(), d.bsz, d.perm.data()); + TM_LOG_WARN("dp{} ID {}: cache out of memory, no request can be admitted; failing head-of-line request", + dp_rank_, + victim->req->id); - // dbg(d.bs0, d.bsz, d.perm); + victim->retiring = true; + victim->done = true; + signals.push_back([r = victim->req] { UpdateState(*r, Request::kOutOfMemory, 0); }); +} + +void Engine::Impl::Setup(BatchData& d) +{ + TM_FUNCTION_SCOPE(); + auto& s = states_.at(0); + + d.bs0 = s.bs0; + d.bsz = s.active; + + d.perm = {d.bsz, kCPU}; + std::copy_n(s.perm.data(), d.bsz, d.perm.data()); BatchCopy copy{}; - TensorMap env{{"batch", d.buf()}, + Buffer_ rs{s.active, kCPU}; + for (int i = 0; i < s.active; ++i) { + auto* c = TM_CHECK_NOTNULL(s.rc[i].get()); + ++c->inflight; + rs[i] = c; + } + + d.restore_copies.clear(); + d.publish_copies.clear(); + { + const ObjectAllocator& alloc = scheduler_.allocator(); + auto resolve = [&](std::vector& in, std::vector& out) { + for (const auto& [src, dst] : in) { + const auto& cs = scheduler_.cache()[src]; + const auto& cd = scheduler_.cache()[dst]; + TM_CHECK_NOTNULL(cs.allocation.a); // validity (resolved allocation) on both ends + TM_CHECK_NOTNULL(cd.allocation.a); + TM_CHECK_EQ(cs.object_id, cd.object_id); // same object => same part layout + TM_CHECK_EQ(cs.part_count(), cd.part_count()); // both replay-populated to the same layout + TM_CHECK_EQ(cs.part_count(), alloc.PartCount(cs.object_id)); + for (int p = 0; p < cs.part_count(); ++p) { + out.push_back({cs.base(p), cd.base(p), alloc.PartBytes(cs.object_id, p)}); + } + } + in.clear(); + }; + for (int i = 0; i < s.active; ++i) { + auto& c = *s.rc[i]; + resolve(c.restore_copies, d.restore_copies); + resolve(c.publish_copies, d.publish_copies); + } + } + + const CacheBlockPool* cache_block_pool = &scheduler_.cache(); + + TensorMap env{{"requests", rs}, + {"batch", d.buf()}, {"copy", copy.buf()}, - {"block_ptrs_offsets", block_ptrs_offsets_buf_}, - {"block_ptrs", block_ptrs_buf_}}; + {"cache_block_pool", Buffer_{&cache_block_pool, 1, kCPU}}}; Run(BatchOp::kSetup, d.phase, env); @@ -727,7 +652,6 @@ void Engine::Impl::Setup(BatchData& d) AllGather(dp_group_, d.local_token_num.data(), 1); } d.global_token_num = std::accumulate(d.local_token_num.begin(), d.local_token_num.end(), 0); - // dbg(dp_group_->rank(), d.local_token_num, d.global_token_num); } void Engine::Impl::Update(BatchData& b, std::vector& signals) @@ -756,67 +680,72 @@ void Engine::Impl::Update(BatchData& b, std::vector& signals) env = {}; - vector sequences_to_cache; - - for (int i = 0; i < b.rc.size(); ++i) { - // In async mode, `seq` may be nullptr when the request is done - if (auto& c = *b.rc[i]; c.seq) { - if (auto& s = *c.seq; generating[i]) { - c.token_ids[c.seq_len] = output_ids[i]; - c.seq_len = sequence_length[i]; - s.cache_len = sequence_length[i] - 1; - if (const int new_tokens = c.seq_len - s.tokens.size()) { - s.tokens.insert(s.tokens.end(), c.token_ids + c.seq_len - new_tokens, c.token_ids + c.seq_len); + vector perm(s.size()); + if (async_) { + perm = s.perm; + } + else { + std::iota(perm.begin(), perm.end(), 0); + } + + for (int i = 0; i < s.size(); ++i) { + int j = perm[i]; + if (j < b.bsz) { + auto& c = *TM_CHECK_NOTNULL(s.rc[i]); + c.filled_len = generating[j] ? sequence_length[j] - 1 : sequence_length[j]; + if (c.retiring) { + continue; + } + if (generating[j]) { + c.token_ids[c.seq_len] = output_ids[j]; + c.seq_len = sequence_length[j]; + if (int new_tokens = c.seq_len - c.tokens.size(); TM_LIKELY(new_tokens)) { + c.tokens.insert(c.tokens.end(), c.token_ids + c.seq_len - new_tokens, c.token_ids + c.seq_len); } - if (TM_UNLIKELY(finished[i])) { - signals.push_back([r = c.req, l = c.seq_len] { // - UpdateState(*r, Request::kFinish, l); - }); + if (TM_UNLIKELY(finished[j])) { + if (!c.is_canceled) { + scheduler_.PublishGeneration(c); + } + signals.push_back([r = c.req, l = c.seq_len] { UpdateState(*r, Request::kFinish, l); }); + c.retiring = true; + c.done = true; } - else if (c.req->stream_output) { - signals.push_back([r = c.req, l = c.seq_len] { // - UpdateState(*r, Request::kOk, l); - }); + else if (TM_LIKELY(c.req->stream_output)) { + signals.push_back([r = c.req, l = c.seq_len] { UpdateState(*r, Request::kOk, l); }); } } - else { - s.cache_len = sequence_length[i]; - } - c.done |= finished[i]; - if (c.seq->status != Sequence::kCached) { // At least `Locked` status is required for caching - sequences_to_cache.push_back(c.seq); - } - // dbg(c.seq_len, c.sequence.cache_len, c.alpha, c.beta, c.is_decoding, c.is_generate); + } + else { // new } } - if (!is_warm_up_) { - seq_mgr_->CachePrompt(sequences_to_cache, sequences_to_cache.size()); - } - - b.rc.clear(); + // b.rc.clear(); if (async_) { const int size = s.active + s.swapout; for (int i = 0; i < size; ++i) { auto& c = *s.rc[i]; if (i < s.active) { - c.alpha = c.input_len; - c.beta = c.generating; + c.inflight_input_len = c.input_len; + c.inflight_new_tokens = c.generating; } else { // Just got swaped-out - c.alpha = c.beta = 0; + c.inflight_input_len = 0; + c.inflight_new_tokens = 0; } } } - for (auto& x : s.rc) { - if (TM_UNLIKELY(x->done)) { - Interrupt(*x); - x.reset(); - s.finish += 1; + for (int i = 0; i < s.size(); ++i) { + const int j = perm[i]; + if (j >= b.bsz) { + continue; } + + auto& c = *TM_CHECK_NOTNULL(s.rc[i]); + TM_CHECK_GT(c.inflight, 0); + --c.inflight; } } @@ -847,9 +776,9 @@ void Engine::Impl::InternalThreadEntry() const int n_free = param_.max_batch_size - st.size() + st.finish; const bool blocking = n_free == param_.max_batch_size; - gateway_.pop(rs->infer, rs->kill, n_free, blocking, rs->abort, dp_group_, queue_id_); + gateway_.pop(rs->infer, n_free, blocking, rs->abort, dp_group_, queue_id_); - Validate(rs->infer, rs->kill); + Validate(rs->infer); rs->cancel = GetCanceled(); } @@ -870,14 +799,14 @@ void Engine::Impl::InternalThreadEntry() vector signals; - Kill(rs->kill, signals); - Accept(rs->infer, signals); Cancel(rs->cancel, signals); gateway_.notify(std::move(signals), tp_rank_ == 0); + Retire(st); + int n_active = st.size() - st.finish; TM_CHECK_GE(n_active, 0); @@ -888,8 +817,12 @@ void Engine::Impl::InternalThreadEntry() Schedule(); + FailStalledHeadOfLine(signals); + UpdateScheduleMetrics(); + MaybeLogCacheStats(); + Setup(*d); d->ready.Record(core::Context::stream()); @@ -909,6 +842,8 @@ void Engine::Impl::InternalThreadEntry() Update(*d, signals); + Retire(st); + gateway_.notify(std::move(signals), tp_rank_ == 0); // if (future.valid()) { @@ -927,16 +862,25 @@ Engine::Engine(Engine&&) noexcept = default; Engine& Engine::operator=(Engine&&) noexcept = default; Engine::Engine(EngineParam param, + ObjectAllocator alloc, + CacheRegistry cache_registry, LanguageModel model, std::unique_ptr vision_model, - const ModelWeight& weights, Context& ctx, Gateway& gateway, int device_id, int dp_rank, int phases): - impl_{std::make_unique( - param, std::move(model), std::move(vision_model), weights, ctx, gateway, device_id, dp_rank, phases)} + impl_{std::make_unique(param, + std::move(alloc), + std::move(cache_registry), + std::move(model), + std::move(vision_model), + ctx, + gateway, + device_id, + dp_rank, + phases)} { } @@ -945,30 +889,43 @@ void Engine::Start() return impl_->Start(); } -void Engine::Impl::UpdateScheduleMetrics() +void Engine::Impl::MaybeLogCacheStats() { - const auto scheduler_tick = scheduler_tick_.fetch_add(1, std::memory_order_relaxed) + 1; + if (cache_log_interval_ <= 0 || tp_rank_ != 0) { + return; // disabled, or non-primary TP rank (avoid duplicate lines) + } + if (++schedule_counter_ % static_cast(cache_log_interval_) != 0) { + return; + } + TM_LOG_WARN("dp{} cache stats:\n{}", dp_rank_, FormatMemoryStats(object_allocator_.Stats())); +} - const auto [total, active, cached] = seq_mgr_->seq_stats(); +void Engine::Impl::UpdateScheduleMetrics() +{ + if (param_.enable_metrics) { + // const auto& [total, active, cached] = seq_mgr_->seq_stats(); - auto m = std::make_shared(); + // auto m = std::make_shared(); - m->total_seqs = total; - m->active_seqs = active; - m->waiting_seqs = total - active; - m->scheduler_tick = scheduler_tick; + // m->total_seqs = total; + // m->active_seqs = active; + // m->waiting_seqs = total - active; - m->total_blocks = seq_mgr_->total_count(); - m->active_blocks = seq_mgr_->active_count(); - m->cached_blocks = seq_mgr_->cached_count(); - m->free_blocks = seq_mgr_->free_count(); + // m->total_blocks = seq_mgr_->total_count(); + // m->active_blocks = seq_mgr_->active_count(); + // m->cached_blocks = seq_mgr_->cached_count(); + // m->free_blocks = seq_mgr_->free_count(); - std::atomic_store_explicit(&metrics_, std::move(m), std::memory_order_release); + // std::atomic_store_explicit(&metrics_, std::move(m), std::memory_order_release); + } } shared_ptr Engine::GetScheduleMetrics() { - return std::atomic_load_explicit(&impl_->metrics_, std::memory_order_acquire); + if (impl_->param_.enable_metrics) { + return std::atomic_load_explicit(&impl_->metrics_, std::memory_order_acquire); + } + return {}; } } // namespace turbomind diff --git a/src/turbomind/engine/engine.h b/src/turbomind/engine/engine.h index 74122ef65f..7e752f500d 100644 --- a/src/turbomind/engine/engine.h +++ b/src/turbomind/engine/engine.h @@ -3,16 +3,17 @@ #include +#include "src/turbomind/engine/cache_registry.h" #include "src/turbomind/engine/gateway.h" #include "src/turbomind/models/language_model.h" #include "src/turbomind/models/llama/context.h" #include "src/turbomind/models/llama/llama_params.h" -#include "src/turbomind/models/vision_model.h" namespace turbomind { struct ScheduleMetrics; +class VisionModel; class Engine { public: @@ -28,9 +29,10 @@ class Engine { } Engine(EngineParam param, + ObjectAllocator alloc, + CacheRegistry cache_registry, LanguageModel model, - std::unique_ptr vision_model, // null for text-only - const ModelWeight& weights, + std::unique_ptr vision_model, // null for text-only checkpoints Context& ctx, Gateway& gateway, int device_id, diff --git a/src/turbomind/engine/engine_config.h b/src/turbomind/engine/engine_config.h index 2c0381e9c4..02ec4a2ff2 100644 --- a/src/turbomind/engine/engine_config.h +++ b/src/turbomind/engine/engine_config.h @@ -23,6 +23,10 @@ struct EngineConfig { X(float, cache_max_block_count, 0) \ X(int, cache_chunk_size, 0) \ X(bool, enable_prefix_caching, false) \ + X(int, cache_checkpoint_interval, 4096) \ + X(std::string, cache_prompt, "auto") \ + X(int, cache_prompt_boundary_skip, 1) \ + X(std::string, cache_generation, "auto") \ X(bool, enable_metrics, false) \ X(int, num_tokens_per_iter, 0) \ X(int, max_prefill_iters, 1) \ diff --git a/src/turbomind/engine/fingerprint.h b/src/turbomind/engine/fingerprint.h new file mode 100644 index 0000000000..cfa10e0f1a --- /dev/null +++ b/src/turbomind/engine/fingerprint.h @@ -0,0 +1,31 @@ +#pragma once +#include +#include + +namespace turbomind { + +// 256-bit (SHA-256) opaque multimodal content identity, stored as four 64-bit +// words. All-zero is the reserved "empty" sentinel; an empty fingerprint never +// compares equal to anything -- including another empty fingerprint. +struct Fingerprint { + std::array words{}; + + bool empty() const noexcept + { + return words == std::array{}; + } + + friend bool operator==(const Fingerprint& a, const Fingerprint& b) noexcept + { + if (a.empty() || b.empty()) { + return false; + } + return a.words == b.words; + } + friend bool operator!=(const Fingerprint& a, const Fingerprint& b) noexcept + { + return !(a == b); + } +}; + +} // namespace turbomind diff --git a/src/turbomind/engine/gateway.cc b/src/turbomind/engine/gateway.cc index 82e3bd119a..160f582961 100644 --- a/src/turbomind/engine/gateway.cc +++ b/src/turbomind/engine/gateway.cc @@ -30,32 +30,16 @@ void Gateway::shutdown() void Gateway::push(std::shared_ptr r) { - int rank = -1; - - if (TM_UNLIKELY(!r->session.start_flag)) { - // route to corresponding rank - rank = binding_.find(r->session.id); - } - else if (TM_LIKELY(size_)) { - rank = next_.fetch_add(1, std::memory_order_relaxed) % size_; - } - else { + if (TM_UNLIKELY(!size_)) { TM_LOG_ERROR("No queues available for submitting the request"); notify({[r = std::move(r)] { UpdateState(*r, Request::kNoQueue, 0); }}); return; } - - if (TM_LIKELY(rank >= 0)) { - queues_[rank]->push({std::move(r)}); - } - else { - TM_LOG_ERROR("Failed to find a binded queue for {}", r->session.id); - notify({[r = std::move(r)] { UpdateState(*r, Request::kInvalid, 0); }}); - } + const int rank = next_.fetch_add(1, std::memory_order_relaxed) % size_; + queues_[rank]->push({std::move(r)}); } void Gateway::pop(std::vector>& infer_reqs, - std::vector>& kill_reqs, unsigned max_infer, bool blocking, bool& abort, @@ -67,10 +51,9 @@ void Gateway::pop(std::vector>& infer_reqs, auto& q = *queues_.at(qid); infer_reqs.clear(); - kill_reqs.clear(); if (dp_group->n_ranks() == 1) { - q.pop(infer_reqs, kill_reqs, max_infer, blocking, abort); + q.pop(infer_reqs, max_infer, blocking, abort); } else { union { @@ -78,8 +61,8 @@ void Gateway::pop(std::vector>& infer_reqs, uint32_t value; }; while (true) { - q.pop(infer_reqs, kill_reqs, max_infer, false, abort); - data[0] = !(blocking && infer_reqs.empty() && kill_reqs.empty()); // ready? + q.pop(infer_reqs, max_infer, false, abort); + data[0] = !(blocking && infer_reqs.empty()); // ready? data[1] = abort; value = comm::AllReduce(dp_group, value, comm::RedOp::kSum); if (data[0] >= dp_thr_ || data[1]) { @@ -91,28 +74,6 @@ void Gateway::pop(std::vector>& infer_reqs, // Assign a monotonic increasing id for each infer request q.assign_unique_ids(infer_reqs); - - // Bind for stateful inference - std::vector bind_ids; - for (const auto& r : infer_reqs) { - if (r->session.start_flag && !r->session.end_flag) { // started but not ended - bind_ids.push_back(r->session.id); - } - } - - /// TODO: fix qid <-> rank mapping - if (!bind_ids.empty()) { - binding_.bind(bind_ids, qid); - } - - // Unbind for stateful kill - std::vector unbind_ids; - for (const auto& r : kill_reqs) { - unbind_ids.push_back(r->session.id); - } - if (!unbind_ids.empty()) { - binding_.unbind(unbind_ids, qid); - } } void Gateway::cancel(std::shared_ptr r) @@ -128,19 +89,6 @@ void Gateway::cancel(std::shared_ptr r) } } -void Gateway::kill(std::shared_ptr r) -{ - if (auto rank = binding_.find(r->session.id); rank >= 0) { - queues_[rank]->kill(std::move(r)); - } - else { - TM_LOG_ERROR("Failed to find a binded queue for {}", r->session.id); - notify({[r = std::move(r)] { // - UpdateState(*r, Request::kInvalid, 0); - }}); - } -} - void Gateway::notify(std::vector signals, bool pred) { if (pred) { diff --git a/src/turbomind/engine/gateway.h b/src/turbomind/engine/gateway.h index ddd80892d0..6ac36fea83 100644 --- a/src/turbomind/engine/gateway.h +++ b/src/turbomind/engine/gateway.h @@ -17,49 +17,6 @@ namespace turbomind { -class SequenceBinding { -public: - int find(uint64_t seq_id) - { - std::lock_guard lock{mutex_}; - if (auto it = map_.find(seq_id); it != map_.end()) { - return it->second; - } - return -1; - } - - void bind(const std::vector& seq_ids, int rank) - { - std::lock_guard lock{mutex_}; - for (const auto& x : seq_ids) { - if (auto [it, success] = map_.emplace(x, rank); !success) { - TM_LOG_WARN("Duplicated binding for {}, {} vs {}", x, rank, it->second); - } - } - } - - void unbind(const std::vector& seq_ids, int rank) - { - std::lock_guard lock{mutex_}; - for (const auto& x : seq_ids) { - auto it = map_.find(x); - if (it == map_.end()) { - TM_LOG_WARN("No entry found for unbinding {}, {}", x, rank); - } - else if (it->second != rank) { - TM_LOG_WARN("Mismatched entry for unbinding {}, {} vs {}", x, rank, it->second); - } - else { - map_.erase(it); - } - } - } - -private: - std::mutex mutex_; - std::unordered_map map_; -}; - class Gateway { public: Gateway(int size, std::function()> ctx_factory); @@ -69,7 +26,6 @@ class Gateway { void push(std::shared_ptr r); void pop(std::vector>& infer_reqs, - std::vector>& kill_reqs, unsigned max_infer, bool blocking, bool& abort, @@ -78,8 +34,6 @@ class Gateway { void cancel(std::shared_ptr r); - void kill(std::shared_ptr r); - void notify(std::vector signals, bool pred = true); void set_threshold(int value) @@ -103,8 +57,6 @@ class Gateway { SignalBuffer signal_buffer_; std::thread signal_thread_; - SequenceBinding binding_; - std::atomic next_; }; diff --git a/src/turbomind/engine/model_executor.cc b/src/turbomind/engine/model_executor.cc index bb9672b8b9..5ab0052101 100644 --- a/src/turbomind/engine/model_executor.cc +++ b/src/turbomind/engine/model_executor.cc @@ -22,7 +22,7 @@ using std::unique_ptr; struct ModelExecutor::Impl { LanguageModel& model_; - VisionModel* vision_model_; // nullable + VisionModel* vision_model_; // nullable: only set for VLM checkpoints LlamaLinear& linear_; const int device_id_; @@ -56,19 +56,32 @@ struct ModelExecutor::Impl { } } + static void RunCopies(std::vector& copies) + { + for (const auto& c : copies) { + Copy(Buffer_{static_cast(c.src), static_cast(c.bytes), kDEVICE}, + Buffer_{static_cast(c.dst), static_cast(c.bytes), kDEVICE}); + } + copies.clear(); + } + void Run(BatchData& d) { TM_FUNCTION_SCOPE(); - auto batch = &d; BatchCopy copy; TensorMap env{{"batch", d.buf()}, {"copy", copy.buf()}}; + // Restore copies first so kPrepare may post-process restored content + // (a module reset overrides whatever a whole-object restore wrote). + RunCopies(d.restore_copies); + + // Vision sub-graph runs before the language model in each phase so its + // env outputs (image embeddings, mrope tensors) are visible downstream. if (vision_model_) { vision_model_->Run(BatchOp::kPrepare, d.phase, env); } model_.Run(BatchOp::kPrepare, d.phase, env); - // dbg(copy); copy.Run(); if (vision_model_) { @@ -77,10 +90,12 @@ struct ModelExecutor::Impl { model_.Run(BatchOp::kForward, d.phase, env); model_.Run(BatchOp::kUnprep, d.phase, env); - // dbg(copy); copy.Run(); - // TM_CHECK(0); + // Publication copies last: kUnprep is the module's final chance to + // finalize frontier contents before the snapshot. + RunCopies(d.publish_copies); + AnomalyHandler::instance().Summarize([](...) {}); AnomalyHandler::instance().Reset(); } @@ -125,12 +140,7 @@ ModelExecutor::ModelExecutor(LanguageModel& model, int device_id, Queue>& inbound, Queue>& outbound): - impl_{std::make_unique(model, // - vision_model, - context, - device_id, - inbound, - outbound)} + impl_{std::make_unique(model, vision_model, context, device_id, inbound, outbound)} { } diff --git a/src/turbomind/engine/model_request.cc b/src/turbomind/engine/model_request.cc index 7e6a7b32cd..f52c0d08d3 100644 --- a/src/turbomind/engine/model_request.cc +++ b/src/turbomind/engine/model_request.cc @@ -33,18 +33,6 @@ void ModelRequest::Cancel() } } -void ModelRequest::End(std::function cb, uint64_t session_id) -{ - auto r = std::make_shared(); - - r->id = r->session.id = session_id; - r->session.kill_flag = true; - - r->end_cb = std::move(cb); - - gateway_->kill(std::move(r)); -} - auto ModelRequest::Forward(InputParam param, std::function cb) -> OutputParam { inputs_ = std::make_shared(); @@ -72,7 +60,7 @@ auto ModelRequest::Forward(InputParam param, std::function cb) -> Output // is used instead const int max_seq_len = session_len_ + 1; const int max_out_len = std::min(output_len, session_len_) + 1; - // This does not include histroy length in interactive mode + // Sized by `session_len` since the actual history length isn't known here const int max_in_out_len = std::min(input_len + output_len, session_len_) + 1; for (auto& [k, v] : *param.tensors) { @@ -120,12 +108,10 @@ auto ModelRequest::Forward(InputParam param, std::function cb) -> Output metrics->scheduled_time.store(0, std::memory_order_relaxed); } - if (param.session.start_flag) { - session_id_ = param.session.id; - } + session_id_ = param.session.id; r->id = param.session.id; - r->session = param.session; + r->step = param.session.step; r->gen_cfg = param.gen_cfg; r->stream_output = param.stream_output; r->forward_cb = std::move(cb); diff --git a/src/turbomind/engine/model_request.h b/src/turbomind/engine/model_request.h index 6240e14fe1..7161c46d5e 100644 --- a/src/turbomind/engine/model_request.h +++ b/src/turbomind/engine/model_request.h @@ -23,9 +23,6 @@ class ModelRequest { // Cancel running request void Cancel(); - // Reset the channel to uninitailized state, calls `notify` when done - void End(std::function cb, uint64_t session_id); - struct InputParam { std::shared_ptr tensors; std::shared_ptr mm_inputs; diff --git a/src/turbomind/engine/prefix_key.h b/src/turbomind/engine/prefix_key.h new file mode 100644 index 0000000000..cc49327eec --- /dev/null +++ b/src/turbomind/engine/prefix_key.h @@ -0,0 +1,92 @@ +#pragma once + +#include +#include + +#include "src/turbomind/core/check.h" +#include "src/turbomind/engine/fingerprint.h" + +namespace turbomind { + +struct TokenSpan { + const int* data{}; + int size{}; + + const int* begin() const noexcept + { + return data; + } + + const int* end() const noexcept + { + return size == 0 ? data : data + size; + } +}; + +inline TokenSpan MakeTokenSpan(const std::vector& tokens) noexcept +{ + return TokenSpan{tokens.data(), static_cast(tokens.size())}; +} + +inline TokenSpan MakeTokenSpan(const int* data, int size) noexcept +{ + return TokenSpan{data, size}; +} + +inline size_t HashCombine(size_t seed, size_t value) noexcept +{ + return seed ^ (value + 0x9e3779b97f4a7c15ULL + (seed << 6) + (seed >> 2)); +} + +inline size_t HashCombine(size_t seed, const Fingerprint& fp) noexcept +{ + for (uint64_t w : fp.words) { + seed = HashCombine(seed, static_cast(w)); + } + return seed; +} + +struct PrefixKey { + int length{}; + size_t hash{}; + + explicit operator bool() const noexcept + { + return hash || length; + } + + friend bool operator==(const PrefixKey& a, const PrefixKey& b) noexcept + { + return a.length == b.length && a.hash == b.hash; + } +}; + +// PrefixKey::hash is already a cumulative hash; using it directly lets the +// trie key be extended incrementally token by token. +struct PrefixKeyHash { + size_t operator()(const PrefixKey& key) const noexcept + { + return key.hash; + } +}; + +inline PrefixKey ExtendPrefixKey(PrefixKey key, TokenSpan tokens) +{ + TM_CHECK_GE(tokens.size, 0); + for (const int* it = tokens.begin(); it != tokens.end(); ++it) { + key.hash = HashCombine(key.hash, static_cast(*it)); + ++key.length; + } + return key; +} + +inline PrefixKey ExtendPrefixKey(PrefixKey key, TokenSpan tokens, const std::vector& fps) +{ + key = ExtendPrefixKey(key, tokens); // existing token fold + for (const Fingerprint& fp : fps) { + key.hash = HashCombine(key.hash, fp); + } + return key; +} + +} // namespace turbomind diff --git a/src/turbomind/engine/prefix_trie.h b/src/turbomind/engine/prefix_trie.h new file mode 100644 index 0000000000..24e2f868d5 --- /dev/null +++ b/src/turbomind/engine/prefix_trie.h @@ -0,0 +1,104 @@ +#pragma once + +#include +#include +#include + +#include "src/turbomind/core/check.h" +#include "src/turbomind/engine/block.h" +#include "src/turbomind/engine/prefix_key.h" + +namespace turbomind { + +// Prefix trie index over logical block nodes. Holds raw (weak) LogicalBlock* +// kept consistent by Erase on recycle, so a lookup never returns a dead node. +class PrefixTrie { +public: + explicit PrefixTrie(int block_size): block_size_{block_size} {} + + // Exact full lookup: hash, parent, length, token identity, and start- + // fingerprints must match. `fps` are the start-fingerprints of images whose + // start token falls inside this block (empty for ordinary blocks). + LogicalBlock* Find(const LogicalBlock* parent, + const PrefixKey& key, + TokenSpan tokens, + const std::vector& fps = {}) const + { + if (auto it = index_.find(key); it != index_.end()) { + LogicalBlock* b = it->second; + if (b->parent == parent && b->size == tokens.size + && std::equal(tokens.begin(), tokens.end(), b->tokens.begin()) + && b->image_fps == fps) { // vector==; empty Fingerprint never equal + return b; + } + } + return nullptr; + } + + // Longest partial match within one block (never the full block). `fps`/`fp_pos` + // describe images whose start token falls inside this block: fp_pos[k] is the + // block-relative start position of fps[k] (ascending). On a hit, `key` is + // replaced with the matched node's key. + LogicalBlock* Search(const LogicalBlock* parent, + PrefixKey& key, + TokenSpan tokens, + const std::vector& fps = {}, + const std::vector& fp_pos = {}) const + { + std::vector prefixes; // token-only cumulative keys + PrefixKey k = key; + for (const int* it = tokens.begin(); it != tokens.end(); ++it) { + k.hash = HashCombine(k.hash, static_cast(*it)); + ++k.length; + prefixes.push_back(k); + } + if (static_cast(prefixes.size()) == block_size_) { + prefixes.pop_back(); // enforce a partial match + } + for (int i = static_cast(prefixes.size()); i > 0; --i) { + std::vector sub; // images that begin within [0, i) + for (size_t j = 0; j < fps.size() && j < fp_pos.size() && fp_pos[j] < i; ++j) { + sub.push_back(fps[j]); + } + PrefixKey ki = prefixes[i - 1]; + for (const Fingerprint& fp : sub) { + ki.hash = HashCombine(ki.hash, fp); + } + if (LogicalBlock* b = Find(parent, ki, TokenSpan{tokens.begin(), i}, sub)) { + key = ki; + return b; + } + } + return nullptr; + } + + // First-wins insertion; reads node.key/parent/tokens already set. + bool Insert(LogicalBlock& b) + { + TM_CHECK(!b.indexed); + TM_CHECK(static_cast(b.key)); + if (auto it = index_.find(b.key); it == index_.end()) { + index_.emplace_hint(it, b.key, &b); + b.indexed = true; + return true; + } + return false; + } + + // Fired by the pool recycle hook. + void Erase(LogicalBlock& b) + { + if (b.indexed) { + auto it = index_.find(b.key); + TM_CHECK(it != index_.end()); + TM_CHECK_EQ(it->second, &b); + index_.erase(it); + } + } + +private: + int block_size_; + std::unordered_map index_; +}; + +} // namespace turbomind diff --git a/src/turbomind/engine/prompt_boundary.h b/src/turbomind/engine/prompt_boundary.h new file mode 100644 index 0000000000..3ffaf7280a --- /dev/null +++ b/src/turbomind/engine/prompt_boundary.h @@ -0,0 +1,58 @@ +// Copyright (c) OpenMMLab. All rights reserved. + +#pragma once + +namespace turbomind { + +// Pure geometry of the prompt-boundary publish point. Decides where the reusable +// boundary B = prompt_len - skip lands and whether a partial fork_to node is +// needed (B strictly inside a block) or B is block-aligned (whole blocks already +// tile [0, B); only the clamp/checkpoint applies). `miss` is the first prompt +// block index not matched in the trie (AcceptState::miss). No scheduler state. +struct PromptBoundaryPlan { + bool valid = false; // a boundary exists (set prompt_boundary_node) + bool partial = false; // needs a fork_to partial node; else block-aligned + int pos = 0; // B + int block = 0; // j = block holding the last token before B + int node_size = 0; // partial node length (B - j*block_size) when partial +}; + +inline PromptBoundaryPlan PlanPromptBoundary(int prompt_len, int block_size, int skip, int miss) +{ + PromptBoundaryPlan p{}; + if (skip < 1) { + skip = 1; // defensive; the scheduler also clamps at construction + } + const int B = prompt_len - skip; + if (B < 1) { + return p; // boundary before the first token: nothing to publish + } + const int j = (B - 1) / block_size; // block holding the last token before B + if (B % block_size != 0) { + // B strictly inside block j: a partial fork_to node is required so [0, B) + // is fully matchable (whole blocks [0, j*bs) + this node). miss < j keeps + // the parent chain [0..j-1] indexed and the node off the fork_from miss + // block. miss < j (with miss >= 0) implies j >= 1, so block j-1 exists. + if (miss < j) { + p.valid = true; + p.partial = true; + p.pos = B; + p.block = j; + p.node_size = B - j * block_size; + } + } + else { + // B block-aligned: block j ends exactly at B and already tiles [0, B); + // no partial node, only the clamp target. miss <= j: block j is matched + // or created-and-indexed. + if (miss <= j) { + p.valid = true; + p.partial = false; + p.pos = B; + p.block = j; + } + } + return p; +} + +} // namespace turbomind diff --git a/src/turbomind/engine/request.h b/src/turbomind/engine/request.h index ce400d63e1..84ee8aa72a 100644 --- a/src/turbomind/engine/request.h +++ b/src/turbomind/engine/request.h @@ -2,15 +2,21 @@ #pragma once +#include #include #include #include #include +#include #include #include +#include +#include #include "src/turbomind/core/core.h" #include "src/turbomind/core/interval.h" +#include "src/turbomind/engine/block.h" +#include "src/turbomind/engine/fingerprint.h" #include "src/turbomind/engine/multimodal_input.h" #include "src/turbomind/utils/metrics.h" @@ -57,12 +63,7 @@ std::ostream& operator<<(std::ostream& os, const GenerationConfig& c); struct SessionParam { uint64_t id; - - int step; - - bool start_flag; - bool end_flag; - bool kill_flag; + int step; }; struct RequestState { @@ -91,7 +92,8 @@ struct Request { uint64_t id; // sequence id uint64_t unique_id; // monotonic increasing - SessionParam session; + int step; // KV/output offset (replaces SessionParam session; start/end/kill removed) + GenerationConfig gen_cfg; bool stream_output; @@ -105,8 +107,6 @@ struct Request { Tensor_ output_ids; Tensor_ sequence_length; - std::function end_cb; - std::atomic cancel_flag; std::function forward_cb; @@ -120,16 +120,15 @@ struct Request { enum { kOk = 0, - kInvalid = 1, // Sequence not exist or both `start` & `stop` (instead of `end`) is set - kConflict = 2, // Concurrent requests to the same sequence - kBusy = 3, // Sequence is already running - kInactive = 4, // Sequence to `stop` is not active - kFail = 5, // Can't find sequence for `stop` request or internal error during inference - kTooLong = 6, // history + prompt > session_len, + kInvalid = 1, // Malformed request (e.g. invalid input embeddings) or routing failure + kConflict = 2, // Concurrent requests to the same sequence id + kFail = 5, // Internal error during inference + kTooLong = 6, // history + prompt > session_len kFinish = 7, kCancel = 8, - kInconsistency = 9, // Inconsistent request parameters, e.g. prefix caching is not allowed in interactive mode - kNoQueue = 10, // No queue available for submitting the request (in current process) + kInconsistency = 9, // Prefix caching incompatible with nonzero step or all-token logits/hidden-state output + kNoQueue = 10, + kOutOfMemory = 11, }; std::shared_ptr grammar; @@ -138,23 +137,52 @@ struct Request { void UpdateState(Request& r, int status, int seq_len); -class Sequence; +struct Sequence; + +struct MultiModalData; // defined in models/vision_model.h + +// The prefix-cache projection of one multimodal input: its token span and +// content identity. The engine never sees MultiModalData / pixels. +struct MultiModalSpan { + Interval interval; // absolute token span [begin, end) + Fingerprint fingerprint; // empty until the generation PR +}; + +// A scheduler-planned device copy between two cache blocks of the same +// category. Resolved to pointers on the engine thread at setup and executed +// as a whole-object copy by the model executor. +struct CacheCopy { + int src{}; + int dst{}; +}; + +// What set this pass's resume_len. resume_len is a single number, produced by +// whichever mechanism reached the highest skip position in Scheduler::Resume(). +// Observability-only; the scheduler stays category-agnostic. +enum class ResumeSource +{ + kNone = 0, // resume_len == 0, nothing skipped + kPrefix, // contiguous valid prefix-category cache (no checkpoint category) + kFrontier, // request's own checkpoint frontier (no restore copy) + kCheckpoint, // restored a published block checkpoint into the frontier + kFork, // extended from a forked sibling's prefix node +}; + +// Unlike `Request` which is shared by all local TP ranks, each rank has its own `Sequence`. +struct Sequence { -// Unlike `Request` which is shared by all local TP ranks, each rank has its own `RequestCache`. -struct RequestCache { std::shared_ptr req; - const Sequence* seq; // May be NULL in `Update` (seq get erased when req is done) - const GenerationConfig& gen_cfg; - RequestCache(std::shared_ptr r, const Sequence& s): req{std::move(r)}, seq{&s}, gen_cfg{req->gen_cfg} {} + const GenerationConfig& gen_cfg; + + explicit Sequence(std::shared_ptr r): req{std::move(r)}, gen_cfg{req->gen_cfg} {} int status = Request::kOk; // These members may be opaque handles from individual modules (pointers to forward declared types), but we tend to // keep it simple as long as the complexity is manageable - int* token_ids = nullptr; // currently the `output_ids` buf of request - uint8_t* random_state = nullptr; + int* token_ids = nullptr; // currently the `output_ids` buf of request int step0 = 0; // set at request init, constant, first prefill step int prompt_len = 0; // set at request init, constant, first decode step @@ -165,23 +193,76 @@ struct RequestCache { int seq_len = 0; // set at request init, updated per step - int input_len = 0; // set at schedule (set to `seq.input_len`) - int history_len = 0; // set at schedule (set to `seq.cache_len`) + int input_len = 0; // set at schedule + int history_len = 0; // set at schedule from `resume_len` bool autoregres = false; // set at schedule, `seq_len` and `input_ids` taken from the engine bool generating = false; // set at schedule bool done = false; // set at cancel / update, is the request finished / canceled - int alpha = 0; // pending growth of cache_len (draft_len + input_len) - int beta = 0; // pending growth of seq_len (draft_len + {0,1}) + bool retiring = false; // finished/canceled; never schedule again + int inflight = 0; // submitted executor batches containing this request + + int generation_token_ids_row = -1; // owned by Generation, allocated lazily + int generation_random_state_row = -1; // owned by Generation, allocated lazily + + int inflight_input_len = 0; // submitted input tokens not yet reflected into filled_len + int inflight_new_tokens = 0; // submitted generated tokens not yet reflected into seq_len float rope_base = 0.f; Interval output_hidden_states; Interval output_logits; - Interval input_ce_loss; + ////////////////////////// Engine-local execution state /////////////////////////// + + std::vector block_ids; // logical (each holds one request ref) + + std::vector alloc_cache_ids; // cache ids needing allocation this schedule pass + std::vector involved_cache_ids; // cache ids stamped for eviction protection (= required alloc set); + // persistent across Continue, rebuilt by Resume + + std::vector restore_copies; // run before BatchOp::kPrepare + std::vector publish_copies; // run after BatchOp::kUnprep + + int resume_len = 0; // prefix length every stateful module agrees can be skipped + int filled_len = 0; // prefix state actually produced by the latest completed forward + + int readonly_block_num = 0; // leading block_ids reused read-only (no KV re-write) + + // Prefix-cache logging only; never read by scheduling/admission logic. + int matched_blocks = 0; // set at Accept: leading prompt blocks found in trie + bool resuming = false; // transient: planned by Resume() this pass + ResumeSource resume_source = ResumeSource::kNone; // transient: mechanism that set resume_len + + int frontier_cache_id = 0; // checkpoint working state for the next forward + int frontier_pos = 0; // sequence position the frontier corresponds to + int publish_cache_id = 0; // reserved slot for the next checkpoint publication + LogicalBlock* publish_target = nullptr; // logical block selected for publication this pass + int publish_end = 0; // sequence position of the pending publication + int last_ckpt_pos = 0; // end of the last published checkpoint + bool prompt_boundary_node = + false; // a reusable prompt-boundary exists and WILL be published: a partial fork_to + // node when B is mid-block, else a block-aligned checkpoint clamp target. The + // producer clamps its forward to prompt_boundary_pos to populate the node's KV + // (and publish a checkpoint when the model is recurrent). Decided in SetupForks. + int prompt_boundary_pos = 0; // resolved boundary B = prompt_len - cache_prompt_boundary_skip; 0 = none + + std::vector tokens; + + std::vector input_embeds; + std::vector input_embeds_offsets; + + // persistent per-sequence vision features (qwen3.5-vit, W1) + std::vector multimodal_spans; // engine-visible projection; consumed by scheduler + std::vector> multimodal_inputs; // opaque (unchanged) + + bool is_active = false; + bool is_canceled = false; + + // get_ppl / CE-loss (W2) + Interval input_ce_loss; Buffer_ ce_loss; // device, size 1; rank-0 CE-loss accumulator. }; @@ -247,7 +328,7 @@ void serdes(Archive& ar, Request& r) // clang-format off ar & r.id; ar & r.unique_id; - ar & r.session; + ar & r.step; ar & r.gen_cfg; ar & r.stream_output; ar & r.inputs; @@ -262,4 +343,79 @@ void serdes(Archive& ar, Request& r) // clang-format on } +class Resource { +public: + virtual ~Resource() = default; + + virtual int Test(const Sequence& s) const noexcept = 0; + virtual void Commit(const Sequence& s) noexcept = 0; +}; + +class ScheduleResources final: public Resource { +public: + template + T& Add(Args&&... args) + { + auto resource = std::make_unique(std::forward(args)...); + auto& ref = *resource; + resources_.push_back(std::move(resource)); + return ref; + } + + int Test(const Sequence& s) const noexcept override + { + int admitted = std::numeric_limits::max(); + for (const auto& resource : resources_) { + const int next = resource->Test(s); + if (next == 0) { + return 0; + } + admitted = std::min(admitted, next); + } + return admitted == std::numeric_limits::max() ? 0 : admitted; + } + + void Commit(const Sequence& s) noexcept override + { + for (const auto& resource : resources_) { + resource->Commit(s); + } + } + +private: + std::vector> resources_; +}; + +class ForwardTokenResource final: public Resource { +public: + explicit ForwardTokenResource(int max_fwd_tokens) noexcept: max_fwd_tokens_{max_fwd_tokens} {} + + int Test(const Sequence& s) const noexcept override + { + const int input_len = InputLen(s); + if (input_len <= 0 || max_fwd_tokens_ <= 0) { + return 0; + } + return std::min(input_len, max_fwd_tokens_); + } + + void Commit(const Sequence& s) noexcept override + { + max_fwd_tokens_ -= s.input_len; + } + + int remaining_tokens() const noexcept + { + return max_fwd_tokens_; + } + +private: + static int InputLen(const Sequence& s) noexcept + { + return s.seq_len + s.inflight_new_tokens - s.inflight_input_len - s.resume_len; + } + + int max_fwd_tokens_{}; +}; + } // namespace turbomind diff --git a/src/turbomind/engine/request_queue.h b/src/turbomind/engine/request_queue.h index 694f96bc79..f022857e5f 100644 --- a/src/turbomind/engine/request_queue.h +++ b/src/turbomind/engine/request_queue.h @@ -27,28 +27,12 @@ class RequestQueue { cv_.notify_one(); } - void kill(std::shared_ptr r) - { - { - std::lock_guard lock{mutex_}; - if (closed_) { - throw std::runtime_error("Queue is closed"); - } - kill_.push_back(std::move(r)); - } - cv_.notify_one(); - } - - void pop(std::vector>& infer_reqs, - std::vector>& kill_reqs, - unsigned max_infer, - bool blocking, - bool& abort) + void pop(std::vector>& infer_reqs, unsigned max_infer, bool blocking, bool& abort) { std::unique_lock lock{mutex_}; if (blocking) { - cv_.wait(lock, [this] { return !(queue_.empty() && kill_.empty()) || closed_; }); + cv_.wait(lock, [this] { return !queue_.empty() || closed_; }); } if (closed_) { @@ -62,9 +46,6 @@ class RequestQueue { } queue_.pop_front(); } - - kill_reqs.insert(kill_reqs.end(), kill_.begin(), kill_.end()); - kill_.clear(); } void close() @@ -89,13 +70,11 @@ class RequestQueue { } private: - std::atomic unique_id_{}; + std::atomic unique_id_{1}; std::pmr::unsynchronized_pool_resource pool_; std::pmr::list> queue_; - std::vector> kill_; - std::mutex mutex_; std::condition_variable cv_; diff --git a/src/turbomind/engine/scheduler.cc b/src/turbomind/engine/scheduler.cc new file mode 100644 index 0000000000..5d5b41257d --- /dev/null +++ b/src/turbomind/engine/scheduler.cc @@ -0,0 +1,1591 @@ +#include "src/turbomind/engine/scheduler.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "src/turbomind/core/check.h" +#include "src/turbomind/core/logger.h" +#include "src/turbomind/engine/cache_mode.h" +#include "src/turbomind/engine/prompt_boundary.h" +#include "src/turbomind/memory/common.h" + +namespace turbomind { + +namespace { + +inline int InitialResumeUpperBound(const Sequence& s) +{ + const int context_len = s.seq_len + s.inflight_new_tokens - s.inflight_input_len; + return std::max(0, std::min(s.seq_len, context_len - 1)); +} + +// Clear per-pass planning buffers (alloc, restore, publish); involved_cache_ids persists. +inline void ResetPassBuffers(Sequence& s) +{ + s.alloc_cache_ids.clear(); + s.restore_copies.clear(); + s.publish_copies.clear(); + s.publish_target = nullptr; + s.publish_end = 0; +} + +// Full rebuild: per-pass buffers plus involved_cache_ids (Resume only). +inline void ResetPlanBuffers(Sequence& s) +{ + ResetPassBuffers(s); + s.involved_cache_ids.clear(); +} + +struct AllocReplay { + int cache_id; +}; + +struct EvictReplay { + int cache_id; +}; + +using Replay = std::vector>; + +class EvictingIterator { +public: + EvictingIterator(const std::vector& cache_ids, const CacheBlockPool& cache): + cache_ids_{&cache_ids}, cache_{&cache} + { + } + + EvictingIterator(std::vector&&, const CacheBlockPool&) = delete; + + EvictingIterator(const EvictingIterator& base, uint64_t cutoff): + cache_ids_{base.cache_ids_}, pos_{base.pos_}, cache_{base.cache_}, cutoff_{cutoff} + { + } + + EvictingIterator(const EvictingIterator&) noexcept = default; + EvictingIterator& operator=(const EvictingIterator&) noexcept = default; + + explicit operator bool() const noexcept + { + return pos_ < cache_ids_->size() && (*cache_)[(*cache_ids_)[pos_]].timestamp < cutoff_; + } + + uint64_t Evict(ScratchAllocator& scratch, Replay& replay) + { + const int cache_id = (*cache_ids_)[pos_++]; + const auto& cache = (*cache_)[cache_id]; + scratch.Evict(cache.object_id, cache.allocation.a); + replay.push_back(EvictReplay{cache_id}); + return cache.timestamp; + } + + size_t pos() const noexcept + { + return pos_; + } + + void SeekTo(size_t pos) noexcept + { + pos_ = pos; + } + +private: + const std::vector* cache_ids_; + size_t pos_{}; + const CacheBlockPool* cache_; + uint64_t cutoff_{std::numeric_limits::max()}; +}; + +class AllocatingIterator { +public: + AllocatingIterator(const std::vector& cache_ids, const CacheBlockPool& cache): + iter_{cache_ids.begin()}, end_{cache_ids.end()}, cache_{cache} + { + } + + AllocatingIterator(std::vector&&, const CacheBlockPool&) = delete; + + AllocatingIterator(const AllocatingIterator&) = delete; + AllocatingIterator& operator=(const AllocatingIterator&) = delete; + + explicit operator bool() const noexcept + { + return iter_ != end_; + } + + // Idempotent: ids already allocated for real (cached alloc set), or + // planned by an earlier request in this pass, are skipped. + bool + Allocate(ScratchAllocator& scratch, std::unordered_set& planned, std::vector& planned_now, Replay& replay) + { + const int cache_id = *iter_; + const auto& cache = cache_[cache_id]; + if (cache.valid() || planned.count(cache_id)) { + ++iter_; + return true; + } + if (scratch.Allocate(cache.object_id)) { + ++iter_; + planned.insert(cache_id); + planned_now.push_back(cache_id); + replay.push_back(AllocReplay{cache_id}); + return true; + } + return false; + } + +private: + std::vector::const_iterator iter_; + std::vector::const_iterator end_; + const CacheBlockPool& cache_; +}; + +const char* ResumeSourceName(ResumeSource src) +{ + switch (src) { + case ResumeSource::kPrefix: + return "prefix"; + case ResumeSource::kFrontier: + return "frontier"; + case ResumeSource::kCheckpoint: + return "checkpoint"; + case ResumeSource::kFork: + return "fork"; + default: + return "none"; + } +} + +// Collect start-fingerprints of images whose start token lies in [lo, hi), with +// their block-relative start positions. multimodal_spans is prompt-ordered +// ascending by interval.begin(). +void CollectStartFps(const Sequence& s, int lo, int hi, std::vector& fps, std::vector* pos = nullptr) +{ + for (const auto& sp : s.multimodal_spans) { + const int b = sp.interval.begin(); + if (b < lo) { + continue; + } + if (b >= hi) { + break; + } + fps.push_back(sp.fingerprint); + if (pos) { + pos->push_back(b - lo); + } + } +} + +enum class CollisionSite +{ + kAccept, + kPromptBoundary, + kPublish +}; + +// Finalize-event record, filled by PublishGeneration's index loop. +struct GenStat { + int first_offset = 0; // offset of first newly-indexed generated block (token) + int indexed = 0; // generated blocks newly inserted into the trie + int last_size = 0; // filled tokens of the last inserted block + bool terminal_ckpt = false; + int dropped = 0; // redundant full-block checkpoints dropped on terminal adoption +}; + +// Prefix-cache log helpers (definitions at the bottom of this file). Each opens +// with an isolated level gate so its formatting is skipped when level > INFO. +// Rule: derive what survives the pass from `s`; pass a record for ephemeral +// within-pass facts. `bs` only where range math needs it. +void LogAccept(const Sequence& s, int bs); +void LogResume(const Sequence& s); +void LogDeferred(const Sequence& s, int bs, const Scheduler::ProducerConflict& c); +void LogPublished(const Sequence& s, int bs, const Scheduler::PublishStat& p); +void LogFinalized(const Sequence& s, int bs, const GenStat& g); +void LogCollision(const Sequence& s, CollisionSite site, int begin, int end); + +} // namespace + +// True if any multimodal span overlaps [lo, hi). Interval is the absolute token +// span [begin, end); a partial prompt block "contains image tokens" when a span +// intersects it, even one that started in an earlier (full) block and extends +// in. multimodal_spans is prompt-ordered ascending by interval.begin(). +bool Scheduler::HasMultimodalOverlap(const Sequence& s, int lo, int hi) +{ + for (const auto& sp : s.multimodal_spans) { + if (sp.interval.begin() >= hi) { + break; // ascending; no later span can overlap + } + if (sp.interval.end() > lo) { + return true; + } + } + return false; +} + +static PerformanceCounter make_perf_counter() +{ + constexpr int kSchedPerfCounters = 32; + return PerformanceCounter{kSchedPerfCounters}; +} + +struct Scheduler::ScheduleState { + std::vector requests; + + std::vector cutoff; // per-request eviction cutoff stamps + uint64_t floor{}; // pass-start timestamp; inactive < floor <= cutoff[i] + Replay replay; // alloc/evict ops of the current phase + size_t committed_replay_size{0}; // replay prefix from committed requests (phase 1) + std::vector committed; + std::vector pending_fork; // fork_to node per request, nullptr = none + std::vector pending_publish; // checkpoint publication intent per request + bool has_optionals{false}; // any optional intent recorded => run phase 2 + std::vector evict_ids; // SortedIndices() snapshot, shared by both phases + size_t evict_pos{0}; // oldest-first eviction cursor shared by both phases + std::unordered_set planned; // cache ids planned/reserved for allocation +}; + +bool Scheduler::PrefixEligible(const Sequence& s) const noexcept +{ + // Native VLM (multimodal_spans) is eligible: image identity is carried by the + // per-image fingerprint folded into the prefix key. The legacy Python-embedding + // path (input_embeds) stays excluded -- out of scope for this change. + return enable_prefix_caching_ && !is_warm_up_ && s.input_embeds.empty() && s.input_embeds_offsets.empty() + && s.token_ids != nullptr; +} + +TokenSpan Scheduler::TokenSegment(const Sequence& s, int offset, int size) const +{ + TM_CHECK_NOTNULL(s.token_ids); + TM_CHECK_GE(offset, 0); + TM_CHECK_GE(size, 0); + TM_CHECK_LE(offset + size, s.seq_len); + return MakeTokenSpan(s.token_ids + offset, size); +} + +Scheduler::Scheduler(ObjectAllocator& alloc, + CacheRegistry registry, + int cache_block_seq_len, + bool enable_prefix_caching, + const std::string& cache_prompt, + int cache_prompt_boundary_skip, + const std::string& cache_generation, + const int& is_warm_up): + enable_prefix_caching_{enable_prefix_caching}, + prompt_cache_mode_{ParseCacheMode(cache_prompt)}, + cache_prompt_boundary_skip_{cache_prompt_boundary_skip < 1 ? 1 : cache_prompt_boundary_skip}, + generation_cache_mode_{ParseCacheMode(cache_generation)}, + is_warm_up_{is_warm_up}, + alloc_{alloc}, + registry_{std::move(registry)}, + logical_{cache_, cache_block_seq_len}, + trie_{cache_block_seq_len}, + accum_{make_perf_counter()}, + interv_{make_perf_counter()} +{ + logical_.set_recycle_hook([this](LogicalBlock& b) { trie_.Erase(b); }); +} + +Scheduler::~Scheduler() +{ + if (interv_) { + accum_ += interv_; + interv_ = {}; + } + LogProfile(accum_); + + // Drain all live allocations so allocation-held refs are released and the + // remaining trie nodes recycle before the pools are destroyed. + // SortedIndices() returns exactly the allocated blocks (alloc set). + for (const int id : cache_.SortedIndices()) { + cache_.Deallocate(alloc_, id); + if (LogicalBlock* o = cache_[id].owner) { + logical_.Drop(o); + } + } +} + +void Scheduler::EnsureBlocks(Sequence& s) +{ + const int bs = logical_.block_size(); + const int length = s.seq_len + s.inflight_new_tokens; + const int needed = (length + bs - 1) / bs; + while (static_cast(s.block_ids.size()) < needed) { + const int i = static_cast(s.block_ids.size()); + BlockHandle h = logical_.Create(i); + h->prefix_id = cache_.Create(registry_.prefix().object_id(), h.get()); // owner = node + s.block_ids.push_back(std::move(h)); // request ref + } +} + +struct Scheduler::AcceptState { + const LogicalBlock* parent{}; // trie node reached so far (nullptr = root) + PrefixKey key{}; + + int miss{}; // first block index not matched in the trie + const LogicalBlock* miss_parent{}; // trie position at the miss, for fork_from + PrefixKey miss_key{}; + + size_t next_fp = 0; // monotonic cursor into Sequence::multimodal_spans +}; + +void Scheduler::Accept(Sequence& s) +{ + TM_CHECK(s.block_ids.empty()); + if (!PrefixEligible(s)) { + return; // blocks are created lazily by EnsureBlocks + } + AcceptState st{}; // parent defaults to nullptr (root) + MatchPrompt(s, st); // match full blocks to the first miss + s.matched_blocks = st.miss; // leading prompt blocks found in the trie + CreateMissingBlocks(s, st); // create + index the remaining prompt blocks + SetupForks(s, st); // fork_from (partial match) + fork_to (prompt boundary) + LogAccept(s, logical_.block_size()); +} + +void Scheduler::MatchPrompt(Sequence& s, AcceptState& st) +{ + const int bs = logical_.block_size(); + const int full_blocks = s.prompt_len / bs; + + int i = 0; + for (; i < full_blocks; ++i) { + const int offset = i * bs; + size_t cur = st.next_fp; // working copy; do not commit on a miss + std::vector fps; + while (cur < s.multimodal_spans.size() && s.multimodal_spans[cur].interval.begin() < offset + bs) { + fps.push_back(s.multimodal_spans[cur].fingerprint); + ++cur; + } + const auto tokens = TokenSegment(s, offset, bs); + const auto next = ExtendPrefixKey(st.key, tokens, fps); + if (LogicalBlock* b = trie_.Find(st.parent, next, tokens, fps)) { + s.block_ids.emplace_back(b); // retain via BlockHandle copy + st.parent = b; + st.key = next; + st.next_fp = cur; // commit advance only on a match + } + else { + break; // cursor still at the miss block's first span + } + } + + st.miss = i; + st.miss_parent = st.parent; + st.miss_key = st.key; +} + +void Scheduler::CreateMissingBlocks(Sequence& s, AcceptState& st) +{ + const int bs = logical_.block_size(); + const int prompt = s.prompt_len; + + const int all_blocks = (prompt + bs - 1) / bs; + + for (int i = st.miss; i < all_blocks; ++i) { + const int offset = i * bs; + const int size = std::min(prompt - offset, bs); + std::vector fps; + while (st.next_fp < s.multimodal_spans.size() + && s.multimodal_spans[st.next_fp].interval.begin() < offset + size) { + fps.push_back(s.multimodal_spans[st.next_fp].fingerprint); + ++st.next_fp; + } + const auto tokens = TokenSegment(s, offset, size); + BlockHandle h = logical_.Create(i); + LogicalBlock& x = *h; + x.prefix_id = cache_.Create(registry_.prefix().object_id(), h.get()); + if (size == bs) { + const auto next = ExtendPrefixKey(st.key, tokens, fps); + x.parent = st.parent; + x.key = next; + x.size = size; + x.tokens.assign(tokens.begin(), tokens.end()); + x.image_fps = fps; // usually empty + if (!trie_.Insert(x)) { + LogCollision(s, CollisionSite::kAccept, offset, offset + size); + // Stays un-indexed; treated as a private block from here on. + x.parent = nullptr; + x.key = {}; + x.size = 0; + x.tokens.clear(); + x.image_fps.clear(); + } + else { + st.parent = h.get(); + st.key = next; + } + } + // The partial last block stays private; parent/key do not advance. + s.block_ids.push_back(std::move(h)); // request ref + } +} + +void Scheduler::SetupForks(Sequence& s, AcceptState& st) +{ + const int bs = logical_.block_size(); + const int prompt = s.prompt_len; + + const int all_blocks = (prompt + bs - 1) / bs; + + // fork_from (read side) is always armed: any prior request may have published + // a prompt partial node (cache_prompt in {all, auto}) or a generation + // terminal partial ('all'), so the read edge must always try to match. + if (st.miss < all_blocks) { + LogicalBlock& x = *s.block_ids[st.miss]; + const int offset = st.miss * bs; + const int size = std::min(prompt - offset, bs); + PrefixKey k = st.miss_key; + + std::vector fps; + std::vector fp_pos; + CollectStartFps(s, offset, offset + size, fps, &fp_pos); + + if (LogicalBlock* v = trie_.Search(st.miss_parent, k, TokenSegment(s, offset, size), fps, fp_pos)) { + x.fork_from = BlockHandle{v}; // edge ref + } + } + + // Prompt-boundary publish point (fork_to). B = prompt_len - K (K = + // cache_prompt_boundary_skip). 'all' publishes a partial node whenever B is + // mid-block and arms the checkpoint clamp when B is block-aligned. 'auto' + // publishes the partial node only when its own token range [j*bs, B) overlaps + // a multimodal span (including a span that began in an earlier block and + // extends into this range), and never arms the block-aligned clamp. + const auto plan = PlanPromptBoundary(prompt, bs, cache_prompt_boundary_skip_, st.miss); + if (plan.valid) { + const bool need_image = plan.partial && prompt_cache_mode_ == CacheMode::kAuto; + const bool has_image = need_image && HasMultimodalOverlap(s, plan.block * bs, plan.pos); + + if (DecidePromptBoundaryPublish(prompt_cache_mode_, plan.partial, has_image)) { + bool have_target = true; + + if (plan.partial) { + const int j = plan.block; // j >= 1 (guaranteed by the planner) + LogicalBlock& x = *s.block_ids[j]; + const auto tokens = TokenSegment(s, j * bs, plan.node_size); + + std::vector fps; + CollectStartFps(s, j * bs, j * bs + plan.node_size, fps); + + const auto next = ExtendPrefixKey(s.block_ids[j - 1]->key, tokens, fps); + BlockHandle vh = logical_.Create(j); + LogicalBlock& y = *vh; + y.parent = s.block_ids[j - 1].get(); + y.key = next; + y.size = plan.node_size; + y.tokens.assign(tokens.begin(), tokens.end()); + y.image_fps = fps; + y.prefix_id = cache_.Create(registry_.prefix().object_id(), vh.get()); + if (trie_.Insert(y)) { + x.fork_to = std::move(vh); // edge holds the only ref + } + else { + LogCollision(s, CollisionSite::kPromptBoundary, j * bs, j * bs + plan.node_size); + have_target = false; // undiscoverable: vh drops at scope end -> recycle + } + } + + if (have_target) { + s.prompt_boundary_node = true; + s.prompt_boundary_pos = plan.pos; // clamp the producer's prefill to B + } + } + } +} + +void Scheduler::Resume(Sequence& s) +{ + TM_CHECK(!s.is_active); + + s.resuming = true; + + EnsureBlocks(s); + + ResetPlanBuffers(s); + + const bool ckpt = registry_.has_checkpoint(); + const int upper = InitialResumeUpperBound(s); + const int bs = logical_.block_size(); + + if (ckpt) { + if (s.frontier_cache_id == 0) { + s.frontier_cache_id = cache_.Create(registry_.checkpoint().object_id()); + s.frontier_pos = 0; + } + if (s.publish_cache_id == 0) { + s.publish_cache_id = cache_.Create(registry_.checkpoint().object_id()); + } + } + + // 1. Contiguous reusable prefix end (token level). Indexed nodes carry + // their own content extent (size); private blocks are capped by what + // this request has proven produced (filled_len). + int prefix_end = 0; + int readonly_block_num = 0; + for (const BlockHandle& h : s.block_ids) { + const LogicalBlock& x = *h; + if (!x.is_valid || !ValidAlloc(x.prefix_id)) { + break; + } + const int extent = x.key ? x.size : std::min(std::max(s.filled_len - x.offset, 0), x.capacity); + if (extent <= 0) { + break; + } + prefix_end = x.offset + extent; + if (extent < x.capacity) { + break; // partial / own-frontier: first writable block + } + ++readonly_block_num; // fully-valid whole block: read-only reusable + } + prefix_end = std::min(prefix_end, upper); // resume bound, unchanged + s.readonly_block_num = readonly_block_num; + + // 2. Resume step selection + int step = prefix_end; // without checkpointing, KV grants per-token resume + ResumeSource source = prefix_end > 0 ? ResumeSource::kPrefix : ResumeSource::kNone; + LogicalBlock* fork_dst = nullptr; + LogicalBlock* fork_src = nullptr; + int restore_ckpt = 0; // checkpoint cache id to copy into the frontier + + if (ckpt) { + step = 0; + source = ResumeSource::kNone; + + // Frontier fast path (no copy needed) + const int fpos = s.frontier_pos - s.inflight_input_len; + if (ValidAlloc(s.frontier_cache_id) && 0 < fpos && fpos <= prefix_end) { + step = fpos; + source = ResumeSource::kFrontier; + } + + // Latest block checkpoint within the reusable prefix + if (step < prefix_end) { + for (int i = std::min(s.block_ids.size(), (prefix_end + bs - 1) / bs); i > 0; --i) { + const LogicalBlock& x = *s.block_ids[i - 1]; + const int e = x.key ? x.offset + x.size : x.offset + x.capacity; + if (e <= step) { + break; + } + if (e <= prefix_end && ValidAlloc(x.checkpoint_id)) { + step = e; + source = ResumeSource::kCheckpoint; + restore_ckpt = x.checkpoint_id; + break; + } + } + } + } + + // 3. Fork extension: an indexed partial node can beat the current step by + // copying its content into our private block at the boundary. + if (prefix_end % bs == 0 && prefix_end / bs < static_cast(s.block_ids.size())) { + LogicalBlock& x = *s.block_ids[prefix_end / bs]; + if (x.fork_from) { + const LogicalBlock& y = *x.fork_from; + const int e = y.offset + y.size; + if (y.is_valid && e <= upper && e > step && ValidAlloc(y.prefix_id) + && (!ckpt || ValidAlloc(y.checkpoint_id))) { + step = e; + source = ResumeSource::kFork; + fork_dst = &x; + fork_src = x.fork_from.get(); + restore_ckpt = ckpt ? y.checkpoint_id : 0; + } + } + } + + s.resume_len = step; + // source is kNone exactly when step == 0, so no extra guard is needed. + s.resume_source = source; + + // 4. Restore copy plans (cache ids; resolved to pointers at setup) + if (fork_dst) { + s.restore_copies.push_back({fork_src->prefix_id, fork_dst->prefix_id}); + } + if (ckpt && step > 0 && restore_ckpt) { + s.restore_copies.push_back({restore_ckpt, s.frontier_cache_id}); + } + // step == 0 with checkpointing: GDN recognizes a forward starting at + // position 0 (history_len + inflight_input_len == 0) and resets. + + // 5. Allocation set and eviction-protection set. Protect only what is + // needed to run the forward: the prefix blocks (read-only context + the + // written tail) and the single frontier. Published checkpoints are + // resume-time optimizations, not run-time state — they stay out of the + // protected set so they remain evictable (a long/high-priority sequence + // can reclaim its own prior checkpoints to run). The one checkpoint (or + // fork source) actually restored this pass is protected separately via + // its restore_copies entry (stamped in PlanRequests, Section 4). + for (const BlockHandle& h : s.block_ids) { + const LogicalBlock& x = *h; + s.involved_cache_ids.push_back(x.prefix_id); + if (!ValidAlloc(x.prefix_id)) { + s.alloc_cache_ids.push_back(x.prefix_id); + } + } + if (ckpt) { + s.involved_cache_ids.push_back(s.frontier_cache_id); + if (!ValidAlloc(s.frontier_cache_id)) { + s.alloc_cache_ids.push_back(s.frontier_cache_id); + } + } +} + +void Scheduler::Continue(Sequence& s) +{ + TM_CHECK(s.is_active); + + s.resume_len = s.filled_len; + s.readonly_block_num = 0; // decode writes only the new token (past the boundary) + s.resuming = false; + + const int first_new = static_cast(s.block_ids.size()); + EnsureBlocks(s); + + ResetPassBuffers(s); // per-pass buffers only; involved_cache_ids persists + + const bool ckpt = registry_.has_checkpoint(); + + if (ckpt && s.publish_cache_id == 0) { + s.publish_cache_id = cache_.Create(registry_.checkpoint().object_id()); + } + + // Active-request invariant: a request that committed last pass kept every + // involved cache id (none were evicted) and allocated its whole required + // set, so the persistent involved set is still valid. Only the blocks + // appended by EnsureBlocks since the last plan are new, and being freshly + // created they are unallocated. Published checkpoints are deliberately not + // tracked here: they are not needed to run and must stay evictable so the + // sequence can run with just its prefix blocks and frontier. + for (int i = first_new; i < static_cast(s.block_ids.size()); ++i) { + const int p = s.block_ids[i]->prefix_id; + s.involved_cache_ids.push_back(p); + s.alloc_cache_ids.push_back(p); + } + + // The frontier was added to involved_cache_ids by the activating Resume and + // stays valid while active (it is in the protected set); nothing to re-add. + if (ckpt) { + TM_CHECK(ValidAlloc(s.frontier_cache_id)); + } +} + +void Scheduler::SetProducers(Sequence& s, int t0, int end) +{ + const int bs = logical_.block_size(); + for (int i = t0 / bs; i < (end + bs - 1) / bs; ++i) { + s.block_ids[i]->producer = s.req->unique_id; + } +} + +Scheduler::ProducerConflict Scheduler::CheckProducers(const Sequence& s, int t0, int end) const +{ + const int bs = logical_.block_size(); + for (int i = t0 / bs; i < (end + bs - 1) / bs; ++i) { + const LogicalBlock& x = *s.block_ids[i]; + if (x.producer && x.producer != s.req->unique_id) { + return {x.producer, i}; + } + } + return {}; +} + +Scheduler::PublishStat Scheduler::Publish(Sequence& s, int t0, int end) +{ + const int bs = logical_.block_size(); + const int last = std::min((end + bs - 1) / bs, s.block_ids.size()); + PublishStat stat{}; + // Start at t0/bs, mirroring SetProducers/CheckProducers: this pass only + // marks producers on [t0/bs, ceil(end/bs)) and clears them here, and every + // indexed block below t0 is already valid (Resume advances resume_len only + // over valid prefix; the in-flight [resume_len, t0) region was published at + // the prior forward's commit). The block straddling t0 sits at index t0/bs, + // so it is still processed. + for (int i = t0 / bs; i < last; ++i) { + LogicalBlock& x = *s.block_ids[i]; + if (x.producer == s.req->unique_id) { + x.producer = 0; + } + if (x.key) { + // Indexed nodes become valid only when fully covered + if (x.offset + x.size <= end) { + if (!x.is_valid) { + if (stat.reusable_blocks == 0) { + stat.start = x.offset; + } + ++stat.reusable_blocks; + stat.end = x.offset + x.size; + } + x.is_valid = true; + } + } + else { + // Private blocks: content extent is tracked via filled_len + x.is_valid = true; + } + } + return stat; +} + +void Scheduler::ReleaseCacheId(int cache_id) +{ + if (cache_id == 0) { + return; + } + auto& c = cache_[cache_id]; + if (c.object_id >= 0) { + TM_CHECK(c.owner == nullptr); // request-owned ids only (frontier/publish) + if (c.valid()) { + cache_.Deallocate(alloc_, cache_id); + } + cache_.Invalidate(cache_id); + } +} + +void Scheduler::Release(Sequence& s) +{ + for (const BlockHandle& h : s.block_ids) { + LogicalBlock& x = *h; + if (!x.indexed) { + // Private blocks are undiscoverable: drop their allocations now so + // the allocation-held refs go away and the block can recycle. + for (const int c : {x.prefix_id, x.checkpoint_id}) { + if (ValidAlloc(c)) { + cache_.Deallocate(alloc_, c); + logical_.Drop(&x); // the allocation's ref (request ref still pins x) + } + } + } + } + s.block_ids.clear(); // request refs -> recycles unreferenced blocks + + ReleaseCacheId(std::exchange(s.frontier_cache_id, 0)); + ReleaseCacheId(std::exchange(s.publish_cache_id, 0)); + + s.frontier_pos = 0; + s.last_ckpt_pos = 0; + s.publish_target = nullptr; + s.publish_end = 0; + s.alloc_cache_ids.clear(); + s.involved_cache_ids.clear(); + s.restore_copies.clear(); + s.publish_copies.clear(); + s.resume_len = 0; + s.filled_len = 0; + s.readonly_block_num = 0; + s.input_len = 0; + s.history_len = 0; +} + +void Scheduler::PublishGeneration(Sequence& s) +{ + if (!PrefixEligible(s) || s.filled_len <= 0) { + return; + } + if (generation_cache_mode_ == CacheMode::kNone) { + return; // index no generated blocks at all + } + + // 'all' indexes the terminal partial block + adopts the terminal recurrent + // frontier checkpoint; 'auto' indexes full generated blocks only. + const bool publish_generation_boundary = (generation_cache_mode_ == CacheMode::kAll); + + const LogicalBlock* parent = nullptr; + PrefixKey key{}; + + GenStat gen{}; // index-loop summary for the finalized log + + for (size_t i = 0; i < s.block_ids.size(); ++i) { + LogicalBlock* up = s.block_ids[i].get(); + LogicalBlock& x = *up; + if (x.offset >= s.filled_len) { + break; + } + if (x.indexed) { + if (x.offset + x.size > s.filled_len) { + break; + } + parent = up; + key = x.key; + continue; + } + const int size = std::min(s.filled_len - x.offset, x.capacity); + if (!x.is_valid || !ValidAlloc(x.prefix_id)) { + break; + } + // The terminal partial generated block is the generation-boundary partial + // node; index it only when generation_cache_mode_ is kAll + // (publish_generation_boundary). It carries the partial block's KV for + // every model; a recurrent model additionally adopts the terminal frontier + // checkpoint below (guarded by a valid frontier id). Full generated blocks + // always index. It ends at filled_len, so nothing follows. + if (size < x.capacity && !publish_generation_boundary) { + break; + } + const auto tokens = TokenSegment(s, x.offset, size); + std::vector fps; + if (x.offset < s.prompt_len) { + // Only the prompt-tail block (private until now) can hold an image start; + // generated positions never do. Fold + store so this node's identity + // matches what a future request's MatchPrompt rebuilds. + CollectStartFps(s, x.offset, x.offset + size, fps); + } + const auto next = ExtendPrefixKey(key, tokens, fps); + x.parent = parent; + x.key = next; + x.size = size; + x.tokens.assign(tokens.begin(), tokens.end()); + x.image_fps = fps; // usually empty + if (!trie_.Insert(x)) { + LogCollision(s, CollisionSite::kPublish, x.offset, x.offset + size); + x.parent = nullptr; + x.key = {}; + x.size = 0; + x.tokens.clear(); + x.image_fps.clear(); + break; + } + if (gen.indexed == 0) { + gen.first_offset = x.offset; + } + ++gen.indexed; + gen.last_size = size; + // Adopt the frontier as the terminal checkpoint of the last block. + // Gated by publish_generation_boundary (this is the only partial-block + // generation checkpoint; full-block ones at boundaries stay always-on). + // + // The live recurrent buffer is guaranteed to correspond to filled_len + // here: the finishing pass stored its state at filled_len, and the GDN + // recurrence kernel bypasses its state write-back whenever the device + // finished mask is set, so any async over-shoot pass leaves the buffer + // untouched. We deliberately do NOT test s.frontier_pos: that field is + // resume-fast-path bookkeeping, committed speculatively as the scheduled + // forward end (CommitResults), so async lookahead over-counts it past + // filled_len and it would spuriously block this (safe) adoption. + // A valid frontier id implies checkpoints are registered (created only + // under has_checkpoint()), so no separate has_checkpoint() gate here. + if (publish_generation_boundary && x.offset + size == s.filled_len && ValidAlloc(s.frontier_cache_id) + && x.checkpoint_id == 0) { + + const int interval = registry_.checkpoint_min_interval(); + + // Classify in-window checkpoints below filled_len. A checkpoint on a + // block being indexed in *this* call (pos > prompt_len, still + // private until now -> no consumer ref) is droppable; one on an + // already-shared block (pos <= prompt_len) is a blocker we must not + // touch, so we skip adoption to preserve min_interval spacing. + bool blocked = false; + for (int j = static_cast(i); j-- > 0;) { + const LogicalBlock& p = *s.block_ids[j]; + const int pos = p.offset + p.size; + if (s.filled_len - pos >= interval) { + break; // outside the window + } + if (const int c = p.checkpoint_id; ValidAlloc(c) && pos <= s.prompt_len) { + blocked = true; + break; + } + } + + if (!blocked) { + const int f = std::exchange(s.frontier_cache_id, 0); + x.checkpoint_id = f; + cache_[f].owner = up; + logical_.Retain(up); // ref held by the live allocation + gen.terminal_ckpt = true; + + // Drop droppable redundant full-block checkpoints in the window; + // the terminal checkpoint supersedes them. Mirror eviction + // exactly: free memory + drop the allocation's logical ref. + for (int j = static_cast(i); j-- > 0;) { + LogicalBlock& p = *s.block_ids[j]; + const int pos = p.offset + p.size; + if (s.filled_len - pos >= interval) { + break; // outside the window; spacing already satisfies min_interval + } + if (pos > s.prompt_len) { + if (const int c = p.checkpoint_id; ValidAlloc(c)) { + cache_.Deallocate(alloc_, c); // free memory + drop the alloc ref + logical_.Drop(&p); // block stays (request + index refs); slot left as evicted leftover + ++gen.dropped; // observability only (LogFinalized) + } + } + } + } + } + parent = up; + key = next; + } + + LogFinalized(s, logical_.block_size(), gen); +} + +// When this pass reaches the prompt boundary, plan the device copy that +// populates the indexed prompt-end partial node (fork_to). Returns the +// fork_to node when a copy is planned, nullptr otherwise. +LogicalBlock* Scheduler::PlanForkToPopulation(Sequence& s, int end, std::unordered_set& planned) +{ + const int bs = logical_.block_size(); + + const LogicalBlock& x = *s.block_ids[(end - 1) / bs]; + if (!x.fork_to) { + return nullptr; + } + const LogicalBlock& y = *x.fork_to; + const int y_cache = y.prefix_id; + if (y.offset + y.size != end || y.is_valid || ValidAlloc(y_cache) || planned.count(y_cache)) { + return nullptr; // boundary not reached, or another request already covers it + } + // Reserve the node so a later request sharing it does not also plan to + // populate it. The slot itself is allocated in the optional phase (from + // inactive memory); this reservation only dedups intent within the pass. A + // fork-to node is a distinct logical block from any request's required + // prefix blocks, so it never collides with a required allocation id. + planned.insert(y_cache); + return x.fork_to.get(); +} + +// Prompt-boundary group (caller guarantees end == B == prompt_boundary_pos): fork_to KV copy + +// checkpoint, both partial-block, bypassing the min-interval. +void Scheduler::PlanPromptBoundaryPublication(ScheduleState& pass, int i, Sequence& s, int end) +{ + // (a) copy the request's partial KV into the shared fork_to node. + if (LogicalBlock* node = PlanForkToPopulation(s, end, pass.planned)) { + pass.pending_fork[i] = node; + pass.has_optionals = true; + } + + // (b) checkpoint onto the fork_to node, or the block itself when + // block-aligned B is a block boundary. + if (s.publish_cache_id) { + LogicalBlock& x = *s.block_ids[(end - 1) / logical_.block_size()]; + const bool at_block = x.offset + x.capacity == end; + const bool at_fork_to = x.fork_to && x.fork_to->offset + x.fork_to->size == end; + LogicalBlock* target = at_block ? &x : (at_fork_to ? x.fork_to.get() : nullptr); + if (target && !ValidAlloc(target->checkpoint_id)) { + pass.pending_publish[i] = {target, end, s.publish_cache_id}; + pass.has_optionals = true; + } + } +} + +// Full-block group: coverage-driven checkpoint, published iff a full block ends +// exactly at `end` (subject to min-interval); no prompt-boundary mode involved. +// The full block's prefix is published in place by Publish() (no KV copy). +void Scheduler::PlanFullBlockPublication(ScheduleState& pass, int i, Sequence& s, int end) +{ + if (s.publish_cache_id == 0) { + return; + } + LogicalBlock& x = *s.block_ids[(end - 1) / logical_.block_size()]; + if (x.offset + x.capacity != end) { + return; // partial block — nothing to publish + } + const int interval = registry_.checkpoint_min_interval(); + if (end - s.last_ckpt_pos >= interval && !ValidAlloc(x.checkpoint_id)) { + pass.pending_publish[i] = {&x, end, s.publish_cache_id}; + pass.has_optionals = true; + } +} + +void Scheduler::Schedule(std::vector requests, Resource& resource) +{ + counter_ = make_perf_counter(); + + counter_.tick(0); + + ScheduleState pass{std::move(requests)}; + PlanRequests(pass); // Resume/Continue, sort, stamp involved + restore srcs, pass.floor + + counter_.tick(1); + + RunRequiredAdmission(pass, resource); // phase 1: required scratch alloc + eviction; collect intents + + counter_.tick(2); + + ReplayMemory(pass); // commit phase 1; clears pass.replay + + counter_.tick(3); + + if (pass.has_optionals) { + counter_.tick(10); + + RunOptionalAdmission(pass); // allocate intents from inactive memory; fill pass.replay + + counter_.tick(11); + + ReplayMemory(pass); // commit phase 2; clears pass.replay + + counter_.tick(12); + } + counter_.tick(4); + + CommitResults(pass); // publication attach, fork_to populate, Publish + + counter_.tick(5); + + interv_ += counter_; + +#if TM_SCHED_PROFILE + if (int n = GetEnv(); n > 0 && interv_.passes[0] == n) { + LogProfile(interv_); + accum_ += interv_; + interv_ = make_perf_counter(); + } +#endif +} + +void Scheduler::PlanRequests(ScheduleState& pass) +{ + for (Sequence* sp : pass.requests) { + if (sp->is_active) { + Continue(*sp); + } + else { + Resume(*sp); + } + } + + std::sort(pass.requests.begin(), pass.requests.end(), [](Sequence* a, Sequence* b) { + return a->req->unique_id < b->req->unique_id; + }); + + pass.cutoff.resize(pass.requests.size()); + const int n = static_cast(pass.requests.size()); + for (int i = n; i > 0; --i) { + Sequence& s = *pass.requests[i - 1]; + const uint64_t pre = cache_.Stamp(s.involved_cache_ids); // pre-stamp value = cutoff + pass.cutoff[i - 1] = pre; + if (i == n) { + pass.floor = pre; // pass-start timestamp: the inactive/active boundary + } + // Protect the sources read by this pass's restore copies (a restored + // checkpoint and/or a fork source). They are foreign blocks not in this + // request's involved set, but must survive eviction until the restore + // copy runs before kPrepare. They land just above this request's cutoff, + // in the same band the old code gave them when they lived in involved. + // restore_copies is empty on the Continue path, so this adds nothing there. + for (const CacheCopy& c : s.restore_copies) { + cache_.Stamp(c.src); + } + } + + pass.committed.assign(pass.requests.size(), false); + pass.pending_fork.assign(pass.requests.size(), nullptr); + pass.pending_publish.assign(pass.requests.size(), PublishPlan{}); +} + +void Scheduler::RunRequiredAdmission(ScheduleState& pass, Resource& resource) +{ + counter_.tick(20); + + pass.evict_ids = cache_.SortedIndices(); + + counter_.tick(21); + + EvictingIterator evict_pos{pass.evict_ids, cache_}; + + uint64_t max_evict_ts = 0; + + ScratchAllocator scratch{alloc_}; + + counter_.tick(22); + + const int bs = logical_.block_size(); + + // Required admission loop: place every forward that fits (prefix blocks + + // frontier), evicting up to each request's cutoff. Optional optimizations + // (publication, fork-to population) are only decided here; their slots are + // allocated later, in RunOptionalAdmission, from inactive memory. + for (int i = 0; i < static_cast(pass.requests.size()); ++i) { + auto& s = *pass.requests[i]; + + if (max_evict_ts >= pass.cutoff[i]) { + break; // would run on memory evicted from a higher-priority request + } + + const int admitted = resource.Test(s); + if (admitted == 0) { + TM_LOG_INFO("hit resource limit at {}/{}", i, pass.requests.size()); + break; + } + + s.history_len = s.resume_len; + + // Land the forward end on a checkpoint candidate. The prompt-boundary + // clamp (forward ends exactly at B) takes precedence; + // otherwise truncate partial prefill chunks to a block boundary. + const int begin = s.resume_len + s.inflight_input_len; + const int ctx_end = s.seq_len + s.inflight_new_tokens; // == prompt_len for a fresh prefill + int desired = begin + admitted; + + const int prompt_boundary_pos = s.prompt_boundary_pos; + + // The publish decision is finalized in SetupForks (prompt_boundary_node); + // the clamp fires on the pass that can reach B (>= so an exact landing + // isn't truncated away). + const bool publish_prompt = + s.prompt_boundary_node && begin < prompt_boundary_pos && desired >= prompt_boundary_pos; + + if (publish_prompt) { + desired = prompt_boundary_pos; // land exactly on B + } + else if (desired < ctx_end) { // partial chunk: truncate to a block boundary + desired = desired / bs * bs; + } + + const int len = desired - begin; + if (len <= 0) { + continue; // nothing admitted this pass; CommitResults leaves it inactive + } + s.input_len = len; + + const int end = begin + s.input_len; + + if (const ProducerConflict conflict = CheckProducers(s, begin, end); conflict.producer) { + LogDeferred(s, bs, conflict); + continue; // deferred; CommitResults leaves it inactive + } + + EvictingIterator evicting{evict_pos, pass.cutoff[i]}; + AllocatingIterator allocating{s.alloc_cache_ids, cache_}; + + uint64_t evict_ts = 0; + std::vector planned_now; + + bool ok = true; + while (allocating) { + bool success = allocating.Allocate(scratch, pass.planned, planned_now, pass.replay); + while (!success && evicting) { + evict_ts = evicting.Evict(scratch, pass.replay); + success = allocating.Allocate(scratch, pass.planned, planned_now, pass.replay); + } + if (!success) { + ok = false; + break; + } + } + + if (!ok) { // out of memory: roll back this request's planning, stop the pass + for (const int id : planned_now) { + pass.planned.erase(id); + } + TM_LOG_INFO("out of memory at {}/{}", i, pass.requests.size()); + break; // CommitResults leaves this and all later requests inactive + } + + resource.Commit(s); + s.is_active = true; + pass.committed[i] = true; + pass.committed_replay_size = pass.replay.size(); + max_evict_ts = std::max(max_evict_ts, evict_ts); + evict_pos = evicting; + + // Optional optimizations (allocated later, from inactive memory). One + // checkpoint per forward, routed by its end; publish_prompt is false when + // prompt_boundary_node was not set in SetupForks, or when this forward's + // geometry does not reach B, so nothing prompt-boundary is allocated. + // PlanPromptBoundaryPublication reserves the fork-to id in pass.planned for + // cross-request intent dedup. + if (publish_prompt) { + PlanPromptBoundaryPublication(pass, i, s, end); // fork_to KV + prompt-boundary checkpoint + } + else { + PlanFullBlockPublication(pass, i, s, end); // full-block checkpoint (coverage only) + } + + SetProducers(s, begin, end); + + if (s.resuming) { + // emit here so a producer's resume precedes any later consumer's defer log + LogResume(s); + } + } + + counter_.tick(23); + + scratch = {}; + + counter_.tick(24); + + pass.replay.resize(pass.committed_replay_size); + pass.evict_pos = evict_pos.pos(); // hand the oldest-first cursor to the optional phase +} + +void Scheduler::RunOptionalAdmission(ScheduleState& pass) +{ + ScratchAllocator opt{alloc_}; + + // Continue the monotonic oldest-first sweep from where phase 1 stopped, but + // reach only INACTIVE slots (timestamp < pass.floor) of any category. + // Phase-1-evicted candidates lie strictly before evict_pos; phase-1-allocated + // blocks are not in the pass-start snapshot; surviving candidates are still + // allocated (a slot stays allocated unless evicted, and its block stays alive + // via its own allocation ref or a sequence/fork ref), so they are valid to + // evict here. opt is a ScratchAllocator over the committed (post-phase-1) + // state: it holds a capacity (MemoryState) copy and borrows the live + // allocator's object registry, so no ObjectAllocator is cloned. The + // recorded replay is applied to the real allocator afterward. + EvictingIterator base{pass.evict_ids, cache_}; + base.SeekTo(pass.evict_pos); + EvictingIterator evicting{base, pass.floor}; + + // Skip only on real, committed memory (c.valid()). A fork-to id reserved in + // pass.planned during phase 1 still needs its slot allocated here, so we must + // NOT treat membership in pass.planned as "already allocated". + auto try_optional = [&](int cache_id) -> bool { + const auto& c = cache_[cache_id]; + if (c.valid()) { + return true; + } + bool ok = opt.Allocate(c.object_id); + while (!ok && evicting) { + evicting.Evict(opt, pass.replay); + ok = opt.Allocate(c.object_id); + } + if (!ok) { + return false; + } + pass.planned.insert(cache_id); + pass.replay.push_back(AllocReplay{cache_id}); + return true; + }; + + for (int i = 0; i < static_cast(pass.requests.size()); ++i) { + if (!pass.committed[i]) { + continue; + } + Sequence& s = *pass.requests[i]; + + // fork-to population (prefix reuse for future forks) + if (LogicalBlock* node = pass.pending_fork[i]) { + if (!try_optional(node->prefix_id)) { + pass.pending_fork[i] = nullptr; // dropped; CommitResults won't populate it + } + } + // checkpoint publication + if (const PublishPlan& pub = pass.pending_publish[i]; pub.cache_id) { + if (try_optional(pub.cache_id)) { + s.publish_target = pub.target; // confirmed; CommitResults attaches it + s.publish_end = pub.end; + } + // else: skip publication this pass; publish_cache_id stays reserved + } + } +} + +void Scheduler::ReplayMemory(ScheduleState& pass) +{ + // Memory replay: the ONLY place where actual allocation/deallocation + // happens during a scheduling pass. + for (const auto& op : pass.replay) { + std::visit( + [&](const auto& item) { + using T = std::decay_t; + auto& c = cache_[item.cache_id]; + if constexpr (std::is_same_v) { + const bool is_prefix = c.object_id == registry_.prefix().object_id_or_negative(); + cache_.Deallocate(alloc_, item.cache_id); // clears allocation; owner persists + if (LogicalBlock* o = c.owner) { + if (is_prefix) { + o->is_valid = false; + } + logical_.Drop(o); // may recycle the block and free this slot + } + } + else { + c.allocation = alloc_.Allocate(c.object_id); // single-object; {nullptr} on OOM + TM_CHECK(c.allocation.a); // admission guarantees capacity + c.alloc_key = c.allocation->key; // snapshot for stale detection + logical_.Retain(c.owner); // no-op when owner == nullptr (request-owned) + } + }, + op); + } + pass.replay.clear(); // each phase materializes only its own segment +} + +void Scheduler::CommitResults(ScheduleState& pass) +{ + const int bs = logical_.block_size(); + + // Post-replay commit: publication attach, fork_to population, frontier + // metadata, and publication of produced ranges. + for (int i = 0; i < static_cast(pass.requests.size()); ++i) { + auto& s = *pass.requests[i]; + + // README scheduler-inactive: reset every uncommitted request here only. + // RunRequiredAdmission reject paths rely on committed[i] == false reaching this + // branch; do not zero these fields at reject sites. + if (!pass.committed[i]) { + s.is_active = false; + s.input_len = 0; + s.history_len = 0; + s.publish_target = nullptr; + s.publish_end = 0; + s.alloc_cache_ids.clear(); + s.restore_copies.clear(); + s.publish_copies.clear(); + continue; + } + + // A resuming request was inactive, so its prior `filled_len` predates the + // prefix it now reuses read-only / restores from a checkpoint. Reconcile it + // to the resume point: `filled_len` is the context currently established, not + // just KV this request's own forward produced. The `[resume_len, end)` span + // the in-flight resume forward rebuilds is carried by `inflight_input_len`. + // Safe to write here: the request is inactive, so this never races Update() + // of the previous batch. + if (s.resuming) { + s.filled_len = s.resume_len; + } + + const int begin = s.history_len + s.inflight_input_len; + const int end = begin + s.input_len; + + bool ckpt_published = false; + + if (LogicalBlock* v = pass.pending_fork[i]) { + LogicalBlock& y = *v; + y.is_valid = true; // content arrives via the device-ordered copy below + s.publish_copies.push_back({s.block_ids[(end - 1) / bs]->prefix_id, y.prefix_id}); + // Allocated outside the stamped involved sets: stamp now so the + // freshly populated node is not the top eviction candidate. + cache_.Stamp(y.prefix_id); + } + + if (s.publish_target) { + LogicalBlock& t = *s.publish_target; + if (ValidAlloc(t.checkpoint_id)) { + // Another request in this pass already published this node + ReleaseCacheId(std::exchange(s.publish_cache_id, 0)); + } + else { + if (const int stale = t.checkpoint_id) { + cache_.Invalidate(stale); // evicted leftover slot + } + const int id = std::exchange(s.publish_cache_id, 0); + t.checkpoint_id = id; + cache_[id].owner = s.publish_target; + logical_.Retain(s.publish_target); // ref held by the live allocation + s.last_ckpt_pos = s.publish_end; + ckpt_published = true; + s.publish_copies.push_back({s.frontier_cache_id, id}); + // Allocated outside the stamped involved sets: stamp now so + // the fresh checkpoint is not the top eviction candidate. + cache_.Stamp(id); + } + s.publish_target = nullptr; + s.publish_end = 0; + } + + if (registry_.has_checkpoint()) { + s.frontier_pos = end; + } + + // Content is guaranteed to be produced by this iteration (device + // execution is in submission order); no point deferring to Update(). + PublishStat pub = Publish(s, begin, end); + pub.forked = pass.pending_fork[i] != nullptr; + pub.ckpt = ckpt_published; + LogPublished(s, bs, pub); + } +} + +void Scheduler::LogProfile(const PerformanceCounter& counter) const +{ +#if TM_SCHED_PROFILE + // TODO: Gate TP rank + + fmt::memory_buffer buf; + + fmt::format_to(std::back_inserter(buf), + "\n[sched] total {:.2f}, plan {:.2f}, required {:.2f}, replay {:.2f}, commit {:.2f}", + counter.dist(0, 5), + counter.dist(0, 1), + counter.dist(1, 2), + counter.dist(2, 3), + counter.dist(4, 5)); + + if (counter.passes[10]) { + fmt::format_to(std::back_inserter(buf), + "\n[sched] optional {:.2f}, replay {:.2f}", + counter.dist(10, 11), + counter.dist(11, 12)); + } + + fmt::format_to(std::back_inserter(buf), + "\n[sched] sort {:.2f}, scratch {:.2f}, loop {:.2f}, ~scratch {:.2f}", + counter.dist(20, 21), + counter.dist(21, 22), + counter.dist(22, 23), + counter.dist(23, 24)); + + TM_LOG_WARN("sched stats:{}", fmt::to_string(buf)); +#endif +} + +namespace { + +using turbomind::core::Logger; + +constexpr auto kCacheLogLevel = Logger::Level::kWarning; + +void LogAccept(const Sequence& s, int bs) +{ + auto msg = [&] { + const int prompt = s.prompt_len, full = prompt / bs, all = (prompt + bs - 1) / bs; + const int matched = s.matched_blocks, M = matched * bs; + std::string mtail, clast, ctail; + if (matched < (int)s.block_ids.size() && s.block_ids[matched]->fork_from) { + const LogicalBlock& y = *s.block_ids[matched]->fork_from; + mtail = fmt::format(", fork_from@{}", y.offset + y.size); // matched-side partial reuse + } + if (all - matched > 0 && prompt % bs) { + clast = fmt::format(", last {}/{}", prompt - full * bs, bs); // created-side partial tail + } + if (s.prompt_boundary_pos > 0) { + const int j = (s.prompt_boundary_pos - 1) / bs; // block holding B (matches PlanPromptBoundary) + if (j >= 0 && j < (int)s.block_ids.size() && s.block_ids[j]->fork_to) { + const LogicalBlock& ft = *s.block_ids[j]->fork_to; + ctail = fmt::format(", fork_to@{}", ft.offset + ft.size); // created-side publish node end + } + } + return fmt::format("req {} (uid {}) matched [0,{}) ({} blk){} | created [{},{}) ({} blk{}){}", + s.req->id, + s.req->unique_id, + M, + matched, + mtail, + M, + prompt, + all - matched, + clast, + ctail); + }; + + TM_LOG(kCacheLogLevel, msg()); +} + +void LogResume(const Sequence& s) +{ + auto msg = [&] { + const int begin = s.history_len + s.inflight_input_len; + const int end = begin + s.input_len; + const int total = s.seq_len + s.inflight_new_tokens; + const int pct = total > 0 ? 100 * s.history_len / total : 0; + return fmt::format("req {} (uid {}) resume [0,{}) {} blk ro ({}%) source={} | computed [{},{}) {} tok", + s.req->id, + s.req->unique_id, + s.history_len, + s.readonly_block_num, + pct, + ResumeSourceName(s.resume_source), + begin, + end, + s.input_len); + }; + + TM_LOG(kCacheLogLevel, msg()); +} + +void LogDeferred(const Sequence& s, int bs, const Scheduler::ProducerConflict& c) +{ + auto msg = [&] { + const int begin = s.resume_len + s.inflight_input_len; + const int end = begin + s.input_len; + const int b0 = std::max(begin, c.block * bs); + const int b1 = std::min(end, (c.block + 1) * bs); + return fmt::format("req {} (uid {}) deferred: tok [{},{}) held by producer uid {}", + s.req->id, + s.req->unique_id, + b0, + b1, + c.producer); + }; + TM_LOG(kCacheLogLevel, msg()); +} + +void LogPublished(const Sequence& s, int bs, const Scheduler::PublishStat& p) +{ + if (!(p.reusable_blocks > 0 || p.forked || p.ckpt)) { + return; + } + auto msg = [&] { + const int end = s.history_len + s.inflight_input_len + s.input_len; // forward end this pass + std::string body; + auto add = [&](std::string c) { body += body.empty() ? c : ", " + c; }; + if (p.reusable_blocks > 0) { + add(fmt::format("prefix [{},{}) ({} blk)", p.start, p.end, p.reusable_blocks)); + } + if (p.forked) { + const int b0 = (end - 1) / bs * bs; + add(fmt::format("boundary [{},{}) ({} tok)", b0, end, end - b0)); + } + if (p.ckpt) { + add(fmt::format("ckpt@{}", s.last_ckpt_pos)); + } + return fmt::format("req {} (uid {}) published {}", s.req->id, s.req->unique_id, body); + }; + TM_LOG(kCacheLogLevel, msg()); +} + +void LogFinalized(const Sequence& s, int bs, const GenStat& g) +{ + if (!(g.indexed > 0 || g.terminal_ckpt)) { + return; + } // terminal_ckpt implies indexed > 0 + auto msg = [&] { + std::string tail = (g.last_size < bs) ? fmt::format(", last {}/{}", g.last_size, bs) : ""; + std::string ckpt = + g.terminal_ckpt ? (g.dropped ? fmt::format(", terminal ckpt (dropped {})", g.dropped) : ", terminal ckpt") : + ""; + return fmt::format("req {} (uid {}) finalized gen [{},{}) ({} blk{}){}", + s.req->id, + s.req->unique_id, + g.first_offset, + s.filled_len, + g.indexed, + tail, + ckpt); + }; + TM_LOG(kCacheLogLevel, msg()); +} + +void LogCollision(const Sequence& s, CollisionSite site, int begin, int end) +{ + auto msg = [&] { + const char* where = ""; + const char* note = ""; + switch (site) { + case CollisionSite::kAccept: + where = "accept"; + note = ""; + break; + case CollisionSite::kPromptBoundary: + where = "prompt boundary"; + note = " (no fork_to)"; + break; + case CollisionSite::kPublish: + where = "publish"; + note = ", stop"; + break; + } + return fmt::format("req {} (uid {}) collision at {}: tok [{},{}) → private{}", + s.req->id, + s.req->unique_id, + where, + begin, + end, + note); + }; + + TM_LOG(kCacheLogLevel, msg()); +} + +} // namespace + +} // namespace turbomind diff --git a/src/turbomind/engine/scheduler.h b/src/turbomind/engine/scheduler.h new file mode 100644 index 0000000000..f3e44d6077 --- /dev/null +++ b/src/turbomind/engine/scheduler.h @@ -0,0 +1,241 @@ +#pragma once + +#include +#include +#include +#include + +#include "src/turbomind/comm/env.h" +#include "src/turbomind/core/check.h" +#include "src/turbomind/engine/block.h" +#include "src/turbomind/engine/cache_mode.h" +#include "src/turbomind/engine/cache_registry.h" +#include "src/turbomind/engine/prefix_trie.h" +#include "src/turbomind/engine/request.h" +#include "src/turbomind/memory/object.h" + +#define TM_SCHED_PROFILE 0 + +namespace turbomind { + +#if TM_SCHED_PROFILE +struct PerformanceCounter { + using time_point = std::chrono::high_resolution_clock::time_point; + std::vector timestamps; + std::vector passes; + + PerformanceCounter() = default; + explicit PerformanceCounter(int capacity): timestamps(capacity), passes(capacity) {} + + float dist(int i, int j) const noexcept + { + if (passes[i]) { + // TM_CHECK_EQ(passes[i], passes[j]); + return static_cast( + std::chrono::duration_cast(timestamps[j] - timestamps[i]).count()) + * 0.001f / passes[i]; + } + else { + return 0.f; + } + } + + void tick(int index) + { + // TM_CHECK(passes[index] == 0); + timestamps[index] = std::chrono::high_resolution_clock::now(); + passes[index] = 1; + } + + PerformanceCounter& operator+=(const PerformanceCounter& other) + { + for (size_t i = 0; i < timestamps.size(); ++i) { + if (other.passes[i]) { + timestamps[i] += (other.timestamps[i] - other.timestamps[0]); + passes[i] += other.passes[i]; + } + } + return *this; + } + + explicit operator bool() const noexcept + { + return passes[0] > 0; + } +}; +#else +struct PerformanceCounter { + PerformanceCounter() = default; + explicit PerformanceCounter(int) {} + void tick(int) {} + PerformanceCounter& operator+=(const PerformanceCounter&) + { + return *this; + } + float dist(int, int) const noexcept + { + return 0.f; + } + explicit operator bool() const noexcept + { + return false; + } +}; +#endif + +TM_ENV_VAR(CACHE, LOG_INTERVAL, 0); // 0 = disabled; N = log every N Schedule() passes + +class Scheduler { +public: + Scheduler(ObjectAllocator& alloc, + CacheRegistry registry, + int cache_block_seq_len, + bool enable_prefix_caching, + const std::string& cache_prompt, + int cache_prompt_boundary_skip, + const std::string& cache_generation, + const int& is_warm_up); + + ~Scheduler(); + + const CacheBlockPool& cache() const noexcept + { + return cache_; + } + + const LogicalBlockPool& logical() const noexcept + { + return logical_; + } + + const CacheRegistry& registry() const noexcept + { + return registry_; + } + + const ObjectAllocator& allocator() const noexcept + { + return alloc_; + } + + bool prefix_enabled() const noexcept + { + return enable_prefix_caching_; + } + + // True if any multimodal span overlaps [lo, hi). Pure; used by SetupForks to + // gate the 'auto' prompt-boundary publish. Public so it can be unit-tested. + static bool HasMultimodalOverlap(const Sequence& s, int lo, int hi); + + // Match the prompt against the prefix trie; create missing blocks; set up + // fork_from (partial match) and fork_to (prompt-boundary publish point). + void Accept(Sequence& s); + + // Commit step: per-request planning (Resume/Continue), admission with + // scratch allocation + eviction, memory replay, publication, Publish. + void Schedule(std::vector requests, Resource& resource); + + // Index generated blocks into the trie; adopt the frontier into the last + // partial block. Called on normal finish. + void PublishGeneration(Sequence& s); + + // Drop the request's references; pool recycling does the rest. + void Release(Sequence& s); + + // Observability-only records consumed by the file-local prefix-cache log + // helpers in scheduler.cc. Public so those file-local helpers can name them. + struct PublishStat { + int start = 0; // first newly-valid prefix block offset (token); Publish() + int reusable_blocks = 0; // indexed nodes whose is_valid flipped true this pass; Publish() + int end = 0; // highest published prefix position (token); Publish() + bool forked = false; // a fork_to boundary populated this pass; set by CommitResults() + bool ckpt = false; // a checkpoint published this pass; set by CommitResults() + }; + struct ProducerConflict { + uint64_t producer = 0; // blocking request's unique_id; 0 = none + int block = -1; // first conflicting block index + }; + +private: + // Shared state of one Schedule pass; defined in scheduler.cc so the + // replay/admission types stay file-local. + struct ScheduleState; + + // Optional checkpoint-publication intent set by PlanPromptBoundaryPublication / + // PlanFullBlockPublication and allocated in the optional admission phase. + // cache_id == 0 => nothing. + struct PublishPlan { + LogicalBlock* target{}; + int end{}; + int cache_id{}; + }; + + // Schedule phases, called in order; see Schedule's body. + void PlanRequests(ScheduleState& pass); + void RunRequiredAdmission(ScheduleState& pass, Resource& resource); + void RunOptionalAdmission(ScheduleState& pass); + void ReplayMemory(ScheduleState& pass); + void CommitResults(ScheduleState& pass); + + // Trie cursor threaded through the Accept phases; defined in scheduler.cc. + struct AcceptState; + + void MatchPrompt(Sequence& s, AcceptState& cur); + void CreateMissingBlocks(Sequence& s, AcceptState& cur); + void SetupForks(Sequence& s, AcceptState& cur); + + // Per-request planning for inactive sequences: find the latest feasible + // resume step, emit restore copy plans, fill resume_len/alloc/involved. + void Resume(Sequence& s); + + // Per-request planning for sequences active in the last iteration. + void Continue(Sequence& s); + + // Clear producer marks and mark produced blocks valid for [t0, end). Returns + // the indexed blocks that became cross-request reusable this pass. + PublishStat Publish(Sequence& s, int t0, int end); + + void SetProducers(Sequence& s, int t0, int end); + ProducerConflict CheckProducers(const Sequence& s, int t0, int end) const; + + // Admission-loop helpers (called from Schedule only, after input_len is + // fixed). They decide and return optional intent; the slots are allocated in + // the optional admission phase. PlanForkToPopulation reserves the fork-to + // node's prefix_id in `planned` to dedup intent across requests. + LogicalBlock* PlanForkToPopulation(Sequence& s, int end, std::unordered_set& planned); + void PlanPromptBoundaryPublication(ScheduleState& pass, int i, Sequence& s, int end); + void PlanFullBlockPublication(ScheduleState& pass, int i, Sequence& s, int end); + + void EnsureBlocks(Sequence& s); + void ReleaseCacheId(int cache_id); + + // The cached CacheBlock::allocation is the allocation-validity flag (set by + // the alloc replay, cleared by every deallocation path), so no + // ObjectAllocator::IsValid lookup is needed on the hot path. + bool ValidAlloc(int cache_id) const + { + return cache_id != 0 && cache_[cache_id].valid(); + } + + bool PrefixEligible(const Sequence& s) const noexcept; + TokenSpan TokenSegment(const Sequence& s, int offset, int size) const; + + void LogProfile(const PerformanceCounter& counter) const; + + bool enable_prefix_caching_{false}; + CacheMode prompt_cache_mode_{CacheMode::kAuto}; + int cache_prompt_boundary_skip_{1}; + CacheMode generation_cache_mode_{CacheMode::kAuto}; + const int& is_warm_up_; + ObjectAllocator& alloc_; // owned by Engine; also used outside the scheduler + CacheRegistry registry_; // owned: registration is closed before construction + CacheBlockPool cache_; + LogicalBlockPool logical_; + PrefixTrie trie_; + + PerformanceCounter counter_; + PerformanceCounter interv_; + PerformanceCounter accum_; +}; + +} // namespace turbomind diff --git a/src/turbomind/engine/test_prefix_trie.cc b/src/turbomind/engine/test_prefix_trie.cc new file mode 100644 index 0000000000..0914e02c26 --- /dev/null +++ b/src/turbomind/engine/test_prefix_trie.cc @@ -0,0 +1,322 @@ +// Copyright (c) OpenMMLab. All rights reserved. + +#include "src/turbomind/core/interval.h" +#include "src/turbomind/engine/cache_mode.h" +#include "src/turbomind/engine/fingerprint.h" +#include "src/turbomind/engine/prefix_key.h" +#include "src/turbomind/engine/prefix_trie.h" +#include "src/turbomind/engine/prompt_boundary.h" +#include "src/turbomind/engine/request.h" +#include "src/turbomind/engine/scheduler.h" + +#include + +#include +#include + +using namespace turbomind; + +namespace { +Fingerprint FP(uint64_t a) +{ + return Fingerprint{{a, a + 1, a + 2, a + 3}}; +} +} // namespace + +TEST_CASE("Fingerprint: empty never equals; distinct differ; identical match", "[fingerprint]") +{ + Fingerprint empty{}; + REQUIRE(empty.empty()); + REQUIRE_FALSE(empty == empty); // empty never equals anything -- including itself + REQUIRE(empty != empty); + + const Fingerprint a = FP(100), b = FP(200), a2 = FP(100); + REQUIRE(a == a2); + REQUIRE_FALSE(a == b); + REQUIRE_FALSE(a == empty); + REQUIRE_FALSE(empty == a); +} + +TEST_CASE("PrefixTrie::Find honors image_fps", "[prefix_trie]") +{ + const int bs = 4; + PrefixTrie trie{bs}; + + std::vector toks = {1, 2, 3, 4}; + const Fingerprint fpA = FP(1), fpB = FP(2); + + LogicalBlock blkA{}; + blkA.parent = nullptr; + blkA.size = bs; + blkA.tokens = toks; + blkA.image_fps = {fpA}; + blkA.key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks), {fpA}); + REQUIRE(trie.Insert(blkA)); + + // Same tokens + same fingerprint -> hit. + { + const auto key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks), {fpA}); + REQUIRE(trie.Find(nullptr, key, MakeTokenSpan(toks), {fpA}) == &blkA); + } + // Same tokens, DIFFERENT fingerprint -> miss (no false hit). + { + const auto key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks), {fpB}); + REQUIRE(trie.Find(nullptr, key, MakeTokenSpan(toks), {fpB}) == nullptr); + } + // Same tokens, EMPTY fingerprint -> miss (empty never equals). + { + const std::vector empty_fps = {Fingerprint{}}; + const auto key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks), empty_fps); + REQUIRE(trie.Find(nullptr, key, MakeTokenSpan(toks), empty_fps) == nullptr); + } +} + +TEST_CASE("PrefixTrie::Find: plain text block matches with empty fps", "[prefix_trie]") +{ + const int bs = 4; + PrefixTrie trie{bs}; + + std::vector toks = {5, 6, 7, 8}; + LogicalBlock blk{}; + blk.parent = nullptr; + blk.size = bs; + blk.tokens = toks; // no image_fps -> empty + blk.key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks)); + REQUIRE(trie.Insert(blk)); + + const auto key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(toks)); + REQUIRE(trie.Find(nullptr, key, MakeTokenSpan(toks)) == &blk); // default fps = {} + REQUIRE(trie.Find(nullptr, key, MakeTokenSpan(toks), {}) == &blk); +} + +TEST_CASE("PrefixTrie::Search finds a partial block and sub-selects fingerprints", "[prefix_trie]") +{ + const int bs = 4; + PrefixTrie trie{bs}; + + // Insert a PARTIAL block of length 2 whose first token carries an image start. + std::vector part = {1, 2}; + const Fingerprint fpA = FP(1); + LogicalBlock blkA{}; + blkA.parent = nullptr; + blkA.size = (int)part.size(); + blkA.tokens = part; + blkA.image_fps = {fpA}; + blkA.key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(part), {fpA}); + REQUIRE(trie.Insert(blkA)); + + // Search a full 4-token span that shares the {1,2} prefix; the image starts at + // relative position 0. Search must enforce a partial match and land on blkA. + std::vector full = {1, 2, 3, 4}; + PrefixKey key{}; + LogicalBlock* hit = trie.Search(nullptr, key, MakeTokenSpan(full), {fpA}, /*fp_pos=*/{0}); + REQUIRE(hit == &blkA); + REQUIRE(key == blkA.key); // on a hit, key is replaced with the matched node's key +} + +TEST_CASE("PrefixTrie::Search excludes an image that starts beyond the matched prefix", "[prefix_trie]") +{ + const int bs = 4; + PrefixTrie trie{bs}; + + // Insert a PARTIAL length-2 block with NO image in its first two tokens. + std::vector part = {1, 2}; + LogicalBlock blk{}; + blk.parent = nullptr; + blk.size = (int)part.size(); + blk.tokens = part; // empty image_fps + blk.key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(part)); + REQUIRE(trie.Insert(blk)); + + // Image starts at relative position 2 (token index 2). For the length-2 prefix the + // sub-selection (fp_pos < 2) is empty, so it must match the empty-fps block. + std::vector full = {1, 2, 3, 4}; + const Fingerprint fpA = FP(7); + PrefixKey key{}; + LogicalBlock* hit = trie.Search(nullptr, key, MakeTokenSpan(full), {fpA}, /*fp_pos=*/{2}); + REQUIRE(hit == &blk); +} + +TEST_CASE("PrefixTrie::Search is bounded when fp_pos is shorter than fps (no OOB)", "[prefix_trie]") +{ + const int bs = 4; + PrefixTrie trie{bs}; + + // Partial length-2 block whose first token carries one image start (fpA). + std::vector part = {1, 2}; + const Fingerprint fpA = FP(1); + LogicalBlock blkA{}; + blkA.parent = nullptr; + blkA.size = (int)part.size(); + blkA.tokens = part; + blkA.image_fps = {fpA}; + blkA.key = ExtendPrefixKey(PrefixKey{}, MakeTokenSpan(part), {fpA}); + REQUIRE(trie.Insert(blkA)); + + // Regression for the `j < fp_pos.size()` bound: fps has 2 entries but fp_pos only 1. + // The loop must stop at fp_pos.size() and never read fp_pos[1]. + std::vector full = {1, 2, 3, 4}; + const Fingerprint fpB = FP(2); + PrefixKey key{}; + LogicalBlock* hit = trie.Search(nullptr, key, MakeTokenSpan(full), {fpA, fpB}, /*fp_pos=*/{0}); + REQUIRE(hit == &blkA); +} + +TEST_CASE("PlanPromptBoundary: geometry and guards", "[prompt_boundary]") +{ + const int bs = 8; + + // K=1, last block has >1 token (prompt%bs==3): partial node at prompt_len-1, j==last. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/19, bs, /*skip=*/1, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE(p.partial); + REQUIRE(p.pos == 18); // 19 - 1 + REQUIRE(p.block == 2); // (18-1)/8 + REQUIRE(p.node_size == 2); // 18 - 16 + } + // K=1, last block has exactly 1 token (prompt%bs==1): block-aligned, no partial node. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/17, bs, /*skip=*/1, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE_FALSE(p.partial); + REQUIRE(p.pos == 16); // 17 - 1, block-aligned + REQUIRE(p.block == 1); // (16-1)/8 + } + // K=2, last block has >2 tokens (prompt%bs==3): partial node at prompt_len-2. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/19, bs, /*skip=*/2, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE(p.partial); + REQUIRE(p.pos == 17); // 19 - 2 + REQUIRE(p.block == 2); // (17-1)/8 + REQUIRE(p.node_size == 1); // 17 - 16 + } + // K pushes B into the prior block (prompt%bs==2, K=3): B=18 in block 2, partial. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/21, bs, /*skip=*/3, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE(p.partial); + REQUIRE(p.pos == 18); // 21 - 3 + REQUIRE(p.block == 2); // (18-1)/8 + REQUIRE(p.node_size == 2); // 18 - 16 + } + // Partial node needs st.miss < j: miss at j blocks the node. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/19, bs, /*skip=*/1, /*miss=*/2); // j==2 + REQUIRE_FALSE(p.valid); + } + // Block-aligned allows st.miss <= j: miss at j still publishes the clamp target. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/17, bs, /*skip=*/1, /*miss=*/1); // j==1 + REQUIRE(p.valid); + REQUIRE_FALSE(p.partial); + REQUIRE(p.pos == 16); + } + // Block-aligned PROMPT at K=1 (prompt%bs==0): NOT suppressed -- a matchable + // boundary is published (option B; old code skipped this). + { + const auto p = PlanPromptBoundary(/*prompt_len=*/16, bs, /*skip=*/1, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE(p.partial); + REQUIRE(p.pos == 15); // 16 - 1 + REQUIRE(p.block == 1); // (15-1)/8 + REQUIRE(p.node_size == 7); // 15 - 8 + } + // Think + full-block case: block-aligned prompt, K=2 -> partial node before + // the volatile suffix that lives in the last full block. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/16, bs, /*skip=*/2, /*miss=*/0); + REQUIRE(p.valid); + REQUIRE(p.partial); + REQUIRE(p.pos == 14); // 16 - 2 + REQUIRE(p.block == 1); // (14-1)/8 + REQUIRE(p.node_size == 6); // 14 - 8 + } + // Partial geometry with j==0 (B < block_size): must be invalid -- no parent + // block exists, and miss < 0 is impossible. Locks Task 3's block_ids[j-1] safety. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/4, bs, /*skip=*/1, /*miss=*/0); // B=3, j=0 + REQUIRE_FALSE(p.valid); + } + // Block-aligned but miss past j: not matchable -> invalid. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/17, bs, /*skip=*/1, /*miss=*/2); // B=16, j=1 + REQUIRE_FALSE(p.valid); + } + // B < 1 -> no boundary. + { + const auto p = PlanPromptBoundary(/*prompt_len=*/1, bs, /*skip=*/1, /*miss=*/0); + REQUIRE_FALSE(p.valid); + } +} + +TEST_CASE("ParseCacheMode maps strings to CacheMode", "[cache_mode]") +{ + using turbomind::CacheMode; + using turbomind::ParseCacheMode; + CHECK(ParseCacheMode("none") == CacheMode::kNone); + CHECK(ParseCacheMode("auto") == CacheMode::kAuto); + CHECK(ParseCacheMode("all") == CacheMode::kAll); +} + +TEST_CASE("DecidePromptBoundaryPublish gates by mode/partial/image", "[cache_mode]") +{ + using turbomind::CacheMode; + using turbomind::DecidePromptBoundaryPublish; + + // Partial node (B mid-block): 'all' always publishes; 'auto' only with image. + CHECK(DecidePromptBoundaryPublish(CacheMode::kAll, /*partial=*/true, /*has_image=*/false)); + CHECK(DecidePromptBoundaryPublish(CacheMode::kAll, true, true)); + CHECK_FALSE(DecidePromptBoundaryPublish(CacheMode::kAuto, true, false)); + CHECK(DecidePromptBoundaryPublish(CacheMode::kAuto, true, true)); + + // Block-aligned B (no partial node): only 'all' arms the checkpoint clamp. + CHECK(DecidePromptBoundaryPublish(CacheMode::kAll, /*partial=*/false, /*has_image=*/false)); + CHECK_FALSE(DecidePromptBoundaryPublish(CacheMode::kAuto, false, false)); + CHECK_FALSE(DecidePromptBoundaryPublish(CacheMode::kAuto, false, true)); +} + +TEST_CASE("HasMultimodalOverlap: overlaps [lo, hi) with ascending spans", "[multimodal_overlap]") +{ + using turbomind::Interval; + using turbomind::MultiModalSpan; + using turbomind::Scheduler; + using turbomind::Sequence; + + auto make_seq = [](std::vector> spans) { + auto s = std::make_shared(std::make_shared()); + for (const auto& [begin, end] : spans) { + s->multimodal_spans.push_back(MultiModalSpan{Interval{begin, end}, {}}); + } + return s; + }; + + // (a) A span fully inside [lo, hi) -> true. + { + auto s = make_seq({{12, 14}}); + CHECK(Scheduler::HasMultimodalOverlap(*s, 8, 16)); + } + // (b) No span, span entirely before lo, or span entirely at-or-after hi -> false. + { + auto empty = make_seq({}); + CHECK_FALSE(Scheduler::HasMultimodalOverlap(*empty, 8, 16)); + + auto before = make_seq({{2, 8}}); // end == lo, so [2,8) is entirely before lo=8 + CHECK_FALSE(Scheduler::HasMultimodalOverlap(*before, 8, 16)); + + auto after = make_seq({{16, 20}}); // begin == hi -> at-or-after hi + CHECK_FALSE(Scheduler::HasMultimodalOverlap(*after, 8, 16)); + } + // (c) A span that begins before lo but ends after lo (begin < hi) -> true. + { + auto s = make_seq({{4, 10}}); // begins before lo=8, extends into [8,16) + CHECK(Scheduler::HasMultimodalOverlap(*s, 8, 16)); + } + // (d) Ascending early-break: a later span with begin >= hi (would return false) + // never masks an earlier overlapping span, which returns true first. + { + auto s = make_seq({{10, 12}, {20, 24}}); // first overlaps, second is >= hi + CHECK(Scheduler::HasMultimodalOverlap(*s, 8, 16)); + } +} diff --git a/src/turbomind/generation/generation.cc b/src/turbomind/generation/generation.cc index 479048b650..291ce40e80 100644 --- a/src/turbomind/generation/generation.cc +++ b/src/turbomind/generation/generation.cc @@ -7,7 +7,6 @@ #include "src/turbomind/core/check.h" #include "src/turbomind/core/copy.h" #include "src/turbomind/core/data_type.h" -#include "src/turbomind/core/state.h" #include "src/turbomind/engine/batch.h" #include "src/turbomind/engine/request.h" @@ -30,9 +29,9 @@ using std::shared_ptr; using std::vector; struct GenerationData { - Buffer_ random_state; Buffer_ random_seed; Buffer_ random_init; + Buffer_ random_state_indices; Buffer_ max_seq_len; Buffer_ token_ids_ptrs; Buffer_ output_ids; @@ -50,14 +49,12 @@ struct Generation::Impl { unique_ptr guided_decoding_; // persistent - Tensor_ token_ids_; + Tensor_ token_ids_; + Tensor_ random_states_; // scheduling states - vector h_token_ids_ptrs_; - vector h_token_ids_free_; - - // execution states - State random_state_; + vector free_token_rows_; + vector free_random_state_rows_; // immutable states Buffer_ output_ids_; @@ -65,9 +62,9 @@ struct Generation::Impl { std::vector> data_; // staging buffers - Buffer_ random_state_buf_; Buffer_ random_seed_buf_; Buffer_ random_init_buf_; + Buffer_ random_state_indices_buf_; Buffer_ token_ids_ptrs_buf_; Buffer_ token_ids_buf_; Buffer_ output_ids_buf_; @@ -75,6 +72,13 @@ struct Generation::Impl { const int max_batch_size_; const int session_len_; + int* RowPtr(int row) + { + TM_CHECK_GE(row, 0); + TM_CHECK_LT(row, max_batch_size_); + return token_ids_.data() + row * token_ids_.stride(0); + } + Impl(DataType dtype, int max_batch_size, int session_len, @@ -87,22 +91,22 @@ struct Generation::Impl { TM_CHECK_EQ(dtype, kFloat32); BaseGenerationParam base{max_batch_size, vocab_size, vocab_size_padded}; logits_processor_ = std::make_unique(base, phases); - sampling_ = std::make_unique(base, phases); + sampling_ = std::make_unique(base, phases, tp_group->rank()); stop_criteria_ = std::make_unique(base, phases); guided_decoding_ = std::make_unique(base, tp_group, phases); static_assert(sizeof(curandState_t) % alignof(curandState_t) == 0); - random_state_ = {{max_batch_size_, (int)sizeof(curandState_t)}, kUint8, kDEVICE}; - token_ids_ = {{max_batch_size_, session_len_}, kDEVICE}; - output_ids_ = {max_batch_size_, kDEVICE}; + random_states_ = {{max_batch_size_, (int)sizeof(curandState_t)}, kDEVICE}; + token_ids_ = {{max_batch_size_, session_len_}, kDEVICE}; + output_ids_ = {max_batch_size_, kDEVICE}; for (int i = 0; i < max_batch_size_; ++i) { - h_token_ids_free_.push_back(token_ids_.data() + i * token_ids_.stride(0)); + free_token_rows_.push_back(i); + free_random_state_rows_.push_back(i); } - h_token_ids_ptrs_.resize(max_batch_size_); - random_state_buf_ = {max_batch_size_ * (int)sizeof(curandState_t), kCPUpinned}; - random_seed_buf_ = {max_batch_size_, kCPUpinned}; - random_init_buf_ = {max_batch_size_, kCPUpinned}; + random_seed_buf_ = {max_batch_size_, kCPUpinned}; + random_init_buf_ = {max_batch_size_, kCPUpinned}; + random_state_indices_buf_ = {max_batch_size_, kCPUpinned}; token_ids_ptrs_buf_ = {max_batch_size_, kCPUpinned}; token_ids_buf_ = {max_batch_size_ * (ssize_t)session_len_, kCPUpinned}; @@ -112,11 +116,11 @@ struct Generation::Impl { for (int i = 0; i < phases; ++i) { auto d = std::make_unique(); - d->random_state = empty_like(random_state_buf_, kDEVICE); - d->random_seed = empty_like(random_seed_buf_, kDEVICE); - d->random_init = empty_like(random_init_buf_, kDEVICE); - d->token_ids_ptrs = empty_like(token_ids_ptrs_buf_, kDEVICE); - d->output_ids = empty_like(output_ids_, kDEVICE); + d->random_seed = empty_like(random_seed_buf_, kDEVICE); + d->random_init = empty_like(random_init_buf_, kDEVICE); + d->random_state_indices = empty_like(random_state_indices_buf_, kDEVICE); + d->token_ids_ptrs = empty_like(token_ids_ptrs_buf_, kDEVICE); + d->output_ids = empty_like(output_ids_, kDEVICE); data_.push_back(std::move(d)); } @@ -127,73 +131,57 @@ struct Generation::Impl { TM_FUNCTION_SCOPE(); auto& d = *data_.at(phase); - auto& b = *env.at("batch").data()[0]; auto& copy = *env.at("copy").data()[0]; - const auto& rc = b.rc; + Buffer_ rc = env.at("requests").buffer(); // random states d.random_init_needed = false; - for (int i = 0; i < b.perm.size(); ++i) { - const auto& c = *rc[i]; - if (TM_LIKELY(b.perm[i] < b.bs0)) { // existing - random_init_buf_[i] = false; - } - else if (c.random_state) { // already initialized - std::copy_n( - c.random_state, sizeof(curandState_t), random_state_buf_.data() + i * sizeof(curandState_t)); - } - else { // uninitialized - d.random_init_needed = true; - random_init_buf_[i] = true; - random_seed_buf_[i] = rc[i]->gen_cfg.random_seed; - } - } - copy(random_state_buf_, b.bsz, d.random_state); - if (d.random_init_needed) { - copy(random_init_buf_, b.bsz, d.random_init); - copy(random_seed_buf_, b.bsz, d.random_seed); - } + std::fill_n(random_init_buf_.data(), max_batch_size_, false); - vector used(b.bs0); - for (int i = 0; i < b.bsz; ++i) { - if (b.perm[i] < b.bs0) { - used[b.perm[i]] = 1; + int* token_ids_buf = token_ids_buf_.data(); + int generation_size = 0; + for (int i = 0; i < rc.size(); ++i) { + auto& c = *rc[i]; + if (!c.generating) { + continue; } - } - for (int i = 0; i < b.bs0; ++i) { - if (!used[i]) { // free unused chunks - h_token_ids_free_.push_back(h_token_ids_ptrs_[i]); + + if (c.generation_random_state_row < 0) { + TM_CHECK(!free_random_state_rows_.empty()); + + c.generation_random_state_row = free_random_state_rows_.back(); + free_random_state_rows_.pop_back(); + + random_init_buf_[c.generation_random_state_row] = true; + random_seed_buf_[c.generation_random_state_row] = c.gen_cfg.random_seed; + d.random_init_needed = true; } - } - // swap-in token_ids - int* token_ids_buf = token_ids_buf_.data(); - for (int i = 0; i < rc.size(); ++i) { - if (const auto& c = *rc[i]; TM_UNLIKELY(b.perm[i] >= b.bs0)) { - // allocation - TM_CHECK(!h_token_ids_free_.empty()); - token_ids_ptrs_buf_[i] = h_token_ids_free_.back(); - h_token_ids_free_.pop_back(); - // copy to staging buffer + + if (c.generation_token_ids_row < 0) { + TM_CHECK(!free_token_rows_.empty()); + + c.generation_token_ids_row = free_token_rows_.back(); + free_token_rows_.pop_back(); + + auto* dst = RowPtr(c.generation_token_ids_row); std::copy_n(c.token_ids, c.seq_len, token_ids_buf); - copy(token_ids_buf, c.seq_len, token_ids_ptrs_buf_[i]); + copy(token_ids_buf, c.seq_len, dst); token_ids_buf += c.seq_len; } - else { - token_ids_ptrs_buf_[i] = h_token_ids_ptrs_[b.perm[i]]; - } - } - - copy(token_ids_ptrs_buf_, b.bsz, d.token_ids_ptrs); - // update `h_token_ids_ptrs_` - std::copy_n(token_ids_ptrs_buf_.data(), b.bsz, h_token_ids_ptrs_.data()); + random_state_indices_buf_[generation_size] = c.generation_random_state_row; + token_ids_ptrs_buf_[generation_size++] = RowPtr(c.generation_token_ids_row); + } - d.generation_size = 0; - for (int i = 0; i < rc.size(); ++i) { - const auto& c = *rc[i]; - d.generation_size += c.generating; + if (d.random_init_needed) { + copy(random_init_buf_, max_batch_size_, d.random_init); + copy(random_seed_buf_, max_batch_size_, d.random_seed); } + + copy(token_ids_ptrs_buf_, generation_size, d.token_ids_ptrs); + copy(random_state_indices_buf_, generation_size, d.random_state_indices); + d.generation_size = generation_size; // dbg(d.generation_size); logits_processor_->Setup(phase, env); @@ -202,20 +190,32 @@ struct Generation::Impl { guided_decoding_->Setup(phase, env); } - void Prepare(int phase, TensorMap& env) + void Del(TensorMap& env) { - TM_FUNCTION_SCOPE(); - auto& d = *data_.at(phase); + Buffer_ rc = env.at("requests").buffer(); - auto& b = *env.at("batch").data()[0]; - auto& copy = *env.at("copy").data()[0]; + for (int i = 0; i < rc.size(); ++i) { + auto& token_row = rc[i]->generation_token_ids_row; + if (token_row >= 0) { + free_token_rows_.push_back(token_row); + token_row = -1; + } - if (auto g = copy.group()) { - Warp(random_state_.front(), d.random_state, b.bs0, b.perm, random_state_.back(), copy); - random_state_.Swap(); + auto& random_row = rc[i]->generation_random_state_row; + if (random_row >= 0) { + free_random_state_rows_.push_back(random_row); + random_row = -1; + } } } + void Prepare(int phase, TensorMap& env) + { + TM_FUNCTION_SCOPE(); + (void)phase; + (void)env; + } + void Unprep(int phase, TensorMap& env) { TM_FUNCTION_SCOPE(); @@ -223,8 +223,6 @@ struct Generation::Impl { auto& b = *env.at("batch").data()[0]; auto& copy = *env.at("copy").data()[0]; - // state -> data - copy(random_state_.front().buffer(), b.bsz * sizeof(curandState_t), d.random_state); copy(output_ids_, b.bsz, d.output_ids); } @@ -234,9 +232,6 @@ struct Generation::Impl { auto& d = *data_.at(phase); auto& copy = *env.at("copy").data()[0]; - copy(d.random_state, d.random_state.size(), random_state_buf_); - env.produce("random_state", random_state_buf_); - copy(d.output_ids, d.output_ids.size(), output_ids_buf_); env.produce("output_ids", output_ids_buf_); @@ -253,24 +248,24 @@ struct Generation::Impl { { TM_FUNCTION_SCOPE(); auto& d = *data_.at(phase); - auto& b = *env.at("batch").data()[0]; const auto stream = core::Context::stream().handle(); if (d.random_init_needed) { - InitializeRandomStates((curandState_t*)random_state_.front().raw_data(), + InitializeRandomStates((curandState_t*)random_states_.raw_data(), d.random_seed.data(), d.random_init.data(), - b.bsz, + max_batch_size_, stream); } - env.emplace("output_ids", output_ids_); // out - env.emplace("curand_state", random_state_.front()); // inout + env.emplace("output_ids", output_ids_); // out + env.emplace("curand_state", random_states_); // inout if (const int gs = d.generation_size) { env.emplace("token_ids_ptrs", d.token_ids_ptrs.slice(0, gs)); + env.emplace("curand_state_indices", d.random_state_indices.slice(0, gs)); auto logits = env.consume("logits"); @@ -319,6 +314,9 @@ void Generation::Run(BatchOp op, int phase, TensorMap& env) if (op == BatchOp::kSetup) { return impl_->Setup(phase, env); } + else if (op == BatchOp::kDel) { + return impl_->Del(env); + } else if (op == BatchOp::kPrepare) { return impl_->Prepare(phase, env); } diff --git a/src/turbomind/generation/guided_decoding.cc b/src/turbomind/generation/guided_decoding.cc index c185156506..c90988ae0c 100644 --- a/src/turbomind/generation/guided_decoding.cc +++ b/src/turbomind/generation/guided_decoding.cc @@ -34,11 +34,15 @@ GuidedDecoding::GuidedDecoding(const BaseGenerationParam& base, const comm::Host void GuidedDecoding::Setup(int phase, TensorMap& env) { auto& d = *data_.at(phase); - auto& b = *env.at("batch").data()[0]; + // auto& b = *env.at("batch").data()[0]; + Buffer_ rs = env.at("requests").buffer(); d.matchers.clear(); d.active = false; - for (const auto& r : b.rc) { + for (const auto& r : rs) { + if (!r->generating) { + continue; + } if (d.matchers.emplace_back(r->req->matcher)) { d.active = true; } @@ -80,9 +84,8 @@ void GuidedDecoding::ApplyMask(int phase, TensorMap& env) comm::Broadcast(tp_group_, bitmask_buf_.data(), numel, 0); } Copy(bitmask_buf_.buffer(), numel, d.bitmask.buffer()); - // Use logits shape(0) instead of d.matchers.size() to ensure dimension match. - // d.matchers.size() is the total number of requests in batch, but logits may be - // sliced to only include requests that are still generating (generation_size). + // Matchers are compacted to the generating prefix in Setup, matching the + // logits rows consumed by generation. auto logits = env.at("logits"); ApplyTokenBitmaskInplace(logits, d.bitmask.slice(0, logits.shape(0))); } diff --git a/src/turbomind/generation/logits_processor.cc b/src/turbomind/generation/logits_processor.cc index 3999d7687a..c32030f487 100644 --- a/src/turbomind/generation/logits_processor.cc +++ b/src/turbomind/generation/logits_processor.cc @@ -123,8 +123,10 @@ void LogitsProcessor::Setup(int phase, TensorMap& env) auto& d = *data_.at(phase); - const auto& rs = env.at("batch").data()[0]->rc; - auto& copy = *env.at("copy").data()[0]; + // const auto& rs = env.at("batch").data()[0]->rc; + Buffer_ rs = env.at("requests").buffer(); + + auto& copy = *env.at("copy").data()[0]; const int bsz = rs.size(); @@ -154,7 +156,7 @@ void LogitsProcessor::Setup(int phase, TensorMap& env) // min_length min_lengths[i] = rs[i]->prompt_len + g.min_new_tokens; - if (rs[i]->seq_len + rs[i]->beta < min_lengths[i]) { + if (rs[i]->seq_len + rs[i]->inflight_new_tokens < min_lengths[i]) { d.has_min_length_penalty = true; } } diff --git a/src/turbomind/generation/sampling.cc b/src/turbomind/generation/sampling.cc index dd4f86b817..ff3266761a 100644 --- a/src/turbomind/generation/sampling.cc +++ b/src/turbomind/generation/sampling.cc @@ -31,6 +31,12 @@ namespace turbomind { struct SamplingData { + struct LogprobOutput { + int row; + int offset; + std::shared_ptr request; + }; + explicit SamplingData(int max_batch_size, DeviceType device) { top_k_buf = {max_batch_size, device}; @@ -54,14 +60,17 @@ struct SamplingData { Buffer_ kept_buf; // kept sample - bool output_logprobs = 0; + int generation_size = 0; + bool output_logprobs = 0; + std::vector logprob_outputs; Buffer_ sampled_logprobs; Buffer_ sampled_indices; Buffer_ sampled_nums; }; -Sampling::Sampling(const BaseGenerationParam& base, int phases): BaseGenerationParam{base} +Sampling::Sampling(const BaseGenerationParam& base, int phases, int tp_rank): + BaseGenerationParam{base}, tp_rank_{tp_rank} { top_k_ = {max_batch_size_, kCPUpinned}; top_p_ = {max_batch_size_, kCPUpinned}; @@ -152,14 +161,15 @@ void Sampling::Forward(int phase, TensorMap& args) // sample { SamplingParams params{}; - params.logits = logits.data(); - params.stride = vocab_size_padded_; - params.indices = indices.data(); - params.kept = d.kept_buf.data(); - params.curandstate = (curandState_t*)args.at("curand_state").raw_data(); - params.batch_size = bsz; - params.output_ids = args.at("output_ids").data(); // (B, 1) - params.sequence_length = args.at("sequence_length").data(); + params.logits = logits.data(); + params.stride = vocab_size_padded_; + params.indices = indices.data(); + params.kept = d.kept_buf.data(); + params.curandstate = (curandState_t*)args.at("curand_state").raw_data(); + params.curandstate_indices = args.at("curand_state_indices").data(); + params.batch_size = bsz; + params.output_ids = args.at("output_ids").data(); // (B, 1) + params.sequence_length = args.at("sequence_length").data(); if (d.output_logprobs) { params.sampled_logprobs = d.sampled_logprobs.data(); @@ -177,18 +187,42 @@ void Sampling::Setup(int phase, TensorMap& env) { TM_FUNCTION_SCOPE(); - const auto& rc = env.at("batch").data()[0]->rc; - auto& copy = *env.at("copy").data()[0]; + // const auto& rc = env.at("batch").data()[0]->rc; + Buffer_ rc = env.at("requests").buffer(); + + auto& copy = *env.at("copy").data()[0]; + + auto& d = *data_.at(phase); - const auto bsz = rc.size(); + d.generation_size = 0; + d.output_logprobs = false; + d.logprob_outputs.clear(); - for (int i = 0; i < bsz; ++i) { - top_k_[i] = rc[i]->gen_cfg.top_k; - top_p_[i] = rc[i]->gen_cfg.top_p; - min_p_[i] = rc[i]->gen_cfg.min_p; + for (int i = 0; i < rc.size(); ++i) { + auto& c = *rc[i]; + if (!c.generating) { + continue; + } + + const int row = d.generation_size++; + + top_k_[row] = c.gen_cfg.top_k; + top_p_[row] = c.gen_cfg.top_p; + min_p_[row] = c.gen_cfg.min_p; + + if (c.gen_cfg.output_logprobs) { + d.output_logprobs = true; + d.logprob_outputs.push_back({row, c.seq_len + c.inflight_new_tokens - c.prompt_len, c.req}); + } } - auto& d = *data_.at(phase); + const int bsz = d.generation_size; + if (bsz == 0) { + d.max_topk = d.min_topk = 0; + d.min_topp = 0.f; + d.max_minp = 0.f; + return; + } d.max_topk = *std::max_element(top_k_.begin(), top_k_.begin() + bsz); d.min_topk = *std::min_element(top_k_.begin(), top_k_.begin() + bsz); @@ -200,47 +234,48 @@ void Sampling::Setup(int phase, TensorMap& env) copy(min_p_.data(), bsz, d.min_p_buf.data()); copy(kept_.data(), bsz, d.kept_buf.data()); - - d.output_logprobs = std::any_of(rc.begin(), rc.end(), [](auto& x) { return x->gen_cfg.output_logprobs; }); } void Sampling::Fetch(int phase, TensorMap& env) { TM_FUNCTION_SCOPE(); auto& d = *data_.at(phase); - auto& b = *env.at("batch").data()[0]; auto& copy = *env.at("copy").data()[0]; if (d.output_logprobs) { - copy(d.sampled_logprobs, b.bsz * kMaxLogProb, sampled_logprobs_buf_); - copy(d.sampled_indices, b.bsz * kMaxLogProb, sampled_indices_buf_); - copy(d.sampled_nums, b.bsz, sampled_nums_buf_); + copy(d.sampled_logprobs, d.generation_size * kMaxLogProb, sampled_logprobs_buf_); + copy(d.sampled_indices, d.generation_size * kMaxLogProb, sampled_indices_buf_); + copy(d.sampled_nums, d.generation_size, sampled_nums_buf_); } } void Sampling::Update(int phase, TensorMap& env) { TM_FUNCTION_SCOPE(); + (void)env; + + if (tp_rank_ != 0) { + return; + } + auto& d = *data_.at(phase); - auto& b = *env.at("batch").data()[0]; + if (!d.output_logprobs) { + return; + } - if (d.output_logprobs) { - float* logprob_buf = sampled_logprobs_buf_.data(); - int* indices_buf = sampled_indices_buf_.data(); - int* n_buf = sampled_nums_buf_.data(); - for (int i = 0; i < b.rc.size(); ++i) { - if (auto& x = *b.rc[i]; x.gen_cfg.output_logprobs) { - // output buffers - auto logprob_out = x.req->outputs.at("logprob_vals").data(); - auto indices_out = x.req->outputs.at("logprob_indexes").data(); - auto n_out = x.req->outputs.at("logprob_nums").data(); - // offset into output buffers - const int offset = x.seq_len - x.prompt_len; - std::copy_n(logprob_buf + i * kMaxLogProb, n_buf[i], logprob_out + offset * kMaxLogProb); - std::copy_n(indices_buf + i * kMaxLogProb, n_buf[i], indices_out + offset * kMaxLogProb); - n_out[offset] = n_buf[i]; - } - } + float* logprob_buf = sampled_logprobs_buf_.data(); + int* indices_buf = sampled_indices_buf_.data(); + int* n_buf = sampled_nums_buf_.data(); + + for (const auto& x : d.logprob_outputs) { + auto logprob_out = x.request->outputs.at("logprob_vals").data(); + auto indices_out = x.request->outputs.at("logprob_indexes").data(); + auto n_out = x.request->outputs.at("logprob_nums").data(); + + const int n = n_buf[x.row]; + std::copy_n(logprob_buf + x.row * kMaxLogProb, n, logprob_out + x.offset * kMaxLogProb); + std::copy_n(indices_buf + x.row * kMaxLogProb, n, indices_out + x.offset * kMaxLogProb); + n_out[x.offset] = n; } } diff --git a/src/turbomind/generation/sampling.h b/src/turbomind/generation/sampling.h index 98e5c49540..8cd77e18aa 100644 --- a/src/turbomind/generation/sampling.h +++ b/src/turbomind/generation/sampling.h @@ -10,7 +10,7 @@ struct SamplingData; class Sampling: public BaseGenerationParam { public: - explicit Sampling(const BaseGenerationParam& base, int phases); + explicit Sampling(const BaseGenerationParam& base, int phases, int tp_rank); void Setup(int phase, TensorMap& env); @@ -21,6 +21,8 @@ class Sampling: public BaseGenerationParam { void Update(int phase, TensorMap& env); private: + const int tp_rank_; + std::vector> data_; // host buffer diff --git a/src/turbomind/generation/stop_criteria.cc b/src/turbomind/generation/stop_criteria.cc index d1efab4ba0..98f73ce4ab 100644 --- a/src/turbomind/generation/stop_criteria.cc +++ b/src/turbomind/generation/stop_criteria.cc @@ -36,8 +36,9 @@ void StopCriteria::Setup(int phase, TensorMap& env) TM_FUNCTION_SCOPE(); auto& d = *data_.at(phase); - const auto& rs = env.at("batch").data()[0]->rc; - auto& copy = *env.at("copy").data()[0]; + Buffer_ rs = env.at("requests").buffer(); + + auto& copy = *env.at("copy").data()[0]; for (int i = 0; i < rs.size(); ++i) { max_seq_len_buf_[i] = rs[i]->max_seq_len; diff --git a/src/turbomind/kernels/attention/attention_params.h b/src/turbomind/kernels/attention/attention_params.h index e9696327f9..7219cb1a83 100644 --- a/src/turbomind/kernels/attention/attention_params.h +++ b/src/turbomind/kernels/attention/attention_params.h @@ -20,7 +20,7 @@ struct LinearIteratorParams { struct BlockIteratorParams { char** block_ptrs; const int* cu_block_nums; - int layer_id; + int offset; // cache block offset int block_len; }; @@ -44,6 +44,7 @@ struct AttentionParams { // sequence-level buffers const int* cu_q_len; const int* cu_k_len; + const int* readonly_block_num; // per-batch read-only leading block count const bool* finished; const float* rope_theta; diff --git a/src/turbomind/kernels/attention/attention_universal.h b/src/turbomind/kernels/attention/attention_universal.h index c7960de01a..39ecfceaae 100644 --- a/src/turbomind/kernels/attention/attention_universal.h +++ b/src/turbomind/kernels/attention/attention_universal.h @@ -260,6 +260,12 @@ struct AttentionUniversal { const int qi = offset.y / CTA_H; const int ti = history_len; + // Read-only prefix: skip the KV store when this position falls inside a + // leading read-only block whose KV is already valid (single multiply). + const int readonly_len = params.readonly_block_num ? + params.readonly_block_num[batch_idx] * params.block_iter_params.block_len : + 0; + int local_ti, local_ti_rank; local_ti = params.cp_size.divmod(local_ti_rank, ti); @@ -286,30 +292,32 @@ struct AttentionUniversal { } } - iterator.block_head_.with( - iterator.block_ptrs_, local_ti, [&](auto k_cache, auto v_cache, T* k_param, T* v_param) { - if (local_ti_rank != params.cp_rank) { - return; - } - PRAGMA_UNROLL - for (int c = 0; c < ITER_C; ++c) { - const int di = offset.x + c * Map::kDeltaC; - if (qi < CTA_Q) { - Store(&k_cache[di], out_K[0][c]); - if constexpr (HAS_V) { - Store(&v_cache[di], out_V[0][c]); + if (ti >= readonly_len) { + iterator.block_head_.with( + iterator.block_ptrs_, local_ti, [&](auto k_cache, auto v_cache, T* k_param, T* v_param) { + if (local_ti_rank != params.cp_rank) { + return; + } + PRAGMA_UNROLL + for (int c = 0; c < ITER_C; ++c) { + const int di = offset.x + c * Map::kDeltaC; + if (qi < CTA_Q) { + Store(&k_cache[di], out_K[0][c]); + if constexpr (HAS_V) { + Store(&v_cache[di], out_V[0][c]); + } } } - } - if constexpr (!std::is_same_v) { - if (qi < CTA_Q && offset.x == 0) { - StoreQuantParam(k_param, param_K[0]); - if constexpr (HAS_V) { - StoreQuantParam(v_param, param_V[0]); + if constexpr (!std::is_same_v) { + if (qi < CTA_Q && offset.x == 0) { + StoreQuantParam(k_param, param_K[0]); + if constexpr (HAS_V) { + StoreQuantParam(v_param, param_V[0]); + } } } - } - }); + }); + } __syncthreads(); } diff --git a/src/turbomind/kernels/attention/block.h b/src/turbomind/kernels/attention/block.h index 59347675bf..4ac7775ca3 100644 --- a/src/turbomind/kernels/attention/block.h +++ b/src/turbomind/kernels/attention/block.h @@ -53,44 +53,41 @@ struct Config { } }; -// Layout -> LayerId -> HeadId -> Timestep -> [Block] -> (k_data, v_data, k_param, v_param) +// Layout -> byte offset -> head id -> token id -> [block] -> (k_data, v_data, k_param, v_param) template class Head { public: - TM_HOST_DEVICE Head(Layout layout, int layer_id, int head_id): - layout_{layout}, layer_id_{layer_id}, head_id_{head_id} - { - } + TM_HOST_DEVICE Head(Layout layout, int offset, int head_id): layout_{layout}, offset_{offset}, head_id_{head_id} {} TM_HOST_DEVICE auto k_data(char* block, int ti) const { if constexpr (std::is_same_v) { - return SubBytePtr{block + layout_.k_data(layer_id_, head_id_, ti)}; + return SubBytePtr{block + offset_ + layout_.k_data(head_id_, ti)}; } else { - return reinterpret_cast(block + layout_.k_data(layer_id_, head_id_, ti)); + return reinterpret_cast(block + offset_ + layout_.k_data(head_id_, ti)); } } TM_HOST_DEVICE auto v_data(char* block, int ti) const { if constexpr (std::is_same_v) { - return SubBytePtr{block + layout_.v_data(layer_id_, head_id_, ti)}; + return SubBytePtr{block + offset_ + layout_.v_data(head_id_, ti)}; } else { - return reinterpret_cast(block + layout_.v_data(layer_id_, head_id_, ti)); + return reinterpret_cast(block + offset_ + layout_.v_data(head_id_, ti)); } } TM_HOST_DEVICE T* k_param(char* block, int ti) const { - return reinterpret_cast(block + layout_.k_param(layer_id_, head_id_, ti)); + return reinterpret_cast(block + offset_ + layout_.k_param(head_id_, ti)); } TM_HOST_DEVICE T* v_param(char* block, int ti) const { - return reinterpret_cast(block + layout_.v_param(layer_id_, head_id_, ti)); + return reinterpret_cast(block + offset_ + layout_.v_param(head_id_, ti)); } TM_HOST_DEVICE void get_block_coord(int seq_ti, int& block_idx, int& block_ti) const @@ -120,7 +117,7 @@ class Head { private: Layout layout_; - int layer_id_; + int offset_; int head_id_; }; @@ -178,39 +175,24 @@ struct Layout { return config().head_num() * kv_num() * head_data_size() + config().head_num() * kv_num() * head_param_size(); } - TM_HOST_DEVICE int block_size(int layer_num) const - { - return layer_size() * layer_num; - } - - TM_HOST_DEVICE int k_data(int layer, int head, int token) const - { - return layer_data(layer) + head_data(head) + token_data(token); - } - - TM_HOST_DEVICE int v_data(int layer, int head, int token) const - { - return k_data(layer, head, token) + (is_share_kv() ? 0 : head_data_size()); - } - - TM_HOST_DEVICE int k_param(int layer, int head, int token) const + TM_HOST_DEVICE int k_data(int head, int token) const { - return layer_param(layer) + head_param(head) + token_param(token); + return head_data(head) + token_data(token); } - TM_HOST_DEVICE int v_param(int layer, int head, int token) const + TM_HOST_DEVICE int v_data(int head, int token) const { - return k_param(layer, head, token) + (is_share_kv() ? 0 : head_param_size()); + return k_data(head, token) + (is_share_kv() ? 0 : head_data_size()); } - TM_HOST_DEVICE int layer_data(int layer) const + TM_HOST_DEVICE int k_param(int head, int token) const { - return layer * layer_size(); + return head_param(head) + token_param(token); } - TM_HOST_DEVICE int layer_param(int layer) const + TM_HOST_DEVICE int v_param(int head, int token) const { - return layer_data(layer) + head_data(config_.head_num()); + return k_param(head, token) + (is_share_kv() ? 0 : head_param_size()); } TM_HOST_DEVICE int head_data(int head) const diff --git a/src/turbomind/kernels/attention/block_iterator.h b/src/turbomind/kernels/attention/block_iterator.h index eba99f7473..6a7be2e7d2 100644 --- a/src/turbomind/kernels/attention/block_iterator.h +++ b/src/turbomind/kernels/attention/block_iterator.h @@ -67,11 +67,11 @@ struct BlockIteratorFactory { BlockLayout_ block_layout_; char** block_ptrs_; const int* cu_block_nums_; - int layer_idx_; + int offset_; __device__ auto Create(int batch_idx, int head_idx) { - block::Head head{block_layout_, layer_idx_, head_idx}; + block::Head head{block_layout_, offset_, head_idx}; char** block_ptrs = block_ptrs_ + cu_block_nums_[batch_idx]; @@ -91,7 +91,7 @@ struct CreateCacheIterFactory block_head{block_layout, layer_id, head_idx}; + block::Head block_head{block_layout, cache_block_offset, head_idx}; + + // Read-only prefix: leading whole blocks whose KV is already valid are read for + // context but not re-written. Precompute the boundary in tokens (single multiply). + const int readonly_len = readonly_block_num ? readonly_block_num[batch_idx] * block_seq_len : 0; PRAGMA_UNROLL for (int s = 0; s < ITER_S; ++s) { const int qi = offset.y + s * Map::kDeltaS + token_idx; // local offset into `input_length` const int ti = history_len + qi; // timestep local_ti = cp_size.divmod(local_ti_rank, ti); - if (qi < q_len && local_ti_rank == cp_rank) { + if (qi < q_len && ti >= readonly_len && local_ti_rank == cp_rank) { block_head.with((char**)blocks, local_ti, [&](auto k_cache, auto v_cache, T* k_param, T* v_param) { PRAGMA_UNROLL for (int c = 0; c < ITER_C; ++c) { @@ -210,13 +216,14 @@ void invokeProcessKV_v2(char** blocks, const int* cu_q_len, const int* cu_k_len, const int* cu_block_num, + const int* readonly_block_num, const RopeKernelParam& rope_param, int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s, int block_seq_len, - int layer_id, + int cache_block_offset, int cp_rank, FastDivmod cp_size, int max_q_len, @@ -251,14 +258,16 @@ void invokeProcessKV_v2(char** blocks, cu_q_len, cu_k_len, cu_block_num, + readonly_block_num, rope_param, stride_b, stride_c, stride_h, stride_s, - layer_id, + cache_block_offset, cp_rank, cp_size, + block_seq_len, block_layout); }; @@ -304,13 +313,14 @@ void invokeProcessKV_v2(char** blocks, const int* cu_q_len, \ const int* cu_k_len, \ const int* cu_block_num, \ + const int* readonly_block_num, \ const RopeKernelParam& rope_param, \ int64_t stride_b, \ int64_t stride_c, \ int64_t stride_h, \ int64_t stride_s, \ int block_seq_len, \ - int layer_id, \ + int cache_block_offset, \ int cp_rank, \ FastDivmod cp_size, \ int max_q_len, \ @@ -336,7 +346,7 @@ __global__ void __launch_bounds__(128) flattenKV_v2(T* k, int64_t stride_c, int64_t stride_h, int64_t stride_s, - int layer_id, + int cache_block_offset, int cp_rank, FastDivmod cp_size, BlockLayout block_layout) @@ -377,7 +387,7 @@ __global__ void __launch_bounds__(128) flattenKV_v2(T* k, blocks += cu_block_num[batch_idx]; - block::Head block_head{block_layout, layer_id, head_idx}; + block::Head block_head{block_layout, cache_block_offset, head_idx}; Array param_K[ITER_S]; Array param_V[ITER_S]; @@ -467,7 +477,7 @@ void invokeFlattenKV_v2(T* k, int64_t stride_h, int64_t stride_s, int block_seq_len, - int layer_id, + int cache_block_offset, int cp_rank, FastDivmod cp_size, int max_seq_len, @@ -504,7 +514,7 @@ void invokeFlattenKV_v2(T* k, stride_c, stride_h, stride_s, - layer_id, + cache_block_offset, cp_rank, cp_size, block_layout); @@ -555,7 +565,7 @@ void invokeFlattenKV_v2(T* k, int64_t stride_h, \ int64_t stride_s, \ int block_seq_len, \ - int layer_id, \ + int cache_block_offset, \ int cp_rank, \ FastDivmod cp_size, \ int max_seq_len, \ diff --git a/src/turbomind/kernels/attention/kv_cache_utils_v2.h b/src/turbomind/kernels/attention/kv_cache_utils_v2.h index ca5259dcea..b5c2df8570 100644 --- a/src/turbomind/kernels/attention/kv_cache_utils_v2.h +++ b/src/turbomind/kernels/attention/kv_cache_utils_v2.h @@ -16,13 +16,14 @@ void invokeProcessKV_v2(char** blocks, const int* cu_q_len, const int* cu_k_len, const int* cu_block_num, + const int* readonly_block_num, const RopeKernelParam& rope_param, int64_t stride_b, int64_t stride_c, int64_t stride_h, int64_t stride_s, int block_seq_len, - int layer_id, + int cache_block_offset, int cp_rank, cutlass::FastDivmod cp_size, int max_q_len, @@ -43,13 +44,14 @@ void invokeProcessKV_v2_(const AttentionParams& params) params.cu_q_len, params.cu_k_len, params.block_iter_params.cu_block_nums, + params.readonly_block_num, params.rope_param, 0, // stride b params.stride / params.size_per_head, // stride c 1, // stride h params.stride / params.size_per_head, // stride s params.block_iter_params.block_len, - params.block_iter_params.layer_id, + params.block_iter_params.offset, params.cp_rank, params.cp_size, params.max_q_len, @@ -72,7 +74,7 @@ void invokeFlattenKV_v2(T* k, int64_t stride_h, int64_t stride_s, int block_seq_len, - int layer_id, + int cache_block_offset, int cp_rank, cutlass::FastDivmod cp_size, int max_seq_len, @@ -98,7 +100,7 @@ void invokeFlattenKV_v2_(const AttentionParams& params, int sum_k_len) params.linear_iter_params.stride_h / params.size_per_head, 1, params.block_iter_params.block_len, - params.block_iter_params.layer_id, + params.block_iter_params.offset, params.cp_rank, params.cp_size, params.max_k_len, diff --git a/src/turbomind/kernels/attention/test_attention.cu b/src/turbomind/kernels/attention/test_attention.cu index 147821c293..5e06b72281 100644 --- a/src/turbomind/kernels/attention/test_attention.cu +++ b/src/turbomind/kernels/attention/test_attention.cu @@ -112,7 +112,7 @@ void TestBlocks(const thrust::universal_vector& k_cache, // [B, H, S, // [B, S/s, 2, H, s, D] // blocks.resize(batch_size * n_blocks * 2 * kHsD); - blocks.resize(batch_size * n_blocks * layout.block_size(1)); + blocks.resize(batch_size * n_blocks * layout.layer_size()); thrust::fill(blocks.begin(), blocks.end(), NAN); k_ptrs.resize(batch_size * n_blocks + 1); // +1 padding @@ -125,7 +125,7 @@ void TestBlocks(const thrust::universal_vector& k_cache, // [B, H, S, for (size_t i = 0; i < idxs.size(); ++i) { // k_ptrs[i] = blocks.data().get() + idxs[i] * 2 * kHsD; - k_ptrs[i] = blocks.data().get() + idxs[i] * layout.block_size(1); + k_ptrs[i] = blocks.data().get() + idxs[i] * layout.layer_size(); } thrust::universal_vector seq_lens(batch_size); @@ -152,6 +152,7 @@ void TestBlocks(const thrust::universal_vector& k_cache, // [B, H, S, cu_seq_lens.data().get(), cu_seq_lens.data().get(), cu_block_cnts.data().get(), + nullptr, // readonly_block_num (test writes all) RopeKernelParam{}, 2 * head_num * seq_len, 0, diff --git a/src/turbomind/kernels/sampling_kernels.cu b/src/turbomind/kernels/sampling_kernels.cu index 7e9df88bd2..b42d2d0e8a 100644 --- a/src/turbomind/kernels/sampling_kernels.cu +++ b/src/turbomind/kernels/sampling_kernels.cu @@ -18,6 +18,7 @@ __global__ void sampling(const T* logits, const int* indices, const int* kept, curandState_t* curandstate, + const int* curandstate_indices, int* output_ids, int* sequence_length, T* sampled_logprobs, @@ -34,7 +35,8 @@ __global__ void sampling(const T* logits, __shared__ float rand_num_s; __shared__ int selected; if (tid == 0) { - rand_num_s = curand_uniform(curandstate + batch_id); + const int state_row = curandstate_indices[batch_id]; + rand_num_s = curand_uniform(curandstate + state_row); } __syncthreads(); @@ -91,6 +93,7 @@ void invokeSampling(SamplingParams& params, cudaStream_t stream) params.indices, params.kept, params.curandstate, + params.curandstate_indices, params.output_ids, params.sequence_length, (T*)params.sampled_logprobs, diff --git a/src/turbomind/kernels/sampling_kernels.h b/src/turbomind/kernels/sampling_kernels.h index 4e8724e815..c3822fc329 100644 --- a/src/turbomind/kernels/sampling_kernels.h +++ b/src/turbomind/kernels/sampling_kernels.h @@ -29,6 +29,7 @@ struct SamplingParams { int* indices; int* kept; curandState_t* curandstate; + const int* curandstate_indices; size_t batch_size; int* output_ids; int* sequence_length; diff --git a/src/turbomind/memory/CMakeLists.txt b/src/turbomind/memory/CMakeLists.txt new file mode 100644 index 0000000000..f5f775d666 --- /dev/null +++ b/src/turbomind/memory/CMakeLists.txt @@ -0,0 +1,12 @@ +# Copyright (c) OpenMMLab. All rights reserved. + +cmake_minimum_required(VERSION 3.11) + +add_library(memory STATIC object.cc stats.cc) +target_link_libraries(memory PRIVATE core) +set_property(TARGET memory PROPERTY POSITION_INDEPENDENT_CODE ON) + +if (BUILD_TEST) + add_executable(test_memory test_memory.cc) + target_link_libraries(test_memory PRIVATE memory core Catch2::Catch2WithMain) +endif () diff --git a/src/turbomind/memory/common.h b/src/turbomind/memory/common.h new file mode 100644 index 0000000000..5dfb922779 --- /dev/null +++ b/src/turbomind/memory/common.h @@ -0,0 +1,46 @@ +// Copyright (c) OpenMMLab. All rights reserved. +#pragma once + +#include +#include + +namespace turbomind { + +struct Allocation; // defined in object.h; the handle is just a pointer to it + +// Opaque 8-byte allocation handle: a pointer to the durable Allocation that +// owns the data pointer(s). {nullptr} == null / never-allocated. Staleness is +// detected by snapshotting Allocation::key (see ObjectAllocator::IsValid), not +// by this pointer. +struct object_alloc_t { + const Allocation* a{}; + const Allocation* operator->() const noexcept + { + return a; + } +}; + +inline bool operator==(object_alloc_t x, object_alloc_t y) noexcept +{ + return x.a == y.a; +} +inline bool operator!=(object_alloc_t x, object_alloc_t y) noexcept +{ + return x.a != y.a; +} +inline bool operator<(object_alloc_t x, object_alloc_t y) noexcept +{ + return std::less{}(x.a, y.a); +} + +// A pure slab coordinate: which Slab (within a SlabAllocator) and which slot. +// No identity/stamp -- staleness is owned by the Allocation key. +struct SlabSlot { + int32_t slab_id; + int32_t slot_id; +}; + +static_assert(sizeof(object_alloc_t) == 8, "object_alloc_t must be 8 bytes"); +static_assert(alignof(object_alloc_t) == 8, "object_alloc_t must be 8-byte aligned"); + +} // namespace turbomind diff --git a/src/turbomind/memory/object.cc b/src/turbomind/memory/object.cc new file mode 100644 index 0000000000..ef7cf9bc28 --- /dev/null +++ b/src/turbomind/memory/object.cc @@ -0,0 +1,373 @@ +#include "src/turbomind/memory/object.h" +#include "src/turbomind/memory/slab.h" + +#include +#include + +namespace turbomind { + +// ---- AllocationTable: the handle store (one backend) ---- +// The handle (object_alloc_t) wraps a const Allocation* into an address-stable +// std::deque; resolving is a field read. Staleness is a monotonic per-Allocation +// key: Acquire assigns the next key, Release zeroes it; a consumer that snapshots +// the key detects reuse/free via ObjectAllocator::IsValid. +struct AllocationTable { + std::deque pool_; + std::vector free_; + uint64_t next_key_{1}; // never 0, never reused -> ABA-proof + + Allocation* Acquire() + { + Allocation* a; + if (free_.empty()) { + a = &pool_.emplace_back(); + } + else { + a = free_.back(); + free_.pop_back(); + } + a->key = next_key_++; + return a; + } + + void Release(Allocation* a) + { + a->key = 0; + a->n = 0; + a->base0 = nullptr; + a->bases.clear(); // keep capacity for reuse + a->slots.clear(); + free_.push_back(a); + } +}; + +// ---- Static registration descriptors ---- + +struct MemberSpec { + int slab_index; + int count; +}; + +struct ObjectSpec { + std::vector members; + int total_parts{0}; + std::vector part_slab; // positional part -> SlabAllocator index +}; + +// ---- MemoryState: capacity layer (the only thing a trial copies) ---- + +struct MemoryState { + static constexpr size_t kPageSize = 32 << 20UL; + static constexpr size_t kMinSlabSize = 32 << 20UL; + static constexpr size_t kMaxSlabSize = 1 << 30UL; + static constexpr size_t kMaxEmptySlabs = 0; + static constexpr float kUtilThresh = .95f; + + Buffer mem_; + PageAllocator pages_; + std::vector slabs_; + + explicit MemoryState(Buffer memory): + mem_{std::move(memory)}, pages_{mem_.raw_data(), (size_t)mem_.byte_size(), kPageSize} + { + } + + int add_slab_class(size_t aligned) + { + const int slab = static_cast(slabs_.size()); + slabs_.emplace_back(aligned, kMinSlabSize, kMaxSlabSize, kUtilThresh, kMaxEmptySlabs); + return slab; + } + + // Reserve spec.total_parts slots into `out`, atomic over members. + bool reserve(const ObjectSpec& spec, SlabSlot* out) + { + int filled = 0; + for (const MemberSpec& m : spec.members) { + const int got = slabs_[m.slab_index].allocate(out + filled, m.count, pages_); + if (got < m.count) { + slabs_[m.slab_index].deallocate(out + filled, got, pages_); + int back = 0; + for (const MemberSpec& mm : spec.members) { + if (&mm == &m) { + break; + } + slabs_[mm.slab_index].deallocate(out + back, mm.count, pages_); + back += mm.count; + } + return false; + } + filled += m.count; + } + return true; + } + + void free(const ObjectSpec& spec, const SlabSlot* slots) + { + int off = 0; + for (const MemberSpec& m : spec.members) { + slabs_[m.slab_index].deallocate(slots + off, m.count, pages_); + off += m.count; + } + } + + char* address_of(int part_slab, SlabSlot s) const + { + return slabs_[part_slab].AddressOf(s); + } + void set_owner(int part_slab, SlabSlot s, object_alloc_t o) + { + slabs_[part_slab].set_owner(s, o); + } +}; + +// ---- ObjectAllocator::Impl ---- + +struct ObjectAllocator::Impl { + MemoryState space_; + AllocationTable table_; + std::vector objects_; + std::unordered_map slab_of_size_; + std::unordered_map simple_id_; + + explicit Impl(Buffer memory): space_{std::move(memory)} {} + + static size_t align_up(size_t size, size_t align) + { + return (size + align - 1) / align * align; + } + + int slab_for_aligned_size(size_t aligned) + { + if (auto it = slab_of_size_.find(aligned); it != slab_of_size_.end()) { + return it->second; + } + const int slab = space_.add_slab_class(aligned); + slab_of_size_[aligned] = slab; + return slab; + } + + int register_simple(size_t size, size_t align) + { + const size_t aligned = align_up(size, align); + if (auto it = simple_id_.find(aligned); it != simple_id_.end()) { + return it->second; + } + const int slab = slab_for_aligned_size(aligned); + const int id = static_cast(objects_.size()); + ObjectSpec spec; + spec.members = {{slab, 1}}; + spec.total_parts = 1; + spec.part_slab = {slab}; + objects_.push_back(std::move(spec)); + simple_id_[aligned] = id; + return id; + } + + int register_composite(const std::vector>& parts) + { + TM_CHECK(!parts.empty()) << "composite must have at least one member"; + if (parts.size() == 1 && parts[0][2] == 1) { + return register_simple(parts[0][0], parts[0][1]); + } + ObjectSpec spec; + for (const auto& p : parts) { + const size_t size = p[0], align = p[1], count = p[2]; + TM_CHECK_GT(count, 0u); + const size_t aligned = align_up(size, align); + const int slab = slab_for_aligned_size(aligned); + spec.members.push_back({slab, static_cast(count)}); + for (size_t k = 0; k < count; ++k) { + spec.part_slab.push_back(slab); + } + spec.total_parts += static_cast(count); + } + const int id = static_cast(objects_.size()); + objects_.push_back(std::move(spec)); + return id; + } + + // Single-object fast path. `index` is trusted (it originates from a + // registered CacheBlock::object_id) -- no check_index on the hot path. + object_alloc_t allocate(int index) + { + const ObjectSpec& spec = objects_[index]; + Allocation* a = table_.Acquire(); + + if (spec.total_parts == 1) { // single-part fast path: no loops, no heap vectors + const int slab = spec.part_slab[0]; + if (space_.slabs_[slab].allocate(&a->slot0, 1, space_.pages_) != 1) { + table_.Release(a); + return {}; + } + a->n = 1; + a->base0 = space_.slabs_[slab].AddressOf(a->slot0); + space_.slabs_[slab].set_owner(a->slot0, object_alloc_t{a}); + return {a}; + } + + a->n = spec.total_parts; // composite + a->bases.resize(a->n); + a->slots.resize(a->n); + if (!space_.reserve(spec, a->slots.data())) { + table_.Release(a); + return {}; + } + for (int p = 0; p < a->n; ++p) { + a->bases[p] = space_.address_of(spec.part_slab[p], a->slots[p]); + space_.set_owner(spec.part_slab[p], a->slots[p], object_alloc_t{a}); + } + return {a}; + } + + void deallocate(int index, object_alloc_t h) + { + const ObjectSpec& spec = objects_[index]; + Allocation* a = const_cast(h.a); // we own it in pool_ + if (spec.total_parts == 1) { + space_.slabs_[spec.part_slab[0]].deallocate(&a->slot0, 1, space_.pages_); + } + else { + space_.free(spec, a->slots.data()); + } + table_.Release(a); + } + + // Batch forms (used by the unit tests): thin loops over the single-object core. + size_t allocate(int index, object_alloc_t* out, size_t count) + { + for (size_t k = 0; k < count; ++k) { + out[k] = allocate(index); + if (!out[k].a) { + return k; // [0,k) placed; matches the partial-OOM contract + } + } + return count; + } + + void deallocate(int index, const object_alloc_t* objs, size_t count) + { + for (size_t k = 0; k < count; ++k) { + deallocate(index, objs[k]); + } + } + + int part_count(int index) const + { + return objects_[index].total_parts; + } + + size_t part_bytes(int index, int part) const + { + return space_.slabs_[objects_[index].part_slab[part]].object_size(); + } + + MemoryStats stats() const + { + MemoryStats s; + s.page = space_.pages_.stats(); + s.region_bytes = static_cast(space_.mem_.byte_size()); + s.live_allocations = table_.pool_.size() - table_.free_.size(); + s.slabs.reserve(space_.slabs_.size()); + for (const SlabAllocator& slab : space_.slabs_) { + s.slabs.push_back(slab.stats()); + } + return s; + } +}; + +// ---- ScratchAllocator::Impl ---- + +struct ScratchAllocator::Impl { + MemoryState space_; // deep copy of source capacity (shares the backing Buffer) + const ObjectAllocator* src_; // borrow: the registry (objects_) lives in src's Impl + std::vector scratch_; + + Impl(MemoryState space, const ObjectAllocator* src): space_{std::move(space)}, src_{src} {} +}; + +// ---- ObjectAllocator public surface ---- + +ObjectAllocator::~ObjectAllocator() = default; +ObjectAllocator::ObjectAllocator() = default; + +ObjectAllocator::ObjectAllocator(Buffer region): impl_{std::make_unique(std::move(region))} {} + +ObjectAllocator::ObjectAllocator(ObjectAllocator&&) noexcept = default; +ObjectAllocator& ObjectAllocator::operator=(ObjectAllocator&&) noexcept = default; + +int ObjectAllocator::Register(size_t size, size_t alignment) +{ + return impl_->register_simple(size, alignment); +} +int ObjectAllocator::Register(const std::vector>& parts) +{ + return impl_->register_composite(parts); +} +object_alloc_t ObjectAllocator::Allocate(int index) +{ + return impl_->allocate(index); +} +void ObjectAllocator::Deallocate(int index, object_alloc_t handle) +{ + impl_->deallocate(index, handle); +} +size_t ObjectAllocator::Allocate(int index, object_alloc_t* objects, size_t count) +{ + return impl_->allocate(index, objects, count); +} +void ObjectAllocator::Deallocate(int index, const object_alloc_t* objects, size_t count) +{ + impl_->deallocate(index, objects, count); +} +int ObjectAllocator::PartCount(int index) const +{ + return impl_->part_count(index); +} +size_t ObjectAllocator::PartBytes(int index, int part) const +{ + return impl_->part_bytes(index, part); +} +bool ObjectAllocator::IsValid(object_alloc_t handle, uint64_t saved_key) const +{ + return handle.a != nullptr && handle->key == saved_key; +} + +MemoryStats ObjectAllocator::Stats() const +{ + return impl_->stats(); +} + +// ---- ScratchAllocator public surface ---- + +ScratchAllocator::ScratchAllocator(const ObjectAllocator& src): impl_{std::make_unique(src.impl_->space_, &src)} +{ +} + +ScratchAllocator::ScratchAllocator(ScratchAllocator&&) noexcept = default; +ScratchAllocator& ScratchAllocator::operator=(ScratchAllocator&&) noexcept = default; +ScratchAllocator::ScratchAllocator() = default; +ScratchAllocator::~ScratchAllocator() = default; + +bool ScratchAllocator::Allocate(int object_id) +{ + const ObjectSpec& spec = impl_->src_->impl_->objects_[object_id]; // friend reaches registry + if (spec.total_parts == 1) { + SlabSlot s; + return impl_->space_.slabs_[spec.part_slab[0]].allocate(&s, 1, impl_->space_.pages_) == 1; + } + impl_->scratch_.resize(spec.total_parts); + return impl_->space_.reserve(spec, impl_->scratch_.data()); +} +void ScratchAllocator::Evict(int object_id, const Allocation* committed) +{ + const ObjectSpec& spec = impl_->src_->impl_->objects_[object_id]; + if (committed->n == 1) { + impl_->space_.slabs_[spec.part_slab[0]].deallocate(&committed->slot0, 1, impl_->space_.pages_); + } + else { + impl_->space_.free(spec, committed->slots.data()); + } +} + +} // namespace turbomind diff --git a/src/turbomind/memory/object.h b/src/turbomind/memory/object.h new file mode 100644 index 0000000000..6e4ae00dc7 --- /dev/null +++ b/src/turbomind/memory/object.h @@ -0,0 +1,105 @@ +#pragma once + +#include "src/turbomind/core/core.h" +#include "src/turbomind/memory/common.h" +#include "src/turbomind/memory/stats.h" + +#include +#include +#include + +namespace turbomind { + +// The durable per-allocation entry. One per allocation (simple = 1 part). The +// handle (object_alloc_t) is a pointer to this. A compactor rewrites base0 / +// bases[] in place; every cached handle then sees the new address. +struct Allocation { + uint64_t key{}; // unique, monotonic; set on Acquire, 0 while free (ABA stale check) + int n{0}; // part count; 0 == free + + // Inline storage for the dominant single-part (simple) object: no heap. + char* base0{}; + SlabSlot slot0{}; + + // Used ONLY for composites (n > 1), holding all n parts. Empty for simple. + std::vector bases; + std::vector slots; + + char* base(int p) const noexcept + { + return n == 1 ? base0 : bases[p]; + } + int part_count() const noexcept + { + return n; + } +}; + +class ObjectAllocator { +public: + ~ObjectAllocator(); + ObjectAllocator(); + + explicit ObjectAllocator(Buffer region); + + // No longer copyable: trials use ScratchAllocator (copies only capacity). + ObjectAllocator(const ObjectAllocator&) = delete; + ObjectAllocator& operator=(const ObjectAllocator&) = delete; + ObjectAllocator(ObjectAllocator&&) noexcept; + ObjectAllocator& operator=(ObjectAllocator&&) noexcept; + + int Register(size_t size, size_t alignment); + int Register(const std::vector>& parts); + + // Single-object fast path (primary; production count is always 1). + [[nodiscard]] object_alloc_t Allocate(int index); // {nullptr} on OOM + void Deallocate(int index, object_alloc_t handle); + + // Batch forms (thin loops over the single-object core; used by tests). + [[nodiscard]] size_t Allocate(int index, object_alloc_t* objects, size_t count); + void Deallocate(int index, const object_alloc_t* objects, size_t count); + + // Registry queries (need the index, not the handle). + int PartCount(int index) const; + size_t PartBytes(int index, int part) const; + + // ABA-safe stale check: handle.a && handle->key == saved_key. + [[nodiscard]] bool IsValid(object_alloc_t handle, uint64_t saved_key) const; + + MemoryStats Stats() const; + +private: + friend class ScratchAllocator; + struct Impl; + std::unique_ptr impl_; +}; + +// Admission probe: owns a copy of the committed allocator's capacity +// (MemoryState) and borrows the live allocator for object layout and to read a +// committed Allocation's slot list. Never touches an AllocationTable. +// +// LIFETIME: a ScratchAllocator borrows its source ObjectAllocator by pointer. +// It MUST NOT outlive that source, and the source MUST NOT be moved-from while +// the scratch is alive. +class ScratchAllocator { +public: + ScratchAllocator(); + explicit ScratchAllocator(const ObjectAllocator& src); + ~ScratchAllocator(); + + ScratchAllocator(const ScratchAllocator&) = delete; + ScratchAllocator& operator=(const ScratchAllocator&) = delete; + ScratchAllocator(ScratchAllocator&&) noexcept; + ScratchAllocator& operator=(ScratchAllocator&&) noexcept; + + // Reserve one object's slots in the capacity copy; false on OOM. + [[nodiscard]] bool Allocate(int object_id); + // Free a committed Allocation's slots in the capacity copy. + void Evict(int object_id, const Allocation* committed); + +private: + struct Impl; + std::unique_ptr impl_; +}; + +} // namespace turbomind diff --git a/src/turbomind/memory/page.h b/src/turbomind/memory/page.h new file mode 100644 index 0000000000..0c61c426e5 --- /dev/null +++ b/src/turbomind/memory/page.h @@ -0,0 +1,249 @@ +#pragma once + +#include "src/turbomind/core/check.h" +#include "src/turbomind/memory/stats.h" + +#include +#include +#include +#include +#include +#include + +#if defined(_MSC_VER) && !defined(__clang__) +#include +#endif + +namespace turbomind { + +#if defined(__GNUC__) || defined(__clang__) +inline int ceil_log2(int x) +{ + if (x <= 1) + return 0; + return 32 - __builtin_clz(static_cast(x - 1)); +} +#elif defined(_MSC_VER) +inline int ceil_log2(int x) +{ + if (x <= 1) + return 0; + unsigned long index; + _BitScanReverse(&index, static_cast(x - 1)); + return static_cast(index) + 1; +} +#else +#error "ceil_log2: unsupported compiler, no leading-zero-count intrinsic available" +#endif + +inline int ceil_pow2(int x) +{ + if (x <= 1) + return 1; + return 1 << ceil_log2(x); +} + +class PageAllocator { +public: + PageAllocator(void* base, size_t size, size_t page_size): + base_{align_base(base, page_size)}, + size_{align_size(base, base_, size)}, + page_size_{page_size}, + page_size_log2_{ceil_log2(page_size)}, + pages_{static_cast(size_ / page_size)}, // + max_order_{ceil_log2(pages_)}, + max_pages_{ceil_pow2(pages_)}, + nodes_(max_pages_ + max_order_ + 1) + { + TM_CHECK_NOTNULL(base_); + for (int k = 0; k <= max_order_; ++k) { + list_init(k); + } + build(); + } + + PageAllocator(const PageAllocator&) = default; + PageAllocator& operator=(const PageAllocator&) = default; + + void* allocate(size_t size) + { + const int order = ceil_log2(get_pages(size)); + + for (int k = order; k <= max_order_; ++k) { + const int idx = head(k); + if (idx != sentinel(k)) { + erase(idx); + split(idx, k, order); + return get_pointer(idx); + } + } + + return nullptr; + } + + void deallocate(void* addr, int size) + { + add(get_page_idx(addr), ceil_log2(get_pages(size))); + } + + PageStats stats() const + { + PageStats s; + s.pages = pages_; + s.page_size = page_size_; + s.max_order = max_order_; + s.free_blocks_by_order.assign(max_order_ + 1, 0); + int free_pages = 0; + for (int k = 0; k <= max_order_; ++k) { + int count = 0; + for (int idx = head(k); idx != sentinel(k); idx = nodes_[idx].next) { + ++count; + } + s.free_blocks_by_order[k] = count; + free_pages += count << k; + } + s.free_pages = free_pages; + s.used_pages = pages_ - free_pages; + return s; + } + +private: + // Round `base` up to the next multiple of `page_size`. Also validates that + // `page_size` is a power of two so that the bit-mask below is well-defined. + static char* align_base(void* base, size_t page_size) + { + TM_CHECK_EQ(page_size, ceil_pow2(page_size)); + const auto p = reinterpret_cast(base); + const auto mask = static_cast(page_size) - 1; + return reinterpret_cast((p + mask) & ~mask); + } + + // Subtract the padding consumed by aligning `base` from the usable region size. + static size_t align_size(void* base, void* aligned, size_t size) + { + const size_t pad = static_cast(aligned) - static_cast(base); + TM_CHECK_LE(pad, size); + return size - pad; + } + + void build() + { + for (int i = 0; i < pages_; ++i) { + add(i, 0); + } + } + + int get_page_idx(void* addr) + { + auto offset = static_cast(addr) - base_; + TM_CHECK(0 <= offset && offset < size_); + return static_cast(offset >> page_size_log2_); + } + + void* get_pointer(int idx) + { + return base_ + (static_cast(idx) << page_size_log2_); + } + + int get_pages(size_t size) + { + return static_cast((size + page_size_ - 1) >> page_size_log2_); + } + + // Sentinel header node index for the order-k free list. Lives at + // `max_pages_ + k` in nodes_, just past the real-and-ghost-page region. + // Sentinels are reached via prev/next links only; their `order` field + // is unread and stays at the default -1. + int sentinel(int k) const noexcept + { + return max_pages_ + k; + } + + int head(int k) const noexcept + { + return nodes_[sentinel(k)].next; + } + + void list_init(int k) noexcept + { + const int s = sentinel(k); + nodes_[s].prev = s; + nodes_[s].next = s; + } + + // Branchless splice. Also writes nodes_[idx].order = k -- the link + // helpers own the order field's "in which list" semantics, which add() + // and split() rely on (see erase()). + void push_front(int k, int idx) noexcept + { + const int s = sentinel(k); + const int next = nodes_[s].next; + nodes_[next].prev = idx; + nodes_[idx].prev = s; + nodes_[idx].next = next; + nodes_[idx].order = k; + nodes_[s].next = idx; + } + + // Branchless detach. Resets nodes_[idx].order = -1 so that add()'s + // `nodes_[buddy].order == k` coalesce check and split()'s + // `nodes_[sibling].order == -1` assertion remain correct. The idx + // node's prev/next are left dangling -- nothing reaches it via link + // chains once it is out of every list. + void erase(int idx) noexcept + { + const int prev = nodes_[idx].prev; + const int next = nodes_[idx].next; + nodes_[prev].next = next; + nodes_[next].prev = prev; + nodes_[idx].order = -1; + } + + void split(int idx, int k, int order) + { + if (k > order) { + const int sibling = idx ^ (1 << (k - 1)); + TM_CHECK(nodes_[sibling].order == -1); + push_front(k - 1, sibling); + return split(idx, k - 1, order); + } + } + + int add(int idx, int k) + { + if (k < max_order_) { + const int buddy = idx ^ (1 << k); + if (nodes_[buddy].order == k) { + erase(buddy); + return add(idx & ~(1 << k), k + 1); + } + } + push_front(k, idx); + return k; + } + +private: + char* base_; + size_t size_; + size_t page_size_; + + int page_size_log2_; + + int pages_; + int max_order_; + int max_pages_; + + struct Node { + int prev = -1; + int next = -1; + int order = -1; // -1 = not in any free list + }; + + // Layout: + // [0, pages_) real pages (may be in a free list) + // [pages_, max_pages_) ghost pages (never added; order stays -1) + // [max_pages_, max_pages_ + max_order_] sentinel header nodes (one per order) + std::vector nodes_; +}; + +} // namespace turbomind diff --git a/src/turbomind/memory/slab.h b/src/turbomind/memory/slab.h new file mode 100644 index 0000000000..c1badc02b3 --- /dev/null +++ b/src/turbomind/memory/slab.h @@ -0,0 +1,348 @@ +// Copyright (c) OpenMMLab. All rights reserved. +#pragma once + +#include "src/turbomind/core/logger.h" +#include "src/turbomind/memory/common.h" +#include "src/turbomind/memory/page.h" +#include "src/turbomind/memory/stats.h" + +#include +#include +#include +#include + +namespace turbomind { + +class Slab { +public: + Slab() = default; + + Slab(int slab_id, char* base, size_t size, int object_size, int object_count): + slab_id_{slab_id}, + base_{base}, + size_{size}, + object_size_{object_size}, + object_count_{object_count}, + slot_owner_(object_count) // default object_alloc_t{} == {nullptr} + { + free_list_.reserve(object_count_); + for (int i = 0; i < object_count_; ++i) { + free_list_.push_back(i); + } + } + + // Compiler-generated copy/move/dtor. + + int allocate(SlabSlot* objects, size_t count) + { + count = std::min(count, free_list_.size()); + for (size_t i = 0; i < count; ++i) { + const int slot_id = free_list_.back(); + free_list_.pop_back(); + objects[i] = {slab_id_, slot_id}; + } + return static_cast(count); + } + + void deallocate(const SlabSlot* objects, size_t count) + { + for (size_t i = 0; i < count; ++i) { + const int slot_id = objects[i].slot_id; + slot_owner_[slot_id] = {}; // clear reverse link + free_list_.push_back(slot_id); + } + } + + void set_owner(int slot_id, object_alloc_t o) noexcept + { + slot_owner_[slot_id] = o; + } + object_alloc_t owner(int slot_id) const noexcept + { + return slot_owner_[slot_id]; + } + + int slab_id() const noexcept + { + return slab_id_; + } + bool is_full() const noexcept + { + return free_list_.empty(); + } + bool is_empty() const noexcept + { + return free_list_.size() == static_cast(object_count_); + } + bool is_partial() const noexcept + { + return !is_full() && !is_empty(); + } + size_t n_free() const noexcept + { + return free_list_.size(); + } + char* base() const noexcept + { + return base_; + } + int object_size() const noexcept + { + return object_size_; + } + int object_count() const noexcept + { + return object_count_; + } + + // Intrusive list links -- directly read/written by SlabAllocator's list helpers. + int prev_id_ = -1; + int next_id_ = -1; + +private: + int slab_id_ = -1; + char* base_ = nullptr; + size_t size_ = 0; + int object_size_ = 0; + int object_count_ = 0; + std::vector free_list_; + std::vector slot_owner_; // slot_id -> owning token (reverse link) +}; + +class SlabAllocator { +public: + SlabAllocator(size_t object_size, size_t min_size, size_t max_size, float threshold, size_t max_empty_slabs): + object_size_{object_size}, max_empty_slabs_{max_empty_slabs} + { + auto get_ratio = [&](size_t size) -> float { + auto quantized = size / object_size * object_size; + return quantized / static_cast(size); + }; + + // smallest power-of-2 size that satisfies both min_size and object_size + size_t size = ceil_pow2(std::max(min_size, object_size)); + + // double slab size until utilization >= threshold + float ratio = get_ratio(size); + while (ratio < threshold && size < max_size) { + size <<= 1; + ratio = get_ratio(size); + } + + slab_size_ = size; + object_count_ = size / object_size; + + TM_LOG_WARN( + "slab_size {}, object_size {}, object_count {}, ratio {}", slab_size_, object_size_, object_count_, ratio); + + slabs_.resize(kFirstSlabId); // three default-constructed sentinel Slabs + list_init(kFull); + list_init(kPartial); + list_init(kEmpty); + } + + // Compiler-generated copy/move/dtor -- pure value type. + + int allocate(SlabSlot* objects, size_t count, PageAllocator& page_alloc) + { + ptrdiff_t remain = count; + + for (int id = head(kPartial); id != kPartial && remain > 0;) { + Slab& slab = slabs_[id]; + int next = slab.next_id_; + remain -= slab.allocate(objects + (count - remain), remain); + if (slab.is_full()) + splice(kFull, kPartial, id); + id = next; + } + + while (remain > static_cast(sizes_[kEmpty] * object_count_)) { + if (!create_empty(page_alloc)) + break; + } + + for (int id = head(kEmpty); id != kEmpty && remain > 0;) { + Slab& slab = slabs_[id]; + int next = slab.next_id_; + remain -= slab.allocate(objects + (count - remain), remain); + if (slab.is_full()) + splice(kFull, kEmpty, id); + else + splice(kPartial, kEmpty, id); + id = next; + } + + return static_cast(count - remain); + } + + int deallocate(const SlabSlot* objects, size_t count, PageAllocator& page_alloc) + { + for (size_t i = 0; i < count;) { + const int slab_id = objects[i].slab_id; + + size_t j = i + 1; + while (j < count && objects[j].slab_id == slab_id) { + ++j; + } + + Slab& slab = slabs_[slab_id]; + const bool was_full = slab.is_full(); + slab.deallocate(objects + i, j - i); + if (was_full) + splice_front(kPartial, kFull, slab_id); + if (slab.is_empty()) + splice(kEmpty, kPartial, slab_id); + + i = j; + } + + while (static_cast(sizes_[kEmpty]) > max_empty_slabs_) { + int slab_id = head(kEmpty); + erase(kEmpty, slab_id); + page_alloc.deallocate(slabs_[slab_id].base(), slab_size_); + free_ids_.push_back(slab_id); + } + + return static_cast(count); + } + + char* AddressOf(SlabSlot h) const noexcept + { + const Slab& slab = slabs_[h.slab_id]; + return slab.base() + static_cast(h.slot_id) * slab.object_size(); + } + + void set_owner(SlabSlot h, object_alloc_t o) noexcept + { + slabs_[h.slab_id].set_owner(h.slot_id, o); + } + object_alloc_t owner(SlabSlot h) const noexcept + { + return slabs_[h.slab_id].owner(h.slot_id); + } + + size_t object_size() const noexcept + { + return object_size_; + } + + SlabStats stats() const + { + SlabStats s{}; + s.object_size = object_size_; + s.slab_size = slab_size_; + s.object_count = object_count_; + s.n_full = sizes_[kFull]; + s.n_partial = sizes_[kPartial]; + s.n_empty = sizes_[kEmpty]; + s.n_slabs = s.n_full + s.n_partial + s.n_empty; + s.total_objects = static_cast(s.n_slabs) * object_count_; + + size_t free_objects = static_cast(s.n_empty) * object_count_; + for (int id = head(kPartial); id != kPartial; id = slabs_[id].next_id_) { + free_objects += slabs_[id].n_free(); + } + s.free_objects = free_objects; + s.used_objects = s.total_objects - free_objects; + return s; + } + +private: + static constexpr int kFull = 0; + static constexpr int kPartial = 1; + static constexpr int kEmpty = 2; + static constexpr int kFirstSlabId = 3; + + int head(int sentinel) const noexcept + { + return slabs_[sentinel].next_id_; + } + int tail(int sentinel) const noexcept + { + return slabs_[sentinel].prev_id_; + } + + void list_init(int sentinel) noexcept + { + slabs_[sentinel].prev_id_ = sentinel; + slabs_[sentinel].next_id_ = sentinel; + } + + void push_back(int sentinel, int id) noexcept + { + const int prev = slabs_[sentinel].prev_id_; + slabs_[prev].next_id_ = id; + slabs_[id].prev_id_ = prev; + slabs_[id].next_id_ = sentinel; + slabs_[sentinel].prev_id_ = id; + ++sizes_[sentinel]; + } + + void push_front(int sentinel, int id) noexcept + { + const int next = slabs_[sentinel].next_id_; + slabs_[next].prev_id_ = id; + slabs_[id].prev_id_ = sentinel; + slabs_[id].next_id_ = next; + slabs_[sentinel].next_id_ = id; + ++sizes_[sentinel]; + } + + void erase(int sentinel, int id) noexcept + { + const int prev = slabs_[id].prev_id_; + const int next = slabs_[id].next_id_; + slabs_[prev].next_id_ = next; + slabs_[next].prev_id_ = prev; + --sizes_[sentinel]; + } + + void splice(int to, int from, int id) noexcept + { + erase(from, id); + push_back(to, id); + } + + void splice_front(int to, int from, int id) noexcept + { + erase(from, id); + push_front(to, id); + } + + // INVARIANT: callers (allocate, deallocate) MUST NOT hold a Slab& across + // a call to create_empty -- slabs_.emplace_back may reallocate the vector. + bool create_empty(PageAllocator& page_alloc) + { + auto memory = page_alloc.allocate(slab_size_); + if (!memory) + return false; + + int slab_id; + if (!free_ids_.empty()) { + slab_id = free_ids_.back(); + free_ids_.pop_back(); + // Move-assign over the ghost entry. Safe: Slab's move-assign only + // copies scalars and swaps containers -- it never dereferences base_. + slabs_[slab_id] = Slab{ + slab_id, (char*)memory, slab_size_, static_cast(object_size_), static_cast(object_count_)}; + } + else { + slab_id = static_cast(slabs_.size()); + slabs_.emplace_back( + slab_id, (char*)memory, slab_size_, static_cast(object_size_), static_cast(object_count_)); + } + push_back(kEmpty, slab_id); + return true; + } + + size_t object_size_ = 0; + size_t object_count_ = 0; + size_t slab_size_ = 0; + size_t max_empty_slabs_ = 0; + + std::vector slabs_; // [0..2] sentinels; [3..] real slabs (some may be ghosts) + std::vector free_ids_; // recycled real slab_ids (always >= 3) + std::array sizes_{}; // sizes_[kFull|kPartial|kEmpty] +}; + +} // namespace turbomind diff --git a/src/turbomind/memory/stats.cc b/src/turbomind/memory/stats.cc new file mode 100644 index 0000000000..0ac69b3c49 --- /dev/null +++ b/src/turbomind/memory/stats.cc @@ -0,0 +1,67 @@ +#include "src/turbomind/memory/stats.h" + +#include +#include + +namespace turbomind { + +namespace { + +std::string human_bytes(size_t n) +{ + constexpr const char* kU[] = {"B", "KiB", "MiB", "GiB", "TiB"}; + double v = static_cast(n); + int i = 0; + while (v >= 1024.0 && i < 4) { + v /= 1024.0; + ++i; + } + return fmt::format("{:.2f}{}", v, kU[i]); +} + +float pct(size_t used, size_t total) +{ + return total ? 100.f * static_cast(used) / static_cast(total) : 0.f; +} + +} // namespace + +std::string FormatMemoryStats(const MemoryStats& s) +{ + fmt::memory_buffer buf; + + fmt::format_to(std::back_inserter(buf), + "[cache] region={} live_alloc={} | pages {}/{} used ({:.2f}%) page_size={}", + human_bytes(s.region_bytes), + s.live_allocations, + s.page.used_pages, + s.page.pages, + pct(s.page.used_pages, s.page.pages), + human_bytes(s.page.page_size)); + + fmt::format_to(std::back_inserter(buf), "\n[cache] free pages by order:"); + for (int k = 0; k <= s.page.max_order; ++k) { + if (s.page.free_blocks_by_order[k]) { + fmt::format_to(std::back_inserter(buf), " o{}={}", k, s.page.free_blocks_by_order[k]); + } + } + + for (const SlabStats& sl : s.slabs) { + fmt::format_to(std::back_inserter(buf), + "\n[cache] slab obj={} slab={} cap={} slabs={}(F{}/P{}/E{}) obj {}/{} used ({:.2f}%)", + human_bytes(sl.object_size), + human_bytes(sl.slab_size), + sl.object_count, + sl.n_slabs, + sl.n_full, + sl.n_partial, + sl.n_empty, + sl.used_objects, + sl.total_objects, + pct(sl.used_objects, sl.total_objects)); + } + + return fmt::to_string(buf); +} + +} // namespace turbomind diff --git a/src/turbomind/memory/stats.h b/src/turbomind/memory/stats.h new file mode 100644 index 0000000000..6fdee7d7a7 --- /dev/null +++ b/src/turbomind/memory/stats.h @@ -0,0 +1,41 @@ +#pragma once + +#include +#include +#include + +namespace turbomind { + +struct PageStats { + int pages; // real pages in the region + size_t page_size; + int free_pages; // sum over orders of (count << order) + int used_pages; // pages - free_pages + int max_order; + std::vector free_blocks_by_order; // index = buddy order, value = # free blocks +}; + +struct SlabStats { + size_t object_size; // bytes per object (the size class) + size_t slab_size; // bytes per slab + size_t object_count; // objects per slab + int n_full; + int n_partial; + int n_empty; + int n_slabs; // full + partial + empty + size_t total_objects; // n_slabs * object_count + size_t used_objects; + size_t free_objects; +}; + +struct MemoryStats { + PageStats page; + std::vector slabs; + size_t live_allocations; // pool_.size() - free_.size() + size_t region_bytes; +}; + +// Renders MemoryStats as a verbose multi-line string (defined in stats.cc). +std::string FormatMemoryStats(const MemoryStats& s); + +} // namespace turbomind diff --git a/src/turbomind/memory/test_memory.cc b/src/turbomind/memory/test_memory.cc new file mode 100644 index 0000000000..51a7824858 --- /dev/null +++ b/src/turbomind/memory/test_memory.cc @@ -0,0 +1,1120 @@ +// Copyright (c) OpenMMLab. All rights reserved. + +#include "src/turbomind/memory/object.h" +#include "src/turbomind/memory/page.h" +#include "src/turbomind/memory/slab.h" + +#include "src/turbomind/core/core.h" + +#include + +#include +#include +#include +#include +#include +#include + +using namespace turbomind; + +namespace { + +// RAII for a heap region whose base is aligned to `alignment` bytes. +struct AlignedRegion { + AlignedRegion(size_t alignment, size_t bytes): bytes_{bytes} + { + // std::aligned_alloc requires bytes to be a multiple of alignment. + size_t rounded = (bytes + alignment - 1) / alignment * alignment; + ptr_ = std::aligned_alloc(alignment, rounded); + TM_CHECK_NOTNULL(ptr_); + } + ~AlignedRegion() + { + std::free(ptr_); + } + AlignedRegion(const AlignedRegion&) = delete; + AlignedRegion& operator=(const AlignedRegion&) = delete; + + void* data() const + { + return ptr_; + } + size_t size() const + { + return bytes_; + } + +private: + void* ptr_; + size_t bytes_; +}; + +inline bool ranges_overlap(const void* a, size_t na, const void* b, size_t nb) +{ + auto pa = reinterpret_cast(a); + auto pb = reinterpret_cast(b); + return pa < pb + nb && pb < pa + na; +} + +// Bytes the buddy allocator actually reserves for a request of `size` bytes given `page_size`. +inline size_t reserved_bytes(size_t size, size_t page_size) +{ + int pages = static_cast((size + page_size - 1) / page_size); + return static_cast(ceil_pow2(pages)) * page_size; +} + +} // namespace + +TEST_CASE("memory test binary builds", "[memory][smoke]") +{ + REQUIRE(true); +} + +TEST_CASE("ceil_log2 / ceil_pow2", "[memory][utils]") +{ + REQUIRE(ceil_log2(0) == 0); + REQUIRE(ceil_log2(1) == 0); + REQUIRE(ceil_log2(2) == 1); + REQUIRE(ceil_log2(3) == 2); + REQUIRE(ceil_log2(4) == 2); + REQUIRE(ceil_log2(5) == 3); + REQUIRE(ceil_log2(7) == 3); + REQUIRE(ceil_log2(8) == 3); + REQUIRE(ceil_log2(16) == 4); + + REQUIRE(ceil_pow2(0) == 1); + REQUIRE(ceil_pow2(1) == 1); + REQUIRE(ceil_pow2(2) == 2); + REQUIRE(ceil_pow2(3) == 4); + REQUIRE(ceil_pow2(4) == 4); + REQUIRE(ceil_pow2(5) == 8); + REQUIRE(ceil_pow2(16) == 16); + REQUIRE(ceil_pow2(17) == 32); + + for (int x = 1; x <= 1024; ++x) { + int p = ceil_pow2(x); + REQUIRE(p >= x); + REQUIRE((p & (p - 1)) == 0); // power of two + } +} + +TEST_CASE("PageAllocator basic", "[memory][page]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + + AlignedRegion region{kPage, kSize}; + PageAllocator alloc{region.data(), kSize, kPage}; + + void* p = alloc.allocate(kPage); + REQUIRE(p != nullptr); + REQUIRE(p >= region.data()); + REQUIRE((char*)p + kPage <= (char*)region.data() + kSize); + + alloc.deallocate(p, kPage); + + void* q = alloc.allocate(kPage); + REQUIRE(q != nullptr); + REQUIRE(reinterpret_cast(q) % kPage == 0); + + alloc.deallocate(q, kPage); +} + +TEST_CASE("PageAllocator alignment", "[memory][page]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + + AlignedRegion region{kPage, kSize}; + PageAllocator alloc{region.data(), kSize, kPage}; + + // Even sub-page requests come back page-aligned. + for (size_t req : {size_t{1}, size_t{17}, size_t{kPage / 2}, size_t{kPage}, size_t{kPage + 1}}) { + void* p = alloc.allocate(req); + REQUIRE(p != nullptr); + REQUIRE(reinterpret_cast(p) % kPage == 0); + alloc.deallocate(p, req); + } +} + +TEST_CASE("PageAllocator misaligned base", "[memory][page]") +{ + constexpr size_t kPage = 4096; + // Allocate enough to absorb a 3-byte preamble and still expose 16 full pages. + AlignedRegion region{kPage, 17 * kPage}; + + void* shifted_base = static_cast(region.data()) + 3; + size_t shifted_size = 16 * kPage + (kPage - 3); // base+3 .. region_end + PageAllocator alloc{shifted_base, shifted_size, kPage}; + + std::vector ptrs; + while (void* p = alloc.allocate(kPage)) { + REQUIRE(reinterpret_cast(p) % kPage == 0); + ptrs.push_back(p); + } + REQUIRE(ptrs.size() == 16); // (shifted_size - pad) / kPage == 16 + for (void* p : ptrs) { + alloc.deallocate(p, kPage); + } +} + +TEST_CASE("PageAllocator coalescing", "[memory][page]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kPages = 16; + constexpr size_t kSize = kPages * kPage; + + AlignedRegion region{kPage, kSize}; + + auto run = [&](bool reverse_free) { + PageAllocator alloc{region.data(), kSize, kPage}; + std::vector ptrs; + for (size_t i = 0; i < kPages; ++i) { + void* p = alloc.allocate(kPage); + REQUIRE(p != nullptr); + ptrs.push_back(p); + } + if (reverse_free) { + std::reverse(ptrs.begin(), ptrs.end()); + } + for (void* p : ptrs) { + alloc.deallocate(p, kPage); + } + // After all pages are freed, the whole region must be allocatable in one go. + void* big = alloc.allocate(kSize); + REQUIRE(big != nullptr); + REQUIRE(big == region.data()); // (only valid because base is page-aligned) + alloc.deallocate(big, kSize); + }; + + run(/*reverse_free=*/false); + run(/*reverse_free=*/true); +} + +TEST_CASE("PageAllocator exhaustion", "[memory][page]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kPages = 16; + constexpr size_t kSize = kPages * kPage; + + AlignedRegion region{kPage, kSize}; + PageAllocator alloc{region.data(), kSize, kPage}; + + std::vector ptrs; + while (void* p = alloc.allocate(kPage)) { + ptrs.push_back(p); + } + REQUIRE(ptrs.size() == kPages); + + for (void* p : ptrs) { + alloc.deallocate(p, kPage); + } + void* big = alloc.allocate(kSize); + REQUIRE(big != nullptr); + alloc.deallocate(big, kSize); +} + +TEST_CASE("PageAllocator non-power-of-2 region", "[memory][page]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kPages = 12; // not a power of two + constexpr size_t kSize = kPages * kPage; + + AlignedRegion region{kPage, kSize}; + PageAllocator alloc{region.data(), kSize, kPage}; + + // 12 sequential single-page allocations must all succeed. + std::vector ptrs; + for (size_t i = 0; i < kPages; ++i) { + void* p = alloc.allocate(kPage); + REQUIRE(p != nullptr); + REQUIRE(reinterpret_cast(p) % kPage == 0); + // Pointer must lie within the valid region. + REQUIRE(p >= region.data()); + REQUIRE(static_cast(p) < static_cast(region.data()) + kSize); + ptrs.push_back(p); + } + REQUIRE(alloc.allocate(kPage) == nullptr); + for (void* p : ptrs) { + alloc.deallocate(p, kPage); + } +} + +TEST_CASE("PageAllocator stress (random)", "[memory][page][stress]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kPages = 64; + constexpr size_t kSize = kPages * kPage; + constexpr int kOps = 5000; + + AlignedRegion region{kPage, kSize}; + PageAllocator alloc{region.data(), kSize, kPage}; + + std::mt19937 rng{0xdeadbeefU}; + + // live[addr] = requested_size + std::map live; + size_t live_reserved = 0; + + auto check_no_overlap = [&](void* p, size_t req) { + size_t res = reserved_bytes(req, kPage); + for (auto& [addr, sz] : live) { + REQUIRE_FALSE(ranges_overlap(p, res, addr, reserved_bytes(sz, kPage))); + } + }; + + for (int op = 0; op < kOps; ++op) { + bool do_alloc = live.empty() ? true : (live.size() >= kPages ? false : (rng() & 1)); + if (do_alloc) { + // Random size 1..kSize bytes (allocator rounds up internally). + size_t req = std::uniform_int_distribution{1, kSize}(rng); + void* p = alloc.allocate(req); + if (!p) { + continue; // capacity-bounded failure is OK + } + REQUIRE(p >= region.data()); + REQUIRE((char*)p + reserved_bytes(req, kPage) <= (char*)region.data() + kSize); + REQUIRE(reinterpret_cast(p) % kPage == 0); + check_no_overlap(p, req); + live[p] = req; + live_reserved += reserved_bytes(req, kPage); + REQUIRE(live_reserved <= kSize); + } + else { + // Pick a random live entry and free it. + auto it = live.begin(); + std::advance(it, std::uniform_int_distribution{0, live.size() - 1}(rng)); + alloc.deallocate(it->first, it->second); + live_reserved -= reserved_bytes(it->second, kPage); + live.erase(it); + } + } + + // Free everything; whole region should be allocatable again. + for (auto& [addr, sz] : live) { + alloc.deallocate(addr, sz); + } + void* big = alloc.allocate(kSize); + REQUIRE(big != nullptr); + alloc.deallocate(big, kSize); +} + +TEST_CASE("SlabAllocator min_size < object_size", "[memory][slab]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + // min_size (64) is smaller than object_size (1024). Pre-fix this loops forever. + SlabAllocator slab{/*object_size=*/1024, + /*min_size=*/64, + /*max_size=*/64 * 1024, + /*threshold=*/0.0f, + /*max_empty_slabs=*/2}; + + SlabSlot out[1] = {}; + int n = slab.allocate(out, 1, pages); + REQUIRE(n == 1); + char* addr = slab.AddressOf(out[0]); + REQUIRE(addr != nullptr); + REQUIRE(reinterpret_cast(addr) >= reinterpret_cast(region.data())); + REQUIRE(reinterpret_cast(addr) + 1024 <= reinterpret_cast(region.data()) + kSize); + slab.deallocate(out, 1, pages); +} + +TEST_CASE("SlabAllocator slot_owner reverse link", "[memory][slab][owner]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + constexpr size_t kObj = 64; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + // max_empty_slabs = 2 keeps the single slab resident after the free below, + // so owner() reads a live (non-reclaimed) slot. + SlabAllocator slab{kObj, kPage, kPage, 0.0f, /*max_empty_slabs=*/2}; + + SlabSlot s[1] = {}; + REQUIRE(slab.allocate(s, 1, pages) == 1); + + // The slab carries a bidirectional link slot -> owning token. A fresh slot + // starts unowned (default token); set_owner records the owner; deallocate + // clears the link. + REQUIRE(slab.owner(s[0]) == object_alloc_t{}); + + const object_alloc_t tok{reinterpret_cast(0x1000)}; + slab.set_owner(s[0], tok); + REQUIRE(slab.owner(s[0]) == tok); + + slab.deallocate(s, 1, pages); + REQUIRE(slab.owner(s[0]) == object_alloc_t{}); // cleared on free +} + +TEST_CASE("SlabAllocator empty-slab reclamation", "[memory][slab]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; // 64 KiB + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + constexpr size_t kObj = 64; + SlabAllocator slab{/*object_size=*/kObj, + /*min_size=*/kPage, + /*max_size=*/kPage, + /*threshold=*/0.0f, + /*max_empty_slabs=*/0}; + + constexpr int kCount = 64 * 3 + 5; + std::vector outs(kCount); + int got = slab.allocate(outs.data(), kCount, pages); + REQUIRE(got == kCount); + + // All addresses are distinct (Bug E sanity check on multi-slab path). + std::vector addrs; + for (auto& o : outs) { + addrs.push_back(slab.AddressOf(o)); + } + std::sort(addrs.begin(), addrs.end()); + REQUIRE(std::unique(addrs.begin(), addrs.end()) == addrs.end()); + + // Free everything; with max_empty_slabs=0, the slab allocator must + // return its pages to the underlying PageAllocator. + slab.deallocate(outs.data(), kCount, pages); + + void* big = pages.allocate(kSize); + REQUIRE(big != nullptr); + pages.deallocate(big, kSize); +} + +TEST_CASE("SlabAllocator single slab", "[memory][slab]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + constexpr size_t kObj = 64; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + SlabAllocator slab{kObj, kPage, kPage, 0.0f, 2}; + + constexpr int kPerSlab = kPage / kObj; // 64 + std::vector outs(kPerSlab); + REQUIRE(slab.allocate(outs.data(), kPerSlab, pages) == kPerSlab); + + // Distinctness + per-slab object alignment. + std::vector addrs; + for (auto& o : outs) { + char* addr = slab.AddressOf(o); + REQUIRE(addr != nullptr); + addrs.push_back(addr); + } + std::sort(addrs.begin(), addrs.end()); + REQUIRE(std::unique(addrs.begin(), addrs.end()) == addrs.end()); + for (size_t i = 1; i < addrs.size(); ++i) { + REQUIRE(static_cast(addrs[i] - addrs[i - 1]) == kObj); + } + + // One more allocation should still succeed because the region has + // pages free for another slab. + SlabSlot extra[1] = {}; + REQUIRE(slab.allocate(extra, 1, pages) == 1); + + slab.deallocate(extra, 1, pages); + slab.deallocate(outs.data(), kPerSlab, pages); + + // Round-trip: refill the same slab. + REQUIRE(slab.allocate(outs.data(), kPerSlab, pages) == kPerSlab); + slab.deallocate(outs.data(), kPerSlab, pages); +} + +TEST_CASE("SlabAllocator spans multiple slabs", "[memory][slab]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + constexpr size_t kObj = 64; + constexpr int kPerSlab = kPage / kObj; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + SlabAllocator slab{kObj, kPage, kPage, 0.0f, 2}; + + constexpr int kRequest = kPerSlab * 2 + 3; + std::vector outs(kRequest); + int got = slab.allocate(outs.data(), kRequest, pages); + REQUIRE(got == kRequest); + + std::vector addrs; + for (int i = 0; i < got; ++i) { + char* addr = slab.AddressOf(outs[i]); + REQUIRE(addr != nullptr); + REQUIRE(addr >= region.data()); + REQUIRE(addr + kObj <= (char*)region.data() + kSize); + addrs.push_back(addr); + } + std::sort(addrs.begin(), addrs.end()); + REQUIRE(std::unique(addrs.begin(), addrs.end()) == addrs.end()); + + slab.deallocate(outs.data(), got, pages); +} + +TEST_CASE("SlabAllocator round-trip", "[memory][slab]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 16 * kPage; + constexpr size_t kObj = 64; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + SlabAllocator slab{kObj, kPage, kPage, 0.0f, 2}; + + for (int round = 0; round < 5; ++round) { + constexpr int kBatch = 200; + std::vector a(kBatch); + std::vector b(kBatch); + REQUIRE(slab.allocate(a.data(), kBatch, pages) == kBatch); + REQUIRE(slab.allocate(b.data(), kBatch, pages) == kBatch); + + // a and b live concurrently; their object addresses must not collide. + std::vector all; + for (auto& o : a) { + all.push_back(slab.AddressOf(o)); + } + for (auto& o : b) { + all.push_back(slab.AddressOf(o)); + } + std::sort(all.begin(), all.end()); + REQUIRE(std::unique(all.begin(), all.end()) == all.end()); + + slab.deallocate(a.data(), kBatch, pages); + slab.deallocate(b.data(), kBatch, pages); + } +} + +TEST_CASE("SlabAllocator stress (random)", "[memory][slab][stress]") +{ + constexpr size_t kPage = 4096; + constexpr size_t kSize = 32 * kPage; + constexpr size_t kObj = 32; + constexpr int kOps = 5000; + + AlignedRegion region{kPage, kSize}; + PageAllocator pages{region.data(), kSize, kPage}; + + SlabAllocator slab{kObj, kPage, kPage, 0.0f, 2}; + + std::mt19937 rng{0xc0ffee01U}; + + std::vector live; + + for (int op = 0; op < kOps; ++op) { + bool do_alloc = live.empty() ? true : (rng() & 1); + if (do_alloc) { + int batch = std::uniform_int_distribution{1, 16}(rng); + std::vector out(batch); + int got = slab.allocate(out.data(), batch, pages); + for (int i = 0; i < got; ++i) { + char* addr = slab.AddressOf(out[i]); + REQUIRE(addr >= region.data()); + REQUIRE(addr + kObj <= (char*)region.data() + kSize); + live.push_back(out[i]); + } + } + else { + int batch = std::uniform_int_distribution{1, std::min(16, live.size())}(rng); + std::vector to_free; + for (int i = 0; i < batch; ++i) { + size_t idx = std::uniform_int_distribution{0, live.size() - 1}(rng); + to_free.push_back(live[idx]); + live[idx] = live.back(); + live.pop_back(); + } + slab.deallocate(to_free.data(), to_free.size(), pages); + } + + // Live objects are pairwise distinct. + if ((op & 0xff) == 0 && !live.empty()) { + std::vector snap; + for (auto& o : live) { + snap.push_back(slab.AddressOf(o)); + } + std::sort(snap.begin(), snap.end()); + REQUIRE(std::unique(snap.begin(), snap.end()) == snap.end()); + } + } + + if (!live.empty()) { + slab.deallocate(live.data(), live.size(), pages); + } +} + +TEST_CASE("ObjectAllocator construct + Register", "[memory][object]") +{ + using core::Allocator; + using core::Buffer; + + // The hard-coded kPageSize/kMinSlabSize inside ObjectAllocator::Impl is 16 MiB, + // so the buffer must be large enough to hold at least a couple of slabs. + constexpr size_t kBytes = 64UL << 20; // 64 MiB + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + + int idx0 = obj.Register(/*size=*/24, /*alignment=*/8); + int idx1 = obj.Register(/*size=*/96, /*alignment=*/16); + REQUIRE(idx0 == 0); + REQUIRE(idx1 == 1); + REQUIRE(idx0 != idx1); +} + +TEST_CASE("ObjectAllocator Allocate / Deallocate", "[memory][object]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + // Use an object size that yields ~256 objects per 16 MiB slab; tiny + // sizes here would force Slab::initialize to push hundreds of + // thousands of list nodes, dominating test runtime. + constexpr size_t kSize = 65536; // 64 KiB + constexpr size_t kAlign = 64; + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + int idx = obj.Register(kSize, kAlign); + + constexpr size_t kBatch = 8; + std::vector outs(kBatch); + size_t got = obj.Allocate(idx, outs.data(), kBatch); + REQUIRE(got == kBatch); + + char* base = static_cast(buf.raw_data()); + for (size_t i = 0; i < got; ++i) { + char* addr = outs[i]->base(0); + REQUIRE(addr >= base); + REQUIRE(addr + kSize <= base + kBytes); + REQUIRE(reinterpret_cast(addr) % kAlign == 0); + } + + // Single-part objects store part 0 inline (no heap vectors) and read it back + // through base(0). + REQUIRE(outs[0]->part_count() == 1); + REQUIRE(outs[0]->base(0) == outs[0]->base0); + REQUIRE(outs[0]->bases.empty()); + + obj.Deallocate(idx, outs.data(), got); + + got = obj.Allocate(idx, outs.data(), kBatch); + REQUIRE(got == kBatch); + obj.Deallocate(idx, outs.data(), got); +} + +TEST_CASE("ObjectAllocator multiple registrations", "[memory][object]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kSize0 = 65536; // 64 KiB + constexpr size_t kSize1 = 131072; // 128 KiB + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + + int idx0 = obj.Register(kSize0, 64); + int idx1 = obj.Register(kSize1, 128); + + constexpr size_t kBatch = 8; + std::vector a(kBatch); + std::vector b(kBatch); + REQUIRE(obj.Allocate(idx0, a.data(), kBatch) == kBatch); + REQUIRE(obj.Allocate(idx1, b.data(), kBatch) == kBatch); + + for (auto& x : a) { + for (auto& y : b) { + REQUIRE_FALSE(ranges_overlap(x->base(0), kSize0, y->base(0), kSize1)); + } + } + + obj.Deallocate(idx0, a.data(), kBatch); + obj.Deallocate(idx1, b.data(), kBatch); +} + +TEST_CASE("ScratchAllocator leaves source untouched", "[memory][object][scratch]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kSize = 65536; + constexpr size_t kAlign = 64; + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register(kSize, kAlign); + + constexpr size_t kBatch = 16; + std::vector live(kBatch); + REQUIRE(obj.Allocate(idx, live.data(), kBatch) == kBatch); + + std::vector addrs(kBatch); + std::vector keys(kBatch); + for (size_t i = 0; i < kBatch; ++i) { + addrs[i] = live[i]->base(0); + keys[i] = live[i]->key; + } + + // A scratch probe shares the backing region (same addresses) but its + // alloc/evict must not perturb the source's live allocations. + { + ScratchAllocator scratch{obj}; + scratch.Evict(idx, live[0].a); // free a committed allocation in the copy + REQUIRE(scratch.Allocate(idx)); // reuse freed capacity in the copy + } + + for (size_t i = 0; i < kBatch; ++i) { + REQUIRE(obj.IsValid(live[i], keys[i])); + REQUIRE(live[i]->base(0) == addrs[i]); + } + obj.Deallocate(idx, live.data(), kBatch); +} + +TEST_CASE("ObjectAllocator IsValid sentinel and lifecycle", "[memory][object][isvalid]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kSize = 65536; + constexpr size_t kAlign = 64; + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + int idx = obj.Register(kSize, kAlign); + + // Null handle is never valid, for any snapshot key. + object_alloc_t zero{}; + REQUIRE(obj.IsValid(zero, 0) == false); + REQUIRE(obj.IsValid(zero, 12345) == false); + + // Live allocations: valid against the key snapshotted at alloc time. + constexpr size_t kBatch = 8; + std::vector live(kBatch); + REQUIRE(obj.Allocate(idx, live.data(), kBatch) == kBatch); + std::vector keys(kBatch); + for (size_t i = 0; i < kBatch; ++i) { + keys[i] = live[i]->key; + REQUIRE(obj.IsValid(live[i], keys[i]) == true); + } + + // Deallocate: the key is zeroed, so every snapshot stops matching. + obj.Deallocate(idx, live.data(), kBatch); + for (size_t i = 0; i < kBatch; ++i) { + REQUIRE(obj.IsValid(live[i], keys[i]) == false); + } + + // Re-allocate: even if an Allocation* is recycled into live2, the monotonic + // key advanced, so the OLD snapshot keys still fail (ABA-proof). + std::vector live2(kBatch); + REQUIRE(obj.Allocate(idx, live2.data(), kBatch) == kBatch); + for (size_t i = 0; i < kBatch; ++i) { + REQUIRE(obj.IsValid(live2[i], live2[i]->key) == true); + REQUIRE(obj.IsValid(live[i], keys[i]) == false); + } + + obj.Deallocate(idx, live2.data(), kBatch); +} + +TEST_CASE("ObjectAllocator IsValid across slab reclaim", "[memory][object][isvalid]") +{ + using core::Allocator; + using core::Buffer; + + // 64 MiB / 64 KiB objects = 256 objects per 16 MiB slab; 768 fills 3 slabs. + // With kMaxEmptySlabs = 2, freeing all 3 reclaims one slab to PageAllocator. + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kSize = 65536; + constexpr size_t kAlign = 64; + constexpr size_t kAlloc = 768; + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + int idx = obj.Register(kSize, kAlign); + + std::vector handles(kAlloc); + REQUIRE(obj.Allocate(idx, handles.data(), kAlloc) == kAlloc); + std::vector keys(kAlloc); + for (size_t i = 0; i < kAlloc; ++i) { + keys[i] = handles[i]->key; + } + + obj.Deallocate(idx, handles.data(), kAlloc); + for (size_t i = 0; i < kAlloc; ++i) { + REQUIRE(obj.IsValid(handles[i], keys[i]) == false); // key zeroed on free + } + + // Churn (slab reclaim/reuse underneath): fresh keys are monotonic, so the + // stale snapshots never match again. + std::vector refill(kAlloc); + REQUIRE(obj.Allocate(idx, refill.data(), kAlloc) == kAlloc); + for (size_t i = 0; i < kAlloc; ++i) { + REQUIRE(obj.IsValid(handles[i], keys[i]) == false); + REQUIRE(obj.IsValid(refill[i], refill[i]->key) == true); + } + + obj.Deallocate(idx, refill.data(), kAlloc); +} + +TEST_CASE("object_alloc_t opacity", "[memory][object][opacity]") +{ + object_alloc_t a{}; + object_alloc_t b{}; + REQUIRE(a == b); + REQUIRE_FALSE(a != b); + REQUIRE_FALSE(a < b); + + object_alloc_t c = a; + REQUIRE(c == a); + + // A non-null sentinel compares unequal to the null handle. + object_alloc_t s{reinterpret_cast(0x1000)}; + REQUIRE(s != a); + REQUIRE(((a < s) || (s < a))); + + std::vector v{a, b, c, s}; + std::sort(v.begin(), v.end()); + + static_assert(sizeof(object_alloc_t) == 8); +} + +TEST_CASE("ObjectAllocator composite lifecycle", "[memory][object][composite]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; // 64 MiB + constexpr size_t kRec = 65536; // recurrent part size + constexpr size_t kConv = 131072; // conv part size (distinct slab class) + constexpr int kN = 4; // recurrent parts + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + // recurrent parts 1..N (count kN) + conv accumulation part 0 (count 1) + const int idx = obj.Register({{kRec, 1, kN}, {kConv, 1, 1}}); + + REQUIRE(obj.PartCount(idx) == kN + 1); + for (int p = 0; p < kN; ++p) { + REQUIRE(obj.PartBytes(idx, p) == kRec); + } + REQUIRE(obj.PartBytes(idx, kN) == kConv); + + object_alloc_t h{}; + REQUIRE(obj.Allocate(idx, &h, 1) == 1); + const uint64_t h_key = h->key; + REQUIRE(obj.IsValid(h, h_key) == true); + + REQUIRE(h->part_count() == kN + 1); + REQUIRE(h->base(0) == h->bases[0]); // composite uses the bases vector + + // All part bases distinct and inside the buffer. + char* base = static_cast(buf.raw_data()); + std::vector addrs; + for (int p = 0; p < kN + 1; ++p) { + REQUIRE(h->base(p) >= base); + REQUIRE(h->base(p) + obj.PartBytes(idx, p) <= base + kBytes); + addrs.push_back(h->base(p)); + } + std::sort(addrs.begin(), addrs.end()); + REQUIRE(std::unique(addrs.begin(), addrs.end()) == addrs.end()); + + obj.Deallocate(idx, &h, 1); + REQUIRE(obj.IsValid(h, h_key) == false); + + // Re-allocate succeeds (pages reclaimed); the freed snapshot stays invalid. + object_alloc_t h2{}; + REQUIRE(obj.Allocate(idx, &h2, 1) == 1); + REQUIRE(obj.IsValid(h2, h2->key) == true); + REQUIRE(obj.IsValid(h, h_key) == false); + obj.Deallocate(idx, &h2, 1); +} + +TEST_CASE("ObjectAllocator composite stale handle after recycle", "[memory][object][composite]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register({{65536, 1, 2}, {131072, 1, 1}}); + + object_alloc_t a{}; + REQUIRE(obj.Allocate(idx, &a, 1) == 1); + const uint64_t a_key = a->key; + obj.Deallocate(idx, &a, 1); // frees the allocation; its key is zeroed + + object_alloc_t b{}; + REQUIRE(obj.Allocate(idx, &b, 1) == 1); // fresh, monotonic key + + // `a`'s snapshot key no longer matches (freed, or recycled into `b` with a + // new key) -> stale/invalid. + REQUIRE(obj.IsValid(a, a_key) == false); + REQUIRE(obj.IsValid(b, b->key) == true); + obj.Deallocate(idx, &b, 1); +} + +TEST_CASE("ObjectAllocator composite atomic rollback on partial OOM", "[memory][object][composite]") +{ + using core::Buffer; + using core::Device; + + // 64 MiB / 16 MiB pages = 4 single-page slots. A 16 MiB object => 1 object + // per 16 MiB slab (one page each). The region base must be 16 MiB-aligned so + // PageAllocator does not lose a page to align_base padding. + constexpr size_t kPage = 16UL << 20; + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kObj = 16UL << 20; // 16 MiB -> slab_size 16 MiB, 1 obj/slab + + AlignedRegion region{kPage, kBytes}; + Buffer buf{region.data(), static_cast(region.size()), data_type_v, Device{kCPU}}; + + ObjectAllocator obj{buf}; + // member 0: 3 parts (3 pages, fits); member 1: 2 parts (needs 2 pages, only 1 + // left) -> partial OOM on member 1 -> roll back all 3 of member 0. + const int idx = obj.Register({{kObj, 1, 3}, {kObj, 1, 2}}); + + object_alloc_t h{}; + REQUIRE(obj.Allocate(idx, &h, 1) == 0); // not placed + REQUIRE(h.a == nullptr); // batch leaves the slot null on OOM + REQUIRE(obj.IsValid(h, 0) == false); + + // No leak: a simple 16 MiB object (same aligned size -> shared slab class) + // can fully allocate all 4 pages. + const int sidx = obj.Register(kObj, 1); + std::vector live(4); + REQUIRE(obj.Allocate(sidx, live.data(), 4) == 4); + obj.Deallocate(sidx, live.data(), 4); +} + +TEST_CASE("ObjectAllocator size dedup", "[memory][object][dedup]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + + // Equal aligned size (65536) -> SAME object id. + const int a = obj.Register(65500, 64); // aligns up to 65536 + const int b = obj.Register(65536, 64); + REQUIRE(a == b); + + // Different aligned size -> different id. + const int c = obj.Register(131072, 64); + REQUIRE(c != a); + + // Composite single entry count 1 collapses to the simple deduped id. + const int d = obj.Register({{65536, 64, 1}}); + REQUIRE(d == a); +} + +TEST_CASE("ScratchAllocator composite source independence", "[memory][object][composite][scratch]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register({{65536, 1, 3}, {131072, 1, 1}}); + + object_alloc_t a{}; + REQUIRE(obj.Allocate(idx, &a, 1) == 1); + const uint64_t a_key = a->key; + char* a_part0 = a->base(0); + + { + ScratchAllocator scratch{obj}; + scratch.Evict(idx, a.a); + // After freeing `a`'s slots in the copy, the copy can place another. + REQUIRE(scratch.Allocate(idx)); + } + + REQUIRE(obj.IsValid(a, a_key)); + REQUIRE(a->base(0) == a_part0); + obj.Deallocate(idx, &a, 1); +} + +TEST_CASE("ObjectAllocator resolve gives part bases", "[memory][object][resolve]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register({{65536, 1, 2}, {131072, 1, 1}}); + + object_alloc_t h{}; + REQUIRE(obj.Allocate(idx, &h, 1) == 1); + + const Allocation* a = h.a; // the handle IS the resolved Allocation pointer + REQUIRE(a != nullptr); + REQUIRE(h->part_count() == obj.PartCount(idx)); + REQUIRE(static_cast(a->bases.size()) == obj.PartCount(idx)); // composite: all parts in bases + for (int p = 0; p < obj.PartCount(idx); ++p) { + REQUIRE(a->bases[p] == h->base(p)); + } + obj.Deallocate(idx, &h, 1); +} + +TEST_CASE("ScratchAllocator reports OOM at capacity", "[memory][object][scratch]") +{ + using core::Buffer; + using core::Device; + + // 64 MiB / 16 MiB pages = 4 single-page slots. A 16 MiB object => slab_size + // 16 MiB, 1 obj/slab, so the region holds exactly 4 such objects. + constexpr size_t kPage = 16UL << 20; + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kObj = 16UL << 20; + + AlignedRegion region{kPage, kBytes}; + Buffer buf{region.data(), static_cast(region.size()), data_type_v, Device{kCPU}}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register(kObj, 1); + + // One live allocation in the source; snapshot its resolved address. + object_alloc_t live{}; + REQUIRE(obj.Allocate(idx, &live, 1) == 1); + const uint64_t live_key = live->key; + char* live_addr = live->base(0); + + // A scratch probe inherits the source's committed capacity (1 of 4 slots + // used). It can place the remaining 3, then the 4th must report OOM. + { + ScratchAllocator scratch{obj}; + REQUIRE(scratch.Allocate(idx)); // slot 2 + REQUIRE(scratch.Allocate(idx)); // slot 3 + REQUIRE(scratch.Allocate(idx)); // slot 4 -> capacity now full + REQUIRE_FALSE(scratch.Allocate(idx)); // capacity exceeded -> false + } + + // Source untouched: its live handle still resolves to the same address. + REQUIRE(obj.IsValid(live, live_key)); + REQUIRE(live->base(0) == live_addr); + obj.Deallocate(idx, &live, 1); +} + +TEST_CASE("ObjectAllocator Allocation pointer is stable across churn", "[memory][object][resolve][stability]") +{ + using core::Allocator; + using core::Buffer; + + constexpr size_t kBytes = 64UL << 20; + constexpr size_t kSize = 65536; + constexpr size_t kAlign = 64; + + Allocator alloc{kCPU}; + Buffer buf{kBytes, data_type_v, alloc}; + + ObjectAllocator obj{buf}; + const int idx = obj.Register(kSize, kAlign); + + // Pin one allocation and snapshot its durable entry (single-part: base0). + object_alloc_t h0{}; + REQUIRE(obj.Allocate(idx, &h0, 1) == 1); + const Allocation* const a0 = h0.a; + char* const base0 = h0->base(0); + REQUIRE(h0->part_count() == 1); + REQUIRE(h0->bases.empty()); // inline storage, no heap parts + + // Churn the table around h0: allocate a batch, free half, allocate another + // batch -- all WITHOUT touching h0. + constexpr size_t kBatch = 32; + std::vector first(kBatch); + REQUIRE(obj.Allocate(idx, first.data(), kBatch) == kBatch); + obj.Deallocate(idx, first.data(), kBatch / 2); // free the first half + std::vector second(kBatch); + REQUIRE(obj.Allocate(idx, second.data(), kBatch) == kBatch); + + // h0's durable Allocation* and its resolved base are unchanged by the churn: + // this is the invariant CacheBlock relies on when it caches the handle. + REQUIRE(h0.a == a0); + REQUIRE(h0->base(0) == base0); + + obj.Deallocate(idx, &h0, 1); + obj.Deallocate(idx, first.data() + kBatch / 2, kBatch / 2); // free the still-live half + obj.Deallocate(idx, second.data(), kBatch); +} + +TEST_CASE("ObjectAllocator simple/composite share one slab class", "[memory][object][dedup][interchange]") +{ + using core::Buffer; + using core::Device; + + // Buffer == exactly ONE ObjectAllocator page (MemoryState::kPageSize, 32 MiB). + // 32 MiB-aligned base so PageAllocator does not lose the only page to padding. + // With a single page there is no second page to reclaim, so a partial donor + // slab cannot help a *separate* slab class -- sharing is the only way through. + constexpr size_t kPageBytes = 32UL << 20; // == MemoryState::kPageSize + constexpr size_t kObj = 65536; // 64 KiB -> slab holds many slots + + AlignedRegion region{kPageBytes, kPageBytes}; + Buffer buf{region.data(), static_cast(region.size()), data_type_v, Device{kCPU}}; + + ObjectAllocator obj{buf}; + + // Simple object of size kObj, and a composite whose single member is two + // parts of the SAME aligned size kObj. count=2 (!=1) keeps it a real + // composite (does not collapse to the simple id). + const int sidx = obj.Register(kObj, 1); + const int cidx = obj.Register({{kObj, 1, 2}}); + REQUIRE(sidx != cidx); // distinct objects ... + REQUIRE(obj.PartCount(cidx) == 2); // ... but the composite has 2 parts + + // One simple alloc creates the single slab on the only page and leaves it + // PARTIAL (one slot used, the rest free, zero free pages remain). + object_alloc_t s0{}; + REQUIRE(obj.Allocate(sidx, &s0, 1) == 1); + + // The composite's two parts must come from that same partial slab. If the + // composite had its own slab class it would need a fresh page (none left) + // and fail. Success here == simple and composite share one slab class. + object_alloc_t comp{}; + REQUIRE(obj.Allocate(cidx, &comp, 1) == 1); + REQUIRE(comp.a != nullptr); + REQUIRE(comp->part_count() == 2); + + obj.Deallocate(cidx, &comp, 1); + obj.Deallocate(sidx, &s0, 1); +} diff --git a/src/turbomind/models/CMakeLists.txt b/src/turbomind/models/CMakeLists.txt index 376b035664..931602800f 100644 --- a/src/turbomind/models/CMakeLists.txt +++ b/src/turbomind/models/CMakeLists.txt @@ -15,19 +15,16 @@ add_library(models STATIC model_weight.cc model_root.cc vision_model.cc + qwen3_5vit/qwen3_5vit_block_weight.cc + qwen3_5vit/qwen3_5vit_weight.cc qwen3_5vit/fast_pos_embed.cu qwen3_5vit/fast_rotary_pos_emb.cu qwen3_5vit/fused_embed_merge.cu qwen3_5vit/qkv_preprocess.cu qwen3_5vit/mrope_position_ids.cu qwen3_5vit/bias_gelu.cu - qwen3_5vit/qwen3_5vit_block_weight.cc - qwen3_5vit/qwen3_5vit_weight.cc qwen3_5vit/qwen3_5vit.cc llama/LlamaLinear.cu - llama/BlockManager.cc - llama/BlockTrie.cc - llama/SequenceManager.cc llama/LlamaFfnLayer.cc llama/moe_ffn_layer.cc llama/unified_decoder.cc @@ -42,6 +39,7 @@ set_property(TARGET models PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON) target_link_libraries(models PUBLIC generation core + memory gemm2 rms_norm layer_norm diff --git a/src/turbomind/models/attention_weight.h b/src/turbomind/models/attention_weight.h index 6462ca194d..ffdb4474e9 100644 --- a/src/turbomind/models/attention_weight.h +++ b/src/turbomind/models/attention_weight.h @@ -124,6 +124,9 @@ class AttentionWeight: public core::Module { bool use_logn_attn{}; core::RopeConfig rope{}; + + // Set by runtime layer + size_t cache_block_offset{}; }; } // namespace turbomind diff --git a/src/turbomind/models/delta_net_weight.h b/src/turbomind/models/delta_net_weight.h index 92fa9bbe72..3298466061 100644 --- a/src/turbomind/models/delta_net_weight.h +++ b/src/turbomind/models/delta_net_weight.h @@ -69,6 +69,10 @@ class DeltaNetWeight: public core::Module { DataType data_type{}; int tp_size{}; int tp_rank{}; + + // Set at runtime + int conv_state_offset{}; + int linear_state_offset{}; }; } // namespace turbomind diff --git a/src/turbomind/models/input_processor.cc b/src/turbomind/models/input_processor.cc index d31b461979..bfcee37dea 100644 --- a/src/turbomind/models/input_processor.cc +++ b/src/turbomind/models/input_processor.cc @@ -6,7 +6,6 @@ #include "src/turbomind/engine/request.h" -#include "src/turbomind/models/llama/SequenceManager.h" #include "src/turbomind/models/vision_model.h" namespace turbomind { @@ -36,16 +35,16 @@ struct InputProcessor::Impl { } } - int Add(RequestCache& c) + int Add(Sequence& c) { - const auto& [r, s] = std::tie(*c.req, *c.seq); + const auto& r = *c.req; // trim input embeds - if (!s.input_embeds_offsets.empty()) { - Interval l{0, (int)s.tokens.size()}; + if (!c.input_embeds_offsets.empty()) { + Interval l{0, (int)c.tokens.size()}; using Size = Interval::Size; - auto& embeds = s.input_embeds; - auto& offsets = s.input_embeds_offsets; + auto& embeds = c.input_embeds; + auto& offsets = c.input_embeds_offsets; int i = embeds.size() - 1; for (; i >= 0; --i) { Interval r{offsets[i], Size{(int)embeds[i].shape(0)}}; @@ -67,12 +66,6 @@ struct InputProcessor::Impl { return Request::kInvalid; } - // clone the embeds if the request persists - if (!r.session.end_flag) { - auto tmp = std::exchange(embeds, empty_like(embeds)); - std::copy_n((const uint8_t*)tmp.raw_data(), tmp.byte_size(), (uint8_t*)embeds.raw_data()); - } - const auto [sum, dim] = embeds.shapes(0, 1); const auto n = ranges_ptr->shape(0); const auto ranges = ranges_ptr->data(); @@ -94,8 +87,8 @@ struct InputProcessor::Impl { /// TODO: reject for src range OOB return Request::kInvalid; } - s.input_embeds_offsets.push_back(range.begin()); - s.input_embeds.push_back(embeds.slice(offset, size)); // reference into `embeds` + c.input_embeds_offsets.push_back(range.begin()); + c.input_embeds.push_back(embeds.slice(offset, size)); // reference into `embeds` offset += size; last = range.end(); } @@ -106,7 +99,7 @@ struct InputProcessor::Impl { void Add(int phase, TensorMap& env) { - const Buffer_ rc = env.at("requests").buffer(); + const Buffer_ rc = env.at("requests").buffer(); for (int i = 0; i < rc.size(); ++i) { auto& c = *TM_CHECK_NOTNULL(rc[i]); if (c.status == 0) { @@ -121,13 +114,13 @@ struct InputProcessor::Impl { auto& b = *env.at("batch").data()[0]; auto& copy = *env.at("copy").data()[0]; - const auto& rc = b.rc; + Buffer_ rc = env.at("requests").buffer(); input_ids_offsets_buf_[0] = 0; for (int i = 0; i < rc.size(); ++i) { input_ids_offsets_buf_[i + 1] = input_ids_offsets_buf_[i]; if (const auto& c = *rc[i]; TM_UNLIKELY(!c.autoregres)) { - const auto src = c.token_ids + c.history_len + c.alpha; + const auto src = c.token_ids + c.history_len + c.inflight_input_len; std::copy_n(src, c.input_len, input_ids_buf_.data() + input_ids_offsets_buf_[i]); // dbg(std::vector(src, src + c.input_len)); d.autoreg_ids_pos[i] = -1; @@ -160,10 +153,10 @@ struct InputProcessor::Impl { auto embed_ptr = (uint8_t*)d.input_embeds_buf.raw_data(); for (int k = 0; k < rc.size(); ++k) { if (auto& c = *rc[k]; !c.autoregres) { - const auto& embeds = c.seq->input_embeds; - const auto& offsets = c.seq->input_embeds_offsets; + const auto& embeds = c.input_embeds; + const auto& offsets = c.input_embeds_offsets; Interval p{input_ids_offsets_buf_[k], input_ids_offsets_buf_[k + 1]}; - Interval s{c.history_len + c.alpha, p.size()}; + Interval s{c.history_len + c.inflight_input_len, p.size()}; for (int i = (int)offsets.size() - 1; i >= 0; --i) { Interval r{offsets[i], Interval::Size{(int)embeds[i].shape(0)}}; auto o = r & s; diff --git a/src/turbomind/models/language_model.cc b/src/turbomind/models/language_model.cc index e01dc42c44..c74d6f5e86 100644 --- a/src/turbomind/models/language_model.cc +++ b/src/turbomind/models/language_model.cc @@ -12,6 +12,7 @@ #include "src/turbomind/core/scope.h" #include "src/turbomind/core/state.h" #include "src/turbomind/engine/batch.h" +#include "src/turbomind/engine/cache_registry.h" #include "src/turbomind/engine/request.h" #include "src/turbomind/generation/generation.h" #include "src/turbomind/kernels/gpt_kernels.h" @@ -60,10 +61,12 @@ struct LanguageModel::Impl { int max_logits_len_ = 0; Buffer_ sequence_length_buf_; + Buffer_ readonly_block_num_buf_; // {max_batch_size}, kCPUpinned Buffer_ finished_buf_; struct Data { Buffer_ sequence_length; + Buffer_ readonly_block_num; Buffer_ finished; Buffer_ autoregres; @@ -100,7 +103,8 @@ struct LanguageModel::Impl { } } - Impl(const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases); + Impl( + CacheRegistry& registry, const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases); Tensor LookupEmbedding(const Buffer_& input_ids, Buffer symm_buf); Tensor PostEmbedding(const Tensor& features, Buffer symm_buf); @@ -112,7 +116,8 @@ struct LanguageModel::Impl { void Fetch(int phase, TensorMap& env); }; -LanguageModel::Impl::Impl(const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases): +LanguageModel::Impl::Impl( + CacheRegistry& registry, const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases): comm_{ctx.comm}, weights_{weights}, linear_{*ctx.linear}, @@ -132,19 +137,21 @@ LanguageModel::Impl::Impl(const EngineParam& engine, const Context& ctx, const M // autoreg_ids_offsets_ = {engine.max_batch_size + 1, kCPU}; // std::fill_n(autoreg_ids_offsets_.data(), autoreg_ids_offsets_.size(), 0); - sequence_length_buf_ = {engine.max_batch_size, kCPUpinned}; - sequence_length_ = {{engine.max_batch_size}, kInt, kDEVICE}; + sequence_length_buf_ = {engine.max_batch_size, kCPUpinned}; + readonly_block_num_buf_ = {engine.max_batch_size, kCPUpinned}; + sequence_length_ = {{engine.max_batch_size}, kInt, kDEVICE}; for (int i = 0; i < phases; ++i) { - auto& d = data_.emplace_back(); - d.sequence_length = empty_like(sequence_length_buf_, kDEVICE); - d.finished = empty_like(finished_buf_, kDEVICE); - d.autoregres = {engine.max_batch_size, kCPU}; - d.generating = {engine.max_batch_size, kCPU}; + auto& d = data_.emplace_back(); + d.sequence_length = empty_like(sequence_length_buf_, kDEVICE); + d.readonly_block_num = empty_like(readonly_block_num_buf_, kDEVICE); + d.finished = empty_like(finished_buf_, kDEVICE); + d.autoregres = {engine.max_batch_size, kCPU}; + d.generating = {engine.max_batch_size, kCPU}; } input_processor_.emplace(engine, weights_.hidden_units, weights_.data_type, phases); - unified_decoder_ = std::make_unique(engine, ctx, phases, weights_); + unified_decoder_ = std::make_unique(registry, engine, ctx, phases, weights_); const int vocab_size = weights_.output->output_dim * tp_size_; @@ -306,7 +313,7 @@ void LanguageModel::Impl::Setup(int phase, TensorMap& env) auto& d = data_.at(phase); auto& copy = *env.at("copy").data()[0]; - const auto& rc = env.at("batch").data()[0]->rc; + Buffer_ rc = env.at("requests").buffer(); d.n_generating = 0; @@ -316,11 +323,13 @@ void LanguageModel::Impl::Setup(int phase, TensorMap& env) d.generating[i] = c.generating; d.n_generating += c.generating; if (TM_UNLIKELY(!c.autoregres)) { - sequence_length_buf_[i] = c.history_len + c.alpha + c.input_len; + sequence_length_buf_[i] = c.history_len + c.inflight_input_len + c.input_len; } + readonly_block_num_buf_[i] = c.readonly_block_num; // all rows, batch order } copy(sequence_length_buf_, rc.size(), d.sequence_length); + copy(readonly_block_num_buf_, rc.size(), d.readonly_block_num); unified_decoder_->Run(BatchOp::kSetup, phase, env); generation_->Run(BatchOp::kSetup, phase, env); @@ -353,7 +362,9 @@ void LanguageModel::Impl::Prepare(int phase, TensorMap& env) } if (auto group = copy.group()) { - // sequence_length = history_len + input_len + // Non-autoregressive rows use the submitted prefix length: + // sequence_length = history_len + inflight_input_len + input_len. + // Existing autoregressive rows carry the previous sequence_length forward. for (int i = 0; i < b.bsz; ++i) { if (const int j = b.perm[i]; j < b.bs0 && d.autoregres[i]) { copy(sequence_length_.front().data() + j, 1, sequence_length_.back().data() + i); @@ -381,6 +392,7 @@ void LanguageModel::Impl::Prepare(int phase, TensorMap& env) env.produce("finished", finished_.front()); env.produce("sequence_length", sequence_length_.front()); + env.produce("readonly_block_num", d.readonly_block_num); env.produce("k_offsets", k_offsets); unified_decoder_->Run(BatchOp::kPrepare, phase, env); @@ -447,6 +459,7 @@ void LanguageModel::Impl::Unprep(int phase, TensorMap& env) copy(finished_.front().buffer(), d.finished.size(), d.finished); + unified_decoder_->Run(BatchOp::kUnprep, phase, env); generation_->Run(BatchOp::kUnprep, phase, env); } @@ -470,9 +483,10 @@ LanguageModel::~LanguageModel() = default; LanguageModel::LanguageModel(LanguageModel&&) noexcept = default; -LanguageModel::LanguageModel(const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases) +LanguageModel::LanguageModel( + CacheRegistry& registry, const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases) { - impl_ = std::make_unique(engine, ctx, weights, phases); + impl_ = std::make_unique(registry, engine, ctx, weights, phases); } void LanguageModel::Run(BatchOp op, int phase, TensorMap& env) diff --git a/src/turbomind/models/language_model.h b/src/turbomind/models/language_model.h index 5699a9e37f..0963b7755f 100644 --- a/src/turbomind/models/language_model.h +++ b/src/turbomind/models/language_model.h @@ -10,6 +10,8 @@ namespace turbomind { class ModelWeight; +struct Sequence; +class CacheRegistry; class LanguageModel { public: @@ -24,7 +26,11 @@ class LanguageModel { return static_cast(impl_); } - LanguageModel(const EngineParam& engine, const Context& ctx, const ModelWeight& weights, int phases); + LanguageModel(CacheRegistry& registry, + const EngineParam& engine, + const Context& context, + const ModelWeight& weights, + int phases); void Run(BatchOp op, int phase, TensorMap& env); diff --git a/src/turbomind/models/llama/BlockManager.cc b/src/turbomind/models/llama/BlockManager.cc deleted file mode 100644 index b8a5001cf9..0000000000 --- a/src/turbomind/models/llama/BlockManager.cc +++ /dev/null @@ -1,282 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#include - -#include "src/turbomind/core/logger.h" -#include "src/turbomind/models/llama/BlockManager.h" -#include "src/turbomind/utils/debug_utils.h" -#include "src/turbomind/utils/string_utils.h" - -namespace turbomind { - -BlockManager::BlockManager( - size_t block_size, double block_count, int chunk_size, core::Allocator allocator, GetFreeMemSize get_free_size): - block_size_(block_size), allocator_(allocator) -{ - if (block_count < 1.) { - max_block_count_ = GetBlockCount(block_size, block_count, get_free_size); - } - else { - max_block_count_ = block_count; - } - - if (chunk_size == 0) { - chunk_size_ = static_cast(std::sqrt(max_block_count_)); - } - else if (chunk_size < 0) { - chunk_size_ = max_block_count_; - } - else { - chunk_size_ = chunk_size; - } - - TM_LOG_INFO("block_size = {:.3f} MB", (float)block_size_ / (1 << 20)); - TM_LOG_INFO("max_block_count = {}", max_block_count_); - TM_LOG_INFO("chunk_size = {}", chunk_size_); - - blocks_.reserve(max_block_count_); - - active_ids_.reserve(max_block_count_); - cached_ids_.reserve(max_block_count_); - free_ids_.reserve(max_block_count_); - - // pre-allocate first chunk - Malloc(); - dbg(free_ids_); -} - -BlockManager::~BlockManager() -{ - for (auto& chunk : chunks_) { - allocator_->deallocate(chunk, block_size_); - } -} - -bool BlockManager::Malloc() -{ - auto chunk_size = std::min(chunk_size_, max_block_count_ - blocks_.size()); - - if (!chunk_size) { - return false; - } - - auto ptr = (std::byte*)allocator_->allocate(block_size_ * chunk_size); - if (!ptr) { - return false; - } - - chunks_.push_back(ptr); - - for (int i = 0; i < chunk_size; ++i, ptr += block_size_) { - auto& block = blocks_.emplace_back(); - block.use_count = 0; - block.id = (int)blocks_.size() - 1; - block.timestamp = 0; - block.data = ptr; - - free_ids_.push_back(block.id); - } - - return true; -} - -size_t BlockManager::GetBlockCount(size_t block_size, double ratio, GetFreeMemSize get_free_size) -{ - size_t free = get_free_size(); - return static_cast(free * ratio) / block_size; -} - -void BlockManager::Move(std::vector& src, const std::vector& delta, std::vector& dst) -{ - TM_CHECK_GE(src.size(), delta.size()); - std::vector src1(src.size() - delta.size()); - { - auto end = std::set_difference(src.begin(), src.end(), delta.begin(), delta.end(), src1.begin()); - TM_CHECK(end == src1.end()); - } - src.swap(src1); - - std::vector dst1(dst.size() + delta.size()); - { - auto end = std::set_union(dst.begin(), dst.end(), delta.begin(), delta.end(), dst1.begin()); - TM_CHECK(end == dst1.end()); - } - dst.swap(dst1); -} - -auto BlockManager::Allocate(int count) -> std::pair -{ - while (free_ids_.size() < count) { - if (!Malloc()) { - throw std::runtime_error("out of memory"); - } - } - - BlockIds block_ids(count); - UniqueIds unique_ids(count); - - for (int i = 0; i < count; ++i) { - int idx = free_ids_[i]; - auto& b = blocks_[idx]; - TM_CHECK(is_free(b)); // pre-condition: uc == 0 && ts == 0 - b.use_count = 1; - b.unique_id = unique_id_++; - b.timestamp = timestamp_++; - TM_CHECK(is_active(b)); // post-condition - block_ids[i] = idx; - unique_ids[i] = b.unique_id; - } - - Move(free_ids_, block_ids, active_ids_); - - dbg(free_ids_, active_ids_); - - return {block_ids, unique_ids}; -} - -void BlockManager::Evict(int count) -{ - TM_CHECK_LE(count, cached_ids_.size()); - std::vector idxs(cached_ids_); - // get first `count` cached ids according to timestamp - std::nth_element(idxs.begin(), idxs.begin() + count, idxs.end(), [&](int i, int j) { - return blocks_[i].timestamp < blocks_[j].timestamp; - }); - idxs.resize(count); - - // sort the retrieved ids - std::sort(idxs.begin(), idxs.end()); - - // set as free - for (const auto& idx : idxs) { - auto& b = blocks_[idx]; - TM_CHECK(is_cached(b)); // pre-condition - b.unique_id = 0; - b.timestamp = 0; - TM_CHECK(is_free(b)); // post-condition - } - - Move(cached_ids_, idxs, free_ids_); - - dbg(cached_ids_, free_ids_); -} - -void BlockManager::Free(BlockIds ids) -{ - std::sort(ids.begin(), ids.end()); - - for (const auto& i : ids) { - auto& b = blocks_[i]; - TM_CHECK(is_cached(b)); // pre-condition - b.unique_id = 0; - b.timestamp = 0; - TM_CHECK(is_free(b)); // post-condition - } - - Move(cached_ids_, ids, free_ids_); -} - -int BlockManager::Unlock(const BlockIds& ids) -{ - BlockIds unlock; - unlock.reserve(ids.size()); - - for (const auto& i : ids) { - auto& b = blocks_[i]; - TM_CHECK(is_active(b)); // pre-condition - if (--b.use_count == 0) { - unlock.push_back(b.id); - TM_CHECK(is_cached(b)); // post-condition - } - } - - std::sort(unlock.begin(), unlock.end()); - - Move(active_ids_, unlock, cached_ids_); - - dbg(active_ids_, cached_ids_); - return unlock.size(); -} - -int BlockManager::Lock(const BlockIds& ids) -{ - BlockIds lock; - lock.reserve(ids.size()); - - for (const auto& i : ids) { - auto& b = blocks_[i]; - if (++b.use_count == 1) { - lock.push_back(i); - TM_CHECK(is_active(b)); // post-condition - } - } - - std::sort(lock.begin(), lock.end()); - - Move(cached_ids_, lock, active_ids_); - - // dbg(cached_ids_, active_ids_); - - return lock.size(); -} - -void BlockManager::Touch(const BlockIds& ids) -{ - std::for_each(ids.crbegin(), ids.crend(), [this](int i) { - TM_CHECK(is_active(blocks_[i])); - blocks_[i].timestamp = timestamp_++; - }); -} - -int BlockManager::Verify(const std::vector& block_ids, const std::vector& unique_ids) -{ - TM_CHECK_EQ(block_ids.size(), unique_ids.size()); - int valid = block_ids.size(); - for (int i = 0; i < block_ids.size(); ++i) { - if (unique_id(block_ids[i]) != unique_ids[i]) { - valid = i; - break; - } - } - int miss = 0; - for (int i = valid; i < block_ids.size(); ++i) { - miss += (unique_id(block_ids[i]) != unique_ids[i]); - } - // All later blocks should have been invalidated - TM_CHECK_EQ(miss, (int)block_ids.size() - valid) - << fmtstr("count = %d, valid = %d, miss = %d", (int)block_ids.size(), valid, miss); - return valid; -} - -Snapshot BlockManager::TakeSnapshot() -{ - std::vector use_count(blocks_.size()); - for (const auto& idx : active_ids_) { - use_count[idx] = blocks_[idx].use_count; - } - return {active_count(), cached_count(), free_count(), std::move(use_count)}; -} - -std::ostream& operator<<(std::ostream& os, const BlockManager& manager) -{ - os << "block_size: " << manager.block_size_ << ", "; - os << "max_block_count: " << manager.max_block_count_ << ", "; - os << "chunk_size: " << manager.chunk_size_ << ", "; - os << "chunks: " << manager.chunks_.size() << ", "; - os << "active_ids: " << manager.active_ids_.size() << ", "; - os << "cached_ids: " << manager.cached_ids_.size() << ", "; - os << "free_ids: " << manager.free_ids_.size() << ", "; - os << "blocks: " << manager.blocks_.size() << ", "; - os << "unique_id: " << manager.unique_id_ << ", "; - os << "timestamp: " << manager.timestamp_; - return os; -} - -std::ostream& operator<<(std::ostream& os, const Block& block) -{ - os << "id=" << block.id << ", use_count=" << block.use_count << ", unique_id=" << block.unique_id - << ", timestamp=" << block.timestamp << ", data=" << block.data; - return os; -} - -} // namespace turbomind diff --git a/src/turbomind/models/llama/BlockManager.h b/src/turbomind/models/llama/BlockManager.h deleted file mode 100644 index de2d9e0384..0000000000 --- a/src/turbomind/models/llama/BlockManager.h +++ /dev/null @@ -1,165 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#pragma once - -#include "src/turbomind/core/allocator.h" -#include "src/turbomind/core/logger.h" -#include "src/turbomind/models/llama/Barrier.h" -#include "src/turbomind/utils/cuda_utils.h" -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -namespace turbomind { - -// [L, H, S, D] - -// [L, S/x, H, x, D] - -struct Block { - int id; // fixed linear id in the pool - int use_count; // active sequences using the block - uint64_t unique_id; // unique for every block allocation - uint64_t timestamp; - void* data; - - friend std::ostream& operator<<(std::ostream& os, const Block& block); - friend std::string to_string(const Block& b) - { - std::stringstream ss; - ss << b; - return ss.str(); - } -}; - -using BlockIds = std::vector; -using UniqueIds = std::vector; - -inline bool is_active(const Block& block) -{ - // timestamp may be 0 for newly allocated block that has not been written - return block.use_count > 0; -} - -inline bool is_cached(const Block& block) -{ - return block.use_count == 0 && block.timestamp != 0; -} - -inline bool is_free(const Block& block) -{ - return block.use_count == 0 && block.timestamp == 0; -} - -struct Snapshot { - int active; - int cached; - int free; - std::vector use_count; -}; - -using GetFreeMemSize = std::function; - -class BlockManager { -public: - explicit BlockManager( - size_t block_size, double block_count, int chunk_size, core::Allocator allocator, GetFreeMemSize get_free_size); - - ~BlockManager(); - - // free -> active (use_count = 1, ref_count = 1) - [[nodiscard]] std::pair Allocate(int count); - - // cached -> active (use_count += 1) - [[maybe_unused]] int Lock(const BlockIds& ids); - - // active -> cached (use_count -= 1) - [[maybe_unused]] int Unlock(const BlockIds& ids); - - // cached -> free (ref_count = 0) - void Evict(int count); - - // cached -> free (ref_count -= 1) - void Free(BlockIds bs); - - // increase timestamp in reversed order - void Touch(const BlockIds& bs); - - [[nodiscard]] int Verify(const BlockIds& block_ids, const UniqueIds& unique_ids); - - Snapshot TakeSnapshot(); - - int max_block_count() const noexcept - { - return max_block_count_; - } - - int total_count() const noexcept - { - return blocks_.size(); - } - - int active_count() const noexcept - { - return active_ids_.size(); - } - - int cached_count() const noexcept - { - return cached_ids_.size(); - } - - int free_count() const noexcept - { - return free_ids_.size(); - } - - Block& block(int idx) - { - return blocks_[idx]; - } - - int unique_id(int idx) - { - return blocks_[idx].unique_id; - } - - friend std::ostream& operator<<(std::ostream& os, const BlockManager&); - -private: - static size_t GetBlockCount(size_t block_size, double ratio, GetFreeMemSize get_free_size); - - // move indices between sets - static void Move(BlockIds& src, const BlockIds& delta, BlockIds& dst); - - // allocate a chunk of blocks - bool Malloc(); - -private: - size_t block_size_; - int max_block_count_{}; - int chunk_size_{}; - - core::Allocator allocator_; - - std::vector chunks_; - - BlockIds active_ids_; - BlockIds cached_ids_; - BlockIds free_ids_; - - std::vector blocks_; // < 100k - - uint64_t unique_id_{1}; - uint64_t timestamp_{1}; -}; - -} // namespace turbomind diff --git a/src/turbomind/models/llama/BlockTrie.cc b/src/turbomind/models/llama/BlockTrie.cc deleted file mode 100644 index 4046741bd2..0000000000 --- a/src/turbomind/models/llama/BlockTrie.cc +++ /dev/null @@ -1,129 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#include "src/turbomind/models/llama/BlockTrie.h" -#include "src/turbomind/models/llama/SequenceManager.h" - -namespace turbomind { - -size_t hash(const std::vector& vec) -{ - size_t seed = vec.size(); - for (const auto& i : vec) { - seed ^= std::hash{}(i) + 0x9e3779b9 + (seed << 6) + (seed >> 2); - } - return seed; -} - -BlockTrie::BlockTrie(size_t block_len, std::shared_ptr block_manager): - block_seq_len_(block_len), block_manager_(block_manager) -{ - root_ = std::make_shared(); -} - -std::tuple BlockTrie::Match(const Sequence& seq) -{ - BlockIds block_ids; - UniqueIds unique_ids; - - auto node = root_; - auto first = seq.prompt.begin(); - - // Warning: Do not use "<=" operator even when seq.prompt length is evenly - // divisible by block_seq_len_. The model needs at least one input token to generate output. - while (first + block_seq_len_ < seq.prompt.end()) { - const std::vector segment{first, first + block_seq_len_}; - const size_t hash_key = hash(segment); - if (const auto it = node->children.find(hash_key); it != node->children.end()) { - if (segment == it->second->tokens) { - block_ids.push_back(it->second->block_id); - unique_ids.push_back(it->second->block_unique_id); - node = it->second; - first += block_seq_len_; - } - else { - TM_LOG_WARN("hash collision detected"); - break; - } - } - else { - break; - } - } - - return std::make_tuple(block_ids, unique_ids); -} - -std::tuple BlockTrie::Cache(const Sequence& seq, const std::vector& tokens) -{ - // Ensure the seq is active or locked so that all cache blocks must be valid - TM_CHECK_NE(seq.status, Sequence::kCached); - TM_CHECK_LE(seq.cache_len, seq.blocks.size() * block_seq_len_); - - auto node = root_; - - BlockIds cache_block_ids; - UniqueIds cache_block_unique_ids; - - const int n_blocks = std::min(seq.cache_len, (int)tokens.size()) / block_seq_len_; - - int new_cached = 0; - - for (int idx = 0; idx < n_blocks; ++idx) { - auto start = tokens.begin() + idx * block_seq_len_; - auto end = start + block_seq_len_; - - const std::vector segment(start, end); - const size_t hash_key = hash(segment); // TODO(lvhan): add salt to ensure the hash security - - int block_id = seq.blocks[idx]; - uint64_t block_unique_id = seq.block_unique_ids[idx]; - - if (auto it = node->children.find(hash_key); it != node->children.end()) { - if (segment == it->second->tokens) { // fast-forward - node = it->second; - node->block_id = block_id; - node->block_unique_id = block_unique_id; - } - else { - TM_LOG_WARN("Hash collision detected"); - break; - } - } - else { - // insert new node - node = node->children.emplace_hint(it, hash_key, std::make_shared())->second; - node->hash_key = hash_key; - node->tokens = segment; - node->block_id = block_id; - node->block_unique_id = block_unique_id; - new_cached += block_seq_len_; - } - cache_block_ids.emplace_back(block_id); - cache_block_unique_ids.emplace_back(block_unique_id); - } - - TM_LOG_INFO("{} new tokens cached", new_cached); - - return std::make_tuple(cache_block_ids, cache_block_unique_ids); -} - -void BlockTrie::Verify() -{ - DFS(root_); -} - -void BlockTrie::DFS(std::shared_ptr& node) -{ - for (auto it = node->children.begin(); it != node->children.end();) { - if (block_manager_->unique_id(it->second->block_id) != it->second->block_unique_id) { - // child invalid - it = node->children.erase(it); - } - else { - DFS(it->second); - it++; - } - } -} - -} // namespace turbomind diff --git a/src/turbomind/models/llama/BlockTrie.h b/src/turbomind/models/llama/BlockTrie.h deleted file mode 100644 index 75381c3bdd..0000000000 --- a/src/turbomind/models/llama/BlockTrie.h +++ /dev/null @@ -1,74 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#pragma once - -#include "src/turbomind/models/llama/BlockManager.h" -#include -#include -#include - -namespace turbomind { - -struct Sequence; - -struct TrieNode { - std::unordered_map> children; - size_t hash_key; - std::vector tokens; - int block_id; - uint64_t block_unique_id; - int num_matched; -}; - -class BlockTrie { -public: - explicit BlockTrie(size_t block_len, std::shared_ptr block_manager); - - /** - * @brief Attempt to match cached key-value (KV) blocks for a given sequence. - * - * This function iterates the tokens of the sequence and attempts - * to match them with the cached KV blocks. If the max prefix match is found, - * it returns the IDs, unique IDs of the matched blocks. - * - * @param seq The sequence whose tokens are to be matched against the cached KV blocks. - * @return A tuple containing the following: - * - BlockIds: A list of IDs of the matched blocks. - * - UniqueIds: A list of unique IDs of the matched blocks. - * - * @note If no blocks are matched, all containers in the returned tuple will be empty. - */ - std::tuple Match(const Sequence& seq); - - /** - * @brief Cache the key-value (KV) blocks of a given sequence. - * - * This function caches the KV blocks of the specified sequence. Only valid blocks - * of a sequence whose status is NOT `Sequence::kCached` are considered - * to be cached - * - * @param seq The sequence whose KV blocks are to be cached. - * @param tokens The token list corresponding to the KV blocks - * @return A tuple containing the following: - * - BlockIds: A list of IDs of the cached blocks. - * - UniqueIds: A list of unique IDs of the cached blocks. - */ - std::tuple Cache(const Sequence& seq, const std::vector& tokens); - - /** - * @brief remove invalid nodes - */ - void Verify(); - -private: - void DFS(std::shared_ptr& node); - -private: - size_t block_seq_len_; - - std::shared_ptr block_manager_; - - std::shared_ptr root_; -}; - -} // namespace turbomind diff --git a/src/turbomind/models/llama/CMakeLists.txt b/src/turbomind/models/llama/CMakeLists.txt index 9b7a70e923..0596937a2c 100644 --- a/src/turbomind/models/llama/CMakeLists.txt +++ b/src/turbomind/models/llama/CMakeLists.txt @@ -9,9 +9,6 @@ add_library(Llama STATIC LlamaV2.cc LlamaBatch.cc LlamaLinear.cu - BlockManager.cc - BlockTrie.cc - SequenceManager.cc LlamaWeight.cc LlamaDecoderLayerWeight.cc LlamaFfnLayer.cc diff --git a/src/turbomind/models/llama/GatedDeltaNetLayer.cc b/src/turbomind/models/llama/GatedDeltaNetLayer.cc index 2d0c49d50f..1593cb0c0a 100644 --- a/src/turbomind/models/llama/GatedDeltaNetLayer.cc +++ b/src/turbomind/models/llama/GatedDeltaNetLayer.cc @@ -1,39 +1,110 @@ #include "src/turbomind/models/llama/GatedDeltaNetLayer.h" + +#include +#include + #include "src/turbomind/core/allocator.h" #include "src/turbomind/core/check.h" #include "src/turbomind/core/data_type.h" #include "src/turbomind/core/logger.h" #include "src/turbomind/core/scope.h" -#include "src/turbomind/models/llama/SequenceManager.h" +#include "src/turbomind/engine/block.h" #include "src/turbomind/models/llama/gated_delta_net_kernels.h" #include "src/turbomind/utils/cuda_utils.h" namespace turbomind { -GatedDeltaNetLayer::GatedDeltaNetLayer(DataType state_dtype, - const std::vector& layer_types, - const EngineParam& engine, - const Context& ctx, - int phases): - tp_size_(engine.attn_tp_size), num_linear_layers_(0), state_dtype_(state_dtype), linear_(*ctx.linear) +auto get_lc_state_size(const DeltaNetWeight& weights, int tp) { - layer_types_ = layer_types; - for (auto t : layer_types_) { - if (t == 1) - ++num_linear_layers_; - } + int num_k_heads = weights.num_k_heads / tp; + int num_v_heads = weights.num_v_heads / tp; + int key_head_dim = weights.key_head_dim; + int value_head_dim = weights.value_head_dim; + int d_conv = weights.d_conv; + int key_dim = num_k_heads * key_head_dim; + int value_dim = num_v_heads * value_head_dim; + int conv_dim = key_dim * 2 + value_dim; + return std::make_pair(num_v_heads * key_head_dim * value_head_dim, conv_dim * d_conv); +} - if (num_linear_layers_ > 0) { - conv_state_ptrs_buf_ = {engine.max_batch_size, kCPUpinned}; - recurrent_state_ptrs_buf_ = {engine.max_batch_size, kCPUpinned}; +GatedDeltaNetLayer::GatedDeltaNetLayer(std::vector weights, + CacheRegistry& registry, + const EngineParam& engine, + const Context& context, + int phases): + tp_size_{engine.attn_tp_size}, state_dtype_{engine.data_type}, linear_{*context.linear} +{ + TM_CHECK(!weights.empty()); + layer_num_ = static_cast(weights.size()); + + const auto [l_state_size, c_state_size] = get_lc_state_size(*weights[0], tp_size_); + + const int num_v_heads = weights[0]->num_v_heads / tp_size_; + const int cell_elems = + weights[0]->key_head_dim * weights[0]->value_head_dim; // one (layer, head) state, in elements + TM_CHECK_EQ(l_state_size, num_v_heads * cell_elems); // sanity: get_lc_state_size agrees + + // Block unit (L_b layers x H_b v_heads). Unset env => one part per layer, + // no head-grouping == today's behavior. + int L_b = 1; + int H_b = num_v_heads; + if (const char* e = std::getenv("TM_GDN_BLOCK_CONFIG")) { + TM_CHECK_EQ(std::sscanf(e, "%d,%d", &L_b, &H_b), 2) << "expected TM_GDN_BLOCK_CONFIG=l,h (e.g. 4,16)"; } + TM_CHECK_GT(L_b, 0); + TM_CHECK_GT(H_b, 0); + + auto cdiv_i = [](int a, int b) { return (a + b - 1) / b; }; + layers_per_block_ = L_b; + heads_per_block_ = H_b; + num_head_groups_ = cdiv_i(num_v_heads, H_b); // == 1 when H_b >= num_v_heads + num_layer_groups_ = cdiv_i(layer_num_, L_b); // == layer_num_ when L_b == 1 + num_blocks_ = num_layer_groups_ * num_head_groups_; + block_bytes_ = byte_size(state_dtype_, (size_t)L_b * H_b * cell_elems); + + // recurrent: num_blocks_ uniform parts, base part id == 1 + rec_base_ = registry.checkpoint().Register({{block_bytes_, 1, static_cast(num_blocks_)}}); + + // conv: accumulation -> part 0; ELEMENT offsets kept exactly as today. + size_t off = 0; + for (int i = 0; i < layer_num_; ++i) { + weights[i]->conv_state_offset = off; + off += c_state_size; + } + conv_total_bytes_ = byte_size(state_dtype_, off); + registry.checkpoint().Register(conv_total_bytes_, /*alignment=*/1); // reserves part 0 + + // Visibility: slot-level interchange with the prefix object requires + // block_bytes_ == prefix object bytes. Not enforced (optimal sizing is + // out of scope); just log. Attention registers prefix before this ctor. + const size_t prefix_bytes = registry.prefix().accumulation_bytes(); + // Logger is fmtlib-style ({} placeholders), matching slab.h's TM_LOG_WARN. + TM_LOG_INFO("[GDN] block config L_b={} H_b={} -> num_layer_groups={} num_head_groups={} " + "num_blocks={} block_bytes={} prefix_object_bytes={} ({})", + L_b, + H_b, + num_layer_groups_, + num_head_groups_, + num_blocks_, + block_bytes_, + prefix_bytes, + (prefix_bytes != 0 && block_bytes_ == prefix_bytes) ? "slab-shared" : "separate-slab-class"); + + for (int L = 0; L < layer_num_; ++L) { + // in-block row offset (elements) for this layer within its block-row; + // == 0 when L_b == 1 (today's behavior). + weights[L]->linear_state_offset = (L % L_b) * H_b * cell_elems; + layer_index_[weights[L]] = L; // weight ptr -> GDN-local layer index + } + + // Staging buffers: conv stays [batch]; recurrent becomes a [layer_group][batch][head_group] table. + conv_state_ptrs_buf_ = {engine.max_batch_size, kCPUpinned}; + recurrent_state_ptrs_buf_ = {(ssize_t)num_layer_groups_ * engine.max_batch_size * num_head_groups_, kCPUpinned}; for (int i = 0; i < phases; ++i) { data_.emplace_back(); - if (num_linear_layers_ > 0) { - data_.at(i).conv_state_ptrs = empty_like(conv_state_ptrs_buf_, kDEVICE); - data_.at(i).recurrent_state_ptrs = empty_like(recurrent_state_ptrs_buf_, kDEVICE); - } + data_.at(i).conv_state_ptrs = empty_like(conv_state_ptrs_buf_, kDEVICE); + data_.at(i).recurrent_state_ptrs = empty_like(recurrent_state_ptrs_buf_, kDEVICE); } int device = 0; @@ -56,7 +127,7 @@ GatedDeltaNetLayer::~GatedDeltaNetLayer() void GatedDeltaNetLayer::Run(BatchOp op, int phase, TensorMap& env) { if (op == BatchOp::kAdd) { - Buffer_ rc = env.at("requests").buffer(); + Buffer_ rc = env.at("requests").buffer(); for (int i = 0; i < rc.size(); ++i) {} } else if (op == BatchOp::kSetup) { @@ -66,57 +137,58 @@ void GatedDeltaNetLayer::Run(BatchOp op, int phase, TensorMap& env) auto& d = data_.at(phase); d.q_offsets = env.at("q_offsets").buffer().borrow(); d.k_offsets = env.at("k_offsets").buffer().borrow(); + d.finished = env.at("finished").buffer().borrow(); + for (const auto& [ptr, bytes] : d.reset_ptrs) { + Clear(Buffer_{ptr, static_cast(bytes), kDEVICE}); + } + d.reset_ptrs.clear(); } } void GatedDeltaNetLayer::Setup(int phase, TensorMap& env) { - auto& d = data_.at(phase); - const auto& b = *env.at("batch").data()[0]; + auto& d = data_.at(phase); - d.batch_size = b.rc.size(); - d.rc.resize(d.batch_size); + Buffer_ rc = env.at("requests").buffer(); + + d.batch_size = rc.size(); d.input_lens.resize(d.batch_size); + d.reset_ptrs.clear(); - d.conv_states.resize(d.batch_size); - d.recurrent_states.resize(d.batch_size); + const auto& c_pool = *env.at("cache_block_pool").data()[0]; for (int i = 0; i < d.batch_size; ++i) { - d.rc[i] = b.rc[i].get(); - d.input_lens[i] = b.rc[i]->input_len; - - auto& s = *b.rc[i]->seq; - TM_CHECK(s.conv_states && s.recurrent_states) - << "Linear-attention state slot is not bound for sequence " << s.id; - if (s.linear_states_need_reset) { - // Reset newly assigned pooled slot state on first use. Keep GPU-side - // state initialization out of SequenceManager. - Clear(s.conv_states); - Clear(s.recurrent_states); - s.linear_states_need_reset = false; + auto& s = *rc[i]; + d.input_lens[i] = s.input_len; + + const auto& cb = c_pool[s.frontier_cache_id]; + TM_CHECK_NOTNULL(cb.allocation.a); + + conv_state_ptrs_buf_[i] = cb.base(0); // conv accumulation part + // One pointer per (layer-group, head-group) == per recurrent part; the L_b + // layers of a block-row share this base (differ only by linear_state_offset). + for (int lg = 0; lg < num_layer_groups_; ++lg) { + for (int hg = 0; hg < num_head_groups_; ++hg) { + const int part = rec_base_ + lg * num_head_groups_ + hg; + recurrent_state_ptrs_buf_[(lg * d.batch_size + i) * num_head_groups_ + hg] = cb.base(part); + } } - // Linear-attention requests are restricted to stateless execution, so - // the sequence-owned states can be passed directly here. - d.conv_states[i] = s.conv_states; - d.recurrent_states[i] = s.recurrent_states; - - conv_state_ptrs_buf_[i] = d.conv_states[i].raw_data(); - recurrent_state_ptrs_buf_[i] = d.recurrent_states[i].raw_data(); + // The forward for this batch starts at history_len + inflight_input_len. + // Reset only when the true start position is 0; clear every part + // (including any rounding padding -- harmless, never read by kernels). + if (s.history_len + s.inflight_input_len == 0) { + d.reset_ptrs.push_back({reinterpret_cast(cb.base(0)), conv_total_bytes_}); + for (int blk = 0; blk < num_blocks_; ++blk) { + d.reset_ptrs.push_back({reinterpret_cast(cb.base(rec_base_ + blk)), block_bytes_}); + } + } } Copy(conv_state_ptrs_buf_, d.batch_size, d.conv_state_ptrs); - Copy(recurrent_state_ptrs_buf_, d.batch_size, d.recurrent_state_ptrs); -} - -static int linear_layer_index(int layer_id, const std::vector& layer_types) -{ - int idx = 0; - for (int i = 0; i < layer_id && i < (int)layer_types.size(); ++i) { - if (layer_types[i] == 1) - ++idx; - } - return idx; + Copy(recurrent_state_ptrs_buf_, + (ssize_t)num_layer_groups_ * d.batch_size * num_head_groups_, + d.recurrent_state_ptrs); } void GatedDeltaNetLayer::Forward(ForwardParam p) @@ -190,10 +262,6 @@ void GatedDeltaNetLayer::Forward(ForwardParam p) Tensor attn_out{{token_num, value_dim}, dtype, device}; Tensor conv_out{{token_num, conv_dim}, dtype, device}; - const int state_layer_idx = linear_layer_index(p.layer_id, layer_types_); - const int conv_state_layer_offset = state_layer_idx * (conv_dim * d_conv); - const int recurrent_state_layer_offset = state_layer_idx * (num_v_heads * key_head_dim * value_head_dim); - // ----- 3a. Fused Causal Conv1d + SiLU (all requests) ----- // all_proj carries the non-contiguous qkv slice (stride = all_col); // in_stride is derived from all_proj.stride(0) inside the launcher. @@ -204,8 +272,9 @@ void GatedDeltaNetLayer::Forward(ForwardParam p) pd.conv_state_ptrs, pd.q_offsets, pd.k_offsets, + pd.finished, pd.batch_size, - conv_state_layer_offset, + weights.conv_state_offset, sm_count_, work_counter_.data(), stream); @@ -215,6 +284,9 @@ void GatedDeltaNetLayer::Forward(ForwardParam p) // Requests are sorted by input_len: decode (seq_len==1) first, prefill last. // Find the split point and dispatch each half to its optimal kernel. // When both are present, run them concurrently on separate streams. + const int lg = layer_index_.at(p.weights) / layers_per_block_; // layer-group (block row) + auto layer_rec = + pd.recurrent_state_ptrs.slice(lg * pd.batch_size * num_head_groups_, pd.batch_size * num_head_groups_); { int decode_count = 0; for (int i = 0; i < pd.batch_size; ++i) { @@ -231,76 +303,92 @@ void GatedDeltaNetLayer::Forward(ForwardParam p) TM_CUDA_CHECK(cudaStreamWaitEvent(aux_stream_, ev_before_)); // Decode on main stream - auto dc_state = pd.recurrent_state_ptrs.slice(0, decode_count); + auto dc_state = layer_rec.slice(0, decode_count * num_head_groups_); auto dc_q = pd.q_offsets.slice(0, decode_count + 1); + auto dc_done = pd.finished.slice(0, decode_count); invokeGatedDeltaRuleBatched_v3(attn_out, conv_out, beta, g, dc_state, dc_q, + dc_done, decode_count, num_k_heads, - recurrent_state_layer_offset, + weights.linear_state_offset, state_dtype_, sm_count_, work_counter_.data(), - stream); + stream, + num_head_groups_, + heads_per_block_); // Prefill on aux stream (higher priority) - auto pf_state = pd.recurrent_state_ptrs.slice(decode_count, prefill_count); + auto pf_state = layer_rec.slice(decode_count * num_head_groups_, prefill_count * num_head_groups_); auto pf_q = pd.q_offsets.slice(decode_count, prefill_count + 1); + auto pf_done = pd.finished.slice(decode_count, prefill_count); invokeChunkedGatedDeltaRuleBatched(attn_out, conv_out, beta, g, pf_state, pf_q, + pf_done, prefill_count, num_k_heads, - recurrent_state_layer_offset, + weights.linear_state_offset, state_dtype_, sm_count_, work_counter_.data(), - aux_stream_); + aux_stream_, + num_head_groups_, + heads_per_block_); // Join: main stream waits for prefill to finish TM_CUDA_CHECK(cudaEventRecord(ev_after_, aux_stream_)); TM_CUDA_CHECK(cudaStreamWaitEvent(stream, ev_after_)); } else if (decode_count > 0) { - auto state_slice = pd.recurrent_state_ptrs.slice(0, decode_count); + auto state_slice = layer_rec.slice(0, decode_count * num_head_groups_); auto q_slice = pd.q_offsets.slice(0, decode_count + 1); + auto done_slice = pd.finished.slice(0, decode_count); invokeGatedDeltaRuleBatched_v3(attn_out, conv_out, beta, g, state_slice, q_slice, + done_slice, decode_count, num_k_heads, - recurrent_state_layer_offset, + weights.linear_state_offset, state_dtype_, sm_count_, work_counter_.data(), - stream); + stream, + num_head_groups_, + heads_per_block_); } else if (prefill_count > 0) { - auto state_slice = pd.recurrent_state_ptrs.slice(decode_count, prefill_count); + auto state_slice = layer_rec.slice(decode_count * num_head_groups_, prefill_count * num_head_groups_); auto q_slice = pd.q_offsets.slice(decode_count, prefill_count + 1); + auto done_slice = pd.finished.slice(decode_count, prefill_count); invokeChunkedGatedDeltaRuleBatched(attn_out, conv_out, beta, g, state_slice, q_slice, + done_slice, prefill_count, num_k_heads, - recurrent_state_layer_offset, + weights.linear_state_offset, state_dtype_, sm_count_, work_counter_.data(), - stream); + stream, + num_head_groups_, + heads_per_block_); // invokeChunkedGatedDeltaRuleBatched } } diff --git a/src/turbomind/models/llama/GatedDeltaNetLayer.h b/src/turbomind/models/llama/GatedDeltaNetLayer.h index bb6b1e2c9a..18a25e0880 100644 --- a/src/turbomind/models/llama/GatedDeltaNetLayer.h +++ b/src/turbomind/models/llama/GatedDeltaNetLayer.h @@ -1,7 +1,12 @@ #pragma once +#include +#include +#include + #include "src/turbomind/core/tensor.h" #include "src/turbomind/engine/batch.h" +#include "src/turbomind/engine/cache_registry.h" #include "src/turbomind/models/delta_net_weight.h" #include "src/turbomind/models/llama/LlamaLinear.h" #include "src/turbomind/models/llama/context.h" @@ -16,14 +21,14 @@ class GatedDeltaNetLayer { Tensor input; Tensor output; const DeltaNetWeight* weights; - int layer_id; + // int layer_id; }; - GatedDeltaNetLayer(DataType state_dtype, - const std::vector& layer_types, - const EngineParam& engine, - const Context& ctx, - int phases); + GatedDeltaNetLayer(std::vector weights, + CacheRegistry& registry, + const EngineParam& engine, + const Context& context, + int phases); ~GatedDeltaNetLayer(); @@ -35,27 +40,36 @@ class GatedDeltaNetLayer { void Setup(int phase, TensorMap& env); // Config passed at construction - int tp_size_; - int num_linear_layers_; - std::vector layer_types_; - DataType state_dtype_; + const int tp_size_; + const DataType state_dtype_; LlamaLinear& linear_; // Per-phase batch data (mirrors UnifiedAttentionLayer pattern) struct Data { - std::vector rc; - std::vector input_lens; - int batch_size = 0; - Buffer_ q_offsets; - Buffer_ k_offsets; - std::vector conv_states; - std::vector recurrent_states; - Buffer_ conv_state_ptrs; - Buffer_ recurrent_state_ptrs; + std::vector input_lens; + int batch_size = 0; + std::vector> reset_ptrs; // (frontier base, bytes) to clear + Buffer_ q_offsets; + Buffer_ k_offsets; + Buffer_ finished; + Buffer_ conv_state_ptrs; + Buffer_ recurrent_state_ptrs; }; std::vector data_; + int layer_num_{}; // == weights.size() + int rec_base_{}; // composite part id of layer 0's recurrent state (== 1) + int layers_per_block_{}; // L_b + int heads_per_block_{}; // H_b + int num_head_groups_{}; // ceil(num_v_heads / H_b) + int num_layer_groups_{}; // ceil(layer_num_ / L_b) + int num_blocks_{}; // num_layer_groups_ * num_head_groups_ + size_t block_bytes_{}; // one recurrent block's bytes (one composite part) + size_t conv_total_bytes_{}; // accumulated conv-state bytes (part 0) + + std::unordered_map layer_index_; // weight ptr -> GDN-local layer index + // staging buffers Buffer_ conv_state_ptrs_buf_; Buffer_ recurrent_state_ptrs_buf_; diff --git a/src/turbomind/models/llama/SequenceManager.cc b/src/turbomind/models/llama/SequenceManager.cc deleted file mode 100644 index 1693699555..0000000000 --- a/src/turbomind/models/llama/SequenceManager.cc +++ /dev/null @@ -1,759 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#include -#include -#include -#include - -#include "src/turbomind/core/logger.h" -#include "src/turbomind/kernels/attention/block.h" -#include "src/turbomind/models/llama/BlockManager.h" -#include "src/turbomind/models/llama/SequenceManager.h" - -// #include "dbg.h" - -namespace turbomind { - -template -std::string vector2string(const std::vector& data) -{ - if (data.empty()) { - return "nil"; - } - std::stringstream ss; - - auto it = data.begin(); - ss << *it; - - for (++it; it != data.end(); ++it) { - ss << ", " << *it; - } - return ss.str(); -} - -SequenceManager::SequenceManager(int head_dim, - int kv_head_num, - int num_layer, - const std::vector& layer_types, - int quant_policy, - DataType data_type, - DataType runtime_dtype, - int linear_key_head_dim, - int linear_value_head_dim, - int linear_conv_kernel_dim, - int linear_num_key_heads, - int linear_num_value_heads, - int cache_block_seq_len, - int attn_tp_size, - int max_batch_size, - double block_count, - int chunk_size, - bool enable_prefix_caching, - int rank, - int attn_cp_size, - core::Allocator allocator, - GetFreeMemSize get_free_size): - block_seq_len_(cache_block_seq_len), rank_(rank), attn_cp_size_(attn_cp_size) -{ - TM_CHECK_GT(attn_tp_size, 0); - TM_CHECK_GT(cache_block_seq_len, 0); - - int cache_layer_num = num_layer; - int num_linear_layers = 0; - for (const auto& type : layer_types) { - if (type == 1) { - --cache_layer_num; - ++num_linear_layers; - } - } - - const size_t free_before = (block_count < 1. && num_linear_layers > 0) ? get_free_size() : 0; - - if (num_linear_layers > 0) { - - const int key_head_dim = linear_key_head_dim > 0 ? linear_key_head_dim : head_dim; - const int value_head_dim = linear_value_head_dim > 0 ? linear_value_head_dim : head_dim; - const int d_conv = linear_conv_kernel_dim > 0 ? linear_conv_kernel_dim : 4; - const int num_k_heads = linear_num_key_heads / attn_tp_size; - const int num_v_heads = linear_num_value_heads / attn_tp_size; - const int key_dim = num_k_heads * key_head_dim; - const int value_dim = num_v_heads * value_head_dim; - const int conv_dim = key_dim * 2 + value_dim; - - TM_CHECK_GT(max_batch_size, 0); - pooled_conv_states_ = {{max_batch_size, num_linear_layers, d_conv, conv_dim}, data_type, kDEVICE}; - pooled_recurrent_states_ = { - {max_batch_size, num_linear_layers, num_v_heads, key_head_dim, value_head_dim}, data_type, kDEVICE}; - - free_linear_state_slots_.reserve(max_batch_size); - for (int slot = max_batch_size - 1; slot >= 0; --slot) { - free_linear_state_slots_.push_back(slot); - } - TM_LOG_INFO("[SeqMgr] linear-state slot pool initialized: {} slots", max_batch_size); - const auto conv_one = pooled_conv_states_.slice(0, 1).squeeze(0); - const auto recurrent_one = pooled_recurrent_states_.slice(0, 1).squeeze(0); - const double mb = 1.0 / (1024.0 * 1024.0); - TM_LOG_INFO("[SeqMgr] linear-state per slot: conv {:.2f} MB + recurrent {:.2f} MB = {:.2f} MB", - conv_one.byte_size() * mb, - recurrent_one.byte_size() * mb, - (conv_one.byte_size() + recurrent_one.byte_size()) * mb); - TM_LOG_INFO("[SeqMgr] linear-state combined total: {:.2f} MB", - (pooled_conv_states_.byte_size() + pooled_recurrent_states_.byte_size()) * mb); - } - - const int dbits = byte_size(runtime_dtype, 8); - const int elem_bits = quant_policy ? quant_policy : dbits; - - BlockConfig block_config{ - head_dim, - kv_head_num, - cache_block_seq_len, - elem_bits == dbits ? 0 : dbits, - elem_bits, - head_dim == 576, // share kv - }; - - block::Layout layout{block_config}; - // dump(layout); - - size_t block_size = layout.block_size(cache_layer_num); - - if (num_linear_layers > 0 && block_count < 1.) { - const size_t linear_bytes = pooled_conv_states_.byte_size() + pooled_recurrent_states_.byte_size(); - const size_t target_bytes = static_cast(free_before * block_count); - TM_LOG_INFO("[SeqMgr] Adjusting block_count: free_before {:.2f} MB, linear {:.2f} MB, target {:.2f} MB", - free_before / (1024. * 1024.), - linear_bytes / (1024. * 1024.), - target_bytes / (1024. * 1024.)); - if (target_bytes <= linear_bytes) { - TM_LOG_ERROR("[SeqMgr] Linear-state memory ({:.2f} MB) >= cache budget ({:.2f} MB). ", - linear_bytes / (1024. * 1024.), - target_bytes / (1024. * 1024.)); - TM_LOG_FATAL( - "Please decrease max_batch_size to reduce total linear state size or increase cache_max_entry_count."); - } - const size_t cache_bytes = target_bytes - linear_bytes; - block_count = static_cast(cache_bytes) / static_cast(block_size); - TM_LOG_INFO("[SeqMgr] Adjusted block_count to {:.0f}", block_count); - } - - block_manager_ = std::make_shared(block_size, block_count, chunk_size, allocator, get_free_size); - - if (enable_prefix_caching) { - block_trie_ = std::make_shared(block_config.block_len_, block_manager_); - } - TM_LOG_WARN("prefix caching is {}", enable_prefix_caching ? "enabled" : "disabled"); -} - -const Sequence* SequenceManager::Create(uint64_t id) -{ - Sequence sequence{id}; - auto it = sequences_.find(id); - if (it != sequences_.end()) { - if (rank_ == 0) { - TM_LOG_WARN("Removing conflicting ID {}", id); - } - Erase(it); - } - it = sequences_.emplace_hint(it, id, std::move(sequence)); - if (rank_ == 0) { - TM_LOG_INFO("ID {}", id); - } - return &it->second; -} - -const Sequence* SequenceManager::Get(uint64_t id) -{ - if (auto it = sequences_.find(id); it != sequences_.end()) { - return &it->second; - } - return nullptr; -} - -bool SequenceManager::Contains(uint64_t id) -{ - return sequences_.find(id) != sequences_.end(); -} - -void SequenceManager::Erase(std::map::iterator& it) -{ - auto& seq = it->second; - if (seq.status == Sequence::kCached) { - const int count = block_manager_->Verify(seq.blocks, seq.block_unique_ids); - seq.blocks.resize(count); - } - else { - UpdateAndSetUnlock(seq); - } - // if prefix cache enabled, blocks will be shared by sequences, cannot be freed immediately - if (!block_trie_) { - freed_.insert(freed_.end(), seq.blocks.begin(), seq.blocks.end()); - } - ReleaseLinearStateSlot(seq); - it = sequences_.erase(it); -} - -bool SequenceManager::Erase(uint64_t id) -{ - if (auto it = sequences_.find(id); it != sequences_.end()) { - Erase(it); - return true; - } - return false; -} - -void SequenceManager::AcquireLinearStateSlot(const Sequence& sequence) -{ - if (!pooled_recurrent_states_) { - return; - } - - auto& seq = const_cast(sequence); - - auto slot_it = seq_to_linear_state_slot_.find(seq.id); - if (slot_it != seq_to_linear_state_slot_.end()) { - const int slot = slot_it->second; - seq.conv_states = pooled_conv_states_.slice(slot).squeeze(0); - seq.recurrent_states = pooled_recurrent_states_.slice(slot).squeeze(0); - return; - } - - TM_CHECK(!free_linear_state_slots_.empty()) << "No free linear-state slot for sequence " << seq.id - << ", max_batch_size=" << pooled_recurrent_states_.shape(0); - - const int slot = free_linear_state_slots_.back(); - free_linear_state_slots_.pop_back(); - seq_to_linear_state_slot_.emplace(seq.id, slot); - - seq.conv_states = pooled_conv_states_.slice(slot).squeeze(0); - seq.recurrent_states = pooled_recurrent_states_.slice(slot).squeeze(0); - seq.linear_states_need_reset = true; -} - -void SequenceManager::ReleaseLinearStateSlot(const Sequence& sequence) -{ - if (!pooled_recurrent_states_) { - return; - } - - auto& seq = const_cast(sequence); - - if (auto slot_it = seq_to_linear_state_slot_.find(seq.id); slot_it != seq_to_linear_state_slot_.end()) { - free_linear_state_slots_.push_back(slot_it->second); - seq_to_linear_state_slot_.erase(slot_it); - } - seq.conv_states = {}; - seq.recurrent_states = {}; - seq.linear_states_need_reset = false; -} - -void SequenceManager::InvalidateStatesAndCache(const Sequence& sequence) -{ - InvalidateStatesAndCache(sequence, freed_); -} - -void SequenceManager::InvalidateStatesAndCache(const Sequence& sequence, BlockIds& freed_blocks) -{ - auto& seq = const_cast(sequence); - if (seq.status != Sequence::kCached) { - UpdateAndSetUnlock(seq); - } - freed_blocks.insert(freed_blocks.end(), seq.blocks.begin(), seq.blocks.end()); - - seq.blocks.clear(); - seq.block_unique_ids.clear(); - seq.input_length = 0; - seq.cache_len = 0; - ReleaseLinearStateSlot(seq); -} - -void SequenceManager::CachePrompt(const Sequences& sequences, int active_size) -{ - if (!block_trie_) { - return; - } - - for (int i = 0; i < active_size; ++i) { - if (auto& seq = *sequences[i]; !seq.prompt.empty()) { - const auto& [block_ids, unique_ids] = block_trie_->Cache(seq, seq.prompt); - if (rank_ == 0) { - // clang-format off - TM_LOG_INFO("ID {}, cached blocks {}, tokens {}", seq.id, - (int)block_ids.size(), (int)seq.prompt.size()); - TM_LOG_DEBUG("ID {}, cached block_ids {}, unique_ids {}", seq.id, - vector2string(block_ids), vector2string(unique_ids)); - // clang-format on - } - if (seq.cache_len >= seq.prompt.size()) { - seq.prompt.clear(); - } - } - } -} - -void SequenceManager::CacheGeneration(const Sequence& seq) -{ - if (!block_trie_) { - return; - } - - const auto& [block_ids, unique_ids] = block_trie_->Cache(seq, seq.tokens); - - if (rank_ == 0) { - // clang-format off - TM_LOG_INFO("ID {}, cached blocks {}, tokens {}", - seq.id, (int)block_ids.size(), (int)seq.tokens.size()); - TM_LOG_DEBUG("ID {}, cached block_ids {}, unique_ids {}", seq.id, - vector2string(block_ids), vector2string(unique_ids)); - // clang-format on - } -} - -void SequenceManager::VerifyAndLockCached(const Sequences& sequences) -{ - BlockIds valid_blocks; - BlockIds freed_blocks; - for (const auto& p : sequences) { - auto& seq = const_cast(*p); - if (seq.status != Sequence::kCached) { - continue; - } - TM_CHECK_EQ(seq.blocks.size(), seq.block_unique_ids.size()); - // Verify cache blocks that may be invalidated - const int original_count = seq.blocks.size(); - const int count = block_manager_->Verify(seq.blocks, seq.block_unique_ids); - seq.blocks.resize(count); - seq.block_unique_ids.resize(count); - - const bool has_linear_states = static_cast(seq.recurrent_states); - if (has_linear_states && count < original_count) { - InvalidateStatesAndCache(seq, freed_blocks); - // This request can still continue in the current scheduling round. - // Rebind a slot immediately so GatedDeltaNetLayer::Setup always sees - // valid linear-state views. - AcquireLinearStateSlot(seq); - continue; - } - - valid_blocks.insert(valid_blocks.end(), seq.blocks.begin(), seq.blocks.end()); - seq.cache_len = std::min(seq.cache_len, seq.blocks.size() * block_seq_len_); - seq.status = Sequence::kLocked; - } - if (!freed_blocks.empty()) { - block_manager_->Free(freed_blocks); - } - block_manager_->Lock(valid_blocks); -} - -void SequenceManager::CommitUnlockAndFree() -{ - if (!unlocked_.empty()) { - block_manager_->Unlock(unlocked_); - unlocked_.clear(); - } - - if (!freed_.empty()) { - block_manager_->Free(freed_); - freed_.clear(); - } -} - -void SequenceManager::UpdateAndSetUnlock(const Sequence& sequence) -{ - TM_CHECK_NE(sequence.status, Sequence::kCached); - auto& seq = const_cast(sequence); - block_manager_->Touch(seq.blocks); - unlocked_.insert(unlocked_.end(), seq.blocks.begin(), seq.blocks.end()); - seq.status = Sequence::kCached; -} - -namespace { - -struct Schedule { - int free; - int cached; - - int allocate{}; - int evict{}; - int preempt{}; - - int last; - - int max_fwd_tokens; - int max_tmp_tokens; - - Sequences active; - std::vector block_counts; - Sequences inactive; - Sequences victims; - - Schedule(Snapshot snapshot, int size, int max_fwd_tokens, int max_tmp_tokens): - free{snapshot.free}, - cached{snapshot.cached}, - last{size}, - max_fwd_tokens{max_fwd_tokens}, - max_tmp_tokens{max_tmp_tokens}, - use_count_{std::move(snapshot.use_count)}, - unlocked_(size), // ! This is a vector, DO NOT brace initialize it - it_{size} - { - } - - int Unlock(const Sequences& seqs, int vidx) - { - while (vidx < it_) { - const auto& blocks = seqs[--it_]->blocks; - int count = 0; - for (const auto& bid : blocks) { - count += static_cast(--use_count_[bid] == 0); - } - unlocked_[it_] = count; - } - return unlocked_[vidx]; - } - -private: - std::vector use_count_; - std::vector unlocked_; - int it_; -}; - -template -std::ostream& operator<<(std::ostream& os, const std::vector& v) -{ - os << "["; - for (int i = 0; i < v.size(); ++i) { - os << (i ? "," : "") << v[i]; - } - os << "]"; - return os; -} - -std::ostream& operator<<(std::ostream& os, const Schedule& s) -{ - os << "free=" << s.free << ", cached=" << s.cached << ", allocate=" << s.allocate << ", evict=" << s.evict - << ", preempt=" << s.preempt << ", active=" << s.active << ", victims=" << s.victims - << ", block_counts=" << s.block_counts << ", inactive=" << s.inactive; - return os; -} - -struct Transaction { - int index_; - int block_count_; - int input_len_; - int temp_len_; - - int allocate_{}; - int evict_{}; - int preempt_{}; - - Sequences victims_; - - const Sequences& sequences_; - Schedule& schedule_; - - explicit Transaction( - const Sequences& sequences, int index, int block_count, int input_len, int temp_len, Schedule& sched): - index_{index}, - block_count_{block_count}, - input_len_{input_len}, - temp_len_{temp_len}, - sequences_{sequences}, - schedule_{sched} - { - } - - void Process() - { - if (schedule_.max_fwd_tokens > 0 && schedule_.max_tmp_tokens >= temp_len_) { - int count = block_count_; - - int tmp = std::min(schedule_.free, count); - count -= tmp; - allocate_ += tmp; - - tmp = std::min(schedule_.cached, count); - count -= tmp; - evict_ += tmp; - - for (int vidx = schedule_.last - 1; count && vidx > index_; --vidx) { - if (sequences_[vidx]->status == Sequence::kCached) { - continue; - } - victims_.push_back(sequences_[vidx]); - preempt_ += schedule_.Unlock(sequences_, vidx); - - if (count <= preempt_) { - evict_ += count; - count -= count; - schedule_.last = vidx; // ! modifiying `sched_.last` is part of commit - break; - } - } - if (count == 0) { - return Commit(); - } - } - - const_cast(sequences_[index_])->input_length = 0; - schedule_.inactive.push_back(sequences_[index_]); - } - - void Commit() - { - // update available resources - schedule_.free -= allocate_; - TM_CHECK_GE(schedule_.free, 0); - schedule_.cached += preempt_; - schedule_.cached -= evict_; - TM_CHECK_GE(schedule_.cached, 0); - - // update scheduled operations - schedule_.allocate += allocate_; - schedule_.evict += evict_; - schedule_.preempt += preempt_; - schedule_.victims.insert(schedule_.victims.end(), victims_.begin(), victims_.end()); - - // update active sequences - schedule_.active.push_back(sequences_[index_]); - schedule_.block_counts.push_back(block_count_); - - input_len_ = std::min(input_len_, schedule_.max_fwd_tokens); - schedule_.max_fwd_tokens -= input_len_; - const_cast(sequences_[index_])->input_length = input_len_; - - schedule_.max_tmp_tokens -= temp_len_; - } -}; - -std::ostream& operator<<(std::ostream& os, const Transaction& trans) -{ - os << "index=" << trans.index_ << ", block_count=" << trans.block_count_ << ", allocate=" << trans.allocate_ - << ", evict=" << trans.evict_ << ", preempt=" << trans.preempt_ << ", victims=" << trans.victims_; - return os; -} - -} // namespace - -template -static void SortByKey(const std::vector& keys, std::vector&... vals) -{ - std::vector idxs(keys.size()); - std::iota(idxs.begin(), idxs.end(), 0); - std::sort(idxs.begin(), idxs.end(), [&](int i, int j) { return keys[i] < keys[j]; }); - auto reorder = [&](auto& xs) { - std::remove_reference_t ys(xs.size()); - for (size_t i = 0; i < xs.size(); ++i) { - ys[i] = xs[idxs[i]]; - } - xs.swap(ys); - }; - (reorder(vals), ...); -} - -std::vector SequenceManager::CountRequiredBlocks(const Sequences& sequences, - const std::vector& context_length) -{ - std::vector required(sequences.size()); - for (int i = 0; i < sequences.size(); ++i) { - int length = (context_length[i] + attn_cp_size_ - 1) / attn_cp_size_; - int count = (length + block_seq_len_ - 1) / block_seq_len_ - static_cast(sequences[i]->blocks.size()); - required[i] = std::max(0, count); - } - return required; -} - -void SequenceManager::AssignAndActivate(const Sequences& sequences, // - const std::vector& counts, - const BlockIds& blocks, - const UniqueIds& unique_ids) -{ - TM_CHECK_EQ(sequences.size(), counts.size()); - int first = 0; - for (int i = 0; i < sequences.size(); ++i) { - auto& s = const_cast(*sequences[i]); - auto count = counts[i]; - int last = first + count; - TM_CHECK_LE(last, blocks.size()); - s.blocks.insert(s.blocks.end(), blocks.begin() + first, blocks.begin() + last); - s.block_unique_ids.insert(s.block_unique_ids.end(), unique_ids.begin() + first, unique_ids.begin() + last); - s.status = Sequence::kActive; - first = last; - } -} - -void SequenceManager::PrefixMatch(Sequences& sequences, const std::vector& alpha) -{ - if (!block_trie_) { - return; - } - - for (int i = 0; i < sequences.size(); i++) { - - auto& seq = const_cast(*sequences[i]); - - /// TODO: Is there a way to exploit the alpha[i] != 0 case? - if (alpha[i] != 0 || seq.cache_len >= seq.prompt.size()) { - continue; - } - - const auto& [block_ids, unique_ids] = block_trie_->Match(seq); - - if (rank_ == 0) { - // clang-format off - TM_LOG_INFO("ID {}, hit blocks {}, cache_len {}", seq.id, (int)block_ids.size(), seq.cache_len); - TM_LOG_DEBUG("ID {}, hit block_ids {}, unique_ids {}", seq.id, - vector2string(block_ids), vector2string(unique_ids)); - // clang-format on - } - - /// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences - if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) { - if (BlockIds tail{seq.blocks.begin() + offset, seq.blocks.end()}; !tail.empty()) { - block_manager_->Unlock(tail); - seq.blocks.resize(offset); - seq.block_unique_ids.resize(offset); - } - seq.blocks.insert(seq.blocks.end(), block_ids.begin() + offset, block_ids.end()); - seq.block_unique_ids.insert(seq.block_unique_ids.end(), unique_ids.begin() + offset, unique_ids.end()); - seq.cache_len = seq.blocks.size() * block_seq_len_; - block_manager_->Lock({block_ids.begin() + offset, block_ids.end()}); - } - - if (rank_ == 0) { - // clang-format off - TM_LOG_INFO("ID {}, after matching, blocks {}, cache_len {}", - seq.id, seq.blocks.size(), seq.cache_len); - TM_LOG_DEBUG("ID {}, after matching, block_ids {}, unique_ids {}", seq.id, - vector2string(seq.blocks), vector2string(seq.block_unique_ids)); - // clang-format on - } - } -} - -auto SequenceManager::Materialize(Sequences sequences, - std::vector context_length, - std::vector alpha, - std::vector priorities, - int max_fwd_tokens, - int max_tmp_tokens) -> Outcome -{ - //////////////////////////////////////////////////////////////////////////////// - /// Schedule the assignment of blocks to sequences - - // process deferred unlock and free operations - CommitUnlockAndFree(); - - SortByKey(priorities, sequences, context_length, alpha); - - // Verify and lock cache sequences to avoid their blocks being evicted unnoticed - // the blocks can still be preempted later - VerifyAndLockCached(sequences); - - PrefixMatch(sequences, alpha); - - std::vector required = CountRequiredBlocks(sequences, context_length); - - Schedule schedule(block_manager_->TakeSnapshot(), sequences.size(), max_fwd_tokens, max_tmp_tokens); - - // `schedule.last` is decreasing in the loop - for (int i = 0; i < schedule.last; ++i) { - auto& s = *sequences[i]; - const int input_len = context_length[i] - alpha[i] - s.cache_len; - // sanity check - TM_CHECK_GT(input_len, 0) << "Logical error: " << context_length[i] << " " << alpha[i] << " " << s.cache_len - << " " << s.status; - // temp buffer for flatten KV cache - const int temp_len = (input_len > 1 || s.status != Sequence::kActive) ? context_length[i] : 0; - Transaction{sequences, i, required[i], input_len, temp_len, schedule}.Process(); - } - - // mark remaining sequences invalid - for (int i = schedule.last; i < sequences.size(); ++i) { - schedule.inactive.push_back(sequences[i]); - } - - //////////////////////////////////////////////////////////////////////////////// - /// Schedule is ready, time to execute it. (locked -> cached -> free -> locked) - - // combine allocate and evict since evicted blocks are reused by allocation - schedule.allocate += schedule.evict; - - // if (schedule.allocate) { - // dbg(*block_manager_); - // } - - Outcome outcome{}; - outcome.allocation = schedule.allocate; - outcome.swap_in = std::count_if(schedule.active.begin(), schedule.active.end(), [](auto p) { - // if (p->status != Sequence::kActive) { - // dbg(*p); - // } - return p->status != Sequence::kActive; - }); - outcome.swap_out = std::count_if(schedule.inactive.begin(), schedule.inactive.end(), [](auto p) { - // if (p->status == Sequence::kActive) { - // dbg(*p); - // } - return p->status == Sequence::kActive; - }); - - // release preempted blocks -> cached - if (!schedule.victims.empty()) { - TM_LOG_INFO("#victim: {}", (int)schedule.victims.size()); - for (const auto& p : schedule.victims) { - UpdateAndSetUnlock(*p); - } - CommitUnlockAndFree(); - } - - // evict cached blocks -> free - if (schedule.evict) { - block_manager_->Evict(schedule.evict); - } - - // allocate & assign blocks - { - BlockIds block_ids; - UniqueIds unique_ids; - if (schedule.allocate) { - std::tie(block_ids, unique_ids) = block_manager_->Allocate(schedule.allocate); - } - AssignAndActivate(schedule.active, schedule.block_counts, block_ids, unique_ids); - } - - // active -> locked - for (const auto& p : schedule.inactive) { - if (p->status == Sequence::kActive) { - const_cast(p)->status = Sequence::kLocked; - } - } - - // TM_LOG_ERROR("active: {:4}, cached: {:4}, free: {:4}", - // block_manager_->active_count(), - // block_manager_->cached_count(), - // block_manager_->free_count()); - if (block_trie_) { - block_trie_->Verify(); - } - - return outcome; -} - -std::tuple SequenceManager::seq_stats() const noexcept -{ - int total = static_cast(sequences_.size()); - int active = 0; - int cached = 0; - for (const auto& p : sequences_) { - if (p.second.status == Sequence::kActive) { - ++active; - } - else if (p.second.status == Sequence::kCached) { - ++cached; - } - } - return std::make_tuple(total, active, cached); -} - -} // namespace turbomind diff --git a/src/turbomind/models/llama/SequenceManager.h b/src/turbomind/models/llama/SequenceManager.h deleted file mode 100644 index 73ec7e71c2..0000000000 --- a/src/turbomind/models/llama/SequenceManager.h +++ /dev/null @@ -1,257 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#pragma once - -#include -#include -#include - -#include "src/turbomind/core/allocator.h" -#include "src/turbomind/core/core.h" - -#include "src/turbomind/models/llama/BlockManager.h" -#include "src/turbomind/models/llama/BlockTrie.h" - -namespace turbomind { - -struct MultiModalData; - -struct Sequence { - - enum Status - { - kCached = 0, - kLocked, - kActive - }; - - uint64_t id; - Status status = kCached; - - BlockIds blocks; - UniqueIds block_unique_ids; - - int input_length = 0; // the number of tokens to be processed in each forward iter - - mutable std::vector prompt; - - mutable std::vector tokens; // update by user or when the sequence is finished - - mutable int cache_len = 0; - - // additional data kept round-to-round - mutable std::vector random_state; // update by user - - mutable float rope_theta = 0.f; - - // embedding data - mutable std::vector input_embeds; - mutable std::vector input_embeds_offsets; - - // multimodal inputs - mutable std::vector> multimodal_inputs; - - // Gated DeltaNet linear attention persistent states (e.g. Qwen3.5-MoE). - // Allocated on first request, preserved across requests for the same session, - // and freed automatically when the sequence is erased from the SequenceManager. - // conv_states: (num_linear_layers, conv_dim, d_conv) — per-channel rolling conv history - // recurrent_states: (num_linear_layers, num_v_heads, key_head_dim, value_head_dim) — SSM state - mutable Tensor conv_states; - mutable Tensor recurrent_states; - mutable bool linear_states_need_reset = false; - - explicit Sequence(uint64_t _id): id(_id) {} - - friend std::ostream& operator<<(std::ostream& os, const Sequence& seq); -}; - -using Sequences = std::vector; - -inline std::ostream& operator<<(std::ostream& os, const Sequence& seq) -{ - os << "id=" << seq.id << ", status=" << seq.status << ", token_count=" << seq.tokens.size() - << ", block_count=" << seq.blocks.size() << ", cache_len=" << seq.cache_len - << ", random_state_size=" << seq.random_state.size() << ", input_length=" << seq.input_length; - return os; -} - -class SequenceManager { -public: - // clang-format off - struct BlockConfig { - int head_dim_; - int head_num_; - int block_len_; - int t_bits_; - int q_bits_; - bool share_kv_; - int t_bits() const { return t_bits_; } - int q_bits() const { return q_bits_; } - int head_dim() const { return head_dim_; } - int head_num() const { return head_num_; } - int block_len() const { return block_len_; } - bool is_share_kv() const { return share_kv_; } - }; - // clang-format on - - explicit SequenceManager(int head_dim, - int kv_head_num, - int num_layer, - const std::vector& layer_types, - int quant_policy, - DataType data_type, - DataType runtime_dtype, - int linear_key_head_dim, - int linear_value_head_dim, - int linear_conv_kernel_dim, - int linear_num_key_heads, - int linear_num_value_heads, - int cache_block_seq_len, - int attn_tp_size, - int max_batch_size, - double block_count, - int chunk_size, - bool enable_prefix_caching, - int rank, - int attn_cp_size, - core::Allocator allocator, - GetFreeMemSize get_free_size); - - SequenceManager(const SequenceManager&) = delete; - SequenceManager(SequenceManager&&) noexcept = default; - - [[nodiscard]] const Sequence* Create(uint64_t id); - - [[nodiscard]] const Sequence* Get(uint64_t id); - - [[nodiscard]] bool Contains(uint64_t id); - - [[nodiscard]] bool Erase(uint64_t id); - - void AcquireLinearStateSlot(const Sequence& seq); - - void ReleaseLinearStateSlot(const Sequence& seq); - - void InvalidateStatesAndCache(const Sequence& seq); - - void UpdateAndSetUnlock(const Sequence& seq); - - struct Outcome { - int allocation; - int swap_in; - int swap_out; - }; - - using AdjustInputCount = std::function&)>; - - // 50 1 0 50 - // context = seq_len + beta = cache + alpha + input - // alpha' = input - // beta' = int(is_gen) - // ----------------------------------- - // seq_len += output - // cache += input + output - 1 or cache = seq_len - 1 - - [[maybe_unused]] Outcome Materialize(Sequences sequences, - std::vector context_length, - std::vector alpha, - std::vector priorities, - int max_fwd_tokens, - int max_tmp_tokens); - - /** @brief cache the input prompt tokens of each seq in sequences[0:active_size-1] - * - * @param sequences The sequence list - * @param active_size the number of active sequences in the list - */ - void CachePrompt(const Sequences& sequences, int active_size); - - /** @brief cache the generated tokens of a given sequence - * - * @param sequence the given sequence - * - * @note This function can only be called after the sequence finish generation - * and all tokens including the prompt tokens and generated tokens have been put to - * `seq.tokens` - */ - void CacheGeneration(const Sequence& sequence); - - [[nodiscard]] void* GetBlockPtr(int block_id) - { - return block_manager_->block(block_id).data; - } - - int max_block_count() const noexcept - { - return block_manager_->max_block_count(); - } - - int total_count() const noexcept - { - return block_manager_->total_count(); - } - - int active_count() const noexcept - { - return block_manager_->active_count(); - } - - int free_count() const noexcept - { - return block_manager_->free_count(); - } - - int cached_count() const noexcept - { - return block_manager_->cached_count(); - } - - // return #total_seq, #active_seq, #cached_seq - std::tuple seq_stats() const noexcept; - -private: - void Erase(std::map::iterator& it); - - void CommitUnlockAndFree(); - - void InvalidateStatesAndCache(const Sequence& seq, BlockIds& freed_blocks); - - void VerifyAndLockCached(const Sequences& sequences); - - std::vector CountRequiredBlocks(const Sequences& sequences, // - const std::vector& context_length); - - static void AssignAndActivate(const Sequences& sequences, // - const std::vector& counts, - const BlockIds& blocks, - const UniqueIds& unique_ids); - - void PrefixMatch(Sequences& sequences, const std::vector& alpha); - -private: - int block_seq_len_; - int rank_; - int attn_cp_size_; - - // Use `std::map` to avoid reference invalidation - std::map sequences_; - - std::shared_ptr block_manager_; - std::shared_ptr block_trie_; - - Tensor pooled_conv_states_; - Tensor pooled_recurrent_states_; - std::vector free_linear_state_slots_; - std::unordered_map seq_to_linear_state_slot_; - - BlockIds unlocked_; - BlockIds freed_; -}; - -inline std::ostream& operator<<(std::ostream& os, const SequenceManager::Outcome& oc) -{ - os << "allocation: " << oc.allocation << ", swap-in: " << oc.swap_in << ", swap-out: " << oc.swap_out; - return os; -} - -} // namespace turbomind diff --git a/src/turbomind/models/llama/context_token_resource.h b/src/turbomind/models/llama/context_token_resource.h new file mode 100644 index 0000000000..292eb8d64c --- /dev/null +++ b/src/turbomind/models/llama/context_token_resource.h @@ -0,0 +1,55 @@ +#pragma once + +#include + +#include "src/turbomind/engine/request.h" + +namespace turbomind { + +class ContextTokenResource final: public Resource { +public: + explicit ContextTokenResource(int max_context_tokens) noexcept: max_context_tokens_{max_context_tokens} {} + + int Test(const Sequence& s) const noexcept override + { + const int input_len = InputLen(s, s.resume_len); + if (input_len <= 0) { + return 0; + } + if (TempLen(s, input_len) > max_context_tokens_) { + return 0; + } + return input_len; + } + + void Commit(const Sequence& s) noexcept override + { + const int input_len = InputLen(s, s.history_len); + max_context_tokens_ -= TempLen(s, input_len); + } + + int remaining_tokens() const noexcept + { + return max_context_tokens_; + } + +private: + static int ContextLen(const Sequence& s) noexcept + { + return s.seq_len + s.inflight_new_tokens; + } + + static int InputLen(const Sequence& s, int history_len) noexcept + { + return ContextLen(s) - s.inflight_input_len - history_len; + } + + static int TempLen(const Sequence& s, int input_len) noexcept + { + return (input_len > 1 || !s.is_active) ? ContextLen(s) : 0; + } + + int max_context_tokens_{}; +}; + +} // namespace turbomind diff --git a/src/turbomind/models/llama/gated_delta_net_kernels.cu b/src/turbomind/models/llama/gated_delta_net_kernels.cu index 0370b9f8f9..0682e57357 100644 --- a/src/turbomind/models/llama/gated_delta_net_kernels.cu +++ b/src/turbomind/models/llama/gated_delta_net_kernels.cu @@ -28,7 +28,9 @@ __global__ void recurrent_gated_delta_rule_kernel_v2(T* v_out, int num_v_heads, int num_k_heads, int k_dim_total, - int state_layer_offset) + int state_layer_offset, + int num_head_groups, + int heads_per_block) { const int bh = blockIdx.x; const int b = bh / num_v_heads; @@ -42,7 +44,10 @@ __global__ void recurrent_gated_delta_rule_kernel_v2(T* v_out, const int conv_dim = 2 * k_dim_total + num_v_heads * v_head_dim; const int v_dim = num_v_heads * v_head_dim; - S* s_ptr = state_ptrs[b] + state_layer_offset + h * state_size; + const int hg = h / heads_per_block; // head-group within the layer + S* s_ptr = state_ptrs[b * num_head_groups + hg] // per (batch, layer-group, head-group) block base + + state_layer_offset // (L % L_b) * H_b * state_size (per-layer, via weights) + + (h % heads_per_block) * state_size; // head within the head-group const float scale = rsqrtf((float)k_head_dim); @@ -255,11 +260,15 @@ void invokeGatedDeltaRuleBatched_v2(Ref v_out_, DataType state_dtype, int /*sm_count*/, int* /*work_counter*/, - cudaStream_t stream) + cudaStream_t stream, + int num_head_groups, + int heads_per_block) { auto& v_out = v_out_.get(); - const int num_v_heads = beta.shape(1); + const int num_v_heads = beta.shape(1); + if (heads_per_block <= 0) + heads_per_block = num_v_heads; // sentinel: one head-group spans all heads (today's behavior) const int v_dim = v_out.shape(1); const int value_head_dim = v_dim / num_v_heads; const int k_dim_total = (qkv_in.shape(1) - v_dim) / 2; @@ -296,7 +305,9 @@ void invokeGatedDeltaRuleBatched_v2(Ref v_out_, num_v_heads, num_k_heads, k_dim_total, - state_layer_offset); + state_layer_offset, + num_head_groups, + heads_per_block); }; if (state_dtype == kFloat32) { launch(float{}); @@ -323,18 +334,21 @@ void invokeGatedDeltaRuleBatched_v2(Ref v_out_, // directly from global memory. smem_sz = 0 in the host launcher. // ============================================================================= template -__global__ __launch_bounds__(block_dim, 2) void recurrent_gated_delta_rule_kernel_v3(T* v_out, - const T* qkv_in, - const T* beta_in, - const T* g_in, - S* const* state_ptrs, - const int* q_offsets, - int* work_counter, - int total_work, - int num_v_heads, - int num_k_heads, - int k_dim_total, - int state_layer_offset) +__global__ __launch_bounds__(block_dim, 2) void recurrent_gated_delta_rule_kernel_v3(T* v_out, + const T* qkv_in, + const T* beta_in, + const T* g_in, + S* const* state_ptrs, + const int* q_offsets, + const bool* finished, + int* work_counter, + int total_work, + int num_v_heads, + int num_k_heads, + int k_dim_total, + int state_layer_offset, + int num_head_groups, + int heads_per_block) { constexpr int state_size = k_head_dim * v_head_dim; const int conv_dim = 2 * k_dim_total + num_v_heads * v_head_dim; @@ -368,9 +382,13 @@ __global__ __launch_bounds__(block_dim, 2) void recurrent_gated_delta_rule_kerne const int ratio = num_v_heads / num_k_heads; const int kh = h / ratio; - const int global_t = q_offsets[b]; // seq_len == 1 guaranteed + const bool skip_state_store = finished != nullptr && finished[b]; + const int global_t = q_offsets[b]; // seq_len == 1 guaranteed - S* s_ptr = state_ptrs[b] + state_layer_offset + h * state_size; + const int hg = h / heads_per_block; // head-group within the layer + S* s_ptr = state_ptrs[b * num_head_groups + hg] // per (batch, layer-group, head-group) block base + + state_layer_offset // (L % L_b) * H_b * state_size (per-layer, via weights) + + (h % heads_per_block) * state_size; // head within the head-group // --- Load state: global → registers (direct strided tile loads, tile_v contiguous) --- Array vec_S[v_iters][tile_k]; @@ -472,12 +490,14 @@ __global__ __launch_bounds__(block_dim, 2) void recurrent_gated_delta_rule_kerne } // --- Store state: registers → global (direct strided tile stores, tile_v contiguous) --- - PRAGMA_UNROLL - for (int v_iter = 0; v_iter < v_iters; ++v_iter) { + if (!skip_state_store) { PRAGMA_UNROLL - for (int k = 0; k < tile_k; ++k) { - auto tmp = cast(vec_S[v_iter][k]); - Store(&s_ptr[(offset_k * tile_k + k) * v_head_dim + (offset_v + v_iter * v_threads) * tile_v], tmp); + for (int v_iter = 0; v_iter < v_iters; ++v_iter) { + PRAGMA_UNROLL + for (int k = 0; k < tile_k; ++k) { + auto tmp = cast(vec_S[v_iter][k]); + Store(&s_ptr[(offset_k * tile_k + k) * v_head_dim + (offset_v + v_iter * v_threads) * tile_v], tmp); + } } } } @@ -489,17 +509,22 @@ void invokeGatedDeltaRuleBatched_v3(Ref v_out_, const Tensor& g, const Buffer_& state_ptrs, const Buffer_& q_offsets, + const Buffer_& finished, int batch_size, int num_k_heads, int state_layer_offset, DataType state_dtype, int sm_count, int* work_counter, - cudaStream_t stream) + cudaStream_t stream, + int num_head_groups, + int heads_per_block) { auto& v_out = v_out_.get(); const int num_v_heads = beta.shape(1); + if (heads_per_block <= 0) + heads_per_block = num_v_heads; // sentinel: one head-group spans all heads (today's behavior) const int v_dim = v_out.shape(1); const int k_dim_total = (qkv_in.shape(1) - v_dim) / 2; @@ -532,12 +557,15 @@ void invokeGatedDeltaRuleBatched_v3(Ref v_out_, g.data(), (S* const*)state_ptrs.data(), q_offsets.data(), + finished ? finished.data() : nullptr, work_counter, total_work, num_v_heads, num_k_heads, k_dim_total, - state_layer_offset); + state_layer_offset, + num_head_groups, + heads_per_block); }; if (state_dtype == kFloat32) launch(float{}); @@ -557,16 +585,19 @@ void invokeGatedDeltaRuleBatched_v3(Ref v_out_, // State load/store uses the full swizzled smem buffer (same as v2). // ============================================================================= template -__global__ void chunked_gated_delta_rule_kernel(T* v_out, - const T* qkv_in, - const T* beta_in, - const T* g_in, - S* const* state_ptrs, - const int* q_offsets, - int num_v_heads, - int num_k_heads, - int k_dim_total, - int state_layer_offset) +__global__ void chunked_gated_delta_rule_kernel(T* v_out, + const T* qkv_in, + const T* beta_in, + const T* g_in, + S* const* state_ptrs, + const int* q_offsets, + const bool* finished, + int num_v_heads, + int num_k_heads, + int k_dim_total, + int state_layer_offset, + int num_head_groups, + int heads_per_block) { constexpr int C = kChunkSize; constexpr int D = kHeadDim; @@ -577,16 +608,20 @@ __global__ void chunked_gated_delta_rule_kernel(T* v_out, const int ratio = num_v_heads / num_k_heads; const int kh = h / ratio; - const int tok_off = q_offsets[b]; - const int seq_len = q_offsets[b + 1] - tok_off; - const int state_size = D * D; - const int conv_dim = 2 * k_dim_total + num_v_heads * D; - const int v_dim = num_v_heads * D; + const bool skip_state_store = finished != nullptr && finished[b]; + const int tok_off = q_offsets[b]; + const int seq_len = q_offsets[b + 1] - tok_off; + const int state_size = D * D; + const int conv_dim = 2 * k_dim_total + num_v_heads * D; + const int v_dim = num_v_heads * D; if (seq_len == 0) return; - S* s_ptr = state_ptrs[b] + state_layer_offset + h * state_size; + const int hg = h / heads_per_block; // head-group within the layer + S* s_ptr = state_ptrs[b * num_head_groups + hg] // per (batch, layer-group, head-group) block base + + state_layer_offset // (L % L_b) * H_b * state_size (per-layer, via weights) + + (h % heads_per_block) * state_size; // head within the head-group const float scale = rsqrtf((float)D); // ── State tiling (same as v2) ── @@ -804,7 +839,7 @@ __global__ void chunked_gated_delta_rule_kernel(T* v_out, // ================================================================ // STORE STATE registers → smem (swizzled) → global (same as v2) // ================================================================ - { + if (!skip_state_store) { using Map_S = ThreadMap_V2; constexpr int kBase = (sizeof(S) == 4) ? 2 : 3; constexpr int kShift = 10 - kBase; @@ -850,17 +885,22 @@ void invokeChunkedGatedDeltaRuleBatched(Ref v_out_, const Tensor& g, const Buffer_& state_ptrs, const Buffer_& q_offsets, + const Buffer_& finished, int batch_size, int num_k_heads, int state_layer_offset, DataType state_dtype, int /*sm_count*/, int* /*work_counter*/, - cudaStream_t stream) + cudaStream_t stream, + int num_head_groups, + int heads_per_block) { auto& v_out = v_out_.get(); - const int num_v_heads = beta.shape(1); + const int num_v_heads = beta.shape(1); + if (heads_per_block <= 0) + heads_per_block = num_v_heads; // sentinel: one head-group spans all heads (today's behavior) const int v_dim = v_out.shape(1); const int value_head_dim = v_dim / num_v_heads; const int k_dim_total = (qkv_in.shape(1) - v_dim) / 2; @@ -903,10 +943,13 @@ void invokeChunkedGatedDeltaRuleBatched(Ref v_out_, g.data(), (S* const*)state_ptrs.data(), q_offsets.data(), + finished ? finished.data() : nullptr, num_v_heads, num_k_heads, k_dim_total, - state_layer_offset); + state_layer_offset, + num_head_groups, + heads_per_block); }; if (state_dtype == kFloat32) { launch(float{}); @@ -1075,6 +1118,7 @@ __global__ void __launch_bounds__(BLOCK_DIM) fused_conv1d_batched_kernel_v2(T* void* const* conv_state_ptrs, const int* q_offsets, const int* k_offsets, + const bool* finished, int* work_counter, int batch_size, int conv_dim, @@ -1176,8 +1220,9 @@ __global__ void __launch_bounds__(BLOCK_DIM) fused_conv1d_batched_kernel_v2(T* n_tokens = min(NUM_TOKENS, seq_len - t_local_start); } - const int ring_start = (history_len + t_local_start + 1) % D_CONV; - T* state_base = (T*)conv_state_ptrs[b] + state_layer_offset; + const bool skip_state_store = finished != nullptr && finished[b]; + const int ring_start = (history_len + t_local_start + 1) % D_CONV; + T* state_base = (T*)conv_state_ptrs[b] + state_layer_offset; if (ch_active) { constexpr int VALS_SIZE = NUM_TOKENS + D_CONV - 1; @@ -1222,7 +1267,7 @@ __global__ void __launch_bounds__(BLOCK_DIM) fused_conv1d_batched_kernel_v2(T* } } - if (t_local_start + n_tokens >= seq_len) { + if (!skip_state_store && t_local_start + n_tokens >= seq_len) { PRAGMA_UNROLL for (int i = 0; i < VALS_SIZE; ++i) { int pos = t_local_start - (D_CONV - 1) + i; @@ -1243,6 +1288,7 @@ void invokeFusedConv1dSiLU(Ref out_, const Buffer_& conv_state_ptrs, const Buffer_& q_offsets, const Buffer_& k_offsets, + const Buffer_& finished, int batch_size, int state_layer_offset, int sm_count, @@ -1285,6 +1331,7 @@ void invokeFusedConv1dSiLU(Ref out_, conv_state_ptrs.data(), q_offsets.data(), k_offsets.data(), + finished ? finished.data() : nullptr, work_counter, batch_size, conv_dim, diff --git a/src/turbomind/models/llama/gated_delta_net_kernels.h b/src/turbomind/models/llama/gated_delta_net_kernels.h index 9519db4da7..c95aa5c5e9 100644 --- a/src/turbomind/models/llama/gated_delta_net_kernels.h +++ b/src/turbomind/models/llama/gated_delta_net_kernels.h @@ -27,12 +27,41 @@ void invokeFusedConv1dSiLU(Ref out, const Buffer_& conv_state_ptrs, const Buffer_& q_offsets, const Buffer_& k_offsets, + const Buffer_& finished, int batch_size, int state_layer_offset, int sm_count, int* work_counter, cudaStream_t stream); +inline void invokeFusedConv1dSiLU(Ref out, + const Tensor& in, + const Tensor& weight, + const Tensor& bias, + const Buffer_& conv_state_ptrs, + const Buffer_& q_offsets, + const Buffer_& k_offsets, + int batch_size, + int state_layer_offset, + int sm_count, + int* work_counter, + cudaStream_t stream) +{ + invokeFusedConv1dSiLU(out, + in, + weight, + bias, + conv_state_ptrs, + q_offsets, + k_offsets, + Buffer_{}, + batch_size, + state_layer_offset, + sm_count, + work_counter, + stream); +} + // All three recurrent-rule launchers share the same trailing parameters for // interface consistency: // sm_count — multiprocessor count, queried once by the caller at init @@ -53,7 +82,9 @@ void invokeGatedDeltaRuleBatched_v2(Ref v_out, DataType state_dtype, int sm_count, int* work_counter, - cudaStream_t stream); + cudaStream_t stream, + int num_head_groups = 1, + int heads_per_block = 0); // v3: persistent decode kernel, seq_len == 1 only. // Launches min(total_work, blocks_per_sm * sm_count) blocks; each block claims @@ -65,13 +96,46 @@ void invokeGatedDeltaRuleBatched_v3(Ref v_out, const Tensor& g, const Buffer_& state_ptrs, const Buffer_& q_offsets, + const Buffer_& finished, int batch_size, int num_k_heads, int state_layer_offset, DataType state_dtype, int sm_count, int* work_counter, - cudaStream_t stream); + cudaStream_t stream, + int num_head_groups = 1, + int heads_per_block = 0); + +inline void invokeGatedDeltaRuleBatched_v3(Ref v_out, + const Tensor& qkv_in, + const Tensor& beta, + const Tensor& g, + const Buffer_& state_ptrs, + const Buffer_& q_offsets, + int batch_size, + int num_k_heads, + int state_layer_offset, + DataType state_dtype, + int sm_count, + int* work_counter, + cudaStream_t stream) +{ + invokeGatedDeltaRuleBatched_v3(v_out, + qkv_in, + beta, + g, + state_ptrs, + q_offsets, + Buffer_{}, + batch_size, + num_k_heads, + state_layer_offset, + state_dtype, + sm_count, + work_counter, + stream); +} // ============================================================================= // Chunked Gated Delta Rule — for accelerating prefill @@ -88,13 +152,46 @@ void invokeChunkedGatedDeltaRuleBatched(Ref v_out, const Tensor& g, const Buffer_& state_ptrs, const Buffer_& q_offsets, + const Buffer_& finished, int batch_size, int num_k_heads, int state_layer_offset, DataType state_dtype, int sm_count, int* work_counter, - cudaStream_t stream); + cudaStream_t stream, + int num_head_groups = 1, + int heads_per_block = 0); + +inline void invokeChunkedGatedDeltaRuleBatched(Ref v_out, + const Tensor& qkv_in, + const Tensor& beta, + const Tensor& g, + const Buffer_& state_ptrs, + const Buffer_& q_offsets, + int batch_size, + int num_k_heads, + int state_layer_offset, + DataType state_dtype, + int sm_count, + int* work_counter, + cudaStream_t stream) +{ + invokeChunkedGatedDeltaRuleBatched(v_out, + qkv_in, + beta, + g, + state_ptrs, + q_offsets, + Buffer_{}, + batch_size, + num_k_heads, + state_layer_offset, + state_dtype, + sm_count, + work_counter, + stream); +} // ============================================================================= // Helper kernels diff --git a/src/turbomind/models/llama/test_cache_manager.cc b/src/turbomind/models/llama/test_cache_manager.cc deleted file mode 100644 index 16629565f1..0000000000 --- a/src/turbomind/models/llama/test_cache_manager.cc +++ /dev/null @@ -1,116 +0,0 @@ -// Copyright (c) OpenMMLab. All rights reserved. - -#include "BlockManager.h" -#include "SequenceManager.h" - -#include "src/turbomind/utils/allocator.h" - -#include "src/turbomind/utils/debug_utils.h" -#include -#include - -using namespace turbomind; - -std::ostream& operator<<(std::ostream& os, const Block* b) -{ - os << "(" << b->id << "," << b->timestamp << ")"; - return os; -} - -TEST_CASE("BlockManager") -{ - Allocator allocator(0); - - BlockManager m(1024, 32, 8, &allocator); - REQUIRE(m.max_block_count() == 32); - REQUIRE(m.free_count() == 32); - - auto blocks1 = m.Allocate(10); - - dbg(blocks1); - - REQUIRE(blocks1.size() == 10); - REQUIRE(m.active_count() == blocks1.size()); - REQUIRE(m.free_count() == 22); - - auto blocks2 = m.Allocate(6); - REQUIRE(blocks2.size() == 6); - REQUIRE(m.active_count() == blocks1.size() + blocks2.size()); - REQUIRE(m.free_count() == 16); - - auto blocks3 = m.Allocate(16); - REQUIRE(blocks3.size() == 16); - REQUIRE(m.active_count() == 32); - REQUIRE(m.free_count() == 0); - - std::copy(blocks3.begin(), blocks3.end(), std::back_inserter(blocks1)); - std::copy(blocks2.begin(), blocks2.end(), std::back_inserter(blocks1)); - - m.Touch(blocks1); - - REQUIRE(m.Unlock(blocks1) == 32); - REQUIRE(m.active_count() == 0); - REQUIRE(m.free_count() == 0); - REQUIRE(m.cached_count() == 32); - - m.Evict(16); - REQUIRE(m.active_count() == 0); - REQUIRE(m.free_count() == 16); - REQUIRE(m.cached_count() == 16); - - auto blocks4 = m.Allocate(14); - REQUIRE(m.active_count() == 14); - REQUIRE(m.free_count() == 2); - REQUIRE(m.cached_count() == 16); -} - -TEST_CASE("SequenceManager basic test") -{ - Allocator allocator(0); - - SequenceManager manager(32, 32, 128, 128, 20, 4, 16, 0, &allocator); - - REQUIRE(manager.max_block_count() == 20); - REQUIRE(manager.Contains(1) == false); - - auto s1 = manager.Create(1); - dbg(*s1); - REQUIRE(manager.Contains(1) == true); - - manager.Erase(1); - REQUIRE(manager.Contains(1) == false); - - s1 = manager.Create(1); - REQUIRE(manager.Contains(1) == true); - - auto outcome = manager.Materialize({s1}, {128}, {100}, 1); - dbg(s1->blocks); - REQUIRE(s1->blocks.size() == 2); - - auto s2 = manager.Create(2); - REQUIRE(manager.Contains(2)); - - outcome = manager.Materialize({s1, s2}, {128, 2559}, {2, 1}, 1); - dbg(outcome); - REQUIRE(outcome.allocation == 20); - REQUIRE(outcome.swap_in == 1); - REQUIRE(outcome.swap_out == 1); - - auto s3 = manager.Create(3); - outcome = manager.Materialize({s1, s2, s3}, {127, 2559, 255}, {1, 100, 2}, 1); - dbg(outcome); -} - -TEST_CASE("SequenceManager functional test") -{ - Allocator allocator(0); - SequenceManager manager(32, 32, 128, 128, 20, 4, 16, 0, &allocator); - - auto seq = manager.Create(1); - for (int i = 0; i < 1024; ++i) { - auto outcome = manager.Materialize({seq}, {i}, {0}, 1); - if (outcome.allocation) { - dbg(i, outcome); - } - } -} diff --git a/src/turbomind/models/llama/unified_attention_layer.cc b/src/turbomind/models/llama/unified_attention_layer.cc index cc0e529eb4..0be375f345 100644 --- a/src/turbomind/models/llama/unified_attention_layer.cc +++ b/src/turbomind/models/llama/unified_attention_layer.cc @@ -19,6 +19,7 @@ // Modified from // https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/layers/attention_layers/GptContextAttentionLayer.cc +#include "src/turbomind/engine/block.h" #include #include #include @@ -39,6 +40,9 @@ #include "src/turbomind/macro.h" +#include "src/turbomind/kernels/attention/block.h" +#include "src/turbomind/memory/object.h" +#include "src/turbomind/models/attention_weight.h" #include "src/turbomind/models/llama/llama_kernels.h" #include "src/turbomind/models/llama/llama_rope.h" #include "src/turbomind/models/llama/llama_utils.h" @@ -54,6 +58,31 @@ namespace turbomind { +namespace { +// clang-format off +struct BlockConfig { + int head_dim_; + int head_num_; + int block_len_; + int t_bits_; + int q_bits_; + bool share_kv_; + int t_bits() const { return t_bits_; } + int q_bits() const { return q_bits_; } + int head_dim() const { return head_dim_; } + int head_num() const { return head_num_; } + int block_len() const { return block_len_; } + bool is_share_kv() const { return share_kv_; } + auto as_tuple() const noexcept { + return std::tie(head_dim_, head_num_, block_len_, t_bits_, q_bits_, share_kv_); + } + friend bool operator==(const BlockConfig& a, const BlockConfig& b) { + return a.as_tuple() == b.as_tuple(); + } +}; +// clang-format on +} // namespace + struct AttentionData { struct Stat { int n; @@ -76,6 +105,7 @@ struct AttentionData { Buffer_ finished; Buffer_ q_offsets; Buffer_ k_offsets; + Buffer_ readonly_block_num; // per-request, batch order // int dbg_offset; // int dbg_size; @@ -92,26 +122,53 @@ UnifiedAttentionLayer::~UnifiedAttentionLayer() aux_stream_ = {}; } -UnifiedAttentionLayer::UnifiedAttentionLayer(int quant_policy, - const std::vector& layer_types, - int layer_num, - std::vector attn_weights, +UnifiedAttentionLayer::UnifiedAttentionLayer(std::vector weights, + CacheRegistry& registry, const EngineParam& engine, - const Context& ctx, + const Context& context, int phases, bool init): - quant_policy_{quant_policy}, - rope_{attn_weights[0]->rope}, + quant_policy_{engine.quant_policy}, + rope_{weights.at(0)->rope}, engine_param_{engine}, - cp_fn_ctx_{ctx.comm.d_comm, ctx.comm.d_cp_group}, - is_warm_up_{*ctx.is_warm_up}, - context_{ctx}, + cp_fn_ctx_{context.comm.d_comm, context.comm.d_cp_group}, + is_warm_up_{*context.is_warm_up}, + context_{context}, init_{init}, - linear_(*ctx.linear), + linear_(*context.linear), arch_{getSMVersion()} { - TM_CHECK(!attn_weights.empty()) << "attn_weights must not be empty"; - TM_CHECK(attn_weights[0]) << "attn_weights[0] must not be null"; + TM_CHECK_GE(weights.size(), 1); + + const auto dtype = engine.data_type; + + const int dtype_bits = byte_size(dtype, 8); + const int qaunt_bits = quant_policy_ ? quant_policy_ : dtype_bits; + + auto get_block_config = [&](const AttentionWeight& w) { + BlockConfig b{w.head_dim, + w.kv_head_num, + engine.cache_block_seq_len, + dtype_bits == qaunt_bits ? 0 : dtype_bits, + qaunt_bits, + w.head_dim == 576}; + return b; + }; + + size_t offset = 0; // byte size (quantization aware) + for (int i = 0; i < weights.size(); ++i) { + block::Layout layout{get_block_config(*weights[i])}; + weights[i]->cache_block_offset = offset; + offset += layout.layer_size(); + } + + const size_t cache_block_byte_size = offset; + prefix_cache_offset_ = registry.prefix().Register(cache_block_byte_size, /*alignment=*/1); + + const auto max_block_num = engine.max_batch_size * cdiv(engine.session_len, engine.cache_block_seq_len); + + block_ptrs_buf_ = {max_block_num, kCPUpinned}; + block_ptrs_offsets_buf_ = {engine.max_batch_size + 1, kCPUpinned}; TM_CUDA_CHECK(cudaStreamCreateWithFlags(&aux_stream_, cudaStreamNonBlocking)); TM_CUDA_CHECK(cudaEventCreateWithFlags(&qkv_event_, cudaEventDisableTiming)); @@ -119,25 +176,14 @@ UnifiedAttentionLayer::UnifiedAttentionLayer(int quant init_rope_kernel_param(rope_, rope_param_); - // Skip other attention layer types - std::vector types = layer_types; - types.resize(layer_num); - cache_layer_ids_.resize(types.size(), -1); - int next_cache_id = 0; - for (size_t i = 0; i < types.size(); ++i) { - if (types[i] == 0) { - cache_layer_ids_[i] = next_cache_id++; - } - } - const int bsz = engine.max_batch_size; if (rope_param_.type == RopeType::kDynamic) { rope_base_buf_ = {bsz + 1, kCPUpinned}; } if (rope_param_.mrope_mode != MropeMode::kNone) { - // mrope device buffers are allocated lazily — borrowed from env when the vision encoder - // produced them, or owned (allocated in legacy_mrope_setup) when only r.inputs supplies them. + // CPU-pinned staging buffers for the legacy r.inputs mrope path; per-phase device + // tensors are allocated below. (W1 also borrows env tensors from the vision encoder.) mrope_position_delta_buf_ = {bsz, kCPUpinned}; mrope_length_buf_ = {bsz, kCPUpinned}; } @@ -149,11 +195,16 @@ UnifiedAttentionLayer::UnifiedAttentionLayer(int quant if (rope_param_.type == RopeType::kDynamic) { d->rope_base = empty_like(rope_base_buf_, kDEVICE); } + if (rope_param_.mrope_mode != MropeMode::kNone) { + d->mrope_position_ids = {{bsz, engine.session_len, 3}, kDEVICE}; + d->mrope_position_delta = empty_like(mrope_position_delta_buf_, kDEVICE); + d->mrope_length = empty_like(mrope_length_buf_, kDEVICE); + } } // Eagerly initialize workspace buffers (was previously lazy in Init()) { - const auto& w = *attn_weights[0]; + const auto& w = *weights[0]; const int tp_size = w.tp_size; const int local_head_num = w.head_num / tp_size; const int size_per_head = w.head_dim; @@ -180,7 +231,7 @@ UnifiedAttentionLayer::UnifiedAttentionLayer(int quant } } -static void init_dynamic_ntk(RequestCache& cache, const core::RopeConfig& rope) +static void init_dynamic_ntk(Sequence& cache, const core::RopeConfig& rope) { cache.rope_base = rope.base; if (auto scaling_factor = rope.factor; scaling_factor > 1.f) { @@ -200,7 +251,7 @@ static void init_dynamic_ntk(RequestCache& cache, const core::RopeConfig& rope) void UnifiedAttentionLayer::Run(BatchOp op, int phase, TensorMap& env) { if (op == BatchOp::kAdd) { - Buffer_ rc = env.at("requests").buffer(); + Buffer_ rc = env.at("requests").buffer(); if (rope_param_.type == RopeType::kDynamic) { for (int i = 0; i < rc.size(); ++i) { init_dynamic_ntk(*rc[i], rope_); @@ -211,9 +262,10 @@ void UnifiedAttentionLayer::Run(BatchOp op, int phase, TensorMap& env) Setup(phase, env); } else if (op == BatchOp::kPrepare) { - data_.at(phase)->finished = env.at("finished").buffer().borrow(); - data_.at(phase)->q_offsets = env.at("q_offsets").buffer().borrow(); - data_.at(phase)->k_offsets = env.at("k_offsets").buffer().borrow(); + data_.at(phase)->finished = env.at("finished").buffer().borrow(); + data_.at(phase)->q_offsets = env.at("q_offsets").buffer().borrow(); + data_.at(phase)->k_offsets = env.at("k_offsets").buffer().borrow(); + data_.at(phase)->readonly_block_num = env.at("readonly_block_num").buffer().borrow(); // This is needed in async mode to clear the `attn` buffer for the finished sequences. Ohterwise random NaNs // will crash the MoE router later @@ -231,16 +283,34 @@ void UnifiedAttentionLayer::Run(BatchOp op, int phase, TensorMap& env) void UnifiedAttentionLayer::Setup(int phase, TensorMap& env) { - const auto& rc = env.at("batch").data()[0]->rc; - const int bsz = rc.size(); + // const auto& rc = env.at("batch").data()[0]->rc; // active requests + Buffer_ rc = env.at("requests").buffer(); + + const int bsz = rc.size(); auto& d = *data_.at(phase); auto& copy = *env.at("copy").data()[0]; { /// Upload KV cache ptrs - const Buffer_ offsets = env.at("block_ptrs_offsets").buffer(); - copy(env.at("block_ptrs").buffer(), offsets[bsz], d.block_ptrs); - copy(offsets, bsz + 1, d.block_ptrs_offsets); + const auto& c_pool = *env.at("cache_block_pool").data()[0]; + + auto blocks = block_ptrs_buf_.data(); + auto offsets = block_ptrs_offsets_buf_.data(); + + offsets[0] = 0; + for (int i = 0; i < rc.size(); ++i) { + const auto& r = *rc[i]; + for (const auto& h : r.block_ids) { + const int cache_id = h->prefix_id; + auto& cb = c_pool[cache_id]; + TM_CHECK_NOTNULL(cb.allocation.a); + *blocks++ = cb.base(0) + prefix_cache_offset_; + } + offsets[i + 1] = offsets[i] + r.block_ids.size(); + } + + copy(block_ptrs_buf_, block_ptrs_offsets_buf_[bsz], d.block_ptrs); + copy(block_ptrs_offsets_buf_, bsz + 1, d.block_ptrs_offsets); } /// prepare Q/K stats for decode/prefill @@ -261,9 +331,9 @@ void UnifiedAttentionLayer::Setup(int phase, TensorMap& env) auto& s = i < d.decode.n ? d.decode : d.prefill; s.q_sum += c.input_len; - s.k_sum += c.history_len + c.alpha + c.input_len; + s.k_sum += c.history_len + c.inflight_input_len + c.input_len; s.q_max = std::max(s.q_max, c.input_len); - s.k_max = std::max(s.k_max, c.history_len + c.alpha + c.input_len); + s.k_max = std::max(s.k_max, c.history_len + c.inflight_input_len + c.input_len); } // auto &D = d.decode, &P = d.prefill; @@ -276,26 +346,19 @@ void UnifiedAttentionLayer::Setup(int phase, TensorMap& env) } copy(rope_base_buf_, bsz, d.rope_base); } - if (rope_param_.mrope_mode != MropeMode::kNone) { - // mrope tensors can come from two sources: - // 1. env: the C++ vision encoder produced device tensors in the exact layout - // FastRoPE expects — borrow them with no copy. - // 2. r.inputs: legacy Python-preprocessor path, per-request shaped (length, 3) + - // scalar delta. Falls back here when env did not produce mrope. + else if (rope_param_.mrope_mode != MropeMode::kNone) { if (env.try_("mrope_length")) { + // The C++ qwen3.5-vit encoder already built the mrope tensors in FastRoPE's + // exact (slot-indexed) layout during its kSetup, which runs before this layer's + // Setup in the same pass. Borrow them with no copy; they live on the encoder's + // per-phase Data (worst-case allocated) so the non-owning views stay valid + // through forward. d.mrope_length = env.at("mrope_length").buffer().borrow(); d.mrope_position_delta = env.at("mrope_position_delta").buffer().borrow(); d.mrope_position_ids = env.at("mrope_position_ids").borrow(); } else { - // Legacy r.inputs path. Lazily allocate owned device buffers on first hit. - if (!d.mrope_position_ids) { - /// TODO: total space for `mrope_position_ids` can be reduced to (max_fwd_tokens, 3) - d.mrope_position_ids = - Tensor_{{engine_param_.max_batch_size, engine_param_.session_len, 3}, kDEVICE}; - d.mrope_position_delta = Buffer_{engine_param_.max_batch_size, kDEVICE}; - d.mrope_length = Buffer_{engine_param_.max_batch_size, kDEVICE}; - } + // Legacy r.inputs mrope path (Python preprocessor); `d.mrope_*` allocated at setup. const auto stride = d.mrope_position_ids.stride(0); for (int i = 0; i < rc.size(); ++i) { auto& c = *rc[i]; @@ -304,7 +367,8 @@ void UnifiedAttentionLayer::Setup(int phase, TensorMap& env) int length = pos_ids->shape(0); mrope_length_buf_[i] = length; mrope_position_delta_buf_[i] = *r.inputs.at("mrope_position_delta").data(); - if (auto o = Interval{0, length} & Interval{c.history_len + c.alpha, Interval::Size{c.input_len}}) { + if (auto o = Interval{0, length} + & Interval{c.history_len + c.inflight_input_len, Interval::Size{c.input_len}}) { copy(pos_ids->data() + o.begin() * 3, (int)o.size() * 3, d.mrope_position_ids.data() + i * stride + o.begin() * 3); @@ -431,8 +495,6 @@ Tensor UnifiedAttentionLayer::core_attention(Tensor& qkv, const ForwardParam& p, Tensor tmp_kv{{local_kv_head_num, is_mla ? 1 : 2, d.prefill.k_sum + MAX_CTA_S, size_per_head}, dtype, device}; - const int cache_layer_id = cache_layer_ids_[p.layer_id]; - auto CreateParams = [&](int offset, AttentionData::Stat stat, int max_kv_splits, cudaStream_t stream) { AttentionParams params{}; @@ -469,10 +531,12 @@ Tensor UnifiedAttentionLayer::core_attention(Tensor& qkv, const ForwardParam& p, params.max_q_len = stat.q_max; params.max_k_len = stat.k_max; + TM_CHECK_LE(weights.cache_block_offset, INT_MAX); + // decode only params.block_iter_params = BlockIteratorParams{(char**)d.block_ptrs.data(), // d.block_ptrs_offsets.data() + offset, - cache_layer_id, + (int)weights.cache_block_offset, engine_param_.cache_block_seq_len}; // prefill only @@ -491,14 +555,14 @@ Tensor UnifiedAttentionLayer::core_attention(Tensor& qkv, const ForwardParam& p, }; } - params.finished = d.finished.data() + offset; - params.cu_q_len = d.q_offsets.data() + offset; - params.cu_k_len = d.k_offsets.data() + offset; + params.finished = d.finished.data() + offset; + params.cu_q_len = d.q_offsets.data() + offset; + params.cu_k_len = d.k_offsets.data() + offset; + params.readonly_block_num = d.readonly_block_num.data() + offset; params.num_heads = local_head_num; params.num_kv_heads = local_kv_head_num; params.size_per_head = size_per_head; - params.layer_id = cache_layer_id; double scaling = 1.; if (weights.softmax_scale) { // model predefined softmax scale diff --git a/src/turbomind/models/llama/unified_attention_layer.h b/src/turbomind/models/llama/unified_attention_layer.h index 79c20d3115..5131d2d99e 100644 --- a/src/turbomind/models/llama/unified_attention_layer.h +++ b/src/turbomind/models/llama/unified_attention_layer.h @@ -26,6 +26,7 @@ #include "src/turbomind/core/core.h" #include "src/turbomind/engine/batch.h" +#include "src/turbomind/engine/cache_registry.h" #include "src/turbomind/kernels/attention/cp_utils.h" #include "src/turbomind/kernels/gemm/test/test_utils.h" #include "src/turbomind/models/attention_weight.h" @@ -55,10 +56,8 @@ class UnifiedAttentionLayer { ~UnifiedAttentionLayer(); - UnifiedAttentionLayer(int quant_policy, - const std::vector& layer_types, - int layer_num, - std::vector attn_weights, + UnifiedAttentionLayer(std::vector weights, + CacheRegistry& registry, const EngineParam& engine, const Context& context, int phases, @@ -100,7 +99,10 @@ class UnifiedAttentionLayer { std::vector> data_; - std::vector cache_layer_ids_; + size_t prefix_cache_offset_{}; + + Buffer_ block_ptrs_buf_; + Buffer_ block_ptrs_offsets_buf_; /////////////////////////////////////////////////////// /// temp runtime buffers (allocated in constructor) diff --git a/src/turbomind/models/llama/unified_decoder.cc b/src/turbomind/models/llama/unified_decoder.cc index 2562156634..ba7857c7c6 100644 --- a/src/turbomind/models/llama/unified_decoder.cc +++ b/src/turbomind/models/llama/unified_decoder.cc @@ -9,7 +9,9 @@ #include "src/turbomind/core/scope.h" #include "src/turbomind/kernels/core/math.h" #include "src/turbomind/kernels/norm/rms_norm.h" +#include "src/turbomind/models/attention_weight.h" #include "src/turbomind/models/decoder_layer_weight.h" +#include "src/turbomind/models/delta_net_weight.h" #include "src/turbomind/models/llama/llama_kernels.h" #include "src/turbomind/models/llama/llama_utils.h" #include "src/turbomind/models/llama/moe_ffn_layer.h" @@ -33,7 +35,8 @@ void UnifiedDecoder::Run(BatchOp op, int phase, TensorMap& env) } } -UnifiedDecoder::UnifiedDecoder(const EngineParam& engine, +UnifiedDecoder::UnifiedDecoder(CacheRegistry& registry, + const EngineParam& engine, const Context& ctx, int phases, const ModelWeight& model_weight): @@ -48,55 +51,50 @@ UnifiedDecoder::UnifiedDecoder(const EngineParam& engine, tune_layer_num_(engine.tune_layer_num), is_warm_up_{*ctx.is_warm_up} { - bool has_moe = false; + std::vector moe_weights; + std::vector ffn_weights; + std::vector gdn_weights; + std::vector attn_weights; + for (int i = 0; i < model_weight.num_layer; ++i) { - if (model_weight.layer(i)->moe_ffn) { - has_moe = true; - break; + auto layer = model_weight.layer(i); + if (layer->moe_ffn) { + moe_weights.push_back(layer->moe_ffn.get()); + } + if (layer->linear_attn) { + gdn_weights.push_back(layer->linear_attn.get()); + } + if (layer->attention) { + attn_weights.push_back(layer->attention.get()); + } + if (layer->feed_forward) { + ffn_weights.push_back(layer->feed_forward.get()); } } - if (has_moe) { + + if (!moe_weights.empty()) { moe_ffn_layer_ = std::make_unique(engine, ctx); } - std::vector attn_weights; - attn_weights.reserve(model_weight.num_layer); - for (int i = 0; i < model_weight.num_layer; ++i) { - if (auto* attn = model_weight.layer(i)->attention.get()) { - attn_weights.push_back(attn); - } + if (!ffn_weights.empty()) { + ffn_layer_ = std::make_unique(ctx); } - attn_layer_ = std::make_unique(engine.quant_policy, - model_weight.layer_types, - model_weight.num_layer, - attn_weights, - engine, - ctx, - phases, - (bool)moe_ffn_layer_); - - bool has_linear_attn = false; - for (auto t : model_weight.layer_types) { - if (t == 1) { - has_linear_attn = true; - break; - } - } - if (has_linear_attn) { - linear_attn_layer_ = - std::make_unique(model_weight.data_type, model_weight.layer_types, engine, ctx, phases); + if (!attn_weights.empty()) { + attn_layer_ = std::make_unique(attn_weights, // + registry, + engine, + ctx, + phases, + (bool)moe_ffn_layer_); } - bool has_ffn = false; - for (int i = 0; i < model_weight.num_layer; ++i) { - if (model_weight.layer(i)->feed_forward) { - has_ffn = true; - break; - } - } - if (has_ffn) { - ffn_layer_ = std::make_unique(ctx); + if (!gdn_weights.empty()) { + linear_attn_layer_ = std::make_unique(gdn_weights, // + registry, + engine, + ctx, + phases); } } @@ -250,7 +248,7 @@ void UnifiedDecoder::Forward(int phase, TensorMap& args, const std::vectorlinear_attn) { linear_attn_layer_->Forward( - {phase, local_hidden_states, local_hidden_states, weights.at(layer)->linear_attn.get(), layer}); + {phase, local_hidden_states, local_hidden_states, weights.at(layer)->linear_attn.get()}); } else { auto* attn = weights.at(layer)->attention.get(); diff --git a/src/turbomind/models/llama/unified_decoder.h b/src/turbomind/models/llama/unified_decoder.h index 7c1f36af65..ac60921978 100644 --- a/src/turbomind/models/llama/unified_decoder.h +++ b/src/turbomind/models/llama/unified_decoder.h @@ -12,12 +12,17 @@ namespace turbomind { class ModelWeight; class DecoderLayerWeight; +class CacheRegistry; class UnifiedDecoder { public: using WeightType = DecoderLayerWeight; - UnifiedDecoder(const EngineParam& engine, const Context& ctx, int phases, const ModelWeight& model_weight); + UnifiedDecoder(CacheRegistry& registry, + const EngineParam& engine, + const Context& ctx, + int phases, + const ModelWeight& model_weight); void Run(BatchOp op, int phase, TensorMap& env); diff --git a/src/turbomind/models/output_processor.cc b/src/turbomind/models/output_processor.cc index 067c6d251b..577b152ec6 100644 --- a/src/turbomind/models/output_processor.cc +++ b/src/turbomind/models/output_processor.cc @@ -12,6 +12,7 @@ namespace turbomind { using std::vector; using std::shared_ptr; +using std::unique_ptr; struct OutputProcessor::Impl { @@ -31,18 +32,38 @@ struct OutputProcessor::Impl { } } + struct OutputRange { + std::shared_ptr request; + int type; + Interval src; + Interval dst; + }; + + // A CE-loss scoring segment for one request in the current forward. Unlike `main`, which carries + // the `RequestCache` on the batch and reaches `c.ce_loss`/`c.input_ce_loss` on the executor + // thread, our executor side never touches `Sequence`. So we capture everything the executor needs + // at Setup time: the request (for `outputs["ce_loss"]`), a handle to the Sequence's persistent + // rank-0 accumulator, the hidden-buffer position interval, and whether `input_ce_loss` was fully + // consumed this forward (chunked prefill emits only on the final forward). + struct CeLossSegment { + std::shared_ptr request; + Buffer_ ce_loss; + Interval range; + bool last; + }; + struct Data { Interval full_states; // requested range for full hidden states Interval full_logits; // requested range for full logits - vector> output_states; - vector> output_logits; + vector output_states; + vector output_logits; - Interval full_ce_loss; // requested range for CE loss logits - // Per CE scoring segment: (request_idx, range). `range` is the hidden-buffer position - // interval; `ce_targets` is indexed by that same position, so no extra offset is stored. - vector> ce_loss_segments; - Buffer_ ce_targets; + Interval full_ce_loss; // requested range for CE-loss logits + // Per CE scoring segment; `ce_targets` is indexed by the segment's hidden-buffer position + // (the `range`), so no extra offset is stored. + vector ce_loss_segments; + Buffer_ ce_targets; }; vector data_; @@ -68,7 +89,7 @@ struct OutputProcessor::Impl { void Add(int phase, TensorMap& env) { - const Buffer_ rc = env.at("requests").buffer(); + const Buffer_ rc = env.at("requests").buffer(); for (int i = 0; i < rc.size(); ++i) { auto& c = *rc[i]; @@ -94,9 +115,9 @@ struct OutputProcessor::Impl { { auto& d = data_.at(phase); - const auto& rc = env.at("batch").data()[0]->rc; - - auto& copy = *env.at("copy").data()[0]; + // const auto& rc = env.at("batch").data()[0]->rc; + Buffer_ rc = env.at("requests").buffer(); + auto& copy = *env.at("copy").data()[0]; vector all_tokens; vector sel_tokens; @@ -104,8 +125,8 @@ struct OutputProcessor::Impl { for (int i = 0; i < rc.size(); ++i) { using Size = Interval::Size; auto& c = *rc[i]; - all_tokens.emplace_back(c.history_len + c.alpha, Size{c.input_len}); - sel_tokens.emplace_back(c.history_len + c.alpha + c.input_len - 1, Size{1}); + all_tokens.emplace_back(c.history_len + c.inflight_input_len, Size{c.input_len}); + sel_tokens.emplace_back(c.history_len + c.inflight_input_len + c.input_len - 1, Size{1}); if (!c.generating) { sel_tokens.back() = {}; } @@ -149,7 +170,7 @@ struct OutputProcessor::Impl { type = 2; } if (type) { - d.output_states.emplace_back(i, type, m.src, m.dst); + d.output_states.push_back({c.req, type, m.src, m.dst}); // dbg(type, &m.src, &m.dst); } } @@ -163,12 +184,13 @@ struct OutputProcessor::Impl { type = 2; } if (type) { - d.output_logits.emplace_back(i, type, m.src, m.dst); + d.output_logits.push_back({c.req, type, m.src, m.dst}); } } if (c.input_ce_loss) { if (tp_rank_ == 0 && !c.ce_loss) { - // Per-request accumulator, allocated and zeroed once. + // Per-request accumulator, allocated and zeroed once; persists across the + // chunked-prefill forwards that erode `input_ce_loss`. c.ce_loss = {1, kDEVICE}; Clear(c.ce_loss); } @@ -177,7 +199,10 @@ struct OutputProcessor::Impl { if (tp_rank_ == 0) { copy(c.token_ids + m.dst.begin() + 1, (int)m.src.size(), d.ce_targets.data() + m.src.begin()); } - d.ce_loss_segments.emplace_back(i, m.src); + // Capture everything the executor side needs (no `Sequence` access there): the + // request, a handle to its persistent accumulator, the hidden-buffer range, and + // whether this forward fully consumed `input_ce_loss` (emit only then). + d.ce_loss_segments.push_back({c.req, c.ce_loss, m.src, !c.input_ce_loss}); } } offset += c.input_len; @@ -195,7 +220,20 @@ struct OutputProcessor::Impl { } } - void ComputeCeLoss(Data& data, const Tensor& logits, int base, const vector>& rs) + template + void OutputHiddenStates(const Ranges& ranges, const Tensor& h, int type) + { + for (const auto& r : ranges) { + if (r.type == type) { + auto& out = r.request->outputs.at("last_hidden_state"); + if (tp_rank_ == 0) { + Copy(h.slice(r.src.begin(), (int)r.src.size()), out.slice(r.dst.begin(), (int)r.dst.size())); + } + } + } + } + + void ComputeCeLoss(Data& data, const Tensor& logits, int base) { if (tp_rank_ != 0 || data.ce_loss_segments.empty()) { return; @@ -204,12 +242,12 @@ struct OutputProcessor::Impl { const auto stream = core::Context::stream().handle(); const Interval rows{base, Interval::Size{(int)logits.shape(0)}}; - for (const auto& [request_idx, range] : data.ce_loss_segments) { - if (auto src = range & rows) { + for (auto& seg : data.ce_loss_segments) { + if (auto src = seg.range & rows) { const int tokens = (int)src.size(); const int target_offset = src.begin(); const int logit_offset = src.begin() - base; - invokeCrossEntropyLoss(rs[request_idx]->ce_loss.data(), + invokeCrossEntropyLoss(seg.ce_loss.data(), logits, data.ce_targets.data(), target_offset, @@ -221,36 +259,19 @@ struct OutputProcessor::Impl { } } - void OutputCELoss(Data& data, BatchCopy& copy, const vector>& rs) + void OutputCELoss(const Data& data) { if (tp_rank_ != 0) { return; } - for (const auto& [i, range] : data.ce_loss_segments) { - if (auto& c = *rs[i]; !c.input_ce_loss) { - copy(c.ce_loss, 1, c.req->outputs.at("ce_loss").buffer()); - } - } - } - - template - void OutputHiddenStates(const Ranges& ranges, const Tensor& h, int type, const vector>& rs) - { - for (const auto& [i, t, src, dst] : ranges) { - if (t == type) { - auto& out = rs[i]->req->outputs.at("last_hidden_state"); - if (tp_rank_ == 0) { - // dbg(&src, &dst); - Copy(h.slice(src.begin(), (int)src.size()), out.slice(dst.begin(), (int)dst.size())); - } + for (const auto& seg : data.ce_loss_segments) { + if (seg.last) { // input_ce_loss fully consumed -> accumulator is final + Copy(seg.ce_loss, seg.request->outputs.at("ce_loss").buffer()); } } } - void ComputeAndOutputLogits(Data& data, // - const Tensor& h, - BatchCopy& copy, - const vector>& rs) + void ComputeAndOutputLogits(Data& data, const Tensor& h) { const int step_size = max_logits_len_; @@ -261,51 +282,51 @@ struct OutputProcessor::Impl { using Size = Interval::Size; bool success = ranges.empty(); - // Erode the range iteratively until empty + // Erode the range iteratively until empty. Each chunk feeds two independent consumers: + // full-logits output and CE-loss accumulation. for (auto r = data.full_logits | data.full_ce_loss; r; r = -step_size | r) { // dbg(&r); if (auto chunk = r & Interval{r.begin(), Size{step_size}}) { // dbg(&chunk); - // Compute & output full logits by chunks - // The chunked logits feeds two independent consumers + // Compute full logits by chunks auto logits = lm_head_(h.slice(chunk.begin(), (int)chunk.size())); if (!success) { - success = OutputLogitsImpl(ranges, p, logits, chunk.begin(), 2, rs); + success = OutputLogitsImpl(ranges, p, logits, chunk.begin(), 2); } - ComputeCeLoss(data, logits, chunk.begin(), rs); + ComputeCeLoss(data, logits, chunk.begin()); } } TM_CHECK(success); // every type-2 logits range must have been output // CE loss is fully accumulated now that the chunk loop is done; emit it. - OutputCELoss(data, copy, rs); + OutputCELoss(data); } template - void OutputLogits(Ranges& ranges_, const Tensor& l, int type, const vector>& rs) + void OutputLogits(Ranges& ranges_, const Tensor& l, int type) { // Coroutine frame int p = 0; auto ranges = ranges_; - TM_CHECK(OutputLogitsImpl(ranges, p, l, /* base */ 0, type, rs)); + TM_CHECK(OutputLogitsImpl(ranges, p, l, /* base */ 0, type)); } template - bool OutputLogitsImpl( - Ranges& ranges, int& p, const Tensor& l, int base, int type, const vector>& rs) + bool OutputLogitsImpl(Ranges& ranges, int& p, const Tensor& l, int base, int type) { // dbg("OutputLogitsImpl"); const auto stream = core::Context::stream().handle(); for (; p < ranges.size(); ++p) { - if (auto& [i, t, src, dst] = ranges[p]; t == type) { - Tensor& out = rs[i]->req->outputs.at("logits"); + auto& r = ranges[p]; + if (r.type == type) { + Tensor& out = r.request->outputs.at("logits"); const DataType dtype = out.dtype(); - TM_CHECK_LE(base, src.begin()); // logical error - if (Interval msrc = src & Interval{base, Interval::Size{(int)l.shape(0)}}) { + TM_CHECK_LE(base, r.src.begin()); // logical error + if (Interval msrc = r.src & Interval{base, Interval::Size{(int)l.shape(0)}}) { const int tokens = (int)msrc.size(); - Interval mdst{dst.begin(), msrc.size()}; + Interval mdst{r.dst.begin(), msrc.size()}; // TODO: support strides in `DLTensor`, so that batched 1D copy can be used if (tp_rank_ == 0) { // dbg(&mdst, &msrc, tokens, out, base, l); @@ -320,11 +341,11 @@ struct OutputProcessor::Impl { 0); } // move to next request if they are empty after the erosion - src = -(int)msrc.size() | src; - dst = -(int)mdst.size() | dst; + r.src = -(int)msrc.size() | r.src; + r.dst = -(int)mdst.size() | r.dst; } - // dbg(&src, (int)src.size(), &dst, (int)dst.size()); - if (src) { + // dbg(&r.src, (int)r.src.size(), &r.dst, (int)r.dst.size()); + if (r.src) { // request not compeleted, suspend and wait for next chunk return false; } @@ -336,25 +357,23 @@ struct OutputProcessor::Impl { void OutputHiddenStatesAndLogits(int phase, TensorMap& env, int type) { auto& d = data_.at(phase); - auto& b = *env.at("batch").data()[0]; if (type == 2 && d.full_states) { auto hidden_states = env.consume("full_hidden_states"); if (!d.output_states.empty()) { - OutputHiddenStates(d.output_states, hidden_states, 2, b.rc); + OutputHiddenStates(d.output_states, hidden_states, 2); } if (d.full_logits || d.full_ce_loss) { - auto& copy = *env.at("copy").data()[0]; - ComputeAndOutputLogits(d, hidden_states, copy, b.rc); + ComputeAndOutputLogits(d, hidden_states); } } if (type == 1) { if (!d.output_states.empty()) { - OutputHiddenStates(d.output_states, env.at("hidden_states"), 1, b.rc); + OutputHiddenStates(d.output_states, env.at("hidden_states"), 1); } if (!d.output_logits.empty()) { - OutputLogits(d.output_logits, env.at("logits"), 1, b.rc); + OutputLogits(d.output_logits, env.at("logits"), 1); } } } diff --git a/src/turbomind/models/qwen3_5vit/qwen3_5vit.cc b/src/turbomind/models/qwen3_5vit/qwen3_5vit.cc index 547206c31e..2fa5a8e0f9 100644 --- a/src/turbomind/models/qwen3_5vit/qwen3_5vit.cc +++ b/src/turbomind/models/qwen3_5vit/qwen3_5vit.cc @@ -9,7 +9,6 @@ #include "src/turbomind/kernels/norm/layer_norm.h" #include "src/turbomind/kernels/norm/rms_norm.h" #include "src/turbomind/models/layer_norm_weight.h" -#include "src/turbomind/models/llama/SequenceManager.h" #include "src/turbomind/models/qwen3_5vit/bias_gelu.h" #include "src/turbomind/models/qwen3_5vit/fast_pos_embed.h" #include "src/turbomind/models/qwen3_5vit/fast_rotary_pos_emb.h" @@ -204,15 +203,12 @@ struct Qwen3_5Vit::Impl { return output; } - int Add(RequestCache& c) + int Add(Sequence& s) { - const auto& [r, s] = std::tie(*c.req, *c.seq); + auto& r = *s.req; if (r.mm_inputs) { - if ((not r.session.start_flag) or (not r.session.end_flag)) { - // only support non-interactive inference - return Request::kInvalid; - } - + // The stateful-session subsystem was removed: every request is a single + // start+end shot, so there is no interactive (start_flag/end_flag) guard. const auto mm_inputs = std::dynamic_pointer_cast(r.mm_inputs); if (!mm_inputs) { return Request::kInvalid; @@ -228,9 +224,10 @@ struct Qwen3_5Vit::Impl { return Request::kInvalid; } - auto mm_item = std::make_shared( - MultiModalData{item.data, Interval{item.token_begin, Interval::Size{tokens}}, item.grid_thw}); + const Interval interval{item.token_begin, Interval::Size{tokens}}; + auto mm_item = std::make_shared(MultiModalData{item.data, interval, item.grid_thw}); s.multimodal_inputs.push_back(mm_item); + s.multimodal_spans.push_back(MultiModalSpan{interval, item.fingerprint}); } } @@ -240,11 +237,11 @@ struct Qwen3_5Vit::Impl { void Add(int phase, TensorMap& env) { // convert model-specific multimodal inputs to internal MultiModalData - const Buffer_ rc = env.at("requests").buffer(); + const Buffer_ rc = env.at("requests").buffer(); for (int i = 0; i < rc.size(); ++i) { - auto& c = *TM_CHECK_NOTNULL(rc[i]); - if (c.status == 0) { - c.status = Add(c); + auto& s = *TM_CHECK_NOTNULL(rc[i]); + if (s.status == 0) { + s.status = Add(s); } } } @@ -261,9 +258,9 @@ struct Qwen3_5Vit::Impl { // ownership via shared_ptr; UAL borrows safely across env clears. void SetupMrope(int phase, TensorMap& env) { - auto& d = data_.at(phase); - auto& b = *env.at("batch").data()[0]; - auto& rc = b.rc; + auto& d = data_.at(phase); + + Buffer_ rc = env.at("requests").buffer(); const int bsz = (int)rc.size(); if (bsz <= 0) { @@ -276,9 +273,9 @@ struct Qwen3_5Vit::Impl { // Worst case per prefill slot with mrope: 2*num_images + 1 segments. int upper_segs = 0; for (int i = 0; i < bsz; ++i) { - const auto& c = *rc[i]; - if (!c.autoregres && !c.seq->multimodal_inputs.empty()) { - upper_segs += 2 * (int)c.seq->multimodal_inputs.size() + 1; + const auto& s = *rc[i]; + if (!s.autoregres && !s.multimodal_inputs.empty()) { + upper_segs += 2 * (int)s.multimodal_inputs.size() + 1; } } const ssize_t upper_ints = (ssize_t)upper_segs * kMropeSegInts; @@ -296,12 +293,11 @@ struct Qwen3_5Vit::Impl { int max_seg_len = 0; for (int i = 0; i < bsz; ++i) { - const auto& c = *rc[i]; - const auto& s = *c.seq; - const int seq_len = (int)c.req->inputs.at("input_ids").shape(0); - const bool needs_table = !c.autoregres && !s.multimodal_inputs.empty(); - const int active_start = c.history_len + c.alpha; - const int active_end = active_start + c.input_len; + const auto& s = *rc[i]; + const int seq_len = (int)s.req->inputs.at("input_ids").shape(0); + const bool needs_table = !s.autoregres && !s.multimodal_inputs.empty(); + const int active_start = s.history_len + s.inflight_input_len; + const int active_end = active_start + s.input_len; auto emit = [&](int run_start, int run_n, int run_base, int h2, int w2) { const int a = std::max(run_start, active_start); @@ -377,7 +373,6 @@ struct Qwen3_5Vit::Impl { { // create batch data according to scheduled sequences auto& d = data_.at(phase); - auto& b = *env.at("batch").data()[0]; auto& copy = *env.at("copy").data()[0]; auto& cfg = weights_.config(); @@ -387,13 +382,16 @@ struct Qwen3_5Vit::Impl { std::vector pixel_values; // collect image/video pixel values, grid_thws and embeds_coords - const auto& rc = b.rc; + Buffer_ rc = env.at("requests").buffer(); + int mm_prefill_seqs = 0; // prefill sequences carrying multimodal inputs + int images_total = 0; // total images across those sequences for (int i = 0; i < rc.size(); ++i) { - const auto& c = *rc[i]; - const auto& s = *c.seq; + const auto& s = *rc[i]; - if ((not c.autoregres) && (not s.multimodal_inputs.empty())) { - Interval text{c.history_len + c.alpha, Interval::Size{c.input_len}}; + if ((not s.autoregres) && (not s.multimodal_inputs.empty())) { + ++mm_prefill_seqs; + images_total += (int)s.multimodal_inputs.size(); + Interval text{s.history_len + s.inflight_input_len, Interval::Size{s.input_len}}; for (const auto& mm : s.multimodal_inputs) { auto o = mm->interval & text; if (auto size = (int)o.size()) { @@ -413,7 +411,19 @@ struct Qwen3_5Vit::Impl { } } - input_ids_offsets += c.autoregres ? 1 : c.input_len; + input_ids_offsets += s.autoregres ? 1 : s.input_len; + } + + // Prefix-cache observability: on a fully-cached image, the window filter + // above batches 0 images (ViT skipped). Only logged for multimodal + // prefill passes so decode steps stay quiet. + if (mm_prefill_seqs > 0) { + const int images_batched = (int)pixel_values.size(); + TM_LOG_INFO("Qwen3.5 ViT setup: mm_seqs={} images_batched={} images_skipped={} patches={}", + mm_prefill_seqs, + images_batched, + images_total - images_batched, + d.batch_size); } // copy pixel values to batch input diff --git a/src/turbomind/models/qwen3_5vit/qwen3_5vit_input.h b/src/turbomind/models/qwen3_5vit/qwen3_5vit_input.h index 122bd49524..279ab4fc06 100644 --- a/src/turbomind/models/qwen3_5vit/qwen3_5vit_input.h +++ b/src/turbomind/models/qwen3_5vit/qwen3_5vit_input.h @@ -3,6 +3,7 @@ #pragma once #include "src/turbomind/core/core.h" +#include "src/turbomind/engine/fingerprint.h" #include "src/turbomind/engine/multimodal_input.h" #include @@ -18,11 +19,22 @@ struct Qwen3_5VitItem { int token_begin; int token_end; std::array grid_thw; + Fingerprint fingerprint{}; // image content hash from the converter (empty if none supplied) Qwen3_5VitItem() = default; - Qwen3_5VitItem(Modality modality, Tensor data, int token_begin, int token_end, std::array grid_thw): - modality{modality}, data{std::move(data)}, token_begin{token_begin}, token_end{token_end}, grid_thw{grid_thw} + Qwen3_5VitItem(Modality modality, + Tensor data, + int token_begin, + int token_end, + std::array grid_thw, + Fingerprint fingerprint = {}): + modality{modality}, + data{std::move(data)}, + token_begin{token_begin}, + token_end{token_end}, + grid_thw{grid_thw}, + fingerprint{fingerprint} { } }; diff --git a/src/turbomind/python/bind.cpp b/src/turbomind/python/bind.cpp index 1a7a107a7f..0593337b08 100644 --- a/src/turbomind/python/bind.cpp +++ b/src/turbomind/python/bind.cpp @@ -1,6 +1,7 @@ // Copyright (c) OpenMMLab. All rights reserved. #include +#include #include #include #include @@ -20,6 +21,7 @@ #include "src/turbomind/core/module.h" #include "src/turbomind/core/tensor.h" #include "src/turbomind/engine/engine_config.h" +#include "src/turbomind/engine/fingerprint.h" #include "src/turbomind/engine/model_request.h" #include "src/turbomind/engine/multimodal_input.h" #include "src/turbomind/models/attention_weight.h" @@ -344,20 +346,41 @@ PYBIND11_MODULE(_turbomind, m) .value("AUDIO", MMModality::kAudio) .value("TIME_SERIES", MMModality::kTimeSeries) .export_values(); + auto fp_from_bytes = [](const py::bytes& b) -> ft::Fingerprint { + ft::Fingerprint fp{}; + char* buf = nullptr; + Py_ssize_t len = 0; + if (PyBytes_AsStringAndSize(b.ptr(), &buf, &len) != 0) { + throw py::error_already_set(); + } + if (len == 0) { + return fp; // empty sentinel + } + if (len != 32) { + throw std::invalid_argument("Qwen3_5VitItem.fingerprint must be 0 or 32 bytes (SHA-256)"); + } + std::memcpy(fp.words.data(), buf, 32); + return fp; + }; + auto fp_to_bytes = [](const ft::Fingerprint& fp) -> py::bytes { + return py::bytes(reinterpret_cast(fp.words.data()), 32); + }; py::class_(multimodal, "Qwen3_5VitItem") .def(py::init<>()) - .def(py::init([](MMModality modality, - std::shared_ptr data, - int token_begin, - int token_end, - std::array grid_thw) { - return QwenVitItem{modality, *data, token_begin, token_end, grid_thw}; + .def(py::init([fp_from_bytes](MMModality modality, + std::shared_ptr data, + int token_begin, + int token_end, + std::array grid_thw, + py::bytes fingerprint) { + return QwenVitItem{modality, *data, token_begin, token_end, grid_thw, fp_from_bytes(fingerprint)}; }), "modality"_a, "data"_a, "token_begin"_a, "token_end"_a, - "grid_thw"_a) + "grid_thw"_a, + "fingerprint"_a = py::bytes()) .def_readwrite("modality", &QwenVitItem::modality) .def_property( "data", @@ -365,7 +388,11 @@ PYBIND11_MODULE(_turbomind, m) [](QwenVitItem& self, std::shared_ptr data) { self.data = *data; }) .def_readwrite("token_begin", &QwenVitItem::token_begin) .def_readwrite("token_end", &QwenVitItem::token_end) - .def_readwrite("grid_thw", &QwenVitItem::grid_thw); + .def_readwrite("grid_thw", &QwenVitItem::grid_thw) + .def_property( + "fingerprint", + [fp_to_bytes](const QwenVitItem& self) { return fp_to_bytes(self.fingerprint); }, + [fp_from_bytes](QwenVitItem& self, py::bytes b) { self.fingerprint = fp_from_bytes(b); }); py::class_>(multimodal, "Qwen3_5VitInput") .def(py::init<>()) .def(py::init>(), "items"_a) @@ -390,25 +417,16 @@ PYBIND11_MODULE(_turbomind, m) .def_readonly("scheduler_tick", &ft::ScheduleMetrics::scheduler_tick); py::class_(m, "SessionParam") - .def(py::init([](uint64_t id, int step, bool start, bool end) { - if (!start && end) { - throw std::logic_error("unsupported arguments: start=false, end=true"); - } + .def(py::init([](uint64_t id, int step) { ft::SessionParam param{}; - param.id = id; - param.step = step; - param.start_flag = start; - param.end_flag = end; + param.id = id; + param.step = step; return param; }), "id"_a, - "step"_a, - "start"_a, - "end"_a) + "step"_a) .def_readwrite("id", &ft::SessionParam::id) - .def_readwrite("step", &ft::SessionParam::step) - .def_readwrite("start", &ft::SessionParam::start_flag) - .def_readwrite("end", &ft::SessionParam::end_flag); + .def_readwrite("step", &ft::SessionParam::step); py::class_(m, "GenerationConfig") .def(py::init()) @@ -614,14 +632,6 @@ PYBIND11_MODULE(_turbomind, m) model_request->Cancel(); // }, py::call_guard()) - .def( - "end", - [](ModelRequest* model_request, std::function cb, uint64_t session_id) { - model_request->End(std::move(cb), session_id); // - }, - py::call_guard(), - "cb"_a, - "session_id"_a) .def( "set_grammar", [](ModelRequest* model_request, const xgrammar::CompiledGrammar& grammar) { diff --git a/src/turbomind/turbomind.cc b/src/turbomind/turbomind.cc index 9ed314bff2..e4c40bb79b 100644 --- a/src/turbomind/turbomind.cc +++ b/src/turbomind/turbomind.cc @@ -1,6 +1,7 @@ // Copyright (c) OpenMMLab. All rights reserved. #include +#include #include #include "src/turbomind/turbomind.h" @@ -11,6 +12,7 @@ #include "src/turbomind/core/core.h" #include "src/turbomind/core/data_type.h" +#include "src/turbomind/engine/cache_registry.h" #include "src/turbomind/engine/engine.h" #include "src/turbomind/engine/gateway.h" #include "src/turbomind/engine/model_executor.h" @@ -22,7 +24,6 @@ #include "src/turbomind/models/model_root.h" #include "src/turbomind/models/model_weight.h" #include "src/turbomind/models/vision_model.h" -#include "src/turbomind/models/vision_model_weight.h" #include "src/turbomind/kernels/gemm/tuner/params.h" @@ -218,7 +219,7 @@ void TurboMind::Impl::CreateContext(int index) auto& c = ctx->comm; - c.h_global = group_id_->CreateCommunicator(comm_size_ * p.outer_dp_size, global_rank, p.node_rank); + c.h_global = group_id_->CreateCommunicator(comm_size_, global_rank, p.node_rank); c.h_comm = c.h_global->Split(outer_rank, 0); @@ -275,20 +276,45 @@ void TurboMind::Impl::CreateEngine(int index) ctx.comm.h_comm->Sync(); + const double cache_ratio = param.cache_max_block_count; + TM_CHECK_GT(cache_ratio, 0.) << "object-cache path expects 0 < cache_max_block_count < 1"; + TM_CHECK_LT(cache_ratio, 1.) << "object-cache path no longer accepts cache_max_block_count as a block count"; + + size_t free_bytes{}, total_bytes{}; + TM_CUDA_CHECK(cudaMemGetInfo(&free_bytes, &total_bytes)); + free_bytes = AllReduce(ctx.comm.h_tp_group, free_bytes, comm::RedOp::kMin); + + const size_t cache_bytes = static_cast(static_cast(free_bytes) * cache_ratio); + TM_CHECK_GT(cache_bytes, size_t{0}); + TM_CHECK_LE(cache_bytes, static_cast(std::numeric_limits::max())); + + TM_LOG_INFO("Object cache budget: {:.2f} MB from free {:.2f} MB and ratio {:.3f}", + cache_bytes / (1024. * 1024.), + free_bytes / (1024. * 1024.), + cache_ratio); + + Buffer cache_region{static_cast(cache_bytes), data_type_v, core::Context::device_alloc()}; + ObjectAllocator alloc{std::move(cache_region)}; + CacheRegistry cache_registry; + cache_registry.set_checkpoint_min_interval(param.cache_checkpoint_interval); + // create model - LanguageModel model{param, ctx, *weights_[index]->text_model_ptr(), phases_}; + LanguageModel model{cache_registry, param, ctx, *weights_[index]->text_model_ptr(), phases_}; - // create optional vision model + // create vision model for VLM checkpoints; null for text-only (no vision sub-tree attached) std::unique_ptr vision_model; if (auto* vw = weights_[index]->vision_model_ptr()) { vision_model = CreateVisionModel(*vw, param, ctx, phases_); } + cache_registry.RegisterObjectIds(alloc); + // create engine engines_[index] = Engine{param, + std::move(alloc), + std::move(cache_registry), std::move(model), std::move(vision_model), - *weights_[index]->text_model_ptr(), ctx, *gateway_, engine_param_.devices[index], @@ -386,8 +412,6 @@ void TurboMind::Impl::WarmUp(int index) TensorMap inputs{{"input_ids", input_ids.slice(0, token_num)}}; ModelRequest::InputParam param{}; - param.session.start_flag = true; - param.session.end_flag = true; param.gen_cfg.max_new_tokens = 1; param.tensors = std::make_shared(inputs); diff --git a/tests/test_lmdeploy/test_pipeline.py b/tests/test_lmdeploy/test_pipeline.py index 9774590cab..5e6196b569 100644 --- a/tests/test_lmdeploy/test_pipeline.py +++ b/tests/test_lmdeploy/test_pipeline.py @@ -120,9 +120,26 @@ def test_stream_infer_batch(self, pipe): full_text = ''.join([c.text for c in chunks]) assert len(full_text) > 0 - def test_stream_infer_with_session(self, pipe): - """Test stream_infer with session for multi-turn context.""" + def test_stream_infer_with_session(self, pipe, backend): + """Multi-turn interactive session. + + PyTorch supports stateful multi-turn inference. TurboMind is stateless-only and rejects any request that is not + (sequence_start and sequence_end) with ResponseType.NOT_SUPPORTED, which surfaces as an error string in the + response text. + """ session = pipe.session() + + if backend == 'turbomind': + generator = pipe.stream_infer(prompts='Hello! My name is Alice.', + sessions=session, + gen_config=GenerationConfig(max_new_tokens=30), + sequence_start=True, + sequence_end=False, + enable_thinking=False) + text = ''.join(out.text for out in generator if out.text) + assert 'NOT_SUPPORTED' in text + return + prompt1 = 'Hello! My name is Alice.' step = 0