Kakeya-LLM-Inference-engine/requirements.txt at main · FluffyAIcode/Kakeya-LLM-Inference-engine · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# NOTE on transformers version pin:
#   Pin lifted 2026-06-09 — was `<5.0` to keep the legacy Qwen3 dLM
#   proposer (`dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1`) running, since
#   its custom `modeling_qwen3.py` depends on the 4.x
#   `decoder_layer.attention_type` API that transformers 5.x removed.
#
#   K3 critical path needs transformers 5.0+:
#     * Gemma 4 26B-A4B verifier (per ADR 0008 §11.7.0)
#     * scripts/research/k3_dflash_specdecode_eval.py and
#       k3_dflash_alignment_train.py (load Gemma 4 via transformers)
#
#   So the upper bound was dropped.
#
#   KNOWN ISSUE (K2.B Qwen backport path, NOT K3 critical path): under
#   transformers 5.x, attempts to load
#   `dllm-hub/Qwen3-0.6B-diffusion-mdlm-v0.1` will likely raise an
#   AttributeError on `attention_type`. This affects:
#       * `training/repr_align/proposer_surgery.py`
#       * `kv_cache_proposer/proposer.py`
#       * `inference_engine/proposer/sparse_logits.py`
#       * `inference_engine/backends/mlx/proposer.py`
#       * `tests/system/test_http_*_real_engine.py` (if they download)
#   None of these are on the K3 critical path. K2.B Qwen backport is
#   the natural place to author a compat patch when that path resumes
#   after K3 ships.
#
#   For now, if you need the legacy Qwen3 dLM path, install transformers
#   4.x in a dedicated venv:  pip install 'transformers>=4.45,<5.0'
torch>=2.4,<3.0
transformers>=4.45
accelerate>=0.34
safetensors>=0.4
huggingface_hub>=0.24
numpy>=1.26

# HTTP serving stack (E2: OpenAI-compatible API + SSE streaming).
# fastapi/starlette pin: 0.115+ for the BackgroundTasks/anyio combo we
# rely on; sse-starlette gives us the EventSourceResponse class that
# handles SSE framing including disconnect detection.
fastapi>=0.115,<1.0
starlette>=0.41,<1.0
sse-starlette>=2.1,<3.0
uvicorn>=0.32,<1.0
pydantic>=2.7,<3.0
httpx>=0.27,<1.0
prometheus-client>=0.20,<1.0

# gRPC runtime (PR-B1 of ADR 0008 Phase B; the runtime/SDK protocol).
# grpcio-tools is needed only at build time (regenerating stubs from
# proto/kakeya/v1/runtime.proto) — kept in this single requirements
# file because (a) we want CI to be able to drift-check the
# committed stubs, and (b) the install footprint is small and there
# is no separate dev-requirements file in this project today.
#
# grpcio-tools is PINNED to an EXACT version: the generated stub embeds
# the generator version as `GRPC_GENERATED_VERSION = '<grpcio-tools ver>'`,
# so a loose range lets a patch release (e.g. 1.81.0 -> 1.81.1) silently
# change the committed bytes and fail the `proto stub drift` CI gate.
# Bump this pin (and re-run scripts/regenerate_proto_stubs.sh) deliberately.
grpcio>=1.65,<2.0
grpcio-tools==1.81.1

# K2.A KakeyaLattice K/V cache compression (ADR 0008 §11.11).
# Pinned to >= 1.5 (the tested Mac M4 release; see PR-K2.A.0
# Mac smoke evidence at results/research/k2a_kl_mac_smoke_*.json).
# The K2.A.1 integration treats the dependency as REQUIRED for v0.4
# runtime functionality with KL on; with --kl-off the runtime falls
# back to IdentityCompressor (K1 baseline) and kakeyalattice is
# unused at runtime but still imported by the
# `inference_engine.v04.kv_compressor` module's eager-import attempt
# during make_default_compressor — installation is therefore
# mandatory for a complete v0.4 install.
kakeyalattice>=1.5

# Test stack (used by tests/ and run_platform_tests.sh)
pytest>=8.0
pytest-cov>=5.0
pytest-asyncio>=0.23
pytest-timeout>=2.3