Releases · dphnAI/aphrodite-engine

02 May 14:05

AlpinDale

v0.21.0

18f852d

v0.21.0 Latest

Latest

What's Changed

build: support python 3.14 by @AlpinDale in #1636
fix: GLM-5.1 on ROCm by @AlpinDale in #1637
fix: replica selection bias in fusedmoe router by @AlpinDale in #1638
fix: respect TORCH_COMPILE_DISABLE env var for torch 2.12 by @AlpinDale in #1639
chore: remove dead code from worker by @AlpinDale in #1640
feat: warmup readonly mm processor during renderer startup by @AlpinDale in #1641
fix: GPU memory leaks in engine shutdown for rocm by @AlpinDale in #1642
chore: optimize deepstack buffer handling for MM Qwen3 models by @AlpinDale in #1643
feat: support kv offload storing with multiple KV groups by @AlpinDale in #1644
feat: add perf benchmark script by @AlpinDale in #1645
fix: only unpad routed output before shared expert add by @AlpinDale in #1646
fix: DSML token leakage in DeepSeek-V4 and 3.2 by @AlpinDale in #1647
fix: size the MNNVL workspace for flashinfer to EP group by @AlpinDale in #1648
fix: offload all KV blocks when doing prefill in P/D by @AlpinDale in #1649
fix: disable sequence parallelism for piecewise compilation by @AlpinDale in #1650
feat: implement DeepSeek-V4 model by @AlpinDale in #1651
perf: EXL3 performance tuning on GeForce Blackwell by @AlpinDale in #1652
fix: TRT-LLM MXFP4 MoE compile for DeepSeek-V4 by @AlpinDale in #1653
fix: normalize nested args in DeepSeek DSML by @AlpinDale in #1654
perf: exl3 decode kernel optimization experiments by @AlpinDale in #1655
perf: exl3 optims with guarded MoE down tuning by @AlpinDale in #1656
fix: auto-disable expandable_segments around cumem memory pool by @AlpinDale in #1657
fix: rejection sampling acceptance rate in MRv2 by @AlpinDale in #1658
fix: cap SWA/chunked-local runtime admission to startup pool-sizing bound by @AlpinDale in #1659
feat: FP8 ViT Attention w/ FlashInfer by @AlpinDale in #1660
chore: share dequant buffers in TurboQuant to save memory by @AlpinDale in #1661
fix: remove invalid deepstack boundary check for Qwen3-VL by @AlpinDale in #1664
feat: add silu clamp limit to shared expert for DeepSeek-V4 by @AlpinDale in #1665
chore: sync to upstream 985961345a13f3e3bb15d29c94b011ba9a6b858b by @AlpinDale in #1666

Full Changelog: v0.20.0...v0.21.0

Contributors

AlpinDale

Assets 3

26 Apr 12:27

AlpinDale

v0.20.0

c0178f1

v0.20.0

What's Changed

[engine] add API for concurrency rate and kv cache token limit by @AlpinDale in #1608
[diffusion] aphrodite diffusion backend by @AlpinDale in #1607
[cli][diffusion] only import diffusion backend when it is called by @AlpinDale in #1610
[logger][metrics] log number of cache hits in the request-level logger by @AlpinDale in #1611
[cli] add CLI arg for selecting attention backend by @AlpinDale in #1612
fix: tokenizer server init by @AlpinDale in #1617
[models] add support for GLM-4.7 Flash by @AlpinDale in #1620
fix: mark GLM-4 MoE Lite as an MLA model by @AlpinDale in #1621
fix: compute engine max_concurrency from worker KV cache configs by @lucyknada in #1622
feat: add support for the Qwen3.5 family of models by @AlpinDale in #1624
feat: update aphrodite to 0.20.0 by @AlpinDale in #1628
feat: add tensor parallel support for exllamav3 by @AlpinDale in #1629
chore: remove unused csrc code by @AlpinDale in #1630
chore: bump cuda to 13.0 by @AlpinDale in #1631
chore: sync to upstream vllm f768b4473e1bd55023dcaff63984cfdd08902fc8 by @AlpinDale in #1632
chore: massively improve DRY performance by @AlpinDale in #1634
feat: optimize lm_head by fusing more kernels and actually quantizing lm_head by @AlpinDale in #1635

New Contributors

@lucyknada made their first contribution in #1622

Full Changelog: v0.10.0...v0.20.0

Contributors

AlpinDale and lucyknada

Assets 4

08 Nov 13:51

AlpinDale

v0.10.0

5e3afa0

v0.10.0

What's Changed

feat: qwen3-next tool parser by @AlpinDale in #1512
[Build] feat: add support for incremental cmake builds by @AlpinDale in #1515
chore: cleanup aphrodite FA directory by @AlpinDale in #1516
docs: update documentation on adding support for new models by @AlpinDale in #1517
fix: multi-node serving with ray by @AlpinDale in #1518
chore: migrate whisper to TensorSchema by @AlpinDale in #1519
feat: add logging for model parameter count by @AlpinDale in #1525
[Attention] feat: add support for Context Parallelism by @AlpinDale in #1521
[Model] feat: support BailingMoe V2 by @AlpinDale in #1527
[API] chore: separate Kobold API code to its own serving class by @AlpinDale in #1529
Revert "[API] chore: separate Kobold API code to its own serving class" by @AlpinDale in #1530
fix: error propagation in chat completions by @AlpinDale in #1532
[API] chore: separate Kobold API code to its own serving class by @AlpinDale in #1531
[API] chore: remove dead code from the old kobold api module by @AlpinDale in #1533
[API] fix: anthropic messages API by @AlpinDale in #1534
[build] fix: relax xformers dependency version by @AlpinDale in #1536
[PP] fix: Qwen3-Next with Pipeline Parallelism by @AlpinDale in #1537
Update readme by @AlpinDale in #1538
[Kernel] chore: add tuned kernel configs for BailingMoEV2 by @AlpinDale in #1542
[config] fix: set the correct max_model_len with YaRN scaling by @AlpinDale in #1543
[API] feat: add lightweight tokenizer-only API server by @AlpinDale in #1545
release: v0.10 by @AlpinDale in #1549
[build] bump flashinfer to 0.5.0 by @AlpinDale in #1551
[API] feat: add model management endpoints for loading and unloading models by @AlpinDale in #1553
[core] feat: enable dynamic KV cache allocation by @AlpinDale in #1552
fix: quantization import for kimi-linear KDA by @AlpinDale in #1555
[API] feat: add multi-model support by @AlpinDale in #1554
fix: Kimi-Linear with AWQ quants by @AlpinDale in #1556
ci: make gemini PR reviews less verbose by @AlpinDale in #1557
fix: avoid GPU-CPU sync in MTP by @AlpinDale in #1558
[kernel] fix: use the same H200 config for both H200 and H200 NVL by @AlpinDale in #1559
[API] fix: task log when multi-model is not enabled by @AlpinDale in #1560
fix: ensure model_registry is not empty before accessing models in OpenAIServing by @AlpinDale in #1561
[ci] chore: update pre-commit scripts by @AlpinDale in #1562
[ci] chore: make all pre-commit checks pass by @AlpinDale in #1563
[core]: update cu_num_accepted_tokens for all req_index by @AlpinDale in #1564
[API] feat: enable DP-aware routing in OAI requests by @AlpinDale in #1565
[logger] fix: don't record sleep mode logs when not in dev mode by @AlpinDale in #1566
[distributed] remove APHRODITE_DEEPEP_LOW_LATENCY_ALLOW_NVLINK env var by @AlpinDale in #1567
[core] invoke save_new_computed_blocks when computed blocks are not empty by @AlpinDale in #1568
[ci] feat: remove ruff workflow and add a pre-commit one by @AlpinDale in #1569
[compilation] allow torch.compile with batch invariant inference by @AlpinDale in #1570
[cpu] fix APHRODITE_CPU_OMP_THREADS_BIND="autho" for PowerPC CPU by @AlpinDale in #1571
[fix] avoid too small block m/n for flex attention by @AlpinDale in #1572
[kernel] perf: significantly enhance KDA/Kimi Linear throughput by decoupling torch op from GDA to use torch.compile by @AlpinDale in #1573
[lora] allow int64 values for LoRA ID to avoid overflow by @AlpinDale in #1574
[mm] fix broken MRoPE for GLM-4.1/4.5V by @AlpinDale in #1575
[nixl] fix: missing metadata for handshake in multi-node by @AlpinDale in #1576
[mm] fix: missing cached item in beam search by @AlpinDale in #1577
[moe] fuse nvfp4 quant with flashinfer_cutlass_moe in TP by @AlpinDale in #1578
[spec] remove unused args in spec decoding by @AlpinDale in #1579
[chore] pydantic validation for scheduler and structured outputs configs by @AlpinDale in #1580
[hybrid] simpler algorithm to find kernel_block_size by @AlpinDale in #1581
[core] feat: support async scheduling with structured outputs by @AlpinDale in #1582
[lora] enable bias support for fused moe lora by @AlpinDale in #1583
[mm] vision attention backend for XPU by @AlpinDale in #1584
[lora] add SplitK in fused MoE lora by @AlpinDale in #1585
[ci] bump transformers to 4.57.1 by @AlpinDale in #1586
[mm] mrope for keye by @AlpinDale in #1587
[cli] add CLI args for kv cache offloading by @AlpinDale in #1588
[kvoffload] feat: make LMCache connecter work by @AlpinDale in #1589
[v0] remove APHRODITE_USE_V1 from platform and v1 by @AlpinDale in #1590
[spec] fix DeepSeek v3.2 MTP metadata and cuda graph by @AlpinDale in #1591
[python3.10] import Self from typing_extensions by @AlpinDale in #1592
[TPU] prevent single-process DP by @AlpinDale in #1593
[sampler] fix mixed penalties in batch with async scheduling by @AlpinDale in #1594
[api] bring back anthropic /v1/messages endpoint in OpenAI server by @AlpinDale in #1595
[offloader] fix async scheduling support with KV cache offloader by @AlpinDale in #1596
[sync] sync to upstream 03c4c4a by @AlpinDale in #1597
[build] downgrade flashinfer to 0.4.1 by @AlpinDale in #1598
[lora][moe] fix MoE models by registering the correct op by @AlpinDale in #1599
[quant] fix GLM-4.5V AWQ by @AlpinDale in #1600
[build] upgrade flashinfer to 0.5.1 by @AlpinDale in #1601
Require structured output parameters to be explicitly None or valid by @50h100a in #1604
[multi node] better cluster example script by @AlpinDale in #160...

Contributors

AlpinDale and 50h100a

Assets 7

10 Sep 18:23

github-actions

v0.9.1

d7b7417

v0.9.1

What's Changed

feat: implement Motif model by @AlpinDale in #1439
[V1] feat: support fp8 kv on ampere through flashinfer by @AlpinDale in #1440
fix: enable Kobold API by default by @AlpinDale in #1441
feat: add DeepConf sampling by @AlpinDale in #1442
build: add option to disable flash attn compilation by @AlpinDale in #1443
[PP] feat: optimize PP performance through throttling by @AlpinDale in #1447
chore: sync to upstream 4071c76 by @AlpinDale in #1448
feat: support encoder data parallel for MiniCPM-V by @AlpinDale in #1449
feat: enable LoRA support for DeepSeek models by @AlpinDale in #1450
fix: add reorder_batch to AttentionMetadataBuilder by @AlpinDale in #1451
chore: move out freezing_value/cuda_event init outside try/finally block by @AlpinDale in #1452
chore: better type hint for rearrange return value in eplbstate by @AlpinDale in #1453
chore: faster LoRA startup by @AlpinDale in #1454
fix: truncate_prompt_tokens type hint by @AlpinDale in #1455
feat: allow passing multi_modal_uuids as multimodal identifiers by @AlpinDale in #1456
fix: vocab_size check by @AlpinDale in #1457
feat: support KV events from connectors by @AlpinDale in #1458
fix: AutoGPTQ Qwen3-MoE models by @AlpinDale in #1459
fix: avoid redundant copy for enc-only models by @AlpinDale in #1460
fix: loading GPTQ 3-bit models by @AlpinDale in #1461
chore: move fast prefill logic to a separate method by @AlpinDale in #1462
fix: update constructors and type hints for multimodal input handling by @AlpinDale in #1463
chore: update import for torch inductor configuration in CPU model runner by @AlpinDale in #1464
chore: migrate Phi4 inputs to TensorSchema by @AlpinDale in #1465
feat: IO processor plugin for pooling models by @AlpinDale in #1466
fix: add support for <tool_call> format in streaming mode for xlam by @AlpinDale in #1467
fix: allow FP16 inference on Turing and below by @AlpinDale in #1468
build: upgrade DeepGEMM by @AlpinDale in #1469
feat: Gemma3n audio endpoint support by @AlpinDale in #1470
feat: support KeyeVL-1.5 model by @AlpinDale in #1471
chore: minor code simplification for spec decode by @AlpinDale in #1472
fix: IO processor plugin fixes by @AlpinDale in #1473
feat: support DP for Kimi-VL ViT by @AlpinDale in #1474
chore: move logprob to a separate module by @AlpinDale in #1475
[kernel] feat: support FP32 for mamba by @AlpinDale in #1476
fix: blip2 inference by @AlpinDale in #1477
chore: remove runtime checks for pooling params by @AlpinDale in #1478
chore: migrate Ovis to TensorSchema by @AlpinDale in #1479
feat: add online FP8 support for XPU by @AlpinDale in #1480
chore: migrate interns to TensorSchema by @AlpinDale in #1481
feat: support DP for GLM-4.5V ViT by @AlpinDale in #1482
fix: gemma3n batched audio by @AlpinDale in #1483
fix: EXAONE4 model RoPE by @AlpinDale in #1484
chore: add logit_bias/sigmoid_normalize support for classification models by @AlpinDale in #1485
fix: transform_config parsing in compressed tensors by @AlpinDale in #1486
fix: pack_factor -> packed_factor in linear layers by @AlpinDale in #1487
chore: deprecate TPOT in favor of ITL by @AlpinDale in #1488
fix: weight loading for Apertus model by @AlpinDale in #1489
fix: only print profiler results on rank 0 by @AlpinDale in #1490
fix: cast offsets tensor bn to tl.int64 to avoid GPU segfault by @AlpinDale in #1491
fix: DeepSeek-R1 accuracy by setting routed_scaling_factor=1.0 by @AlpinDale in #1492
fix: LoRA logits on XPU by @AlpinDale in #1493
feat: multi-threaded model weights loader by @AlpinDale in #1494
feat: fully async model execution by @AlpinDale in #1495
feat: enable pytorch symmetric memory for A100 GPUs by @AlpinDale in #1496
[kernel] feat: vectorized kernels by @AlpinDale in #1497
chore: upgrade xgrammar to 0.1.23 by @AlpinDale in #1499
feat: enable request-level logits processor in the batch-level logit-procing by @AlpinDale in #1500
fix: compile warning for w4a8_mm_entry.cu by @AlpinDale in #1501
fix: add check for dual chunk attn by @AlpinDale in #1502
fix: double mul for dots1 and GLM4 MoE by @AlpinDale in #1503
chore: remove NCCL cumem env var override by @AlpinDale in #1504
feat: enable heterogenous TP for Nixl KV connector w/ flashinfer by @AlpinDale in #1505
chore: support add_generation_prompt in embeddings endpoint by @AlpinDale in #1506
refactor: simplify weight handling in MiniMaxText01RMSNormTP class by removing dead code by @AlpinDale in #1507
fix: division by zero in triton_attn by @AlpinDale in #1508
feat: enable full CUDA graph support for PLaMo2 on V1 by @AlpinDale in #1509
feat: add support for Qwen3-Next model and add Flash Linear Kernels by @AlpinDale in #1510
Fix for logit bias crash (probably) by @50h100a in #1511

Full Changelog: v0.9.0...v0.9.1

Contributors

AlpinDale and 50h100a

Assets 3

24 Aug 15:33

github-actions

v0.9.0

c8b1ea6

v0.9.0

It has been a long time. There have been many, many changes between this release and v0.6.7. I'll try to summarize the most important ones, but I'll likely miss quite a lot.

New Models

Transformers backend

You can now load any unsupported model using the integrated transformers backend. By default, if an unsupported model is loaded, Aphrodite will attempt to load it using Transformers if a native implementation doesn't exist.

Quantization Methods

There have been a few quant methods added.

NVFP4 - the new datatype supported by Blackwell GPUs. Will also work on Ampere and Hopper using Marlin kernels.
MXFP4 - popularized by GPT-Oss. Natively supported by Blackwell; Hopper and Ampere will use Marlin.
GPTQAllSpark - optimization for GPTQ models, supported when the model has group_size=-1 and act_desc=False. Seems to provide better perf than Marlin. Enable with -q gptq_allspark.
BitBLAS - support for BitNet-quantized 1.58bit models. Will also support GPTQ.
TorchAO - support for models quantized using TorchAO.

New Features

There's a lot of new features! Here are some of the more important ones:

Sophisticated support for torch.compile
DeepGEMM for DeepSeek-V3
Expert Load Balancing and Expert Parallel
FlexAttention
TreeAttention
Differential Flash Attention
Dual Chunk Flash Attention
Flash Attention V3 (supported only for A100/H100 GPUs)
Disaggregated Prefill – run separate instances of Aphrodite for prefill and decode. Boosts throughput by eliminating compute starvation caused by expensive prefill requests at the cost of requiring more GPUs.
Async Scheduling via NanoFlow – provides a ~13% throughput and latency boost (--async-scheduling)
Async Tensor Parallel – provides an ~8% throughput and latency boost on Hopper GPUs, ~3% on Ampere (-O '{"level":3, "compile_sizes": [512], "pass_config": {"enable_async_tp": true}}')
Mirostat Sampling – it returns!
String/Phrase Banning – also known as "anti-slop sampler", this one allows you to specify a list of phrases to ban from generation. Pass banned_strings as a parameter in the API request body, and provide it with a list of strings.

As always, many thanks to the vLLM!

What's Changed

ci: bump aphrodite-engine version to 0.6.7 by @AlpinDale in #1253
docs: update documentation by @AlpinDale in #1254
chore: update docs site packages by @ahme-dev in #1255
fix: multi-step scheduling for TPU model runner by @AlpinDale in #1256
fix: tool call finish reason in streaming case by @AlpinDale in #1257
spec decode: use flash_attn_varlen_func for MQA scorer by @AlpinDale in #1260
fix: xformer attn backend prefill for encoder-decoder models by @AlpinDale in #1261
fix: placeholder attn max_decode_seq_len by @AlpinDale in #1262
fix: check for head_size in spec decode models by @AlpinDale in #1263
XPU: enable async output processing by @AlpinDale in #1264
feat: migrate docs to starlight by @ahme-dev in #1259
docs: fix broken links by @AlpinDale in #1266
TPU: fix SMEM OOM error by @AlpinDale in #1267
VLM: add multi-image support fro Mllama (llama 3.2) by @AlpinDale in #1268
VLM: add image embeds support for InternVL by @AlpinDale in #1269
fix: chat API continuous usage stats by @AlpinDale in #1270
fix: UsageInfo and logprobs=None assertion w/ empty token_ids by @AlpinDale in #1271
fix: text-only input bug for Molmo by @AlpinDale in #1272
chore: fix mrope handling by @AlpinDale in #1273
model: add support for VLM2Vec - multimodal embedding model by @AlpinDale in #1274
fix: use proper vocab size for logit_bias by @nyxkrage in https://gi...

Contributors

dsk7, Nero10578, and 4 other contributors

Assets 3

07 Mar 11:36

github-actions

v0.6.7

c63ad41

v0.6.7

What's Changed

Core: add output streaming support to multi-step + async by @AlpinDale in #1112
tests: update scheduler tests by @AlpinDale in #1113
(1/N) XQA: integrate the XQA CUDA kernels within Aphrodite by @AlpinDale in #1115
chore: support loading weights by ID within models by @AlpinDale in #1116
chore: expose phi3_v num_crops as an mm_processor_kwargs by @AlpinDale in #1117
fix: unsafe all-reduce sync by @AlpinDale in #1118
kernels: split marlin kernels for faster compile, fix MoE, temporarily remove HQQ by @AlpinDale in #1119
LLM: enable batched inference for llm.chat() API by @AlpinDale in #1120
Quantization: re-enable Marlin serialization for AWQ quants by @AlpinDale in #1121
fix: torch.compile dynamo fix by @AlpinDale in #1122
chore: bump bitsandbytes version to latest; enable cuda graphs for 4bit bnb by @AlpinDale in #1123
(1/N) Triton Backend: integrate Triton layernorm kernels by @AlpinDale in #1125
(2/N) Triton Backend: integrate Triton activation kernels by @AlpinDale in #1126
chore: remove trailing whitespaces by @AlpinDale in #1128
chore: support prompt_logprobs with speculative decoding by @AlpinDale in #1129
feat: add Priority-based Scheduling by @AlpinDale in #1130
API: use heartbeats instead of health checks by @AlpinDale in #1131
kernel: fix custom all-reduce kernel compilation on Pascal GPUs by @AlpinDale in #1132
fix: propagate trust_remote_code in InternVL and MiniCPM-V by @AlpinDale in #1133
fix: load fully-connected layer bias for EAGLE models by @AlpinDale in #1134
API: propagate usage accounting to FastAPI middleware layer by @AlpinDale in #1135
fix: ray 2.9.x does not expose available_resources_per_node by @AlpinDale in #1136
fix: multi-step scheduling with InternVL by @AlpinDale in #1137
chore: support FP8 MoE for compressed-tensors by @AlpinDale in #1138
model: add support for Mllama (Llama 3.2) models by @AlpinDale in #1139
fix: quantization for Mllama models by @AlpinDale in #1140
fix: include encoder prompt len to non-stream api usage response by @AlpinDale in #1141
fix: downgrade logger.warning for BOS fallback to print_warning_once by @AlpinDale in #1142
fix: only set tool_choice to auto if at least one tool is provided by @AlpinDale in #1143
API: add tool calling support for Llama 3.1 and 3.2 by @AlpinDale in #1144
fix: batched inference with fuyu by @AlpinDale in #1145
TPU: support Trillium by @AlpinDale in #1146
torch.compile: use empty tensor instead of None for profiling by @AlpinDale in #1147
kernel: Integrate asymmetric quantization for INT8 activations by @AlpinDale in #1148
core: add support for chunked prefill + multi-step scheduling by @AlpinDale in #1149
distributed: add env var to force custom all-reduce by skipping p2p check by @AlpinDale in #1150
chore: add priority scheduling to async engine by @AlpinDale in #1151
fix: XPU docker build by @AlpinDale in #1153
distributed: force full nvlink when APHRODITE_FORCE_P2P env var by @AlpinDale in #1154
fix: multi-step scheduling with Pipeline Parallel by @AlpinDale in #1155
chore: improve implicit choice of spawn/fork for multiprocessing method by @AlpinDale in #1156
fix: block manager v2 with preemption and lookahead slots by @AlpinDale in #1157
fix: marlin MoE act order when is_k_full==False by @AlpinDale in #1158
build: set FETCHCONTENT_BASE_DIR to one location for better caching by @AlpinDale in #1159
model: add support for Qwen2.5-Math-RM-72B reward model by @AlpinDale in #1160
lora: add LoRA support for MiniCPMV-2.5 by @AlpinDale in #1161
fix: seeded gens with encoder-decoder models by @AlpinDale in #1162
api: add support for prefill to chat completions endpoint by @AlpinDale in #1163
kernel: varlen prefill + prefill chunking support for mamba kernels by @AlpinDale in #1164
model: support input embeddings for qwen2-vl by @AlpinDale in #1165
models: add LoRA support for MiniCPM-V 2.6 by @AlpinDale in #1166
vlm: expose internvl2 max_dynamic_patch as a mm_processor_kwarg by @AlpinDale in #1167
api: expose priority scheduling in the API server by @AlpinDale in #1168
feat: add request-level logging by @AlpinDale in #1169
fix: adjust max_position_embeddings for LoRA by @AlpinDale in #1170
core: move guided decoding params into sampling params by @AlpinDale in #1171
build: fix machete generation file ordering by @AlpinDale in #1172
fix: torch.compile tensor alias by @AlpinDale in #1173
chore: add process_weights_after_loading for DummyLoader by @AlpinDale in #1174
fix: tensor-parallel inference with fuyu by @AlpinDale in #1175
fix: token IDs reference fro MiniCPM-V when images are provided with no placeholders by @AlpinDale in #1176
(1/N) MQA Scorer: add MQA scorer by @AlpinDale in #1177
feat: support multi-step, chunked prefill, prefix cache, cuda graph combo by @AlpinDale in #1178
fix: guided decoding default values breaking text completions API by @AlpinDale in #1179
OpenVINO: add support for GPU, fix Docker build by @AlpinDale in #1181
models: add support for Granite MoE model (PowerMoE) by @AlpinDale in #1182
fix: mistral parallel tool call template fix by @AlpinDale in #1183
fix: enforce mistral tool call ID constraint by @AlpinDale in #1184
core: make v2 block manager the default by @AlpinDale in #1185
logging: only log the non-default parameters in engine by @AlpinDale in #1180
chore: parse literals out of --override-neuron-config by @AlpinDale in #1186
[torch.compile]: add forward context for attention by @AlpinDale in #1187
[torch.compile]: add forward context for flashinfer by @AlpinDale in #1188
fix: OPT model loading for checkpoints with no tied embeds by @AlpinDale in #1189
api: add tool parser plugin + Inte...

Contributors

wejoncy and AlpinDale

Assets 3

27 Jan 15:53

github-actions

v0.6.6

b20c457

v0.6.6

What's Changed

distributed: support pipeline parallelism for internvl and internlm2 by @AlpinDale in #965
tpu: add support for async postprocessing by @AlpinDale in #968
fix: prometheus.yaml path in monitoring example by @AlpinDale in #969
tpu: support single and multi-host TPUs on GKE and RayServe by @AlpinDale in #970
vlm: add tensor parallel support for vision transformer models by @AlpinDale in #971
tests: update internvl test for #971 by @AlpinDale in #972
vlm: increase the default max_num_batched_tokens for multimodal models by @AlpinDale in #973
core: fix chunked prefill not being enabled by default for long contexts by @AlpinDale in #974
tpu: fix TPU type api by @AlpinDale in #975
fix: modelscope for VLMs by @AlpinDale in #976
fix: crash when cancelling a request with multi-step by @AlpinDale in #977
models: add support for IBM Granite (PowerLM) models by @AlpinDale in #978
tpu: align worker index with node boundary by @AlpinDale in #979
fix: InternLM2 model with Tensor Parallel by @AlpinDale in #980
core: slightly improve chunked prefill performance by @AlpinDale in #981
vlm: fallback to SDPA for ViT models on CPU backend by @AlpinDale in #982
core: improve async postproc + multi-step performance by @AlpinDale in #983
fix: raise exception when accessing logger for disable_log_stats=True case by @AlpinDale in #984
chore: rename task_handler to worker by @AlpinDale in #985
tpu: fix outputs by correcting the next_token_ids shape by @AlpinDale in #986
quants: add GPTQ and FBGEMM to AphroditeParameters by @AlpinDale in #987
benchmarks: add --async-engine arg to throughput benchmark by @AlpinDale in #988
tpu: use XLA rank for persistent cache path by @AlpinDale in #989
vlm: support multiple audios per prompt for Ultravox by @AlpinDale in #990
vlm: fix siglip layernorm and paligemma weight loading by @AlpinDale in #991
vlm: enable multimodal inputs for the LLM class by @AlpinDale in #992
api: implement OpenAI-compatible tools API for Hermes/Mistral models by @AlpinDale in #993
neuron: add 8bit quantization for Neuron by @AlpinDale in #994
models: add support for QwenVL by @AlpinDale in #995
fix: gptq_marlin exception on older GPUs by @AlpinDale in #996
chore: use ray[adag] dep instead of cuda by @AlpinDale in #997
quants: improve awq_triton throughput by @AlpinDale in #998
fix: hermes tool call chat template by @AlpinDale in #999
core: fix async postprocessor in case of preemption by @AlpinDale in #1000
vlm: add multi-input support for LLaVA and InternVL models by @AlpinDale in #1002
tools: fix tool calls to more strictly follow OpenAI format by @AlpinDale in #1003
fix: LoRA support for Cohere and Jamba models by @AlpinDale in #1004
spec decode: move ops.advane_step to flash attention backend by @AlpinDale in #1005
chore: remove peft as a requirement by @AlpinDale in #1006
chore: keep chunked prefill enabled with prefix caching by @AlpinDale in #1007
fix: ensure multistep lookahead allocation is compatible with cugraph max capture by @AlpinDale in #1008
fix: pass APHRODITE_ATTENTION_BACKEND to ray workers by @AlpinDale in #1009
build: shallow clone cutlass 3.5.1 tag by @AlpinDale in #1010
chore: skip loading extra bias for qwen2 moe GPTQ by @AlpinDale in #1011
fix: internvl pipeline parallel by @AlpinDale in #1012
quants: add support for NVIDIA's ModelOpt checkpoints by @AlpinDale in #1013
vlm: add support for video modality + llava next video model by @AlpinDale in #1014
vlm: add support for Qwen2-VL model by @AlpinDale in #1015
cpu: fix issue with sampling kernels by @AlpinDale in #1016
cpu: add support for W8A8 quantization via compressed-tensor by @AlpinDale in #1017
kernel: add meta functions for ops to prevent graph breaks by @AlpinDale in #1019
chore: move device keys to a constant by @AlpinDale in #1020
tests: refactor speculative decoding tests to remove the async engine by @AlpinDale in #1021
vlm: add support for Pixtral model by @AlpinDale in #1022
core: dump model runner inputs during crash by @AlpinDale in #1023
chore: remove engine_use_ray by @AlpinDale in #1024
api: fix logic for deciding if tool parser is used by @AlpinDale in #1025
quants: add bitsandbytes support for gemma2 model by @AlpinDale in #1026
cpu: raise error if using encoder-decoder models by @AlpinDale in #1027
chore: use RoPE cache for MRoPE method by @AlpinDale in #1028
torch.compile: hide slicing under custom op for inductor by @AlpinDale in #1029
vlm: fix internvl2 inference with various num_patches by @AlpinDale in #1030
vlm: support multiple images for qwen-vl by @AlpinDale in #1031
fix: lazy init _copy_stream by @AlpinDale in #1032
multi-step: add support for flashinfer attention backend by @AlpinDale in #1033
api: add sampling/engine option to return only deltas or final output by @AlpinDale in #1035
fix: multi-step + flashinfer with cuda graphs by @AlpinDale in #1036
fix: disable chunked prefill and prefix caching for multimodal models by @AlpinDale in #1037
fix: grouped_topk return type by @AlpinDale in #1038
core: factor out input preprocessing into a separate class by @AlpinDale in #1039
fix: skip loading extra bias for Qwen2-VL GPTQ by @AlpinDale in #1040
torch.compile: allow adding custom compile backends via plugins by @AlpinDale in #1041
xpu: bump IPEX to 2.3, support GQA by @AlpinDale in #1042
rocm: add custom paged attention kernels for ROCm by @AlpinDale in #1043
model: add support for MiniCPM-3 by @AlpinDale in #1044
torch.compile: fix functionalization by @AlpinDale in #1045
tpu: implement...

Contributors

AlpinDale

Assets 6

22 Dec 06:43

github-actions

v0.6.5

cbd51a2

v0.6.5

What's Changed

xpu: refactor XPU worker & executor by @AlpinDale in #861
build: add jinja2 to requirements file by @AlpinDale in #862
attention: add AttentionState abstraction by @AlpinDale in #863
xpu: disable punica kernels for XPU by @AlpinDale in #864
executor: pipe worker_class_fn arg in executor by @AlpinDale in #865
server: log the process occupying our port by @AlpinDale in #866
feat: AWQ quantization for InternVL by @AlpinDale in #867
Rewrite DRY sampler to be a lot faster by @50h100a in #868
fix: ROCm build by @Naomiusearch in #817
fix: temp_last warning being repeated for every output token by @AlpinDale in #869
feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
async: avoid premature exit in the async generator by @AlpinDale in #872
cpu: fix mm_limits initialization by @AlpinDale in #873
spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
sampler: pad dry sequence breakers tensor by @AlpinDale in #875
fix: add_generation_template -> add_generation_prompt in llm by @AlpinDale in #877
Update README.md by @NoahBPeterson in #876
api: fix crashes under very high loads by @AlpinDale in #878
build: pass PYTHONPATH from setup.py to cmake by @AlpinDale in #879
async: disable multi-step scheduling for sync engine by @AlpinDale in #880
api: better startup failure UX by @AlpinDale in #881
chore: consolidate environment variables within one file by @AlpinDale in #882
core: fix spec decode metrics and envs circular import by @AlpinDale in #889
feat: add support for audio models by @AlpinDale in #891
distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
build: fix invalid path for envs.py in setup by @AlpinDale in #894
kernel: use cub::BlockReduce instead of custom impl by @AlpinDale in #895
fix: Phi 3.5 Vision model loading by @AlpinDale in #896
api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
spec decode: add support for EAGLE by @AlpinDale in #899
fix: ShardedStateLoader with fp8 quant by @AlpinDale in #900
kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
ray: better error when placement group topology is incorrect by @AlpinDale in #906
xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
fix: empty prompt crashing the server by @AlpinDale in #912
quantization: update marlin to use AphroditeParameters by @AlpinDale in #913
core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
api: add json_schema to OpenAI server by @AlpinDale in #915
fix: phi3v crash with unusual image sizes by @AlpinDale in #916
feat: multi-image input support for Phi3V by @AlpinDale in #917
spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
api: use fp32 for base64 embeddings by @AlpinDale in #919
core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
quants: update qqq and gptq_marlin_24 to use AphroditeParameters by @AlpinDale in #921
distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
neuron: add support for tensor parallelism by @AlpinDale in #923
quants: update compressed tensors lifecycle to remove prefix from create_weights by @AlpinDale in #924
feat: add async postprocessor by @AlpinDale in #925
api: add endpoint for loading and unloading the model by @AlpinDale in #926
feat: add single user mode by @AlpinDale in #927
api: add inline model loading by @AlpinDale in #928
api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
fix: inline model loading conflicts with lora by @AlpinDale in #930
core: do not compile for profiling by @AlpinDale in #931
xpu: support pipeline parallel by @AlpinDale in #932
fix: phi3v image_idx in async server by @AlpinDale in #933
feat: add fused Marlin MoE kernel by @AlpinDale in #934
chore: multi-image support for llava-next by @AlpinDale in #935
model: add support for paligemma2 by @AlpinDale in #936
vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
core: add virtual engine for async outproc by @AlpinDale in #939
api: log prompt truncation by @AlpinDale in #940
vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with --tokenizer-mode mistral by @khanonnie in #943
core: use flashinfer for FP8 KV when available by @AlpinDale in #944
tests: update flashinfer test for #944 by @AlpinDale in #945
quants: add triton kernels for AWQ by @AlpinDale in #946
tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
fix: do not register punica with torch if using older torch by @AlpinDale in #948
tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
fix: issues with flashinfer fp8 kv by @AlpinDale in #950
api: optimize zeromq frontend performance by @AlpinDale in #951
tpu: remove torch._dynamo.reset() by @AlpinDale in #952
vlm: fix errors on ragged NestedTensors by @AlpinDal...

Contributors

NoahBPeterson, AlpinDale, and 3 other contributors

Assets 9

03 Dec 01:51

github-actions

v0.6.4.post1

8b8d2ce

v0.6.4.post1

What's Changed

add linux arm64/aarch64/GH200 installation tips by @qpwo in #851
DRY Fix: Add output_tokens to sampler by @selalipop in #849
sampler: fix DRY concurrency issue by @AlpinDale in #852
sampler: add range parameter for DRY by @AlpinDale in #855
sampler: optimize DRY performance using z-algorithm by @AlpinDale in #856
sampler: allow parsing sampler order using strings by @AlpinDale in #858

New Contributors

@qpwo made their first contribution in #851

Full Changelog: v0.6.4...v0.6.4.post1

Contributors

qpwo, selalipop, and AlpinDale

Assets 3

27 Nov 07:31

github-actions

v0.6.4

d2971a6

v0.6.4

What's Changed

frontend: enable kobold api by default by @AlpinDale in #803
feat: add serviceinfo endpoint by @AlpinDale in #807
feat: update to serviceinfo v0.2 by @AlpinDale in #808
Mask dynatemp using min/max, rather than exp by @50h100a in #813
fix: temperature issues by @50h100a in #814
fix: --max-seq-len-to-capture arg by @AlpinDale in #818
[IMPORTANT] updating test units by @AlpinDale in #769
fix: tokenization api test by @AlpinDale in #821
feat: add chat method for LLM class by @AlpinDale in #822
feat: support chunked prefill with LoRA by @AlpinDale in #823
SPMD optimizations by @AlpinDale in #824
fix: sampler test with new transformers version by @AlpinDale in #826
feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in #828
feat: add metrics for prefix cache hit rate by @AlpinDale in #829
fix: unbound tokenizer error by @AlpinDale in #830
feat: multi-step scheduling by @AlpinDale in #831
feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in #827
feat: add no_repeat_ngram sampler by @AlpinDale in #832
feat: add skew sampling by @AlpinDale in #834
fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in #839
chore: refactor executor classes for easier inheritance by @AlpinDale in #840
fix: latency and serving benchmarks by @AlpinDale in #841
feat: Machete Kernels for Hopper GPUs by @AlpinDale in #842
feat: add sampler_priorty by @AlpinDale in #837
fix: disable awq_marlin override for awq models by @AlpinDale in #843
chore: bump mistral_common to 1.5.0 by @AlpinDale in #844
ci: bump version to 0.6.4 by @AlpinDale in #845

New Contributors

@dependabot made their first contribution in #796
@selalipop made their first contribution in #827

Full Changelog: v0.6.3...v0.6.4

Contributors

selalipop, dependabot, and 2 other contributors

Assets 3

Uh oh!

Releases: dphnAI/aphrodite-engine

v0.21.0

What's Changed

Contributors

Uh oh!

v0.20.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.10.0

What's Changed

Contributors

Uh oh!

v0.9.1

What's Changed

Contributors

Uh oh!

v0.9.0

v0.9.0

New Models

Transformers backend

Quantization Methods

New Features

What's Changed

Contributors

Uh oh!

v0.6.7

What's Changed

Contributors

Uh oh!

v0.6.6

What's Changed

Contributors

Uh oh!

v0.6.5

What's Changed

Contributors

Uh oh!

v0.6.4.post1

What's Changed

New Contributors

Contributors

Uh oh!

v0.6.4

What's Changed

New Contributors

Contributors

Uh oh!