Releases: dphnAI/aphrodite-engine
Releases Β· dphnAI/aphrodite-engine
v0.21.0
What's Changed
- build: support python 3.14 by @AlpinDale in #1636
- fix: GLM-5.1 on ROCm by @AlpinDale in #1637
- fix: replica selection bias in fusedmoe router by @AlpinDale in #1638
- fix: respect
TORCH_COMPILE_DISABLEenv var for torch 2.12 by @AlpinDale in #1639 - chore: remove dead code from worker by @AlpinDale in #1640
- feat: warmup readonly mm processor during renderer startup by @AlpinDale in #1641
- fix: GPU memory leaks in engine shutdown for rocm by @AlpinDale in #1642
- chore: optimize deepstack buffer handling for MM Qwen3 models by @AlpinDale in #1643
- feat: support kv offload storing with multiple KV groups by @AlpinDale in #1644
- feat: add perf benchmark script by @AlpinDale in #1645
- fix: only unpad routed output before shared expert add by @AlpinDale in #1646
- fix: DSML token leakage in DeepSeek-V4 and 3.2 by @AlpinDale in #1647
- fix: size the MNNVL workspace for flashinfer to EP group by @AlpinDale in #1648
- fix: offload all KV blocks when doing prefill in P/D by @AlpinDale in #1649
- fix: disable sequence parallelism for piecewise compilation by @AlpinDale in #1650
- feat: implement DeepSeek-V4 model by @AlpinDale in #1651
- perf: EXL3 performance tuning on GeForce Blackwell by @AlpinDale in #1652
- fix: TRT-LLM MXFP4 MoE compile for DeepSeek-V4 by @AlpinDale in #1653
- fix: normalize nested args in DeepSeek DSML by @AlpinDale in #1654
- perf: exl3 decode kernel optimization experiments by @AlpinDale in #1655
- perf: exl3 optims with guarded MoE down tuning by @AlpinDale in #1656
- fix: auto-disable
expandable_segmentsaround cumem memory pool by @AlpinDale in #1657 - fix: rejection sampling acceptance rate in MRv2 by @AlpinDale in #1658
- fix: cap SWA/chunked-local runtime admission to startup pool-sizing bound by @AlpinDale in #1659
- feat: FP8 ViT Attention w/ FlashInfer by @AlpinDale in #1660
- chore: share dequant buffers in TurboQuant to save memory by @AlpinDale in #1661
- fix: remove invalid deepstack boundary check for Qwen3-VL by @AlpinDale in #1664
- feat: add silu clamp limit to shared expert for DeepSeek-V4 by @AlpinDale in #1665
- chore: sync to upstream 985961345a13f3e3bb15d29c94b011ba9a6b858b by @AlpinDale in #1666
Full Changelog: v0.20.0...v0.21.0
v0.20.0
What's Changed
- [engine] add API for concurrency rate and kv cache token limit by @AlpinDale in #1608
- [diffusion]
aphrodite diffusionbackend by @AlpinDale in #1607 - [cli][diffusion] only import diffusion backend when it is called by @AlpinDale in #1610
- [logger][metrics] log number of cache hits in the request-level logger by @AlpinDale in #1611
- [cli] add CLI arg for selecting attention backend by @AlpinDale in #1612
- fix: tokenizer server init by @AlpinDale in #1617
- [models] add support for GLM-4.7 Flash by @AlpinDale in #1620
- fix: mark GLM-4 MoE Lite as an MLA model by @AlpinDale in #1621
- fix: compute engine max_concurrency from worker KV cache configs by @lucyknada in #1622
- feat: add support for the Qwen3.5 family of models by @AlpinDale in #1624
- feat: update aphrodite to 0.20.0 by @AlpinDale in #1628
- feat: add tensor parallel support for exllamav3 by @AlpinDale in #1629
- chore: remove unused csrc code by @AlpinDale in #1630
- chore: bump cuda to 13.0 by @AlpinDale in #1631
- chore: sync to upstream vllm f768b4473e1bd55023dcaff63984cfdd08902fc8 by @AlpinDale in #1632
- chore: massively improve DRY performance by @AlpinDale in #1634
- feat: optimize lm_head by fusing more kernels and actually quantizing lm_head by @AlpinDale in #1635
New Contributors
- @lucyknada made their first contribution in #1622
Full Changelog: v0.10.0...v0.20.0
v0.10.0
What's Changed
- feat: qwen3-next tool parser by @AlpinDale in #1512
- [Build] feat: add support for incremental cmake builds by @AlpinDale in #1515
- chore: cleanup aphrodite FA directory by @AlpinDale in #1516
- docs: update documentation on adding support for new models by @AlpinDale in #1517
- fix: multi-node serving with ray by @AlpinDale in #1518
- chore: migrate whisper to TensorSchema by @AlpinDale in #1519
- feat: add logging for model parameter count by @AlpinDale in #1525
- [Attention] feat: add support for Context Parallelism by @AlpinDale in #1521
- [Model] feat: support BailingMoe V2 by @AlpinDale in #1527
- [API] chore: separate Kobold API code to its own serving class by @AlpinDale in #1529
- Revert "[API] chore: separate Kobold API code to its own serving class" by @AlpinDale in #1530
- fix: error propagation in chat completions by @AlpinDale in #1532
- [API] chore: separate Kobold API code to its own serving class by @AlpinDale in #1531
- [API] chore: remove dead code from the old kobold api module by @AlpinDale in #1533
- [API] fix: anthropic messages API by @AlpinDale in #1534
- [build] fix: relax xformers dependency version by @AlpinDale in #1536
- [PP] fix: Qwen3-Next with Pipeline Parallelism by @AlpinDale in #1537
- Update readme by @AlpinDale in #1538
- [Kernel] chore: add tuned kernel configs for BailingMoEV2 by @AlpinDale in #1542
- [config] fix: set the correct max_model_len with YaRN scaling by @AlpinDale in #1543
- [API] feat: add lightweight tokenizer-only API server by @AlpinDale in #1545
- release: v0.10 by @AlpinDale in #1549
- [build] bump flashinfer to 0.5.0 by @AlpinDale in #1551
- [API] feat: add model management endpoints for loading and unloading models by @AlpinDale in #1553
- [core] feat: enable dynamic KV cache allocation by @AlpinDale in #1552
- fix: quantization import for kimi-linear KDA by @AlpinDale in #1555
- [API] feat: add multi-model support by @AlpinDale in #1554
- fix: Kimi-Linear with AWQ quants by @AlpinDale in #1556
- ci: make gemini PR reviews less verbose by @AlpinDale in #1557
- fix: avoid GPU-CPU sync in MTP by @AlpinDale in #1558
- [kernel] fix: use the same H200 config for both H200 and H200 NVL by @AlpinDale in #1559
- [API] fix: task log when multi-model is not enabled by @AlpinDale in #1560
- fix: ensure model_registry is not empty before accessing models in OpenAIServing by @AlpinDale in #1561
- [ci] chore: update pre-commit scripts by @AlpinDale in #1562
- [ci] chore: make all pre-commit checks pass by @AlpinDale in #1563
- [core]: update
cu_num_accepted_tokensfor allreq_indexby @AlpinDale in #1564 - [API] feat: enable DP-aware routing in OAI requests by @AlpinDale in #1565
- [logger] fix: don't record sleep mode logs when not in dev mode by @AlpinDale in #1566
- [distributed] remove
APHRODITE_DEEPEP_LOW_LATENCY_ALLOW_NVLINKenv var by @AlpinDale in #1567 - [core] invoke
save_new_computed_blockswhen computed blocks are not empty by @AlpinDale in #1568 - [ci] feat: remove ruff workflow and add a pre-commit one by @AlpinDale in #1569
- [compilation] allow torch.compile with batch invariant inference by @AlpinDale in #1570
- [cpu] fix
APHRODITE_CPU_OMP_THREADS_BIND="autho"for PowerPC CPU by @AlpinDale in #1571 - [fix] avoid too small block m/n for flex attention by @AlpinDale in #1572
- [kernel] perf: significantly enhance KDA/Kimi Linear throughput by decoupling torch op from GDA to use torch.compile by @AlpinDale in #1573
- [lora] allow int64 values for LoRA ID to avoid overflow by @AlpinDale in #1574
- [mm] fix broken MRoPE for GLM-4.1/4.5V by @AlpinDale in #1575
- [nixl] fix: missing metadata for handshake in multi-node by @AlpinDale in #1576
- [mm] fix: missing cached item in beam search by @AlpinDale in #1577
- [moe] fuse nvfp4 quant with
flashinfer_cutlass_moein TP by @AlpinDale in #1578 - [spec] remove unused args in spec decoding by @AlpinDale in #1579
- [chore] pydantic validation for scheduler and structured outputs configs by @AlpinDale in #1580
- [hybrid] simpler algorithm to find
kernel_block_sizeby @AlpinDale in #1581 - [core] feat: support async scheduling with structured outputs by @AlpinDale in #1582
- [lora] enable bias support for fused moe lora by @AlpinDale in #1583
- [mm] vision attention backend for XPU by @AlpinDale in #1584
- [lora] add SplitK in fused MoE lora by @AlpinDale in #1585
- [ci] bump transformers to 4.57.1 by @AlpinDale in #1586
- [mm] mrope for keye by @AlpinDale in #1587
- [cli] add CLI args for kv cache offloading by @AlpinDale in #1588
- [kvoffload] feat: make LMCache connecter work by @AlpinDale in #1589
- [v0] remove
APHRODITE_USE_V1from platform and v1 by @AlpinDale in #1590 - [spec] fix DeepSeek v3.2 MTP metadata and cuda graph by @AlpinDale in #1591
- [python3.10] import
Selffromtyping_extensionsby @AlpinDale in #1592 - [TPU] prevent single-process DP by @AlpinDale in #1593
- [sampler] fix mixed penalties in batch with async scheduling by @AlpinDale in #1594
- [api] bring back anthropic /v1/messages endpoint in OpenAI server by @AlpinDale in #1595
- [offloader] fix async scheduling support with KV cache offloader by @AlpinDale in #1596
- [sync] sync to upstream 03c4c4a by @AlpinDale in #1597
- [build] downgrade flashinfer to 0.4.1 by @AlpinDale in #1598
- [lora][moe] fix MoE models by registering the correct op by @AlpinDale in #1599
- [quant] fix GLM-4.5V AWQ by @AlpinDale in #1600
- [build] upgrade flashinfer to 0.5.1 by @AlpinDale in #1601
- Require structured output parameters to be explicitly None or valid by @50h100a in #1604
- [multi node] better cluster example script by @AlpinDale in #160...
v0.9.1
What's Changed
- feat: implement Motif model by @AlpinDale in #1439
- [V1] feat: support fp8 kv on ampere through flashinfer by @AlpinDale in #1440
- fix: enable Kobold API by default by @AlpinDale in #1441
- feat: add DeepConf sampling by @AlpinDale in #1442
- build: add option to disable flash attn compilation by @AlpinDale in #1443
- [PP] feat: optimize PP performance through throttling by @AlpinDale in #1447
- chore: sync to upstream 4071c76 by @AlpinDale in #1448
- feat: support encoder data parallel for MiniCPM-V by @AlpinDale in #1449
- feat: enable LoRA support for DeepSeek models by @AlpinDale in #1450
- fix: add
reorder_batchtoAttentionMetadataBuilderby @AlpinDale in #1451 - chore: move out
freezing_value/cuda_eventinit outside try/finally block by @AlpinDale in #1452 - chore: better type hint for rearrange return value in eplbstate by @AlpinDale in #1453
- chore: faster LoRA startup by @AlpinDale in #1454
- fix:
truncate_prompt_tokenstype hint by @AlpinDale in #1455 - feat: allow passing
multi_modal_uuidsas multimodal identifiers by @AlpinDale in #1456 - fix: vocab_size check by @AlpinDale in #1457
- feat: support KV events from connectors by @AlpinDale in #1458
- fix: AutoGPTQ Qwen3-MoE models by @AlpinDale in #1459
- fix: avoid redundant copy for enc-only models by @AlpinDale in #1460
- fix: loading GPTQ 3-bit models by @AlpinDale in #1461
- chore: move fast prefill logic to a separate method by @AlpinDale in #1462
- fix: update constructors and type hints for multimodal input handling by @AlpinDale in #1463
- chore: update import for torch inductor configuration in CPU model runner by @AlpinDale in #1464
- chore: migrate Phi4 inputs to TensorSchema by @AlpinDale in #1465
- feat: IO processor plugin for pooling models by @AlpinDale in #1466
- fix: add support for
<tool_call>format in streaming mode for xlam by @AlpinDale in #1467 - fix: allow FP16 inference on Turing and below by @AlpinDale in #1468
- build: upgrade DeepGEMM by @AlpinDale in #1469
- feat: Gemma3n audio endpoint support by @AlpinDale in #1470
- feat: support KeyeVL-1.5 model by @AlpinDale in #1471
- chore: minor code simplification for spec decode by @AlpinDale in #1472
- fix: IO processor plugin fixes by @AlpinDale in #1473
- feat: support DP for Kimi-VL ViT by @AlpinDale in #1474
- chore: move logprob to a separate module by @AlpinDale in #1475
- [kernel] feat: support FP32 for mamba by @AlpinDale in #1476
- fix: blip2 inference by @AlpinDale in #1477
- chore: remove runtime checks for pooling params by @AlpinDale in #1478
- chore: migrate Ovis to TensorSchema by @AlpinDale in #1479
- feat: add online FP8 support for XPU by @AlpinDale in #1480
- chore: migrate interns to TensorSchema by @AlpinDale in #1481
- feat: support DP for GLM-4.5V ViT by @AlpinDale in #1482
- fix: gemma3n batched audio by @AlpinDale in #1483
- fix: EXAONE4 model RoPE by @AlpinDale in #1484
- chore: add logit_bias/sigmoid_normalize support for classification models by @AlpinDale in #1485
- fix: transform_config parsing in compressed tensors by @AlpinDale in #1486
- fix:
pack_factor -> packed_factorin linear layers by @AlpinDale in #1487 - chore: deprecate TPOT in favor of ITL by @AlpinDale in #1488
- fix: weight loading for Apertus model by @AlpinDale in #1489
- fix: only print profiler results on rank 0 by @AlpinDale in #1490
- fix: cast offsets tensor
bntotl.int64to avoid GPU segfault by @AlpinDale in #1491 - fix: DeepSeek-R1 accuracy by setting
routed_scaling_factor=1.0by @AlpinDale in #1492 - fix: LoRA logits on XPU by @AlpinDale in #1493
- feat: multi-threaded model weights loader by @AlpinDale in #1494
- feat: fully async model execution by @AlpinDale in #1495
- feat: enable pytorch symmetric memory for A100 GPUs by @AlpinDale in #1496
- [kernel] feat: vectorized kernels by @AlpinDale in #1497
- chore: upgrade xgrammar to 0.1.23 by @AlpinDale in #1499
- feat: enable request-level logits processor in the batch-level logit-procing by @AlpinDale in #1500
- fix: compile warning for
w4a8_mm_entry.cuby @AlpinDale in #1501 - fix: add check for dual chunk attn by @AlpinDale in #1502
- fix: double mul for dots1 and GLM4 MoE by @AlpinDale in #1503
- chore: remove NCCL cumem env var override by @AlpinDale in #1504
- feat: enable heterogenous TP for Nixl KV connector w/ flashinfer by @AlpinDale in #1505
- chore: support
add_generation_promptin embeddings endpoint by @AlpinDale in #1506 - refactor: simplify weight handling in MiniMaxText01RMSNormTP class by removing dead code by @AlpinDale in #1507
- fix: division by zero in triton_attn by @AlpinDale in #1508
- feat: enable full CUDA graph support for PLaMo2 on V1 by @AlpinDale in #1509
- feat: add support for Qwen3-Next model and add Flash Linear Kernels by @AlpinDale in #1510
- Fix for logit bias crash (probably) by @50h100a in #1511
Full Changelog: v0.9.0...v0.9.1
v0.9.0
v0.9.0
It has been a long time. There have been many, many changes between this release and v0.6.7. I'll try to summarize the most important ones, but I'll likely miss quite a lot.
New Models
- AIMv2
- Arcee
- Aria
- Aya Vision
- BaiLing
- Bamba
- BertModel (encoder-only embedding)
- Command A Vision
- CLIP Text Model
- DeepSeek-V3
- dots.llm1
- Ernie-4.5
- Ernie-4.5 MoE
- EXAONE-4
- Falcon-H1
- Florence-2
- Gemma-3
- Gemma-3n
- GLM-4
- GLM-4V
- GLM-4.1V
- GLM-4.5
- GPT-Oss
- Granite Speech
- Granite MoE Hybrid
- Granite MoE Shared
- GritLM
- Grok-1
- H2O-VL
- Hunyuan V1
- HyperCLOVA X SEED
- Idefics3
- Mono-InternVL
- Intern-S1
- JinaVL Reranker
- Keye
- Kimi-VL
- Llama-4
- Mamba-2
- MiMo
- MiniCPM-O
- MiniMax-M1
- MiniMax-VL
- Mistral-3
- ModernBERT
- Nemotron-H
- Nemotron-NAS (Super)
- Nemotron-VL
- OLMo-2
- Ovis
- Ovis-2
- Phi-4-MM
- Phi-4 Flash
- Plamo-2
- Prithvi GeoSpatial Model
- Qwen2.5 Omni Thinker
- Qwen2.5 VL
- Qwen2 Audio
- Qwen3
- Qwen3 MoE
- XLM Roberta
- Skyword-R1V
- Smol-VLM
- Step-3
- Tarsier
- TeleChat-2
- Tele-FLM
- Voxtral
- Whisper
- Zamba2
Transformers backend
You can now load any unsupported model using the integrated transformers backend. By default, if an unsupported model is loaded, Aphrodite will attempt to load it using Transformers if a native implementation doesn't exist.
Quantization Methods
There have been a few quant methods added.
- NVFP4 - the new datatype supported by Blackwell GPUs. Will also work on Ampere and Hopper using Marlin kernels.
- MXFP4 - popularized by GPT-Oss. Natively supported by Blackwell; Hopper and Ampere will use Marlin.
GPTQAllSpark- optimization for GPTQ models, supported when the model hasgroup_size=-1andact_desc=False. Seems to provide better perf than Marlin. Enable with-q gptq_allspark.- BitBLAS - support for BitNet-quantized 1.58bit models. Will also support GPTQ.
- TorchAO - support for models quantized using TorchAO.
New Features
There's a lot of new features! Here are some of the more important ones:
- Sophisticated support for
torch.compile - DeepGEMM for DeepSeek-V3
- Expert Load Balancing and Expert Parallel
- FlexAttention
- TreeAttention
- Differential Flash Attention
- Dual Chunk Flash Attention
- Flash Attention V3 (supported only for A100/H100 GPUs)
- Disaggregated Prefill β run separate instances of Aphrodite for prefill and decode. Boosts throughput by eliminating compute starvation caused by expensive prefill requests at the cost of requiring more GPUs.
- Async Scheduling via NanoFlow β provides a ~13% throughput and latency boost (
--async-scheduling) - Async Tensor Parallel β provides an ~8% throughput and latency boost on Hopper GPUs, ~3% on Ampere (
-O '{"level":3, "compile_sizes": [512], "pass_config": {"enable_async_tp": true}}') - Mirostat Sampling β it returns!
String/Phrase Banningβ also known as "anti-slop sampler", this one allows you to specify a list of phrases to ban from generation. Passbanned_stringsas a parameter in the API request body, and provide it with a list of strings.
As always, many thanks to the vLLM!
What's Changed
- ci: bump aphrodite-engine version to 0.6.7 by @AlpinDale in #1253
- docs: update documentation by @AlpinDale in #1254
- chore: update docs site packages by @ahme-dev in #1255
- fix: multi-step scheduling for TPU model runner by @AlpinDale in #1256
- fix: tool call finish reason in streaming case by @AlpinDale in #1257
- spec decode: use
flash_attn_varlen_funcfor MQA scorer by @AlpinDale in #1260 - fix: xformer attn backend prefill for encoder-decoder models by @AlpinDale in #1261
- fix: placeholder attn
max_decode_seq_lenby @AlpinDale in #1262 - fix: check for head_size in spec decode models by @AlpinDale in #1263
- XPU: enable async output processing by @AlpinDale in #1264
- feat: migrate docs to starlight by @ahme-dev in #1259
- docs: fix broken links by @AlpinDale in #1266
- TPU: fix SMEM OOM error by @AlpinDale in #1267
- VLM: add multi-image support fro Mllama (llama 3.2) by @AlpinDale in #1268
- VLM: add image embeds support for InternVL by @AlpinDale in #1269
- fix: chat API continuous usage stats by @AlpinDale in #1270
- fix:
UsageInfoandlogprobs=Noneassertion w/ empty token_ids by @AlpinDale in #1271 - fix: text-only input bug for Molmo by @AlpinDale in #1272
- chore: fix mrope handling by @AlpinDale in #1273
- model: add support for
VLM2Vec- multimodal embedding model by @AlpinDale in #1274 - fix: use proper vocab size for logit_bias by @nyxkrage in https://gi...
v0.6.7
What's Changed
- Core: add output streaming support to multi-step + async by @AlpinDale in #1112
- tests: update scheduler tests by @AlpinDale in #1113
- (1/N) XQA: integrate the XQA CUDA kernels within Aphrodite by @AlpinDale in #1115
- chore: support loading weights by ID within models by @AlpinDale in #1116
- chore: expose phi3_v num_crops as an mm_processor_kwargs by @AlpinDale in #1117
- fix: unsafe all-reduce sync by @AlpinDale in #1118
- kernels: split marlin kernels for faster compile, fix MoE, temporarily remove HQQ by @AlpinDale in #1119
- LLM: enable batched inference for llm.chat() API by @AlpinDale in #1120
- Quantization: re-enable Marlin serialization for AWQ quants by @AlpinDale in #1121
- fix: torch.compile dynamo fix by @AlpinDale in #1122
- chore: bump bitsandbytes version to latest; enable cuda graphs for 4bit bnb by @AlpinDale in #1123
- (1/N) Triton Backend: integrate Triton layernorm kernels by @AlpinDale in #1125
- (2/N) Triton Backend: integrate Triton activation kernels by @AlpinDale in #1126
- chore: remove trailing whitespaces by @AlpinDale in #1128
- chore: support prompt_logprobs with speculative decoding by @AlpinDale in #1129
- feat: add Priority-based Scheduling by @AlpinDale in #1130
- API: use heartbeats instead of health checks by @AlpinDale in #1131
- kernel: fix custom all-reduce kernel compilation on Pascal GPUs by @AlpinDale in #1132
- fix: propagate trust_remote_code in InternVL and MiniCPM-V by @AlpinDale in #1133
- fix: load fully-connected layer bias for EAGLE models by @AlpinDale in #1134
- API: propagate usage accounting to FastAPI middleware layer by @AlpinDale in #1135
- fix: ray 2.9.x does not expose available_resources_per_node by @AlpinDale in #1136
- fix: multi-step scheduling with InternVL by @AlpinDale in #1137
- chore: support FP8 MoE for compressed-tensors by @AlpinDale in #1138
- model: add support for Mllama (Llama 3.2) models by @AlpinDale in #1139
- fix: quantization for Mllama models by @AlpinDale in #1140
- fix: include encoder prompt len to non-stream api usage response by @AlpinDale in #1141
- fix: downgrade logger.warning for BOS fallback to print_warning_once by @AlpinDale in #1142
- fix: only set tool_choice to auto if at least one tool is provided by @AlpinDale in #1143
- API: add tool calling support for Llama 3.1 and 3.2 by @AlpinDale in #1144
- fix: batched inference with fuyu by @AlpinDale in #1145
- TPU: support Trillium by @AlpinDale in #1146
- torch.compile: use empty tensor instead of None for profiling by @AlpinDale in #1147
- kernel: Integrate asymmetric quantization for INT8 activations by @AlpinDale in #1148
- core: add support for chunked prefill + multi-step scheduling by @AlpinDale in #1149
- distributed: add env var to force custom all-reduce by skipping p2p check by @AlpinDale in #1150
- chore: add priority scheduling to async engine by @AlpinDale in #1151
- fix: XPU docker build by @AlpinDale in #1153
- distributed: force full nvlink when APHRODITE_FORCE_P2P env var by @AlpinDale in #1154
- fix: multi-step scheduling with Pipeline Parallel by @AlpinDale in #1155
- chore: improve implicit choice of spawn/fork for multiprocessing method by @AlpinDale in #1156
- fix: block manager v2 with preemption and lookahead slots by @AlpinDale in #1157
- fix: marlin MoE act order when is_k_full==False by @AlpinDale in #1158
- build: set FETCHCONTENT_BASE_DIR to one location for better caching by @AlpinDale in #1159
- model: add support for Qwen2.5-Math-RM-72B reward model by @AlpinDale in #1160
- lora: add LoRA support for MiniCPMV-2.5 by @AlpinDale in #1161
- fix: seeded gens with encoder-decoder models by @AlpinDale in #1162
- api: add support for prefill to chat completions endpoint by @AlpinDale in #1163
- kernel: varlen prefill + prefill chunking support for mamba kernels by @AlpinDale in #1164
- model: support input embeddings for qwen2-vl by @AlpinDale in #1165
- models: add LoRA support for MiniCPM-V 2.6 by @AlpinDale in #1166
- vlm: expose internvl2 max_dynamic_patch as a mm_processor_kwarg by @AlpinDale in #1167
- api: expose priority scheduling in the API server by @AlpinDale in #1168
- feat: add request-level logging by @AlpinDale in #1169
- fix: adjust max_position_embeddings for LoRA by @AlpinDale in #1170
- core: move guided decoding params into sampling params by @AlpinDale in #1171
- build: fix machete generation file ordering by @AlpinDale in #1172
- fix: torch.compile tensor alias by @AlpinDale in #1173
- chore: add
process_weights_after_loadingfor DummyLoader by @AlpinDale in #1174 - fix: tensor-parallel inference with fuyu by @AlpinDale in #1175
- fix: token IDs reference fro MiniCPM-V when images are provided with no placeholders by @AlpinDale in #1176
- (1/N) MQA Scorer: add MQA scorer by @AlpinDale in #1177
- feat: support multi-step, chunked prefill, prefix cache, cuda graph combo by @AlpinDale in #1178
- fix: guided decoding default values breaking text completions API by @AlpinDale in #1179
- OpenVINO: add support for GPU, fix Docker build by @AlpinDale in #1181
- models: add support for Granite MoE model (PowerMoE) by @AlpinDale in #1182
- fix: mistral parallel tool call template fix by @AlpinDale in #1183
- fix: enforce mistral tool call ID constraint by @AlpinDale in #1184
- core: make v2 block manager the default by @AlpinDale in #1185
- logging: only log the non-default parameters in engine by @AlpinDale in #1180
- chore: parse literals out of --override-neuron-config by @AlpinDale in #1186
- [
torch.compile]: add forward context for attention by @AlpinDale in #1187 - [
torch.compile]: add forward context for flashinfer by @AlpinDale in #1188 - fix: OPT model loading for checkpoints with no tied embeds by @AlpinDale in #1189
- api: add tool parser plugin + Inte...
v0.6.6
What's Changed
- distributed: support pipeline parallelism for internvl and internlm2 by @AlpinDale in #965
- tpu: add support for async postprocessing by @AlpinDale in #968
- fix: prometheus.yaml path in monitoring example by @AlpinDale in #969
- tpu: support single and multi-host TPUs on GKE and RayServe by @AlpinDale in #970
- vlm: add tensor parallel support for vision transformer models by @AlpinDale in #971
- tests: update internvl test for #971 by @AlpinDale in #972
- vlm: increase the default
max_num_batched_tokensfor multimodal models by @AlpinDale in #973 - core: fix chunked prefill not being enabled by default for long contexts by @AlpinDale in #974
- tpu: fix TPU type api by @AlpinDale in #975
- fix: modelscope for VLMs by @AlpinDale in #976
- fix: crash when cancelling a request with multi-step by @AlpinDale in #977
- models: add support for IBM Granite (PowerLM) models by @AlpinDale in #978
- tpu: align worker index with node boundary by @AlpinDale in #979
- fix: InternLM2 model with Tensor Parallel by @AlpinDale in #980
- core: slightly improve chunked prefill performance by @AlpinDale in #981
- vlm: fallback to SDPA for ViT models on CPU backend by @AlpinDale in #982
- core: improve async postproc + multi-step performance by @AlpinDale in #983
- fix: raise exception when accessing logger for disable_log_stats=True case by @AlpinDale in #984
- chore: rename
task_handlertoworkerby @AlpinDale in #985 - tpu: fix outputs by correcting the next_token_ids shape by @AlpinDale in #986
- quants: add GPTQ and FBGEMM to AphroditeParameters by @AlpinDale in #987
- benchmarks: add
--async-enginearg to throughput benchmark by @AlpinDale in #988 - tpu: use XLA rank for persistent cache path by @AlpinDale in #989
- vlm: support multiple audios per prompt for Ultravox by @AlpinDale in #990
- vlm: fix siglip layernorm and paligemma weight loading by @AlpinDale in #991
- vlm: enable multimodal inputs for the LLM class by @AlpinDale in #992
- api: implement OpenAI-compatible tools API for Hermes/Mistral models by @AlpinDale in #993
- neuron: add 8bit quantization for Neuron by @AlpinDale in #994
- models: add support for QwenVL by @AlpinDale in #995
- fix: gptq_marlin exception on older GPUs by @AlpinDale in #996
- chore: use
ray[adag]dep instead of cuda by @AlpinDale in #997 - quants: improve awq_triton throughput by @AlpinDale in #998
- fix: hermes tool call chat template by @AlpinDale in #999
- core: fix async postprocessor in case of preemption by @AlpinDale in #1000
- vlm: add multi-input support for LLaVA and InternVL models by @AlpinDale in #1002
- tools: fix tool calls to more strictly follow OpenAI format by @AlpinDale in #1003
- fix: LoRA support for Cohere and Jamba models by @AlpinDale in #1004
- spec decode: move ops.advane_step to flash attention backend by @AlpinDale in #1005
- chore: remove peft as a requirement by @AlpinDale in #1006
- chore: keep chunked prefill enabled with prefix caching by @AlpinDale in #1007
- fix: ensure multistep lookahead allocation is compatible with cugraph max capture by @AlpinDale in #1008
- fix: pass
APHRODITE_ATTENTION_BACKENDto ray workers by @AlpinDale in #1009 - build: shallow clone cutlass 3.5.1 tag by @AlpinDale in #1010
- chore: skip loading extra bias for qwen2 moe GPTQ by @AlpinDale in #1011
- fix: internvl pipeline parallel by @AlpinDale in #1012
- quants: add support for NVIDIA's ModelOpt checkpoints by @AlpinDale in #1013
- vlm: add support for video modality + llava next video model by @AlpinDale in #1014
- vlm: add support for Qwen2-VL model by @AlpinDale in #1015
- cpu: fix issue with sampling kernels by @AlpinDale in #1016
- cpu: add support for W8A8 quantization via compressed-tensor by @AlpinDale in #1017
- kernel: add meta functions for ops to prevent graph breaks by @AlpinDale in #1019
- chore: move
devicekeys to a constant by @AlpinDale in #1020 - tests: refactor speculative decoding tests to remove the async engine by @AlpinDale in #1021
- vlm: add support for Pixtral model by @AlpinDale in #1022
- core: dump model runner inputs during crash by @AlpinDale in #1023
- chore: remove engine_use_ray by @AlpinDale in #1024
- api: fix logic for deciding if tool parser is used by @AlpinDale in #1025
- quants: add bitsandbytes support for gemma2 model by @AlpinDale in #1026
- cpu: raise error if using encoder-decoder models by @AlpinDale in #1027
- chore: use RoPE cache for MRoPE method by @AlpinDale in #1028
- torch.compile: hide slicing under custom op for inductor by @AlpinDale in #1029
- vlm: fix internvl2 inference with various num_patches by @AlpinDale in #1030
- vlm: support multiple images for qwen-vl by @AlpinDale in #1031
- fix: lazy init _copy_stream by @AlpinDale in #1032
- multi-step: add support for flashinfer attention backend by @AlpinDale in #1033
- api: add sampling/engine option to return only deltas or final output by @AlpinDale in #1035
- fix: multi-step + flashinfer with cuda graphs by @AlpinDale in #1036
- fix: disable chunked prefill and prefix caching for multimodal models by @AlpinDale in #1037
- fix: grouped_topk return type by @AlpinDale in #1038
- core: factor out input preprocessing into a separate class by @AlpinDale in #1039
- fix: skip loading extra bias for Qwen2-VL GPTQ by @AlpinDale in #1040
- torch.compile: allow adding custom compile backends via plugins by @AlpinDale in #1041
- xpu: bump IPEX to 2.3, support GQA by @AlpinDale in #1042
- rocm: add custom paged attention kernels for ROCm by @AlpinDale in #1043
- model: add support for MiniCPM-3 by @AlpinDale in #1044
- torch.compile: fix functionalization by @AlpinDale in #1045
- tpu: implement...
v0.6.5
What's Changed
- xpu: refactor XPU worker & executor by @AlpinDale in #861
- build: add jinja2 to requirements file by @AlpinDale in #862
- attention: add
AttentionStateabstraction by @AlpinDale in #863 - xpu: disable punica kernels for XPU by @AlpinDale in #864
- executor: pipe
worker_class_fnarg in executor by @AlpinDale in #865 - server: log the process occupying our port by @AlpinDale in #866
- feat: AWQ quantization for InternVL by @AlpinDale in #867
- Rewrite DRY sampler to be a lot faster by @50h100a in #868
- fix: ROCm build by @Naomiusearch in #817
- fix: temp_last warning being repeated for every output token by @AlpinDale in #869
- feat: add support for chunked prefill + prefix caching by @AlpinDale in #871
- async: avoid premature exit in the async generator by @AlpinDale in #872
- cpu: fix
mm_limitsinitialization by @AlpinDale in #873 - spec decoding: set the draft model ctxlen to target model by @AlpinDale in #874
- sampler: pad dry sequence breakers tensor by @AlpinDale in #875
- fix:
add_generation_template->add_generation_promptin llm by @AlpinDale in #877 - Update README.md by @NoahBPeterson in #876
- api: fix crashes under very high loads by @AlpinDale in #878
- build: pass
PYTHONPATHfrom setup.py to cmake by @AlpinDale in #879 - async: disable multi-step scheduling for sync engine by @AlpinDale in #880
- api: better startup failure UX by @AlpinDale in #881
- chore: consolidate environment variables within one file by @AlpinDale in #882
- core: fix spec decode metrics and envs circular import by @AlpinDale in #889
- feat: add support for audio models by @AlpinDale in #891
- distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in #892
- rocm: fix compile issues with rocm 6.2 by @AlpinDale in #893
- build: fix invalid path for envs.py in setup by @AlpinDale in #894
- kernel: use
cub::BlockReduceinstead of custom impl by @AlpinDale in #895 - fix: Phi 3.5 Vision model loading by @AlpinDale in #896
- api: add client timeouts for the ZeroMQ server by @AlpinDale in #897
- feat: add torch.compile for GemmaRMSNorm by @AlpinDale in #898
- spec decode: add support for EAGLE by @AlpinDale in #899
- fix:
ShardedStateLoaderwith fp8 quant by @AlpinDale in #900 - kernel: do not compile machete for cuda 11 and below by @AlpinDale in #901
- chore: add AphroditeParameter support for FP8 quant by @AlpinDale in #902
- spec decode: fix logprobs when using speculative decoding by @AlpinDale in #904
- api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in #905
- ray: better error when placement group topology is incorrect by @AlpinDale in #906
- xpu: refactor the model runner for tensor parallelism by @AlpinDale in #910
- fix: empty prompt crashing the server by @AlpinDale in #912
- quantization: update marlin to use
AphroditeParametersby @AlpinDale in #913 - core: add multi-step scheduling support for the synchronous engine by @AlpinDale in #914
- api: add json_schema to OpenAI server by @AlpinDale in #915
- fix: phi3v crash with unusual image sizes by @AlpinDale in #916
- feat: multi-image input support for Phi3V by @AlpinDale in #917
- spec decode: streamline batch expansion tensor manipulation by @AlpinDale in #918
- api: use fp32 for base64 embeddings by @AlpinDale in #919
- core: improve warmup times for prefix caching in block manager v2 by @AlpinDale in #920
- quants: update
qqqandgptq_marlin_24to use AphroditeParameters by @AlpinDale in #921 - distributed: fix custom allreduce p2p cache file generation by @AlpinDale in #922
- neuron: add support for tensor parallelism by @AlpinDale in #923
- quants: update compressed tensors lifecycle to remove
prefixfromcreate_weightsby @AlpinDale in #924 - feat: add async postprocessor by @AlpinDale in #925
- api: add endpoint for loading and unloading the model by @AlpinDale in #926
- feat: add single user mode by @AlpinDale in #927
- api: add inline model loading by @AlpinDale in #928
- api: support aphrodite_config.yaml with inline loading by @AlpinDale in #929
- fix: inline model loading conflicts with lora by @AlpinDale in #930
- core: do not compile for profiling by @AlpinDale in #931
- xpu: support pipeline parallel by @AlpinDale in #932
- fix: phi3v image_idx in async server by @AlpinDale in #933
- feat: add fused Marlin MoE kernel by @AlpinDale in #934
- chore: multi-image support for llava-next by @AlpinDale in #935
- model: add support for paligemma2 by @AlpinDale in #936
- vlm: stack multimodal tensors to represent multiple images within each prompt by @AlpinDale in #937
- core: do not compile ScalarType for torch < 2.4.0 by @AlpinDale in #938
- core: add virtual engine for async outproc by @AlpinDale in #939
- api: log prompt truncation by @AlpinDale in #940
- vlm: fix incompatibility nested tensors and multi-image llava-next by @AlpinDale in #941
- vlm: fix persimmon and fuyu issues with transformers 4.45 by @AlpinDale in #942
- Fix SentencePieceTokenizer error when generating on Mistral Large 2411 with
--tokenizer-mode mistralby @khanonnie in #943 - core: use flashinfer for FP8 KV when available by @AlpinDale in #944
- tests: update flashinfer test for #944 by @AlpinDale in #945
- quants: add triton kernels for AWQ by @AlpinDale in #946
- tests: add kernel tests for causal_conv1d and mamba_ssm by @AlpinDale in #947
- fix: do not register punica with torch if using older torch by @AlpinDale in #948
- tpu: avoid dynamo guard eval overhead by @AlpinDale in #949
- fix: issues with flashinfer fp8 kv by @AlpinDale in #950
- api: optimize zeromq frontend performance by @AlpinDale in #951
- tpu: remove torch._dynamo.reset() by @AlpinDale in #952
- vlm: fix errors on ragged NestedTensors by @AlpinDal...
v0.6.4.post1
What's Changed
- add linux arm64/aarch64/GH200 installation tips by @qpwo in #851
- DRY Fix: Add output_tokens to sampler by @selalipop in #849
- sampler: fix DRY concurrency issue by @AlpinDale in #852
- sampler: add range parameter for DRY by @AlpinDale in #855
- sampler: optimize DRY performance using z-algorithm by @AlpinDale in #856
- sampler: allow parsing sampler order using strings by @AlpinDale in #858
New Contributors
Full Changelog: v0.6.4...v0.6.4.post1
v0.6.4
What's Changed
- frontend: enable kobold api by default by @AlpinDale in #803
- feat: add serviceinfo endpoint by @AlpinDale in #807
- feat: update to serviceinfo v0.2 by @AlpinDale in #808
- Mask dynatemp using min/max, rather than exp by @50h100a in #813
- fix: temperature issues by @50h100a in #814
- fix: --max-seq-len-to-capture arg by @AlpinDale in #818
- [IMPORTANT] updating test units by @AlpinDale in #769
- fix: tokenization api test by @AlpinDale in #821
- feat: add chat method for LLM class by @AlpinDale in #822
- feat: support chunked prefill with LoRA by @AlpinDale in #823
- SPMD optimizations by @AlpinDale in #824
- fix: sampler test with new transformers version by @AlpinDale in #826
- feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in #828
- feat: add metrics for prefix cache hit rate by @AlpinDale in #829
- fix: unbound tokenizer error by @AlpinDale in #830
- feat: multi-step scheduling by @AlpinDale in #831
- feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in #827
- feat: add no_repeat_ngram sampler by @AlpinDale in #832
- feat: add skew sampling by @AlpinDale in #834
- fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in #839
- chore: refactor executor classes for easier inheritance by @AlpinDale in #840
- fix: latency and serving benchmarks by @AlpinDale in #841
- feat: Machete Kernels for Hopper GPUs by @AlpinDale in #842
- feat: add sampler_priorty by @AlpinDale in #837
- fix: disable awq_marlin override for awq models by @AlpinDale in #843
- chore: bump mistral_common to 1.5.0 by @AlpinDale in #844
- ci: bump version to 0.6.4 by @AlpinDale in #845
New Contributors
- @dependabot made their first contribution in #796
- @selalipop made their first contribution in #827
Full Changelog: v0.6.3...v0.6.4