Skip to content

Releases: dphnAI/aphrodite-engine

v0.21.0

02 May 14:05
18f852d

Choose a tag to compare

What's Changed

Full Changelog: v0.20.0...v0.21.0

v0.20.0

26 Apr 12:27

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.10.0...v0.20.0

v0.10.0

08 Nov 13:51
5e3afa0

Choose a tag to compare

What's Changed

Read more

v0.9.1

10 Sep 18:23
d7b7417

Choose a tag to compare

What's Changed

Full Changelog: v0.9.0...v0.9.1

v0.9.0

24 Aug 15:33

Choose a tag to compare

v0.9.0

It has been a long time. There have been many, many changes between this release and v0.6.7. I'll try to summarize the most important ones, but I'll likely miss quite a lot.

New Models

Transformers backend

You can now load any unsupported model using the integrated transformers backend. By default, if an unsupported model is loaded, Aphrodite will attempt to load it using Transformers if a native implementation doesn't exist.

Quantization Methods

There have been a few quant methods added.

  • NVFP4 - the new datatype supported by Blackwell GPUs. Will also work on Ampere and Hopper using Marlin kernels.
  • MXFP4 - popularized by GPT-Oss. Natively supported by Blackwell; Hopper and Ampere will use Marlin.
  • GPTQAllSpark - optimization for GPTQ models, supported when the model has group_size=-1 and act_desc=False. Seems to provide better perf than Marlin. Enable with -q gptq_allspark.
  • BitBLAS - support for BitNet-quantized 1.58bit models. Will also support GPTQ.
  • TorchAO - support for models quantized using TorchAO.

New Features

There's a lot of new features! Here are some of the more important ones:

  • Sophisticated support for torch.compile
  • DeepGEMM for DeepSeek-V3
  • Expert Load Balancing and Expert Parallel
  • FlexAttention
  • TreeAttention
  • Differential Flash Attention
  • Dual Chunk Flash Attention
  • Flash Attention V3 (supported only for A100/H100 GPUs)
  • Disaggregated Prefill – run separate instances of Aphrodite for prefill and decode. Boosts throughput by eliminating compute starvation caused by expensive prefill requests at the cost of requiring more GPUs.
  • Async Scheduling via NanoFlow – provides a ~13% throughput and latency boost (--async-scheduling)
  • Async Tensor Parallel – provides an ~8% throughput and latency boost on Hopper GPUs, ~3% on Ampere (-O '{"level":3, "compile_sizes": [512], "pass_config": {"enable_async_tp": true}}')
  • Mirostat Sampling – it returns!
  • String/Phrase Banning – also known as "anti-slop sampler", this one allows you to specify a list of phrases to ban from generation. Pass banned_strings as a parameter in the API request body, and provide it with a list of strings.

As always, many thanks to the vLLM!

What's Changed

Read more

v0.6.7

07 Mar 11:36
c63ad41

Choose a tag to compare

What's Changed

  • Core: add output streaming support to multi-step + async by @AlpinDale in #1112
  • tests: update scheduler tests by @AlpinDale in #1113
  • (1/N) XQA: integrate the XQA CUDA kernels within Aphrodite by @AlpinDale in #1115
  • chore: support loading weights by ID within models by @AlpinDale in #1116
  • chore: expose phi3_v num_crops as an mm_processor_kwargs by @AlpinDale in #1117
  • fix: unsafe all-reduce sync by @AlpinDale in #1118
  • kernels: split marlin kernels for faster compile, fix MoE, temporarily remove HQQ by @AlpinDale in #1119
  • LLM: enable batched inference for llm.chat() API by @AlpinDale in #1120
  • Quantization: re-enable Marlin serialization for AWQ quants by @AlpinDale in #1121
  • fix: torch.compile dynamo fix by @AlpinDale in #1122
  • chore: bump bitsandbytes version to latest; enable cuda graphs for 4bit bnb by @AlpinDale in #1123
  • (1/N) Triton Backend: integrate Triton layernorm kernels by @AlpinDale in #1125
  • (2/N) Triton Backend: integrate Triton activation kernels by @AlpinDale in #1126
  • chore: remove trailing whitespaces by @AlpinDale in #1128
  • chore: support prompt_logprobs with speculative decoding by @AlpinDale in #1129
  • feat: add Priority-based Scheduling by @AlpinDale in #1130
  • API: use heartbeats instead of health checks by @AlpinDale in #1131
  • kernel: fix custom all-reduce kernel compilation on Pascal GPUs by @AlpinDale in #1132
  • fix: propagate trust_remote_code in InternVL and MiniCPM-V by @AlpinDale in #1133
  • fix: load fully-connected layer bias for EAGLE models by @AlpinDale in #1134
  • API: propagate usage accounting to FastAPI middleware layer by @AlpinDale in #1135
  • fix: ray 2.9.x does not expose available_resources_per_node by @AlpinDale in #1136
  • fix: multi-step scheduling with InternVL by @AlpinDale in #1137
  • chore: support FP8 MoE for compressed-tensors by @AlpinDale in #1138
  • model: add support for Mllama (Llama 3.2) models by @AlpinDale in #1139
  • fix: quantization for Mllama models by @AlpinDale in #1140
  • fix: include encoder prompt len to non-stream api usage response by @AlpinDale in #1141
  • fix: downgrade logger.warning for BOS fallback to print_warning_once by @AlpinDale in #1142
  • fix: only set tool_choice to auto if at least one tool is provided by @AlpinDale in #1143
  • API: add tool calling support for Llama 3.1 and 3.2 by @AlpinDale in #1144
  • fix: batched inference with fuyu by @AlpinDale in #1145
  • TPU: support Trillium by @AlpinDale in #1146
  • torch.compile: use empty tensor instead of None for profiling by @AlpinDale in #1147
  • kernel: Integrate asymmetric quantization for INT8 activations by @AlpinDale in #1148
  • core: add support for chunked prefill + multi-step scheduling by @AlpinDale in #1149
  • distributed: add env var to force custom all-reduce by skipping p2p check by @AlpinDale in #1150
  • chore: add priority scheduling to async engine by @AlpinDale in #1151
  • fix: XPU docker build by @AlpinDale in #1153
  • distributed: force full nvlink when APHRODITE_FORCE_P2P env var by @AlpinDale in #1154
  • fix: multi-step scheduling with Pipeline Parallel by @AlpinDale in #1155
  • chore: improve implicit choice of spawn/fork for multiprocessing method by @AlpinDale in #1156
  • fix: block manager v2 with preemption and lookahead slots by @AlpinDale in #1157
  • fix: marlin MoE act order when is_k_full==False by @AlpinDale in #1158
  • build: set FETCHCONTENT_BASE_DIR to one location for better caching by @AlpinDale in #1159
  • model: add support for Qwen2.5-Math-RM-72B reward model by @AlpinDale in #1160
  • lora: add LoRA support for MiniCPMV-2.5 by @AlpinDale in #1161
  • fix: seeded gens with encoder-decoder models by @AlpinDale in #1162
  • api: add support for prefill to chat completions endpoint by @AlpinDale in #1163
  • kernel: varlen prefill + prefill chunking support for mamba kernels by @AlpinDale in #1164
  • model: support input embeddings for qwen2-vl by @AlpinDale in #1165
  • models: add LoRA support for MiniCPM-V 2.6 by @AlpinDale in #1166
  • vlm: expose internvl2 max_dynamic_patch as a mm_processor_kwarg by @AlpinDale in #1167
  • api: expose priority scheduling in the API server by @AlpinDale in #1168
  • feat: add request-level logging by @AlpinDale in #1169
  • fix: adjust max_position_embeddings for LoRA by @AlpinDale in #1170
  • core: move guided decoding params into sampling params by @AlpinDale in #1171
  • build: fix machete generation file ordering by @AlpinDale in #1172
  • fix: torch.compile tensor alias by @AlpinDale in #1173
  • chore: add process_weights_after_loading for DummyLoader by @AlpinDale in #1174
  • fix: tensor-parallel inference with fuyu by @AlpinDale in #1175
  • fix: token IDs reference fro MiniCPM-V when images are provided with no placeholders by @AlpinDale in #1176
  • (1/N) MQA Scorer: add MQA scorer by @AlpinDale in #1177
  • feat: support multi-step, chunked prefill, prefix cache, cuda graph combo by @AlpinDale in #1178
  • fix: guided decoding default values breaking text completions API by @AlpinDale in #1179
  • OpenVINO: add support for GPU, fix Docker build by @AlpinDale in #1181
  • models: add support for Granite MoE model (PowerMoE) by @AlpinDale in #1182
  • fix: mistral parallel tool call template fix by @AlpinDale in #1183
  • fix: enforce mistral tool call ID constraint by @AlpinDale in #1184
  • core: make v2 block manager the default by @AlpinDale in #1185
  • logging: only log the non-default parameters in engine by @AlpinDale in #1180
  • chore: parse literals out of --override-neuron-config by @AlpinDale in #1186
  • [torch.compile]: add forward context for attention by @AlpinDale in #1187
  • [torch.compile]: add forward context for flashinfer by @AlpinDale in #1188
  • fix: OPT model loading for checkpoints with no tied embeds by @AlpinDale in #1189
  • api: add tool parser plugin + Inte...
Read more

v0.6.6

27 Jan 15:53
b20c457

Choose a tag to compare

What's Changed

Read more

v0.6.5

22 Dec 06:43
cbd51a2

Choose a tag to compare

What's Changed

Read more

v0.6.4.post1

03 Dec 01:51
8b8d2ce

Choose a tag to compare

What's Changed

New Contributors

  • @qpwo made their first contribution in #851

Full Changelog: v0.6.4...v0.6.4.post1

v0.6.4

27 Nov 07:31
d2971a6

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.6.3...v0.6.4