speculative-decoding

Here are 313 public repositories matching this topic...

Luce-Org / lucebox

LLM speculative inference server for consumer hardware & heterogeneous computing

spark kernel cuda cuda-kernels rocm heterogeneous-computing luce poolside rtx3090 llama-cpp local-ai qwen speculative-decoding strix-halo dflash megakernel speculative-prefill pflash

Updated Aug 1, 2026
C++

SafeAILab / EAGLE

Star

Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3 (NeurIPS'25).

large-language-models llm-inference speculative-decoding

Updated Feb 20, 2026
Python

intel / intel-extension-for-transformers

Star

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

retrieval chatbot rag habana large-language-model chatpdf llm-inference 4-bits speculative-decoding llm-cpu streamingllm intel-optimized-llamacpp neural-chat neural-chat-7b autoround gaudi3

Updated Oct 8, 2024
Python

dphnAI / sonar

Star

Large-scale LLM inference engine

machine-learning cuda intel api-rest lora rocm inference-engine tpu inferentia speculative-decoding

Updated Jul 31, 2026
C++

Tencent / AngelSlim

Star

Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

audio eagle quantization diffusion vlm llm qwen speculative-decoding llm-compression hunyuan deepseek fp4 dflash

Updated Jul 30, 2026
Python

youssofal / MTPLX

Star

3x decode TPS increase On Qwen 3.6 27B @ temp 0.6 | Native MTP Speculative Decoding On Apple Silicon With No External Drafter.

metal mtp mlx inference-engine apple-silicon local-ai qwen speculative-decoding speculative-sampling openai-compatible qwen3-next anthropic-compatible native-mtp mtplx

Updated Jul 31, 2026
Python

Avarok-Cybersecurity / atlas

Star

Pure Rust Inference Engine

rust cuda transformers ssm mamba dgx openai-api llm-inference speculative-decoding gb10 nvfp4 dgx-spark

Updated Jul 30, 2026
Rust

chrisliu298 / awesome-on-policy-distillation

Star

A curated collection of papers, technical reports, frameworks, and tools for on-policy distillation (OPD) of large language models

awesome reinforcement-learning rl awesome-list knowledge-distillation post-training opd distillation self-distillation llm rlhf gkd llm-training speculative-decoding on-policy-distillation minillm llm-distillation

Updated Jul 31, 2026

From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton/CUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.

rust cuda pytorch triton quantization knowledge-distillation inference-engine jax kv-cache ml-systems llm mechanistic-interpretability fsdp flash-attention speculative-decoding paged-attention

Updated Jun 5, 2026
Python

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

Star

Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container, tuned for long-context draft acceptance on DGX Spark. 6 HF variants (BF16/NVFP4/MTP/MTP-XS), docker-compose, and QuickStart.

quantization uncensored blackwell llm vllm qwen speculative-decoding abliteration qwen3 nvfp4 dgx-spark dflash

Updated Jul 3, 2026
Python

Infini-AI-Lab / Sequoia

Star

scalable and robust tree-based speculative decoding algorithm

efficiency inference llm speculative-decoding

Updated Jan 28, 2025
Python

facebookresearch / LayerSkip

Star

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024

optimization transformers early-exit llm speculative-decoding layer-drop

Updated Jul 20, 2026
Python

AtomicBot-ai / atomic-llama-cpp-turboquant

Star

llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Updated Jul 30, 2026
C++

avifenesh / memra

Sponsor

Star

from-scratch LLM inference for RTX 5090 (sm_120a) and H100 (sm_90a)

rust ai cuda inference moe hoper gpu-kernels blackwell llm llama-cpp llm-inference gguf speculative-decoding nvfp4 sm120a sm90a

Updated Aug 1, 2026
Rust

Infini-AI-Lab / TriForce

Star

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

acceleration efficiency inference llm long-context llm-inference speculative-decoding

Updated Aug 31, 2024
Python

raketenkater / ggrun

Star

llama.cpp/ik_llama.cpp launcher: loads big MoE models across mismatched multi-GPU rigs by exact VRAM math.

golang metal vulkan cuda self-hosted moe inference-server multi-gpu openai-api llm llamacpp llama-cpp local-llm gguf speculative-decoding localllama ollama-alternative

Updated Jul 31, 2026
Go

FasterDecoding / REST

Star

REST: Retrieval-Based Speculative Decoding, NAACL 2024

retrieval llm-inference speculative-decoding

Updated Mar 5, 2026
C

hao-ai-lab / JetSpec

Star

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Causal Parallel Tree Drafting

large-language-models speculative-decoding efficient-llm-inference

Updated Jul 31, 2026
Python

Tencent / AngelSpec

Star

A unified, torch-native training framework for MTP and block-parallel speculative decoding.

inference-acceleration speculative-decoding dflash d-cut dfly

Updated Jul 31, 2026
Python

humanrouter / ddtree-mlx

Star

Tree-based speculative decoding for Apple Silicon (MLX). ~10-15% faster than DFlash on code, ~1.5x over autoregressive. First MLX port with custom Metal kernels for hybrid model support.

inference mlx apple-silicon llm speculative-decoding

Updated Apr 15, 2026
Python

Improve this page

Add a description, image, and links to the speculative-decoding topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the speculative-decoding topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speculative-decoding

Here are 313 public repositories matching this topic...

Luce-Org / lucebox

SafeAILab / EAGLE

intel / intel-extension-for-transformers

dphnAI / sonar

Tencent / AngelSlim

youssofal / MTPLX

Avarok-Cybersecurity / atlas

chrisliu298 / awesome-on-policy-distillation

zengxiao-he / tessera

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

Infini-AI-Lab / Sequoia

facebookresearch / LayerSkip

AtomicBot-ai / atomic-llama-cpp-turboquant

avifenesh / memra

Infini-AI-Lab / TriForce

raketenkater / ggrun

FasterDecoding / REST

hao-ai-lab / JetSpec

Tencent / AngelSpec

humanrouter / ddtree-mlx

Improve this page

Add this topic to your repo