Skip to content

meta: reposition turboquant-vllm around HF reference, verification, and research #86

@Alberto-Codes

Description

@Alberto-Codes

Summary

The project now has a clearer strategic split:

  • Native vLLM serving path should converge on upstream vllm-project/vllm#38479
  • turboquant-vllm should become the reference implementation for:
    • HuggingFace DynamicCache workflows
    • model verification / policy tooling
    • multimodal + heterogeneous architecture validation
    • incubation of upstreamable TurboQuant ideas

This issue tracks the repo-level repositioning and the next execution slices.

Why

Recent repo/docs state and validation work point in the same direction:

  • docs/index.md, docs/ARCHITECTURE.md, and docs/ROADMAP.md already say the native vLLM serving roadmap is superseded by upstream PR #38479
  • The repo remains strongest where upstream is narrower or not optimized:
    • HuggingFace compression workflows
    • verify CLI / model compatibility checks
    • multimodal models (Molmo2)
    • heterogeneous / shared-KV / sliding-window architectures (Gemma 3/4)
    • experimental algorithm work (e.g. WHT / Hadamard, norm correction backports)

The goal is to stop treating the CUSTOM backend as the main long-term product and instead treat it as optional bridge / staging infrastructure.

Strategic decision

Primary identity for turboquant-vllm:

  1. HuggingFace reference implementation for TurboQuant KV compression
  2. Verification + policy engine for deciding whether/how TQ should be used on a model
  3. Multimodal / weird-architecture validation lab
  4. Incubator for ideas that may later be upstreamed into vLLM

Secondary identity:

  • Optional compatibility bridge for vLLM via plugin path when useful

Non-goal:

  • Competing long-term with upstream native vLLM TurboQuant as a parallel production serving stack

Proposed work breakdown

Phase 1 — messaging + product shape

  • Reposition README and docs to make HF/reference/validation the headline
  • De-emphasize plugin-first language where it implies the main future direction
  • Add a short strategy page explaining:
    • upstream native vLLM path
    • turboquant-vllm HF path
    • when to use each

Phase 2 — verification / policy tooling

  • Upgrade python -m turboquant_vllm.verify from "cosine checker" to architecture-aware advisor
  • Detect and report risky traits:
    • sliding window
    • shared KV layers
    • heterogeneous head_dim
    • hybrid Mamba / non-standard layer types
  • Emit recommended policy/config hints, e.g.
    • full-attention dense → safe default profile
    • sliding-window small model → bypass full-attn routing hubs
    • shared-KV model → donor-rotation handling required
    • hybrid model → experimental / native-vLLM-preferred
  • Document validated / likely / unsupported model classes

Phase 3 — algorithm incubation

  • Prototype WHT / Hadamard rotation as an alternative to QR random rotation
  • Compare against current rotation on:
    • quality
    • memory / storage complexity
    • runtime cost
    • kernel-friendliness
  • Evaluate which upstream PR ideas belong here:
    • norm correction
    • alternative value quantization choices
    • architecture-aware boundary/full-attn passthrough presets

Phase 4 — multimodal + architecture moat

  • Continue first-class validation for Molmo2 / Gemma-family / future multimodal models
  • Preserve repo leadership on:
    • heterogeneous head_dim support
    • shared-KV correctness
    • sliding-window/full-attn policy logic
    • benchmark harnesses for hard architectures

Candidate follow-up issues

  1. docs: reposition project around HF reference + validation workflow
  2. feat(verify): architecture-aware compatibility and policy recommendations
  3. research(rotation): prototype WHT/Hadamard rotation path
  4. docs: add decision guide for upstream native vLLM vs turboquant-vllm

Success criteria

  • New users understand this repo's role in under 2 minutes
  • The verify CLI answers "will TQ work on my model and how should I use it?"
  • New architecture work lands here first, then only validated pieces move upstream
  • Plugin/backend work becomes optional support infrastructure, not the strategic center

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationenhancementNew feature or requestpriority: P1High — needed for model support parity

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions