meta: reposition turboquant-vllm around HF reference, verification, and research

## Summary

The project now has a clearer strategic split:

- **Native vLLM serving path** should converge on upstream `vllm-project/vllm#38479`
- **turboquant-vllm** should become the reference implementation for:
  - HuggingFace `DynamicCache` workflows
  - model verification / policy tooling
  - multimodal + heterogeneous architecture validation
  - incubation of upstreamable TurboQuant ideas

This issue tracks the repo-level repositioning and the next execution slices.

## Why

Recent repo/docs state and validation work point in the same direction:

- `docs/index.md`, `docs/ARCHITECTURE.md`, and `docs/ROADMAP.md` already say the native vLLM serving roadmap is superseded by upstream PR #38479
- The repo remains strongest where upstream is narrower or not optimized:
  - HuggingFace compression workflows
  - verify CLI / model compatibility checks
  - multimodal models (Molmo2)
  - heterogeneous / shared-KV / sliding-window architectures (Gemma 3/4)
  - experimental algorithm work (e.g. WHT / Hadamard, norm correction backports)

The goal is to stop treating the CUSTOM backend as the main long-term product and instead treat it as optional bridge / staging infrastructure.

## Strategic decision

Primary identity for `turboquant-vllm`:

1. **HuggingFace reference implementation** for TurboQuant KV compression
2. **Verification + policy engine** for deciding whether/how TQ should be used on a model
3. **Multimodal / weird-architecture validation lab**
4. **Incubator** for ideas that may later be upstreamed into vLLM

Secondary identity:

- Optional compatibility bridge for vLLM via plugin path when useful

Non-goal:

- Competing long-term with upstream native vLLM TurboQuant as a parallel production serving stack

## Proposed work breakdown

### Phase 1 — messaging + product shape

- [ ] Reposition README and docs to make HF/reference/validation the headline
- [ ] De-emphasize plugin-first language where it implies the main future direction
- [ ] Add a short strategy page explaining:
  - upstream native vLLM path
  - turboquant-vllm HF path
  - when to use each

### Phase 2 — verification / policy tooling

- [ ] Upgrade `python -m turboquant_vllm.verify` from "cosine checker" to architecture-aware advisor
- [ ] Detect and report risky traits:
  - sliding window
  - shared KV layers
  - heterogeneous head_dim
  - hybrid Mamba / non-standard layer types
- [ ] Emit recommended policy/config hints, e.g.
  - full-attention dense → safe default profile
  - sliding-window small model → bypass full-attn routing hubs
  - shared-KV model → donor-rotation handling required
  - hybrid model → experimental / native-vLLM-preferred
- [ ] Document validated / likely / unsupported model classes

### Phase 3 — algorithm incubation

- [ ] Prototype WHT / Hadamard rotation as an alternative to QR random rotation
- [ ] Compare against current rotation on:
  - quality
  - memory / storage complexity
  - runtime cost
  - kernel-friendliness
- [ ] Evaluate which upstream PR ideas belong here:
  - norm correction
  - alternative value quantization choices
  - architecture-aware boundary/full-attn passthrough presets

### Phase 4 — multimodal + architecture moat

- [ ] Continue first-class validation for Molmo2 / Gemma-family / future multimodal models
- [ ] Preserve repo leadership on:
  - heterogeneous head_dim support
  - shared-KV correctness
  - sliding-window/full-attn policy logic
  - benchmark harnesses for hard architectures

## Candidate follow-up issues

1. `docs: reposition project around HF reference + validation workflow`
2. `feat(verify): architecture-aware compatibility and policy recommendations`
3. `research(rotation): prototype WHT/Hadamard rotation path`
4. `docs: add decision guide for upstream native vLLM vs turboquant-vllm`

## Success criteria

- New users understand this repo's role in under 2 minutes
- The verify CLI answers "will TQ work on my model and how should I use it?"
- New architecture work lands here first, then only validated pieces move upstream
- Plugin/backend work becomes optional support infrastructure, not the strategic center


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meta: reposition turboquant-vllm around HF reference, verification, and research #86

Summary

Why

Strategic decision

Proposed work breakdown

Phase 1 — messaging + product shape

Phase 2 — verification / policy tooling

Phase 3 — algorithm incubation

Phase 4 — multimodal + architecture moat

Candidate follow-up issues

Success criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

meta: reposition turboquant-vllm around HF reference, verification, and research #86

Description

Summary

Why

Strategic decision

Proposed work breakdown

Phase 1 — messaging + product shape

Phase 2 — verification / policy tooling

Phase 3 — algorithm incubation

Phase 4 — multimodal + architecture moat

Candidate follow-up issues

Success criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions