docs(roadmap): queue distilled classifier + per-query alpha regression + soft routing; capture HyDE query-time wipeout (#1067)

jamie8johnson · claude · web-flow · commit d290f23b266e · 2026-04-20T20:17:00.000-05:00
* docs(roadmap): HyDE wipeout result + queue label expansion + context features

Three edits to the CPU Lane roadmap:

1. HyDE moved from "[ ] most promising untested lever" to
   "[x] tested 2026-04-20, CATASTROPHIC" with the diagnosis. Query-time
   HyDE: R@5 = 0.0% across all 8 categories on both test and dev. Cause:
   synthetic Rust code from Gemma is generic and has zero cqs-specific
   identifiers, so search returns generic chunks and never matches the
   gold. The v2-era HyDE result that motivated the experiment was
   index-time, not query-time — wrong direction tested.

   Index-time HyDE remains open (existing cqs index --hyde-queries
   pipeline) but is now lower priority than the classifier/regression
   items above it.

2. Queued: Expand the v3 label set with Gemma-generated synthetic
   queries. Prerequisite for the distilled classifier and per-query α
   regression. ~1 day of compute, no new engineering.

3. Queued: Context-aware classification. Add index language
   distribution + project category + recent-search features as input
   dims to the distilled classifier head. Speculative ceiling but
   cheap.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

* docs(roadmap): queue soft routing as orthogonal-to-classifier-choice lever

The whole categorization arc is fundamentally a classification-and-routing
problem. Hard routing (argmax category → α(category)) throws away the
classifier's confidence — even the centroid classifier internally has
soft cosine scores per category, but we collapse them to one bucket.
Soft routing keeps the distribution and computes effective α = Σ P(c) × α(c).

Compatible with rule+centroid today (use centroid cosines as the soft
distribution), with the distilled head (softmax outputs natural), and with
the per-query α regression (regression is already soft).

Risk: mixing alphas may attenuate their per-category effect. Worth
measuring with a synthetic where we know the true category from fixture
metadata.

Pairs particularly well with the per-query α regression — train on a soft
target (R@5-weighted distribution) instead of a hard one-hot.

Half-day to wire centroid-based soft routing today; the rest comes for
free with the distilled head.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: jamie8johnson &lt;jamie8johnson@users.noreply.github.com&gt;
Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -48,7 +48,23 @@ The R@5 re-sweep also surfaced direction-stable but small-magnitude moves on `cr
 ### CPU Lane
 
 **Retrieval quality:**
-- [ ] **HyDE for structural queries — most promising untested lever.** v2-era data: +14pp structural, +12pp type_filtered, −22pp conceptual, −15pp behavioral. Router → LLM generates synthetic code → embed → search, per-category by design. Treat v2 numbers as motivation, not promise: this session saw several wins vanish through the full router (centroid, reranker v2, full alpha sweep). Design the experiment to hold the production router fixed and vary only the query embedding source. Prereqs already built (Gemma 4 31B via vLLM, BGE embedder, v3 eval harness).
+- [x] **Query-time HyDE — tested 2026-04-20, CATASTROPHIC.** Per `evals/hyde_per_category_eval.py`: generate synthetic Rust code via Gemma 4 31B per query, search with synthetic as the query string. **R@5 = 0.0% across all 8 categories on both test and dev splits** (vs baseline 65-95% per category). Inspecting samples: synthetic code is generic Rust/SQL with zero cqs-specific identifiers (e.g. for "table named notes AND columns with NOT NULL constraint" Gemma generated a generic `CREATE TABLE notes (id INTEGER PRIMARY KEY, ...)` — has nothing in common with cqs's actual schema chunks). Search returns generic-looking chunks; gold is never matched. The v2-era HyDE result that motivated the experiment was index-time, not query-time, so we tested the wrong direction.
+
+  **Index-time HyDE re-eval still open.** cqs already has `cqs index --hyde-queries` that adds LLM-generated "queries that would find me" strings to each chunk at index time. The 2026-04-08 measurement on v2_300q showed +14pp structural / +12pp type_filtered / −22pp conceptual / −15pp behavioral — net negative on R@1 in a single-config measurement. Per-category routing (only enable hyde-augmented chunks for queries where the v3 sweep says it helps) was never tried. Properly testing this requires: (1) regenerate HyDE for all chunks via the existing Claude Batches pipeline, (2) reindex with `--hyde-queries`, (3) per-category A/B harness that toggles the hyde-augmented embedding column. ~1 day. Lower expected lift than the categorization improvements above; promote only if classifier/regression work plateaus.
+
+- [ ] **Expand the v3 label set with Gemma-generated synthetic queries.** Current v3 train + dev + test = 544 queries. Categorical optimization (alpha sweep, distilled classifier, per-query α regression) is data-bound past 50-100q per category. Generate ~5-10k more via the existing chunk-driven pipeline (`evals/generate_from_chunks.py`), classified self-consistently via Gemma. Bias generation toward thin categories (`conceptual` 0% rule fire, `negation` small-N noise on test). Prerequisite for the distilled classifier and per-query α regression items above. ~1 day of compute (Gemma already up via vLLM); negligible engineering since the pipeline is already working.
+
+- [ ] **Context-aware classification.** Currently the router classifies the query in isolation. Add features available at query time: index language distribution (Rust-heavy vs Python-heavy vs polyglot), project category if known, top-N most-recently-searched terms. The intuition: same query in different project shapes might want different α (e.g., "function with retry" in a Go project routes to behavioral, in a Rust project might route to structural because Rust queries are more often structural in nature). Cheap to add as additional input dims to the distilled classifier or per-query α regression heads (no separate model needed). Effort: ~1 day after the distilled head is in place. Speculative ceiling — could be 0pp if context doesn't predict, or +3-5pp if there's signal we're not using. Also unlocks better behavior when an index spans heterogeneous projects (refs).
+
+- [ ] **Soft routing — distribution over categories instead of argmax.** Today the classifier returns a single `QueryCategory`, the router picks `α(category)`, and a marginal misclassification fully swaps the alpha (e.g., behavioral=0.80 vs structural=0.90 — close enough, but multi_step=0.10 vs structural=0.90 if the classifier puts a multi_step query in `structural` is catastrophic). Soft routing: classifier outputs `P(c)` per category, effective α = `Σ P(c) × α(c)`. A query that's 60% behavioral / 30% structural / 10% multi_step gets α = 0.6×0.80 + 0.3×0.90 + 0.1×0.10 = 0.79.
+
+  **Why now**: this whole arc is fundamentally a classification-and-routing problem. Hard routing throws away the classifier's confidence — even today's centroid classifier internally has soft cosine scores per category, but we softmax → argmax → pick one. Soft routing reuses that signal end-to-end.
+
+  **Compatible with everything**: works on rule+centroid (use centroid cosines as the soft distribution), works on the distilled head (softmax outputs natural), works on the per-query α regression (the regression IS already producing a soft α). Probably a half-day in `src/search/router.rs` to wire centroid-based soft routing today; the rest follows for free when the distilled head lands.
+
+  **Risk**: mixing alphas may attenuate their effect — if behavioral wants 0.80 and structural wants 0.90, mixing gives 0.85, which might be in the "neither helps much" middle ground. Worth measuring with a synthetic test where we know the true category from fixture metadata.
+
+  **Pairs particularly well with the per-query α regression**: train on a soft target (R@5-weighted distribution over categories) instead of a hard one-hot, which gives the model nuanced training signal.
 - [ ] **BGE → E5 v9-200k.** Clean-index eval ties on R@1, slight edge on R@5/R@20, 1/3 the embedding dim (768 vs 1024). Gated on [#949](https://github.com/jamie8johnson/cqs/issues/949) (embedder abstraction) + v3 re-run to rule out noise.
 - [ ] **CAGRA filtering regression on enriched index.** Fully-routed v1.24.0: conceptual −5.5pp, structural −3.8pp, identifier −2pp vs pre-release. Hypothesis: CAGRA graph walk strands in filtered-out regions. Concrete proposal in [#962](https://github.com/jamie8johnson/cqs/issues/962).
 - [~] **Classifier accuracy — SCOPE REOPENED 2026-04-20.** The earlier "SCOPE REDUCED" call was correct in a world where per-category α tuning was R@1-targeted (and gave small gains). The 2026-04-20 R@5 re-sweep changed the picture: per-category α has substantial latent R@5 signal (sweep predicted +14pp held-out lift) but ~8× of it dilutes through the rule-based + centroid classifier (production reality: +0.9pp test, ±0 dev). **Classifier accuracy is now the bottleneck on R@5, not the alpha grid.**