evalops
diff --git a/‎TODO.md‎
Lines changed: 154 additions & 142 deletions b/‎TODO.md‎
Lines changed: 154 additions & 142 deletions
@@ -1,146 +1,158 @@
-# Deep Refactor TODO
+# Deep Research Improvement Roadmap
+
+This roadmap is derived from deep research into Greptile's public docs, blog, MCP surface, self-hosted architecture, and GitHub repos, then mapped onto DiffScope's current architecture and gaps.
+
+## Research Signals
+
+- Greptile treats review as a full-codebase intelligence product, not just a PR comment bot.
+- Their learning loop is explicit: thumbs, replies, and addressed/not-addressed outcomes reshape future comments.
+- Their `v3` review flow is agentic and tool-using, not a rigid single-pass flowchart.
+- They productize workflow state: unresolved comments, review completeness, weekly reports, merge readiness.
+- They pull in external intent via Jira/Notion/Docs and cross-repo context via pattern repositories.
+- They expose review operations back into IDE/agent workflows through MCP and skills.
+- They sell an operational platform: self-hosted, queued workflows, analytics, and enterprise controls.
 
 ## Working Rules
 
-- Keep refactors behavior-preserving.
-- Validate every checkpoint with `cargo fmt --check`, `cargo clippy --all-targets -- -D warnings`, `cargo test`, and `bash scripts/check-workflows.sh`.
+- Keep changes additive and behavior-preserving unless an item explicitly requires workflow changes.
+- Validate each checkpoint with `cargo fmt --check`, `cargo clippy --all-targets --all-features -- -D warnings`, `cargo test`, `bash scripts/check-workflows.sh`, `npm --prefix web run lint`, `npm --prefix web run build`, and `npm --prefix web run test` when frontend code changes.
 - Commit and push after each validated slice.
-- Prefer extracting pure helpers and formatter/parsing boundaries before moving async orchestration.
-- Keep module roots thin; if a root becomes mostly re-exports, let children carry the logic.
-
-## Improvement Queue
-
-- [ ] `src/commands/eval/`
-  - Add suite/category/language baseline comparisons so regressions are gated by dimension, not only whole-run totals.
-  - Add model-matrix and repeat execution support so the same suite can be compared across frontier models and flake-checked.
-  - Capture failed-run artifacts, including emitted comments, verifier warnings, and per-fixture mismatch details.
-  - Reduce fixture brittleness with semantic/alias expectation matching instead of exact wording dependence.
-  - Extend trend history with suite/category/language series plus verifier-health counters and model/provider labels.
-  - Expand `review-depth-core` with authz, supply-chain, and async-correctness benchmark packs.
-- [ ] `src/commands/feedback_eval/`
-  - Correlate feedback calibration with eval-suite category performance and rule-level precision/recall.
-  - Surface high-confidence but frequently rejected categories/rules so review quality gaps are obvious.
-
-## Immediate Queue
-
-- [ ] `src/core/semantic.rs`
-  - Split source-file discovery and excerpt/query builders from index refresh bookkeeping.
-  - Split semantic diff retrieval and feedback-example matching from feedback-store maintenance.
-- [ ] `src/core/symbol_index.rs`
-  - Split LSP command detection and extension scanning from index-building entry points.
-  - Split regex-based symbol extraction and dependency-hint parsing from graph/file-summary registration.
-  - Split `LspClient` protocol transport from symbol-result decoding and path/URI utilities.
-  - Keep `build()` and `build_with_lsp()` as thin orchestration entry points.
-
-## Core Backlog
-
-- [ ] `src/core/semantic.rs`
-  - Split semantic chunk hashing/key generation from summary/excerpt assembly.
-  - Split changed-range filtering and per-query match scoring from context chunk rendering.
-  - Split feedback fingerprint helpers from feedback-store reconciliation.
-- [ ] `src/config.rs`
-  - Split defaults/model-role conversion from load/deserialize paths.
-  - Split env/path resolution from validation/migration logic.
-  - Split serialization-focused test helpers from production config code.
-- [ ] `src/core/symbol_index.rs`
-  - Split language-pattern tables and path candidate expansion from dependency resolution.
-  - Split file collection and byte-size filtering from index population.
-  - Split symbol graph and reverse-dependency registration from symbol storage.
-  - Split LSP symbol collection/range extraction from request/notification plumbing.
-- [ ] `src/core/symbol_graph.rs`
-  - Split graph construction from traversal/query helpers.
-  - Split serialization/persistence helpers from graph algorithms.
-- [ ] `src/core/pr_summary.rs`
-  - Split stats aggregation, prompt generation, response parsing, and diagram helpers.
-- [ ] `src/core/enhanced_review.rs`
-  - Split context construction, guidance generation, and response handling.
-- [ ] `src/core/eval_benchmarks.rs`
-  - Split fixture loading, threshold selection, scoring, and aggregation/reporting.
-- [ ] `src/core/prompt.rs`
-  - Split prompt fragments, model-specific tuning, and reusable prompt builders.
-- [ ] `src/core/context.rs`
-  - Split context chunk construction, provenance helpers, and formatting/rendering.
-- [ ] `src/core/offline.rs`
-  - Split endpoint/model probing, metadata parsing, and recommendation helpers.
-- [ ] `src/core/function_chunker.rs`
-  - Split parsing, chunk planning, and scoring heuristics.
-- [ ] `src/core/agent_tools.rs`
-  - Split tool registry/definitions from execution adapters and tool-context helpers.
-- [ ] `src/core/agent_loop.rs`
-  - Split loop orchestration, state transitions, and tool/result handling.
-- [ ] `src/core/code_summary.rs`
-  - Split summary planning, extraction, cache helpers, and formatting.
-- [ ] `src/core/changelog.rs`
-  - Split git/history ingestion from final changelog rendering.
-- [ ] `src/core/multi_pass.rs`
-  - Split pass planning, execution bookkeeping, and result merging.
-- [ ] `src/core/composable_pipeline.rs`
-  - Split stage wiring from execution semantics and result transport.
-- [ ] `src/core/convention_learner.rs`
-  - Split store persistence, scoring, and feedback ingestion helpers.
-- [ ] `src/core/git_history.rs`
-  - Split log collection, parsing, and summarization.
-- [ ] `src/core/diff_parser.rs`
-  - Split unified diff parsing, text diff parsing, hunk assembly, and post-processing helpers.
-- [ ] `src/core/interactive.rs`
-  - Split REPL/input loop, commands, and output formatting.
-
-## Server and Storage Backlog
-
-- [ ] `src/server/api.rs`
-  - Split route handlers by domain plus shared request/response and error helpers.
-- [ ] `src/server/state.rs`
-  - Split session state, queueing, and persistence coordination.
-- [ ] `src/server/storage_json.rs`
-  - Split file I/O, indexing, migrations, and query helpers.
-- [ ] `src/server/storage_pg.rs`
-  - Split SQL-backed persistence by domain and query grouping.
-- [ ] `src/server/github.rs`
-  - Split webhook parsing, API interactions, and review-session orchestration.
-- [ ] `src/server/metrics.rs`
-  - Split metric registration from event emission helpers.
-- [ ] `src/server/mod.rs`
-  - Keep top-level wiring thin as submodules mature.
-
-## Adapters, Parsing, and Plugins Backlog
-
-- [ ] `src/adapters/llm.rs`
-  - Split request shaping, retry/policy logic, and response normalization.
-- [ ] `src/adapters/openai.rs`
-  - Split request builders, streaming handling, and schema/response parsing.
-- [ ] `src/adapters/anthropic.rs`
-  - Split request conversion, retries, and response parsing.
-- [ ] `src/adapters/ollama.rs`
-  - Split local model capabilities, request building, and response parsing.
-- [ ] `src/adapters/common.rs`
-  - Split shared retry/auth/http helpers.
-- [ ] `src/parsing/llm_response.rs`
-  - Split fenced-block parsing, comment extraction, structured JSON handling, and validation.
-- [ ] `src/parsing/smart_response.rs`
-  - Split structured smart-review parsing from fallback parsing paths.
-- [ ] `src/plugins/builtin/secret_scanner.rs`
-  - Split rule loading, scanning, and finding shaping.
-- [ ] `src/plugins/builtin/supply_chain.rs`
-  - Split manifest parsing, registry lookups, and finding generation.
-- [ ] `src/plugins/builtin/eslint.rs`
-  - Split command execution, parser helpers, and finding conversion.
-- [ ] `src/plugins/builtin/semgrep.rs`
-  - Split command assembly, result parsing, and finding mapping.
-- [ ] `src/plugins/builtin/duplicate_filter.rs`
-  - Split fingerprinting from suppression heuristics.
-- [ ] `src/plugins/plugin.rs`
-  - Split plugin traits/types from execution helpers.
-
-## Output and Entrypoint Backlog
-
-- [ ] `src/output/format.rs`
-  - Split smart review formatting, patch output, and walkthrough generation.
-- [ ] `src/main.rs`
-  - Split CLI wiring by command group and shared config/bootstrap helpers.
-- [ ] `src/vault.rs`
-  - Split vault discovery, parsing, and maintenance operations.
-
-## Ongoing Watchlist
-
-- [ ] Revisit freshly split files once they cross roughly 150 LOC again, especially `src/review/pipeline/execution/dispatcher/job.rs`, `src/review/pipeline/session/build.rs`, `src/review/pipeline/services/support.rs`, and `src/review/pipeline/postprocess/feedback/lookup.rs`.
-- [ ] Keep module roots thin; if a root becomes only re-exports plus tests, leave it alone until children regrow.
+- Prefer turning existing primitives into first-class product surfaces before inventing brand new subsystems.
+- Optimize for independent validation, tight feedback loops, and high-signal comments over superficial feature parity.
+
+## 1. Feedback, Memory, and Outcomes
+
+1. [ ] Add first-class comment outcome states beyond thumbs: `new`, `accepted`, `rejected`, `addressed`, `stale`, `auto_fixed`.
+2. [ ] Infer "addressed by later commit" by diffing follow-up pushes against the original commented lines.
+3. [ ] Feed addressed/not-addressed outcomes into the reinforcement store alongside thumbs.
+4. [ ] Separate false-positive rejections from "valid but won't fix" dismissals in stored feedback.
+5. [ ] Weight reinforcement by reviewer role or trust level when GitHub identity is available.
+6. [ ] Add rule-level reinforcement decay so old team preferences do not dominate forever.
+7. [ ] Add path-scoped reinforcement buckets so teams can prefer different standards in `tests/`, `scripts/`, and production code.
+8. [ ] Persist explanation text from follow-up feedback replies and mine it into reusable review guidance.
+9. [ ] Learn "preferred phrasing" for accepted comments so comment tone and specificity improve over time.
+10. [ ] Backfill existing stored reviews into the new outcome-aware feedback store for cold-start reduction.
+
+## 2. Review Lifecycle and Merge Readiness
+
+11. [ ] Track unresolved vs resolved findings for PR reviews as a first-class lifecycle state.
+12. [ ] Add review completeness metrics: total findings, acknowledged findings, fixed findings, stale findings.
+13. [ ] Compute merge-readiness summaries for GitHub PR reviews using severity, unresolved count, and verification state.
+14. [ ] Add stale-review detection when new commits land after the latest completed review.
+15. [ ] Show "needs re-review" state in review detail and history pages for incremental PR workflows.
+16. [ ] Distinguish informational findings from blocking findings in lifecycle and readiness calculations.
+17. [ ] Add "critical blockers" summary cards for unresolved `Error` and `Warning` comments.
+18. [ ] Add per-PR readiness timelines showing when a review became mergeable.
+19. [ ] Store resolution timestamps for findings so mean-time-to-fix can be measured.
+20. [ ] Add CLI and API surfaces to query PR readiness without opening the web UI.
+
+## 3. Agentic Validation Loops
+
+21. [ ] Build a first-class `fix until clean` loop that can run review, apply fixes, rerun review, and stop on convergence.
+22. [ ] Reuse the existing DAG runtime to model iterative review/fix loops as resumable workflow nodes.
+23. [ ] Add a max-iteration policy and loop budget controls for autonomous review convergence.
+24. [ ] Add "issue replay" prompts that hand unresolved findings back to a coding agent with file-local context.
+25. [ ] Add a handoff contract from reviewer findings to fix agents with rule IDs, evidence, and suggested diffs.
+26. [ ] Persist loop-level telemetry: iterations, fixes attempted, findings cleared, findings reopened.
+27. [ ] Add "challenge the finding" verification loops where a validator tries to falsify a suspected issue before keeping it.
+28. [ ] Add caching between iterations so repeated codebase retrieval and verification runs are cheaper.
+29. [ ] Allow loop policies to differ by profile: conservative auditor, high-autonomy fixer, or report-only.
+30. [ ] Add eval fixtures specifically for loop convergence and reopened-issue regressions.
+
+## 4. Code Graph and Repository Intelligence
+
+31. [ ] Turn the current symbol graph into a persisted repository graph with durable storage and reload support.
+32. [ ] Add caller/callee expansion APIs for multi-hop impact analysis from changed symbols.
+33. [ ] Add contract edges between interfaces, implementations, and API endpoints.
+34. [ ] Add "similar implementation" lookup so repeated patterns and divergences are explicit.
+35. [ ] Add cross-file blast-radius summaries to findings when a change affects many callers.
+36. [ ] Add graph freshness/version metadata so reviews know whether they are using stale repository intelligence.
+37. [ ] Add graph-backed ranking of related files before semantic RAG retrieval.
+38. [ ] Add graph query traces to `dag_traces` or review artifacts for explainability and debugging.
+39. [ ] Add graph-aware eval fixtures that require multi-hop code understanding to pass.
+40. [ ] Split `src/core/symbol_graph.rs` into construction, persistence, traversal, and ranking modules as it grows.
+
+## 5. External Context and Pattern Repositories
+
+41. [x] Surface pattern repository sources in the Settings UI with validation and defaults.
+42. [x] Surface review rule file sources in the Settings UI instead of requiring config edits by hand.
+43. [ ] Add structured UI editing for custom context notes, files, and scopes.
+44. [ ] Add per-path scoped review instructions in the Settings UI for common repo areas.
+45. [ ] Support Jira/Linear issue context ingestion for PR-linked reviews.
+46. [ ] Support document-backed context ingestion for design docs, RFCs, and runbooks.
+47. [ ] Add explicit "intent mismatch" review checks comparing PR changes to ticket acceptance criteria.
+48. [ ] Add review artifacts that show which external context sources influenced a finding.
+49. [ ] Add tests for pattern repository resolution across local paths, Git URLs, and broken sources.
+50. [ ] Add analytics on which context sources actually improve acceptance and fix rates.
+
+## 6. Review UX and Workflow Integration
+
+51. [ ] Add visible accepted/rejected/dismissed badges to comments throughout the UI, not just icon state.
+52. [ ] Add comment grouping by unresolved, fixed, stale, and informational sections in `ReviewView`.
+53. [ ] Add a "show only blockers" mode for large reviews.
+54. [ ] Add keyboard actions for thumbs, resolve, and jump-to-next-finding workflows.
+55. [ ] Add file-level readiness summaries in the diff sidebar.
+56. [ ] Add lifecycle-aware PR summaries that explain what still blocks merge.
+57. [ ] Add a "train the reviewer" callout when thumbs coverage on a review is low.
+58. [ ] Add review-change comparisons so users can diff one review run against the next on the same PR.
+59. [ ] Add better surfacing for incremental PR reviews so users know when only the delta was reviewed.
+60. [ ] Add discussion workflows that can convert repeated human comments into candidate rules or context snippets.
+
+## 7. Analytics, Reporting, and Quality Dashboards
+
+61. [x] Add feedback coverage metrics: percent of findings with thumbs or explicit disposition.
+62. [x] Add acceptance/rejection trend lines over time for recent reviews.
+63. [x] Add top accepted categories/rules and top rejected categories/rules to Analytics.
+64. [ ] Add unresolved blocker counts per repository and per PR.
+65. [ ] Add review completeness and mean-time-to-resolution charts.
+66. [ ] Add feedback-learning effectiveness metrics: did reranked findings get higher acceptance after rollout?
+67. [ ] Add pattern-repository utilization analytics showing when extra context actually affected findings.
+68. [ ] Add eval-vs-production dashboards comparing benchmark strength against real-world acceptance.
+69. [ ] Add drill-downs from trend charts directly into the affected reviews, findings, and rules.
+70. [ ] Add exportable JSON/CSV reports for review quality, lifecycle, and reinforcement metrics.
+
+## 8. APIs, Automation, and MCP-Like Surfaces
+
+71. [ ] Expose unresolved/resolved comment search through the HTTP API.
+72. [ ] Expose PR readiness through the HTTP API for CI and agent integrations.
+73. [ ] Add API endpoints to fetch learned rules, attention gaps, and top rejected patterns.
+74. [ ] Add machine-friendly APIs to fetch findings grouped by severity, file, and lifecycle state.
+75. [ ] Add a "trigger re-review" API that reuses existing PR metadata and loop policy.
+76. [ ] Add APIs for comment resolution and lifecycle updates, not just thumbs.
+77. [ ] Add an MCP server for DiffScope with review, analytics, and rule-management tools.
+78. [ ] Add reusable agent skills/workflows for checking PR readiness and running fix loops.
+79. [ ] Add signed webhook or event-stream integration for downstream automation consumers.
+80. [ ] Add rate-limited API auth and audit trails for automation-heavy deployments.
+
+## 9. Infra, Self-Hosting, and Enterprise Operations
+
+81. [ ] Split `src/server/api.rs` by domain so the growing platform API stays maintainable.
+82. [ ] Split `src/server/state.rs` into session lifecycle, persistence, progress, and GitHub coordination modules.
+83. [ ] Add queue depth and worker saturation metrics for long-running review and eval jobs.
+84. [ ] Add retention policies for review artifacts, eval artifacts, and trend histories.
+85. [ ] Add storage migrations for richer comment lifecycle and reinforcement schemas.
+86. [ ] Add deployment docs for self-hosted review + analytics + trend retention setups.
+87. [ ] Add secret-management guidance and validation for multi-provider enterprise installs.
+88. [ ] Add background jobs for recomputing analytics after schema or scoring changes.
+89. [ ] Add cost dashboards by provider/model/role for review, verification, and eval workloads.
+90. [ ] Add failure forensics bundles for self-hosted users when review or eval jobs degrade.
+
+## 10. Eval, Benchmarking, and Model Governance
+
+91. [ ] Add eval fixtures for external-context alignment, not just diff-local correctness.
+92. [ ] Add eval fixtures for merge-readiness judgments and unresolved-blocker classification.
+93. [ ] Add eval fixtures for addressed-vs-stale finding lifecycle inference.
+94. [ ] Add eval fixtures for multi-hop graph reasoning across call chains and contract edges.
+95. [ ] Add eval runs that compare single-pass review against agentic loop review.
+96. [ ] Add production replay evals using anonymized accepted/rejected review outcomes.
+97. [ ] Add leaderboard reporting for reviewer usefulness metrics, not just precision/recall.
+98. [ ] Add regression gates for feedback coverage, verifier health, and lifecycle-state accuracy.
+99. [ ] Add model-routing policies that explicitly separate generation, verification, and auditing roles.
+100. [ ] Publish a repeatable "independent auditor" benchmark story in the UI and CLI so DiffScope's differentiation is measurable.
+
+## Current Execution Slice
+
+- [x] Rewrite this roadmap into the active backlog and keep it updated as slices ship.
+- [x] Productize the learning loop in Analytics with reaction coverage and acceptance trends.
+- [x] Surface repository rule sources and pattern repository sources in Settings.
+- [ ] Commit and push each validated checkpoint before moving to the next epic.