|
1 | | -# Deep Refactor TODO |
| 1 | +# Deep Research Improvement Roadmap |
| 2 | + |
| 3 | +This roadmap is derived from deep research into Greptile's public docs, blog, MCP surface, self-hosted architecture, and GitHub repos, then mapped onto DiffScope's current architecture and gaps. |
| 4 | + |
| 5 | +## Research Signals |
| 6 | + |
| 7 | +- Greptile treats review as a full-codebase intelligence product, not just a PR comment bot. |
| 8 | +- Their learning loop is explicit: thumbs, replies, and addressed/not-addressed outcomes reshape future comments. |
| 9 | +- Their `v3` review flow is agentic and tool-using, not a rigid single-pass flowchart. |
| 10 | +- They productize workflow state: unresolved comments, review completeness, weekly reports, merge readiness. |
| 11 | +- They pull in external intent via Jira/Notion/Docs and cross-repo context via pattern repositories. |
| 12 | +- They expose review operations back into IDE/agent workflows through MCP and skills. |
| 13 | +- They sell an operational platform: self-hosted, queued workflows, analytics, and enterprise controls. |
2 | 14 |
|
3 | 15 | ## Working Rules |
4 | 16 |
|
5 | | -- Keep refactors behavior-preserving. |
6 | | -- Validate every checkpoint with `cargo fmt --check`, `cargo clippy --all-targets -- -D warnings`, `cargo test`, and `bash scripts/check-workflows.sh`. |
| 17 | +- Keep changes additive and behavior-preserving unless an item explicitly requires workflow changes. |
| 18 | +- Validate each checkpoint with `cargo fmt --check`, `cargo clippy --all-targets --all-features -- -D warnings`, `cargo test`, `bash scripts/check-workflows.sh`, `npm --prefix web run lint`, `npm --prefix web run build`, and `npm --prefix web run test` when frontend code changes. |
7 | 19 | - Commit and push after each validated slice. |
8 | | -- Prefer extracting pure helpers and formatter/parsing boundaries before moving async orchestration. |
9 | | -- Keep module roots thin; if a root becomes mostly re-exports, let children carry the logic. |
10 | | - |
11 | | -## Improvement Queue |
12 | | - |
13 | | -- [ ] `src/commands/eval/` |
14 | | - - Add suite/category/language baseline comparisons so regressions are gated by dimension, not only whole-run totals. |
15 | | - - Add model-matrix and repeat execution support so the same suite can be compared across frontier models and flake-checked. |
16 | | - - Capture failed-run artifacts, including emitted comments, verifier warnings, and per-fixture mismatch details. |
17 | | - - Reduce fixture brittleness with semantic/alias expectation matching instead of exact wording dependence. |
18 | | - - Extend trend history with suite/category/language series plus verifier-health counters and model/provider labels. |
19 | | - - Expand `review-depth-core` with authz, supply-chain, and async-correctness benchmark packs. |
20 | | -- [ ] `src/commands/feedback_eval/` |
21 | | - - Correlate feedback calibration with eval-suite category performance and rule-level precision/recall. |
22 | | - - Surface high-confidence but frequently rejected categories/rules so review quality gaps are obvious. |
23 | | - |
24 | | -## Immediate Queue |
25 | | - |
26 | | -- [ ] `src/core/semantic.rs` |
27 | | - - Split source-file discovery and excerpt/query builders from index refresh bookkeeping. |
28 | | - - Split semantic diff retrieval and feedback-example matching from feedback-store maintenance. |
29 | | -- [ ] `src/core/symbol_index.rs` |
30 | | - - Split LSP command detection and extension scanning from index-building entry points. |
31 | | - - Split regex-based symbol extraction and dependency-hint parsing from graph/file-summary registration. |
32 | | - - Split `LspClient` protocol transport from symbol-result decoding and path/URI utilities. |
33 | | - - Keep `build()` and `build_with_lsp()` as thin orchestration entry points. |
34 | | - |
35 | | -## Core Backlog |
36 | | - |
37 | | -- [ ] `src/core/semantic.rs` |
38 | | - - Split semantic chunk hashing/key generation from summary/excerpt assembly. |
39 | | - - Split changed-range filtering and per-query match scoring from context chunk rendering. |
40 | | - - Split feedback fingerprint helpers from feedback-store reconciliation. |
41 | | -- [ ] `src/config.rs` |
42 | | - - Split defaults/model-role conversion from load/deserialize paths. |
43 | | - - Split env/path resolution from validation/migration logic. |
44 | | - - Split serialization-focused test helpers from production config code. |
45 | | -- [ ] `src/core/symbol_index.rs` |
46 | | - - Split language-pattern tables and path candidate expansion from dependency resolution. |
47 | | - - Split file collection and byte-size filtering from index population. |
48 | | - - Split symbol graph and reverse-dependency registration from symbol storage. |
49 | | - - Split LSP symbol collection/range extraction from request/notification plumbing. |
50 | | -- [ ] `src/core/symbol_graph.rs` |
51 | | - - Split graph construction from traversal/query helpers. |
52 | | - - Split serialization/persistence helpers from graph algorithms. |
53 | | -- [ ] `src/core/pr_summary.rs` |
54 | | - - Split stats aggregation, prompt generation, response parsing, and diagram helpers. |
55 | | -- [ ] `src/core/enhanced_review.rs` |
56 | | - - Split context construction, guidance generation, and response handling. |
57 | | -- [ ] `src/core/eval_benchmarks.rs` |
58 | | - - Split fixture loading, threshold selection, scoring, and aggregation/reporting. |
59 | | -- [ ] `src/core/prompt.rs` |
60 | | - - Split prompt fragments, model-specific tuning, and reusable prompt builders. |
61 | | -- [ ] `src/core/context.rs` |
62 | | - - Split context chunk construction, provenance helpers, and formatting/rendering. |
63 | | -- [ ] `src/core/offline.rs` |
64 | | - - Split endpoint/model probing, metadata parsing, and recommendation helpers. |
65 | | -- [ ] `src/core/function_chunker.rs` |
66 | | - - Split parsing, chunk planning, and scoring heuristics. |
67 | | -- [ ] `src/core/agent_tools.rs` |
68 | | - - Split tool registry/definitions from execution adapters and tool-context helpers. |
69 | | -- [ ] `src/core/agent_loop.rs` |
70 | | - - Split loop orchestration, state transitions, and tool/result handling. |
71 | | -- [ ] `src/core/code_summary.rs` |
72 | | - - Split summary planning, extraction, cache helpers, and formatting. |
73 | | -- [ ] `src/core/changelog.rs` |
74 | | - - Split git/history ingestion from final changelog rendering. |
75 | | -- [ ] `src/core/multi_pass.rs` |
76 | | - - Split pass planning, execution bookkeeping, and result merging. |
77 | | -- [ ] `src/core/composable_pipeline.rs` |
78 | | - - Split stage wiring from execution semantics and result transport. |
79 | | -- [ ] `src/core/convention_learner.rs` |
80 | | - - Split store persistence, scoring, and feedback ingestion helpers. |
81 | | -- [ ] `src/core/git_history.rs` |
82 | | - - Split log collection, parsing, and summarization. |
83 | | -- [ ] `src/core/diff_parser.rs` |
84 | | - - Split unified diff parsing, text diff parsing, hunk assembly, and post-processing helpers. |
85 | | -- [ ] `src/core/interactive.rs` |
86 | | - - Split REPL/input loop, commands, and output formatting. |
87 | | - |
88 | | -## Server and Storage Backlog |
89 | | - |
90 | | -- [ ] `src/server/api.rs` |
91 | | - - Split route handlers by domain plus shared request/response and error helpers. |
92 | | -- [ ] `src/server/state.rs` |
93 | | - - Split session state, queueing, and persistence coordination. |
94 | | -- [ ] `src/server/storage_json.rs` |
95 | | - - Split file I/O, indexing, migrations, and query helpers. |
96 | | -- [ ] `src/server/storage_pg.rs` |
97 | | - - Split SQL-backed persistence by domain and query grouping. |
98 | | -- [ ] `src/server/github.rs` |
99 | | - - Split webhook parsing, API interactions, and review-session orchestration. |
100 | | -- [ ] `src/server/metrics.rs` |
101 | | - - Split metric registration from event emission helpers. |
102 | | -- [ ] `src/server/mod.rs` |
103 | | - - Keep top-level wiring thin as submodules mature. |
104 | | - |
105 | | -## Adapters, Parsing, and Plugins Backlog |
106 | | - |
107 | | -- [ ] `src/adapters/llm.rs` |
108 | | - - Split request shaping, retry/policy logic, and response normalization. |
109 | | -- [ ] `src/adapters/openai.rs` |
110 | | - - Split request builders, streaming handling, and schema/response parsing. |
111 | | -- [ ] `src/adapters/anthropic.rs` |
112 | | - - Split request conversion, retries, and response parsing. |
113 | | -- [ ] `src/adapters/ollama.rs` |
114 | | - - Split local model capabilities, request building, and response parsing. |
115 | | -- [ ] `src/adapters/common.rs` |
116 | | - - Split shared retry/auth/http helpers. |
117 | | -- [ ] `src/parsing/llm_response.rs` |
118 | | - - Split fenced-block parsing, comment extraction, structured JSON handling, and validation. |
119 | | -- [ ] `src/parsing/smart_response.rs` |
120 | | - - Split structured smart-review parsing from fallback parsing paths. |
121 | | -- [ ] `src/plugins/builtin/secret_scanner.rs` |
122 | | - - Split rule loading, scanning, and finding shaping. |
123 | | -- [ ] `src/plugins/builtin/supply_chain.rs` |
124 | | - - Split manifest parsing, registry lookups, and finding generation. |
125 | | -- [ ] `src/plugins/builtin/eslint.rs` |
126 | | - - Split command execution, parser helpers, and finding conversion. |
127 | | -- [ ] `src/plugins/builtin/semgrep.rs` |
128 | | - - Split command assembly, result parsing, and finding mapping. |
129 | | -- [ ] `src/plugins/builtin/duplicate_filter.rs` |
130 | | - - Split fingerprinting from suppression heuristics. |
131 | | -- [ ] `src/plugins/plugin.rs` |
132 | | - - Split plugin traits/types from execution helpers. |
133 | | - |
134 | | -## Output and Entrypoint Backlog |
135 | | - |
136 | | -- [ ] `src/output/format.rs` |
137 | | - - Split smart review formatting, patch output, and walkthrough generation. |
138 | | -- [ ] `src/main.rs` |
139 | | - - Split CLI wiring by command group and shared config/bootstrap helpers. |
140 | | -- [ ] `src/vault.rs` |
141 | | - - Split vault discovery, parsing, and maintenance operations. |
142 | | - |
143 | | -## Ongoing Watchlist |
144 | | - |
145 | | -- [ ] Revisit freshly split files once they cross roughly 150 LOC again, especially `src/review/pipeline/execution/dispatcher/job.rs`, `src/review/pipeline/session/build.rs`, `src/review/pipeline/services/support.rs`, and `src/review/pipeline/postprocess/feedback/lookup.rs`. |
146 | | -- [ ] Keep module roots thin; if a root becomes only re-exports plus tests, leave it alone until children regrow. |
| 20 | +- Prefer turning existing primitives into first-class product surfaces before inventing brand new subsystems. |
| 21 | +- Optimize for independent validation, tight feedback loops, and high-signal comments over superficial feature parity. |
| 22 | + |
| 23 | +## 1. Feedback, Memory, and Outcomes |
| 24 | + |
| 25 | +1. [ ] Add first-class comment outcome states beyond thumbs: `new`, `accepted`, `rejected`, `addressed`, `stale`, `auto_fixed`. |
| 26 | +2. [ ] Infer "addressed by later commit" by diffing follow-up pushes against the original commented lines. |
| 27 | +3. [ ] Feed addressed/not-addressed outcomes into the reinforcement store alongside thumbs. |
| 28 | +4. [ ] Separate false-positive rejections from "valid but won't fix" dismissals in stored feedback. |
| 29 | +5. [ ] Weight reinforcement by reviewer role or trust level when GitHub identity is available. |
| 30 | +6. [ ] Add rule-level reinforcement decay so old team preferences do not dominate forever. |
| 31 | +7. [ ] Add path-scoped reinforcement buckets so teams can prefer different standards in `tests/`, `scripts/`, and production code. |
| 32 | +8. [ ] Persist explanation text from follow-up feedback replies and mine it into reusable review guidance. |
| 33 | +9. [ ] Learn "preferred phrasing" for accepted comments so comment tone and specificity improve over time. |
| 34 | +10. [ ] Backfill existing stored reviews into the new outcome-aware feedback store for cold-start reduction. |
| 35 | + |
| 36 | +## 2. Review Lifecycle and Merge Readiness |
| 37 | + |
| 38 | +11. [ ] Track unresolved vs resolved findings for PR reviews as a first-class lifecycle state. |
| 39 | +12. [ ] Add review completeness metrics: total findings, acknowledged findings, fixed findings, stale findings. |
| 40 | +13. [ ] Compute merge-readiness summaries for GitHub PR reviews using severity, unresolved count, and verification state. |
| 41 | +14. [ ] Add stale-review detection when new commits land after the latest completed review. |
| 42 | +15. [ ] Show "needs re-review" state in review detail and history pages for incremental PR workflows. |
| 43 | +16. [ ] Distinguish informational findings from blocking findings in lifecycle and readiness calculations. |
| 44 | +17. [ ] Add "critical blockers" summary cards for unresolved `Error` and `Warning` comments. |
| 45 | +18. [ ] Add per-PR readiness timelines showing when a review became mergeable. |
| 46 | +19. [ ] Store resolution timestamps for findings so mean-time-to-fix can be measured. |
| 47 | +20. [ ] Add CLI and API surfaces to query PR readiness without opening the web UI. |
| 48 | + |
| 49 | +## 3. Agentic Validation Loops |
| 50 | + |
| 51 | +21. [ ] Build a first-class `fix until clean` loop that can run review, apply fixes, rerun review, and stop on convergence. |
| 52 | +22. [ ] Reuse the existing DAG runtime to model iterative review/fix loops as resumable workflow nodes. |
| 53 | +23. [ ] Add a max-iteration policy and loop budget controls for autonomous review convergence. |
| 54 | +24. [ ] Add "issue replay" prompts that hand unresolved findings back to a coding agent with file-local context. |
| 55 | +25. [ ] Add a handoff contract from reviewer findings to fix agents with rule IDs, evidence, and suggested diffs. |
| 56 | +26. [ ] Persist loop-level telemetry: iterations, fixes attempted, findings cleared, findings reopened. |
| 57 | +27. [ ] Add "challenge the finding" verification loops where a validator tries to falsify a suspected issue before keeping it. |
| 58 | +28. [ ] Add caching between iterations so repeated codebase retrieval and verification runs are cheaper. |
| 59 | +29. [ ] Allow loop policies to differ by profile: conservative auditor, high-autonomy fixer, or report-only. |
| 60 | +30. [ ] Add eval fixtures specifically for loop convergence and reopened-issue regressions. |
| 61 | + |
| 62 | +## 4. Code Graph and Repository Intelligence |
| 63 | + |
| 64 | +31. [ ] Turn the current symbol graph into a persisted repository graph with durable storage and reload support. |
| 65 | +32. [ ] Add caller/callee expansion APIs for multi-hop impact analysis from changed symbols. |
| 66 | +33. [ ] Add contract edges between interfaces, implementations, and API endpoints. |
| 67 | +34. [ ] Add "similar implementation" lookup so repeated patterns and divergences are explicit. |
| 68 | +35. [ ] Add cross-file blast-radius summaries to findings when a change affects many callers. |
| 69 | +36. [ ] Add graph freshness/version metadata so reviews know whether they are using stale repository intelligence. |
| 70 | +37. [ ] Add graph-backed ranking of related files before semantic RAG retrieval. |
| 71 | +38. [ ] Add graph query traces to `dag_traces` or review artifacts for explainability and debugging. |
| 72 | +39. [ ] Add graph-aware eval fixtures that require multi-hop code understanding to pass. |
| 73 | +40. [ ] Split `src/core/symbol_graph.rs` into construction, persistence, traversal, and ranking modules as it grows. |
| 74 | + |
| 75 | +## 5. External Context and Pattern Repositories |
| 76 | + |
| 77 | +41. [x] Surface pattern repository sources in the Settings UI with validation and defaults. |
| 78 | +42. [x] Surface review rule file sources in the Settings UI instead of requiring config edits by hand. |
| 79 | +43. [ ] Add structured UI editing for custom context notes, files, and scopes. |
| 80 | +44. [ ] Add per-path scoped review instructions in the Settings UI for common repo areas. |
| 81 | +45. [ ] Support Jira/Linear issue context ingestion for PR-linked reviews. |
| 82 | +46. [ ] Support document-backed context ingestion for design docs, RFCs, and runbooks. |
| 83 | +47. [ ] Add explicit "intent mismatch" review checks comparing PR changes to ticket acceptance criteria. |
| 84 | +48. [ ] Add review artifacts that show which external context sources influenced a finding. |
| 85 | +49. [ ] Add tests for pattern repository resolution across local paths, Git URLs, and broken sources. |
| 86 | +50. [ ] Add analytics on which context sources actually improve acceptance and fix rates. |
| 87 | + |
| 88 | +## 6. Review UX and Workflow Integration |
| 89 | + |
| 90 | +51. [ ] Add visible accepted/rejected/dismissed badges to comments throughout the UI, not just icon state. |
| 91 | +52. [ ] Add comment grouping by unresolved, fixed, stale, and informational sections in `ReviewView`. |
| 92 | +53. [ ] Add a "show only blockers" mode for large reviews. |
| 93 | +54. [ ] Add keyboard actions for thumbs, resolve, and jump-to-next-finding workflows. |
| 94 | +55. [ ] Add file-level readiness summaries in the diff sidebar. |
| 95 | +56. [ ] Add lifecycle-aware PR summaries that explain what still blocks merge. |
| 96 | +57. [ ] Add a "train the reviewer" callout when thumbs coverage on a review is low. |
| 97 | +58. [ ] Add review-change comparisons so users can diff one review run against the next on the same PR. |
| 98 | +59. [ ] Add better surfacing for incremental PR reviews so users know when only the delta was reviewed. |
| 99 | +60. [ ] Add discussion workflows that can convert repeated human comments into candidate rules or context snippets. |
| 100 | + |
| 101 | +## 7. Analytics, Reporting, and Quality Dashboards |
| 102 | + |
| 103 | +61. [x] Add feedback coverage metrics: percent of findings with thumbs or explicit disposition. |
| 104 | +62. [x] Add acceptance/rejection trend lines over time for recent reviews. |
| 105 | +63. [x] Add top accepted categories/rules and top rejected categories/rules to Analytics. |
| 106 | +64. [ ] Add unresolved blocker counts per repository and per PR. |
| 107 | +65. [ ] Add review completeness and mean-time-to-resolution charts. |
| 108 | +66. [ ] Add feedback-learning effectiveness metrics: did reranked findings get higher acceptance after rollout? |
| 109 | +67. [ ] Add pattern-repository utilization analytics showing when extra context actually affected findings. |
| 110 | +68. [ ] Add eval-vs-production dashboards comparing benchmark strength against real-world acceptance. |
| 111 | +69. [ ] Add drill-downs from trend charts directly into the affected reviews, findings, and rules. |
| 112 | +70. [ ] Add exportable JSON/CSV reports for review quality, lifecycle, and reinforcement metrics. |
| 113 | + |
| 114 | +## 8. APIs, Automation, and MCP-Like Surfaces |
| 115 | + |
| 116 | +71. [ ] Expose unresolved/resolved comment search through the HTTP API. |
| 117 | +72. [ ] Expose PR readiness through the HTTP API for CI and agent integrations. |
| 118 | +73. [ ] Add API endpoints to fetch learned rules, attention gaps, and top rejected patterns. |
| 119 | +74. [ ] Add machine-friendly APIs to fetch findings grouped by severity, file, and lifecycle state. |
| 120 | +75. [ ] Add a "trigger re-review" API that reuses existing PR metadata and loop policy. |
| 121 | +76. [ ] Add APIs for comment resolution and lifecycle updates, not just thumbs. |
| 122 | +77. [ ] Add an MCP server for DiffScope with review, analytics, and rule-management tools. |
| 123 | +78. [ ] Add reusable agent skills/workflows for checking PR readiness and running fix loops. |
| 124 | +79. [ ] Add signed webhook or event-stream integration for downstream automation consumers. |
| 125 | +80. [ ] Add rate-limited API auth and audit trails for automation-heavy deployments. |
| 126 | + |
| 127 | +## 9. Infra, Self-Hosting, and Enterprise Operations |
| 128 | + |
| 129 | +81. [ ] Split `src/server/api.rs` by domain so the growing platform API stays maintainable. |
| 130 | +82. [ ] Split `src/server/state.rs` into session lifecycle, persistence, progress, and GitHub coordination modules. |
| 131 | +83. [ ] Add queue depth and worker saturation metrics for long-running review and eval jobs. |
| 132 | +84. [ ] Add retention policies for review artifacts, eval artifacts, and trend histories. |
| 133 | +85. [ ] Add storage migrations for richer comment lifecycle and reinforcement schemas. |
| 134 | +86. [ ] Add deployment docs for self-hosted review + analytics + trend retention setups. |
| 135 | +87. [ ] Add secret-management guidance and validation for multi-provider enterprise installs. |
| 136 | +88. [ ] Add background jobs for recomputing analytics after schema or scoring changes. |
| 137 | +89. [ ] Add cost dashboards by provider/model/role for review, verification, and eval workloads. |
| 138 | +90. [ ] Add failure forensics bundles for self-hosted users when review or eval jobs degrade. |
| 139 | + |
| 140 | +## 10. Eval, Benchmarking, and Model Governance |
| 141 | + |
| 142 | +91. [ ] Add eval fixtures for external-context alignment, not just diff-local correctness. |
| 143 | +92. [ ] Add eval fixtures for merge-readiness judgments and unresolved-blocker classification. |
| 144 | +93. [ ] Add eval fixtures for addressed-vs-stale finding lifecycle inference. |
| 145 | +94. [ ] Add eval fixtures for multi-hop graph reasoning across call chains and contract edges. |
| 146 | +95. [ ] Add eval runs that compare single-pass review against agentic loop review. |
| 147 | +96. [ ] Add production replay evals using anonymized accepted/rejected review outcomes. |
| 148 | +97. [ ] Add leaderboard reporting for reviewer usefulness metrics, not just precision/recall. |
| 149 | +98. [ ] Add regression gates for feedback coverage, verifier health, and lifecycle-state accuracy. |
| 150 | +99. [ ] Add model-routing policies that explicitly separate generation, verification, and auditing roles. |
| 151 | +100. [ ] Publish a repeatable "independent auditor" benchmark story in the UI and CLI so DiffScope's differentiation is measurable. |
| 152 | + |
| 153 | +## Current Execution Slice |
| 154 | + |
| 155 | +- [x] Rewrite this roadmap into the active backlog and keep it updated as slices ship. |
| 156 | +- [x] Productize the learning loop in Analytics with reaction coverage and acceptance trends. |
| 157 | +- [x] Surface repository rule sources and pattern repository sources in Settings. |
| 158 | +- [ ] Commit and push each validated checkpoint before moving to the next epic. |
0 commit comments