|
| 1 | +# Plan: Improving the Complexity Prompt with Human-Labeled PRs |
| 2 | + |
| 3 | +This document outlines a plan to improve the labeling prompt over time for your specific company using human-labeled PRs, aligned with the "Human labeled. AI trained." approach (e.g., GEPA framework via DSPy). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Phase 1: Collect Human Labels (Foundation) |
| 8 | + |
| 9 | +### 1.1 Define the labeling format |
| 10 | + |
| 11 | +- Use the same 1–10 scale and JSON output as the current prompt. |
| 12 | +- Add a `human_complexity` column (and optionally `human_explanation`) to distinguish human labels from LLM output. |
| 13 | + |
| 14 | +### 1.2 Create a labeling dataset |
| 15 | + |
| 16 | +- **Option A — CSV with human override column** |
| 17 | + Add `human_complexity` to the schema. When present, treat it as the ground truth for evaluation and training. |
| 18 | + |
| 19 | +- **Option B — Separate labeling file** |
| 20 | + Maintain `human_labels.csv` with: `pr_url`, `human_complexity`, `human_explanation`, `labeler`, `labeled_at`. |
| 21 | + |
| 22 | +### 1.3 Labeling workflow |
| 23 | + |
| 24 | +- Sample PRs from `complexity-report.csv` (e.g., stratified by current LLM score and team). |
| 25 | +- Use a simple UI (spreadsheet, internal tool, or lightweight Streamlit app) where reviewers see the PR diff + title and assign 1–10. |
| 26 | +- Aim for **~500–1000 PRs** for initial calibration; **2000+** for GEPA-style optimization. |
| 27 | +- Have at least 2 reviewers on a subset (e.g., 10–20%) to measure inter-rater agreement. |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Phase 2: Evaluate Current Prompt |
| 32 | + |
| 33 | +### 2.1 Metrics |
| 34 | + |
| 35 | +| Metric | Purpose | |
| 36 | +|--------|---------| |
| 37 | +| **MAE** (Mean Absolute Error) | Overall score accuracy | |
| 38 | +| **Within-1 agreement** | % of PRs where AI is within ±1 of human | |
| 39 | +| **Exact match** | % of exact score matches | |
| 40 | +| **Per-band accuracy** | Performance by band (1–2, 3–4, 5–6, 7–8, 9–10) | |
| 41 | + |
| 42 | +### 2.2 Baseline run |
| 43 | + |
| 44 | +```bash |
| 45 | +# Run current prompt on human-labeled PRs, compare to human_complexity |
| 46 | +complexity-cli batch-analyze --input-file human_labeled_prs.txt --output baseline_eval.csv |
| 47 | +# Then compute metrics: MAE, within-1, exact match |
| 48 | +``` |
| 49 | + |
| 50 | +### 2.3 Error analysis |
| 51 | + |
| 52 | +- Identify systematic biases (e.g., over-scoring infra PRs, under-scoring migrations). |
| 53 | +- Group by: team, repo, PR type (feat/fix/refactor), lines changed. |
| 54 | +- Use this to decide what to fix in the prompt or via optimization. |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## Phase 3: Iterate (Two Paths) |
| 59 | + |
| 60 | +### Path A: Manual Prompt Iteration (Simpler, No New Dependencies) |
| 61 | + |
| 62 | +1. Add **company-specific guidance** to `default.txt` based on error analysis (e.g., "Rivery data pipeline changes: add +1 when multiple sources/targets are involved"). |
| 63 | +2. Re-run evaluation and compare metrics. |
| 64 | +3. Repeat until MAE and within-1 agreement are acceptable. |
| 65 | +4. Bump `Prompt-Version` in the prompt file when you make changes. |
| 66 | + |
| 67 | +**Pros:** No new tools, fits current architecture. |
| 68 | +**Cons:** Manual, slower, no automatic prompt search. |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +### Path B: DSPy + GEPA (Automated Optimization) |
| 73 | + |
| 74 | +1. **Add DSPy** to the project and define a signature: |
| 75 | + |
| 76 | +```python |
| 77 | +class PRComplexity(dspy.Signature): |
| 78 | + """Estimate implementation complexity of a PR on 1-10 scale.""" |
| 79 | + diff_excerpt: str = dspy.InputField(desc="PR diff and metadata") |
| 80 | + stats_json: str = dspy.InputField(desc="Additions, deletions, file counts") |
| 81 | + title: str = dspy.InputField() |
| 82 | + complexity: int = dspy.OutputField(desc="1-10 integer") |
| 83 | + explanation: str = dspy.OutputField(desc="Short rationale") |
| 84 | +``` |
| 85 | + |
| 86 | +2. **Build a training set** in DSPy format: |
| 87 | + |
| 88 | +```python |
| 89 | +# Convert human_labels.csv to dspy.Example with inputs + complexity as label |
| 90 | +trainset = [dspy.Example(pr_url=..., diff_excerpt=..., stats_json=..., title=..., complexity=human_score).with_inputs(...)] |
| 91 | +``` |
| 92 | + |
| 93 | +3. **Define a metric** (e.g., negative MAE or within-1 agreement). |
| 94 | +4. **Run GEPA** to optimize the prompt: |
| 95 | + |
| 96 | +```python |
| 97 | +optimizer = dspy.GEPA(metric=lambda pred, gold: -abs(pred.complexity - gold.complexity)) |
| 98 | +optimized = optimizer.compile(complexity_module, trainset=trainset) |
| 99 | +``` |
| 100 | + |
| 101 | +5. **Extract the optimized prompt** from the compiled module and save it as your new `default.txt` (or a company-specific prompt file). |
| 102 | + |
| 103 | +**Pros:** Automatic prompt search, can improve over time with more data. |
| 104 | +**Cons:** New dependency, setup effort, need to wire diff fetching into DSPy. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Phase 4: Ongoing Improvement |
| 109 | + |
| 110 | +### 4.1 Continuous labeling |
| 111 | + |
| 112 | +- Label a small batch of new PRs each sprint (e.g., 20–50). |
| 113 | +- Prioritize PRs where AI and human disagree. |
| 114 | +- Add them to the human-labeled set. |
| 115 | + |
| 116 | +### 4.2 Periodic re-evaluation |
| 117 | + |
| 118 | +- Every quarter (or when you add ~200+ new labels), re-run evaluation. |
| 119 | +- If metrics degrade, re-optimize (Path B) or refine the prompt (Path A). |
| 120 | + |
| 121 | +### 4.3 Versioning |
| 122 | + |
| 123 | +- Keep `Prompt-Version` in the prompt file. |
| 124 | +- Log which prompt version was used for each analysis (e.g., in CSV or metadata). |
| 125 | + |
| 126 | +--- |
| 127 | + |
| 128 | +## Suggested Rollout |
| 129 | + |
| 130 | +| Week | Action | |
| 131 | +|------|--------| |
| 132 | +| 1–2 | Define schema, create `human_labels.csv`, build a simple labeling workflow | |
| 133 | +| 3 | Label 200–500 PRs (start with stratified sample) | |
| 134 | +| 4 | Run baseline evaluation, compute MAE and within-1 | |
| 135 | +| 5 | Error analysis: identify biases and failure modes | |
| 136 | +| 6–8 | **Path A:** Add company-specific rules and re-evaluate, **or** **Path B:** Integrate DSPy + GEPA and run first optimization | |
| 137 | +| 9+ | Ongoing labeling and quarterly re-evaluation | |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## Quick Wins (No New Infrastructure) |
| 142 | + |
| 143 | +1. **Export a sample for labeling** |
| 144 | + Use `complexity-cli batch-analyze` to produce a CSV, then add a `human_complexity` column in a spreadsheet. |
| 145 | + |
| 146 | +2. **Compare AI vs human** |
| 147 | + Write a small script that computes MAE and within-1 between `complexity` and `human_complexity`. |
| 148 | + |
| 149 | +3. **Add 3–5 company-specific rules** |
| 150 | + Based on your domains (e.g., Rivery, Boomi, data pipelines), add short guidance to `default.txt` and re-run on a sample. |
0 commit comments