Skip to content

Commit 4f807f9

Browse files
Add prompt improvement plan with human-labeled PRs
Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 52ec1d4 commit 4f807f9

1 file changed

Lines changed: 150 additions & 0 deletions

File tree

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Plan: Improving the Complexity Prompt with Human-Labeled PRs
2+
3+
This document outlines a plan to improve the labeling prompt over time for your specific company using human-labeled PRs, aligned with the "Human labeled. AI trained." approach (e.g., GEPA framework via DSPy).
4+
5+
---
6+
7+
## Phase 1: Collect Human Labels (Foundation)
8+
9+
### 1.1 Define the labeling format
10+
11+
- Use the same 1–10 scale and JSON output as the current prompt.
12+
- Add a `human_complexity` column (and optionally `human_explanation`) to distinguish human labels from LLM output.
13+
14+
### 1.2 Create a labeling dataset
15+
16+
- **Option A — CSV with human override column**
17+
Add `human_complexity` to the schema. When present, treat it as the ground truth for evaluation and training.
18+
19+
- **Option B — Separate labeling file**
20+
Maintain `human_labels.csv` with: `pr_url`, `human_complexity`, `human_explanation`, `labeler`, `labeled_at`.
21+
22+
### 1.3 Labeling workflow
23+
24+
- Sample PRs from `complexity-report.csv` (e.g., stratified by current LLM score and team).
25+
- Use a simple UI (spreadsheet, internal tool, or lightweight Streamlit app) where reviewers see the PR diff + title and assign 1–10.
26+
- Aim for **~500–1000 PRs** for initial calibration; **2000+** for GEPA-style optimization.
27+
- Have at least 2 reviewers on a subset (e.g., 10–20%) to measure inter-rater agreement.
28+
29+
---
30+
31+
## Phase 2: Evaluate Current Prompt
32+
33+
### 2.1 Metrics
34+
35+
| Metric | Purpose |
36+
|--------|---------|
37+
| **MAE** (Mean Absolute Error) | Overall score accuracy |
38+
| **Within-1 agreement** | % of PRs where AI is within ±1 of human |
39+
| **Exact match** | % of exact score matches |
40+
| **Per-band accuracy** | Performance by band (1–2, 3–4, 5–6, 7–8, 9–10) |
41+
42+
### 2.2 Baseline run
43+
44+
```bash
45+
# Run current prompt on human-labeled PRs, compare to human_complexity
46+
complexity-cli batch-analyze --input-file human_labeled_prs.txt --output baseline_eval.csv
47+
# Then compute metrics: MAE, within-1, exact match
48+
```
49+
50+
### 2.3 Error analysis
51+
52+
- Identify systematic biases (e.g., over-scoring infra PRs, under-scoring migrations).
53+
- Group by: team, repo, PR type (feat/fix/refactor), lines changed.
54+
- Use this to decide what to fix in the prompt or via optimization.
55+
56+
---
57+
58+
## Phase 3: Iterate (Two Paths)
59+
60+
### Path A: Manual Prompt Iteration (Simpler, No New Dependencies)
61+
62+
1. Add **company-specific guidance** to `default.txt` based on error analysis (e.g., "Rivery data pipeline changes: add +1 when multiple sources/targets are involved").
63+
2. Re-run evaluation and compare metrics.
64+
3. Repeat until MAE and within-1 agreement are acceptable.
65+
4. Bump `Prompt-Version` in the prompt file when you make changes.
66+
67+
**Pros:** No new tools, fits current architecture.
68+
**Cons:** Manual, slower, no automatic prompt search.
69+
70+
---
71+
72+
### Path B: DSPy + GEPA (Automated Optimization)
73+
74+
1. **Add DSPy** to the project and define a signature:
75+
76+
```python
77+
class PRComplexity(dspy.Signature):
78+
"""Estimate implementation complexity of a PR on 1-10 scale."""
79+
diff_excerpt: str = dspy.InputField(desc="PR diff and metadata")
80+
stats_json: str = dspy.InputField(desc="Additions, deletions, file counts")
81+
title: str = dspy.InputField()
82+
complexity: int = dspy.OutputField(desc="1-10 integer")
83+
explanation: str = dspy.OutputField(desc="Short rationale")
84+
```
85+
86+
2. **Build a training set** in DSPy format:
87+
88+
```python
89+
# Convert human_labels.csv to dspy.Example with inputs + complexity as label
90+
trainset = [dspy.Example(pr_url=..., diff_excerpt=..., stats_json=..., title=..., complexity=human_score).with_inputs(...)]
91+
```
92+
93+
3. **Define a metric** (e.g., negative MAE or within-1 agreement).
94+
4. **Run GEPA** to optimize the prompt:
95+
96+
```python
97+
optimizer = dspy.GEPA(metric=lambda pred, gold: -abs(pred.complexity - gold.complexity))
98+
optimized = optimizer.compile(complexity_module, trainset=trainset)
99+
```
100+
101+
5. **Extract the optimized prompt** from the compiled module and save it as your new `default.txt` (or a company-specific prompt file).
102+
103+
**Pros:** Automatic prompt search, can improve over time with more data.
104+
**Cons:** New dependency, setup effort, need to wire diff fetching into DSPy.
105+
106+
---
107+
108+
## Phase 4: Ongoing Improvement
109+
110+
### 4.1 Continuous labeling
111+
112+
- Label a small batch of new PRs each sprint (e.g., 20–50).
113+
- Prioritize PRs where AI and human disagree.
114+
- Add them to the human-labeled set.
115+
116+
### 4.2 Periodic re-evaluation
117+
118+
- Every quarter (or when you add ~200+ new labels), re-run evaluation.
119+
- If metrics degrade, re-optimize (Path B) or refine the prompt (Path A).
120+
121+
### 4.3 Versioning
122+
123+
- Keep `Prompt-Version` in the prompt file.
124+
- Log which prompt version was used for each analysis (e.g., in CSV or metadata).
125+
126+
---
127+
128+
## Suggested Rollout
129+
130+
| Week | Action |
131+
|------|--------|
132+
| 1–2 | Define schema, create `human_labels.csv`, build a simple labeling workflow |
133+
| 3 | Label 200–500 PRs (start with stratified sample) |
134+
| 4 | Run baseline evaluation, compute MAE and within-1 |
135+
| 5 | Error analysis: identify biases and failure modes |
136+
| 6–8 | **Path A:** Add company-specific rules and re-evaluate, **or** **Path B:** Integrate DSPy + GEPA and run first optimization |
137+
| 9+ | Ongoing labeling and quarterly re-evaluation |
138+
139+
---
140+
141+
## Quick Wins (No New Infrastructure)
142+
143+
1. **Export a sample for labeling**
144+
Use `complexity-cli batch-analyze` to produce a CSV, then add a `human_complexity` column in a spreadsheet.
145+
146+
2. **Compare AI vs human**
147+
Write a small script that computes MAE and within-1 between `complexity` and `human_complexity`.
148+
149+
3. **Add 3–5 company-specific rules**
150+
Based on your domains (e.g., Rivery, Boomi, data pipelines), add short guidance to `default.txt` and re-run on a sample.

0 commit comments

Comments
 (0)