Skip to content

Commit 537c7c3

Browse files
committed
docs: Phase 10 quality loop CI design plan
Config-driven improvement loop: run prompts → validate → deduplicate findings into GitHub Issues → assign top 3 to Copilot Workspace → fix → merge → loop. Generic across modules (PDF, future DOCX, etc.) via watch_paths in module config.
1 parent 26c9386 commit 537c7c3

File tree

1 file changed

+237
-0
lines changed

1 file changed

+237
-0
lines changed

docs/design/QUALITY-LOOP-PLAN.md

Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# Phase 10: Quality Loop CI — Design Plan
2+
3+
## Overview
4+
5+
A generic, config-driven continuous improvement loop that:
6+
1. Runs prompts against HyperAgent modules (PDF, future DOCX, etc.)
7+
2. Validates outputs (structural, visual, content, feedback extraction)
8+
3. Deduplicates findings into tracked GitHub Issues
9+
4. Auto-assigns top 3 issues (by frequency) to Copilot Workspace for fixing
10+
5. Loops after fixes are merged
11+
12+
## Architecture: Same Repo, Watch Paths in Module Config
13+
14+
- Quality loop is a GitHub Actions workflow in hyperagent
15+
- Issues tracked with `quality-loop` + module labels
16+
- Copilot Workspace creates PRs from `copilot-fix` issues
17+
- Module detection: orchestrator reads module configs, compares `watch_paths` against git diff
18+
- Adding a new module = adding a config file + prompts, no workflow changes
19+
20+
## Module Config Format
21+
22+
Each module registers itself via a config file:
23+
24+
```yaml
25+
# quality-loop/modules/pdf.yaml
26+
name: pdf
27+
skill: pdf-expert
28+
profiles: file-builder
29+
prompts_dir: tests/pdf-prompts
30+
expected_patterns:
31+
- "*.yaml"
32+
watch_paths:
33+
- builtin-modules/src/pdf.ts
34+
- builtin-modules/src/pdf-charts.ts
35+
- builtin-modules/src/doc-core.ts
36+
- skills/pdf-expert/
37+
- tests/pdf-prompts/
38+
- src/agent/** # core changes trigger all modules
39+
- src/sandbox/**
40+
validation:
41+
structural: true # qpdf --check
42+
visual: true # pdftoppm + pixelmatch against golden
43+
content: true # expected_content field matching
44+
feedback: true # extract LLM feedback JSON
45+
text_extraction: true # pdftotext for custom font docs
46+
timeout: 300 # per-prompt timeout in seconds
47+
```
48+
49+
Future DOCX module:
50+
```yaml
51+
# quality-loop/modules/docx.yaml
52+
name: docx
53+
skill: docx-expert
54+
profiles: file-builder
55+
prompts_dir: tests/docx-prompts
56+
watch_paths:
57+
- builtin-modules/src/docx.ts
58+
- skills/docx-expert/
59+
- tests/docx-prompts/
60+
- src/agent/**
61+
- src/sandbox/**
62+
validation:
63+
structural: true
64+
content: true
65+
feedback: true
66+
timeout: 300
67+
```
68+
69+
## Pipeline Stages
70+
71+
### Stage 1: Run Prompts
72+
- For each module config, run all prompts via generalized `run-prompts.sh`
73+
- Collect: output files, debug logs, code logs, transcripts, feedback JSON
74+
- Output: `/tmp/quality-loop-results/<run-id>/<module>/<prompt>/`
75+
76+
### Stage 2: Validate & Score
77+
For each prompt result:
78+
- **Structural**: qpdf/file header/EOF checks → pass/fail
79+
- **Content**: expected_content field matches → score (found/total)
80+
- **Visual**: pixelmatch against golden baselines → diff pixel count
81+
- **Feedback**: extract LLM feedback JSON → parse errors/hard/improvements
82+
- **Text extraction**: pdftotext output sanity check
83+
- **Code analysis**: read code.log for patterns (unused imports, error recovery attempts)
84+
- **Timing**: duration, number of LLM edits/retries
85+
86+
Output: `evaluation-report.json` per prompt
87+
88+
### Stage 3: Deduplicate & Track Issues
89+
- Parse all evaluation reports + feedback across all prompts
90+
- Group findings by category:
91+
- `bug`: runtime errors, qpdf failures, visual regressions
92+
- `api-gap`: missing features reported by 2+ prompts
93+
- `ux-friction`: confusing APIs, misleading types, extra LLM attempts needed
94+
- `performance`: slow prompts, excessive retries
95+
- For each finding:
96+
- Hash the finding (category + description + key details) → fingerprint
97+
- Check existing GitHub Issues with label `quality-loop` for matching fingerprint
98+
- If exists: increment occurrence count in issue body, add latest evidence
99+
- If new: create new issue with label `quality-loop`, priority tag, evidence
100+
101+
### Stage 4: Prioritize & Assign
102+
- Score issues by: frequency × severity × recency
103+
- Top 3 issues by score → assign to Copilot Workspace
104+
- Add label `copilot-fix`
105+
- Issue body includes: reproduction steps, relevant code paths, suggested fix
106+
- Remaining issues: labelled `quality-loop-backlog`
107+
108+
### Stage 5: Fix & Merge
109+
- Copilot Workspace creates PRs from `copilot-fix` issues
110+
- PRs run standard CI (`just check`)
111+
- Human reviews and merges
112+
- On merge → quality loop runs again (triggered by merge event + watch_paths match)
113+
114+
## Workflow Definition
115+
116+
```yaml
117+
name: Quality Loop
118+
on:
119+
workflow_dispatch:
120+
inputs:
121+
modules:
122+
description: "Comma-separated module names (or 'all')"
123+
default: "all"
124+
max_iterations:
125+
description: "Max improvement iterations"
126+
default: "1"
127+
schedule:
128+
- cron: "0 2 * * 1-5" # Weekday 2am
129+
push:
130+
branches: [main]
131+
paths:
132+
- "builtin-modules/src/**"
133+
- "skills/**"
134+
- "src/agent/**"
135+
- "src/sandbox/**"
136+
- "quality-loop/modules/**"
137+
138+
jobs:
139+
detect-modules:
140+
runs-on: ubuntu-latest
141+
outputs:
142+
modules: ${{ steps.detect.outputs.modules }}
143+
steps:
144+
- uses: actions/checkout@v4
145+
with:
146+
fetch-depth: 2
147+
- id: detect
148+
run: |
149+
node scripts/quality-loop/detect-modules.mjs \
150+
--changed "$(git diff --name-only HEAD~1)" \
151+
>> "$GITHUB_OUTPUT"
152+
153+
quality-loop:
154+
needs: detect-modules
155+
if: needs.detect-modules.outputs.modules != 'none'
156+
runs-on:
157+
- self-hosted
158+
- 1ES.Pool=hld-kvm-amd
159+
env:
160+
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
161+
steps:
162+
- uses: actions/checkout@v4
163+
- name: Setup
164+
run: |
165+
sudo apt-get update
166+
sudo apt-get install -y poppler-utils qpdf fonts-dejavu-core
167+
just setup
168+
- name: Build
169+
run: just build
170+
- name: Run quality loop
171+
run: |
172+
node scripts/quality-loop/orchestrator.mjs \
173+
--modules "${{ needs.detect-modules.outputs.modules }}" \
174+
--max-iterations "${{ inputs.max_iterations || '1' }}"
175+
- name: Upload results
176+
if: always()
177+
uses: actions/upload-artifact@v4
178+
with:
179+
name: quality-loop-${{ github.run_number }}
180+
path: /tmp/quality-loop-results/
181+
retention-days: 30
182+
```
183+
184+
## Scripts
185+
186+
| Script | Purpose | ~LOC |
187+
|--------|---------|------|
188+
| `scripts/quality-loop/orchestrator.mjs` | Main entry: load module configs, run stages, loop | ~100 |
189+
| `scripts/quality-loop/detect-modules.mjs` | Compare changed files against watch_paths | ~60 |
190+
| `scripts/quality-loop/runner.mjs` | Stage 1: execute prompts for a module | ~80 |
191+
| `scripts/quality-loop/evaluator.mjs` | Stage 2: validate outputs, generate scores | ~150 |
192+
| `scripts/quality-loop/deduplicator.mjs` | Stage 3: group findings, manage GitHub Issues | ~200 |
193+
| `scripts/quality-loop/prioritizer.mjs` | Stage 4: score issues, assign top 3 to Copilot | ~80 |
194+
| `scripts/quality-loop/reporter.mjs` | Generate HTML summary report | ~100 |
195+
196+
## Issue Format
197+
198+
```markdown
199+
## Quality Loop Finding: [category] [title]
200+
201+
**Module:** pdf
202+
**Priority:** P0 | P1 | P2
203+
**Occurrences:** 3 (across 3 runs)
204+
**Fingerprint:** `sha256:abc123...`
205+
206+
### Description
207+
[What's wrong]
208+
209+
### Evidence
210+
- Run 2026-04-14 #42: invoice.yaml — [details]
211+
- Run 2026-04-13 #41: letter.yaml — [details]
212+
- Run 2026-04-12 #40: resume.yaml — [details]
213+
214+
### Suggested Fix
215+
[Code paths to change, approach]
216+
217+
### Reproduction
218+
\```bash
219+
./scripts/run-pdf-prompts.sh invoice.yaml
220+
# Then check: ...
221+
\```
222+
223+
Labels: `quality-loop`, `pdf`, `P0`
224+
```
225+
226+
## Stop Conditions
227+
- All prompts score ≥ 90% on all metrics
228+
- No new issues found in 2 consecutive runs
229+
- Max iterations reached (configurable)
230+
- Zero P0 issues remain open
231+
232+
## Key Design Principles
233+
1. **Generic**: module configs make it work for any document type
234+
2. **Idempotent**: same run twice → same issues (deduplication via fingerprint)
235+
3. **Observable**: HTML report, GitHub Issues, artifact uploads
236+
4. **Safe**: never auto-merges, always needs human review for PRs
237+
5. **Incremental**: fixes top 3 issues per iteration, not everything at once

0 commit comments

Comments
 (0)