|
| 1 | +# Phase 10: Quality Loop CI — Design Plan |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +A generic, config-driven continuous improvement loop that: |
| 6 | +1. Runs prompts against HyperAgent modules (PDF, future DOCX, etc.) |
| 7 | +2. Validates outputs (structural, visual, content, feedback extraction) |
| 8 | +3. Deduplicates findings into tracked GitHub Issues |
| 9 | +4. Auto-assigns top 3 issues (by frequency) to Copilot Workspace for fixing |
| 10 | +5. Loops after fixes are merged |
| 11 | + |
| 12 | +## Architecture: Same Repo, Watch Paths in Module Config |
| 13 | + |
| 14 | +- Quality loop is a GitHub Actions workflow in hyperagent |
| 15 | +- Issues tracked with `quality-loop` + module labels |
| 16 | +- Copilot Workspace creates PRs from `copilot-fix` issues |
| 17 | +- Module detection: orchestrator reads module configs, compares `watch_paths` against git diff |
| 18 | +- Adding a new module = adding a config file + prompts, no workflow changes |
| 19 | + |
| 20 | +## Module Config Format |
| 21 | + |
| 22 | +Each module registers itself via a config file: |
| 23 | + |
| 24 | +```yaml |
| 25 | +# quality-loop/modules/pdf.yaml |
| 26 | +name: pdf |
| 27 | +skill: pdf-expert |
| 28 | +profiles: file-builder |
| 29 | +prompts_dir: tests/pdf-prompts |
| 30 | +expected_patterns: |
| 31 | + - "*.yaml" |
| 32 | +watch_paths: |
| 33 | + - builtin-modules/src/pdf.ts |
| 34 | + - builtin-modules/src/pdf-charts.ts |
| 35 | + - builtin-modules/src/doc-core.ts |
| 36 | + - skills/pdf-expert/ |
| 37 | + - tests/pdf-prompts/ |
| 38 | + - src/agent/** # core changes trigger all modules |
| 39 | + - src/sandbox/** |
| 40 | +validation: |
| 41 | + structural: true # qpdf --check |
| 42 | + visual: true # pdftoppm + pixelmatch against golden |
| 43 | + content: true # expected_content field matching |
| 44 | + feedback: true # extract LLM feedback JSON |
| 45 | + text_extraction: true # pdftotext for custom font docs |
| 46 | +timeout: 300 # per-prompt timeout in seconds |
| 47 | +``` |
| 48 | +
|
| 49 | +Future DOCX module: |
| 50 | +```yaml |
| 51 | +# quality-loop/modules/docx.yaml |
| 52 | +name: docx |
| 53 | +skill: docx-expert |
| 54 | +profiles: file-builder |
| 55 | +prompts_dir: tests/docx-prompts |
| 56 | +watch_paths: |
| 57 | + - builtin-modules/src/docx.ts |
| 58 | + - skills/docx-expert/ |
| 59 | + - tests/docx-prompts/ |
| 60 | + - src/agent/** |
| 61 | + - src/sandbox/** |
| 62 | +validation: |
| 63 | + structural: true |
| 64 | + content: true |
| 65 | + feedback: true |
| 66 | +timeout: 300 |
| 67 | +``` |
| 68 | +
|
| 69 | +## Pipeline Stages |
| 70 | +
|
| 71 | +### Stage 1: Run Prompts |
| 72 | +- For each module config, run all prompts via generalized `run-prompts.sh` |
| 73 | +- Collect: output files, debug logs, code logs, transcripts, feedback JSON |
| 74 | +- Output: `/tmp/quality-loop-results/<run-id>/<module>/<prompt>/` |
| 75 | + |
| 76 | +### Stage 2: Validate & Score |
| 77 | +For each prompt result: |
| 78 | +- **Structural**: qpdf/file header/EOF checks → pass/fail |
| 79 | +- **Content**: expected_content field matches → score (found/total) |
| 80 | +- **Visual**: pixelmatch against golden baselines → diff pixel count |
| 81 | +- **Feedback**: extract LLM feedback JSON → parse errors/hard/improvements |
| 82 | +- **Text extraction**: pdftotext output sanity check |
| 83 | +- **Code analysis**: read code.log for patterns (unused imports, error recovery attempts) |
| 84 | +- **Timing**: duration, number of LLM edits/retries |
| 85 | + |
| 86 | +Output: `evaluation-report.json` per prompt |
| 87 | + |
| 88 | +### Stage 3: Deduplicate & Track Issues |
| 89 | +- Parse all evaluation reports + feedback across all prompts |
| 90 | +- Group findings by category: |
| 91 | + - `bug`: runtime errors, qpdf failures, visual regressions |
| 92 | + - `api-gap`: missing features reported by 2+ prompts |
| 93 | + - `ux-friction`: confusing APIs, misleading types, extra LLM attempts needed |
| 94 | + - `performance`: slow prompts, excessive retries |
| 95 | +- For each finding: |
| 96 | + - Hash the finding (category + description + key details) → fingerprint |
| 97 | + - Check existing GitHub Issues with label `quality-loop` for matching fingerprint |
| 98 | + - If exists: increment occurrence count in issue body, add latest evidence |
| 99 | + - If new: create new issue with label `quality-loop`, priority tag, evidence |
| 100 | + |
| 101 | +### Stage 4: Prioritize & Assign |
| 102 | +- Score issues by: frequency × severity × recency |
| 103 | +- Top 3 issues by score → assign to Copilot Workspace |
| 104 | + - Add label `copilot-fix` |
| 105 | + - Issue body includes: reproduction steps, relevant code paths, suggested fix |
| 106 | +- Remaining issues: labelled `quality-loop-backlog` |
| 107 | + |
| 108 | +### Stage 5: Fix & Merge |
| 109 | +- Copilot Workspace creates PRs from `copilot-fix` issues |
| 110 | +- PRs run standard CI (`just check`) |
| 111 | +- Human reviews and merges |
| 112 | +- On merge → quality loop runs again (triggered by merge event + watch_paths match) |
| 113 | + |
| 114 | +## Workflow Definition |
| 115 | + |
| 116 | +```yaml |
| 117 | +name: Quality Loop |
| 118 | +on: |
| 119 | + workflow_dispatch: |
| 120 | + inputs: |
| 121 | + modules: |
| 122 | + description: "Comma-separated module names (or 'all')" |
| 123 | + default: "all" |
| 124 | + max_iterations: |
| 125 | + description: "Max improvement iterations" |
| 126 | + default: "1" |
| 127 | + schedule: |
| 128 | + - cron: "0 2 * * 1-5" # Weekday 2am |
| 129 | + push: |
| 130 | + branches: [main] |
| 131 | + paths: |
| 132 | + - "builtin-modules/src/**" |
| 133 | + - "skills/**" |
| 134 | + - "src/agent/**" |
| 135 | + - "src/sandbox/**" |
| 136 | + - "quality-loop/modules/**" |
| 137 | +
|
| 138 | +jobs: |
| 139 | + detect-modules: |
| 140 | + runs-on: ubuntu-latest |
| 141 | + outputs: |
| 142 | + modules: ${{ steps.detect.outputs.modules }} |
| 143 | + steps: |
| 144 | + - uses: actions/checkout@v4 |
| 145 | + with: |
| 146 | + fetch-depth: 2 |
| 147 | + - id: detect |
| 148 | + run: | |
| 149 | + node scripts/quality-loop/detect-modules.mjs \ |
| 150 | + --changed "$(git diff --name-only HEAD~1)" \ |
| 151 | + >> "$GITHUB_OUTPUT" |
| 152 | +
|
| 153 | + quality-loop: |
| 154 | + needs: detect-modules |
| 155 | + if: needs.detect-modules.outputs.modules != 'none' |
| 156 | + runs-on: |
| 157 | + - self-hosted |
| 158 | + - 1ES.Pool=hld-kvm-amd |
| 159 | + env: |
| 160 | + GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} |
| 161 | + steps: |
| 162 | + - uses: actions/checkout@v4 |
| 163 | + - name: Setup |
| 164 | + run: | |
| 165 | + sudo apt-get update |
| 166 | + sudo apt-get install -y poppler-utils qpdf fonts-dejavu-core |
| 167 | + just setup |
| 168 | + - name: Build |
| 169 | + run: just build |
| 170 | + - name: Run quality loop |
| 171 | + run: | |
| 172 | + node scripts/quality-loop/orchestrator.mjs \ |
| 173 | + --modules "${{ needs.detect-modules.outputs.modules }}" \ |
| 174 | + --max-iterations "${{ inputs.max_iterations || '1' }}" |
| 175 | + - name: Upload results |
| 176 | + if: always() |
| 177 | + uses: actions/upload-artifact@v4 |
| 178 | + with: |
| 179 | + name: quality-loop-${{ github.run_number }} |
| 180 | + path: /tmp/quality-loop-results/ |
| 181 | + retention-days: 30 |
| 182 | +``` |
| 183 | + |
| 184 | +## Scripts |
| 185 | + |
| 186 | +| Script | Purpose | ~LOC | |
| 187 | +|--------|---------|------| |
| 188 | +| `scripts/quality-loop/orchestrator.mjs` | Main entry: load module configs, run stages, loop | ~100 | |
| 189 | +| `scripts/quality-loop/detect-modules.mjs` | Compare changed files against watch_paths | ~60 | |
| 190 | +| `scripts/quality-loop/runner.mjs` | Stage 1: execute prompts for a module | ~80 | |
| 191 | +| `scripts/quality-loop/evaluator.mjs` | Stage 2: validate outputs, generate scores | ~150 | |
| 192 | +| `scripts/quality-loop/deduplicator.mjs` | Stage 3: group findings, manage GitHub Issues | ~200 | |
| 193 | +| `scripts/quality-loop/prioritizer.mjs` | Stage 4: score issues, assign top 3 to Copilot | ~80 | |
| 194 | +| `scripts/quality-loop/reporter.mjs` | Generate HTML summary report | ~100 | |
| 195 | + |
| 196 | +## Issue Format |
| 197 | + |
| 198 | +```markdown |
| 199 | +## Quality Loop Finding: [category] [title] |
| 200 | +
|
| 201 | +**Module:** pdf |
| 202 | +**Priority:** P0 | P1 | P2 |
| 203 | +**Occurrences:** 3 (across 3 runs) |
| 204 | +**Fingerprint:** `sha256:abc123...` |
| 205 | + |
| 206 | +### Description |
| 207 | +[What's wrong] |
| 208 | + |
| 209 | +### Evidence |
| 210 | +- Run 2026-04-14 #42: invoice.yaml — [details] |
| 211 | +- Run 2026-04-13 #41: letter.yaml — [details] |
| 212 | +- Run 2026-04-12 #40: resume.yaml — [details] |
| 213 | + |
| 214 | +### Suggested Fix |
| 215 | +[Code paths to change, approach] |
| 216 | + |
| 217 | +### Reproduction |
| 218 | +\```bash |
| 219 | +./scripts/run-pdf-prompts.sh invoice.yaml |
| 220 | +# Then check: ... |
| 221 | +\``` |
| 222 | + |
| 223 | +Labels: `quality-loop`, `pdf`, `P0` |
| 224 | +``` |
| 225 | +
|
| 226 | +## Stop Conditions |
| 227 | +- All prompts score ≥ 90% on all metrics |
| 228 | +- No new issues found in 2 consecutive runs |
| 229 | +- Max iterations reached (configurable) |
| 230 | +- Zero P0 issues remain open |
| 231 | +
|
| 232 | +## Key Design Principles |
| 233 | +1. **Generic**: module configs make it work for any document type |
| 234 | +2. **Idempotent**: same run twice → same issues (deduplication via fingerprint) |
| 235 | +3. **Observable**: HTML report, GitHub Issues, artifact uploads |
| 236 | +4. **Safe**: never auto-merges, always needs human review for PRs |
| 237 | +5. **Incremental**: fixes top 3 issues per iteration, not everything at once |
0 commit comments