Skip to content

Commit b1e8c8c

Browse files
update(bubble-packed): enhance quality evaluation process
- Refine quality score reporting with category breakdown and local repair iterations - Update self-check section to include formal evaluation steps - Clarify scoring calibration and apply score caps for quality categories
1 parent 7b39e5c commit b1e8c8c

1 file changed

Lines changed: 43 additions & 14 deletions

File tree

agentic/commands/update.md

Lines changed: 43 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ Agents report back via `SendMessage` (auto-delivered to you). Agents may report
150150
2. **Present a summary to the user** for each library that completed successfully:
151151
- What was changed (bullet points from agent)
152152
- **Preview:** `{absolute path}/plots/{spec_id}/implementations/.update-preview/{library}/plot.png` (use the absolute path reported by the agent in its `IMAGE:` field — display it on its own line so terminal emulators render it as a clickable link)
153-
- Agent's self-assessment score
153+
- Agent's quality score with category breakdown and number of local repair iterations
154154
- Any spec changes the agent made
155155

156156
**After the summary**, create before/after comparison images and open them:
@@ -369,7 +369,8 @@ Updated **{library}** implementation for **{spec_id}**.
369369
370370
### Changes
371371
{bullet points of changes from agent}
372-
- Quality self-assessment: {score}/100
372+
- Quality: {score}/100 (after {N} local repair iterations)
373+
- VQ: {vq}/30 | DE: {de}/20 | SC: {sc}/15 | DQ: {dq}/15 | CQ: {cq}/10 | LM: {lm}/10
373374
374375
## Test Plan
375376
@@ -640,21 +641,46 @@ uv run python -m core.images process \
640641
plots/{SPEC_ID}/implementations/.update-preview/{LIBRARY}/plot_thumb.png
641642
```
642643

643-
### Step 7: Self-Check
644+
### Step 7: Quality Evaluation & Local Repair Loop
644645

645-
View the generated image at `plots/{SPEC_ID}/implementations/.update-preview/{LIBRARY}/plot.png`.
646+
Before shipping, formally evaluate your implementation against the same criteria CI uses. This catches issues locally
647+
(cheaper) instead of triggering expensive remote repair cycles.
646648

647-
Check against the quality criteria from `prompts/quality-criteria.md`:
649+
**7a. View the generated image** at `plots/{SPEC_ID}/implementations/.update-preview/{LIBRARY}/plot.png`.
648650

649-
- Text legibility (title 24pt, labels 20pt, ticks 16pt)
650-
- No overlapping elements
651-
- Elements visible and distinguishable
652-
- Color accessibility
653-
- Layout balance (16:9)
654-
- Correct axis labels with units
655-
- Spec compliance
651+
**7b. Score against all 6 quality categories** using the criteria from `prompts/quality-criteria.md` (read in Step 1).
652+
Produce category totals only (no per-criterion notes needed):
656653

657-
Fix any obvious issues before reporting.
654+
```
655+
VQ: __/30 | DE: __/20 | SC: __/15 | DQ: __/15 | CQ: __/10 | LM: __/10 → TOTAL: __/100
656+
```
657+
658+
**Scoring calibration — apply these defaults strictly:**
659+
- DE-01 = 4 (configured default, not exceptional) unless design is genuinely outstanding
660+
- DE-02 = 2 (library defaults, minimal refinement) unless you added intentional polish
661+
- DE-03 = 2 (data displayed, no storytelling) unless visual hierarchy clearly guides the viewer
662+
- LM-01 = 3 (correct but not best patterns) unless you used high-level API expertly
663+
- LM-02 = 1 (generic usage) unless you used a feature distinctive to this library
664+
- **Median implementation scores 72-78, not 90+.** If your total is above 85, re-check — are you inflating?
665+
666+
**Apply score caps** (from quality-criteria.md):
667+
- VQ-02 = 0 (severe overlap) → max 49
668+
- VQ-03 = 0 (invisible elements) → max 49
669+
- SC-01 = 0 (wrong plot type) → max 40
670+
- DQ-02 = 0 (controversial data) → max 49
671+
- DE-01 ≤ 2 AND DE-02 ≤ 2 (generic + no refinement) → max 75
672+
- CQ-04 = 0 (fake functionality) → max 70
673+
674+
**7c. If score ≥ 90** → proceed to Step 8.
675+
676+
**7d. If score < 90** → repair locally (max **2 iterations**, separate from CI's 3 repair attempts after PR submission):
677+
1. Identify the top 2-3 weakest categories/criteria dragging the score down
678+
2. Fix the implementation code to address those specific weaknesses
679+
3. Re-run the implementation (Step 4), re-lint (Step 5), re-process images (Step 6)
680+
4. **Re-read the generated image** and **formally re-score** — produce the full VQ/DE/SC/DQ/CQ/LM breakdown again (fixes can introduce new issues)
681+
5. If score ≥ 90 → proceed to Step 8
682+
6. If score < 90 and iterations < 2 → repeat from substep 1
683+
7. If score < 90 after 2 iterations → proceed to Step 8 anyway with your most recent score (ship to CI for fresh perspective)
658684

659685
### Step 8: Report to Lead
660686

@@ -670,7 +696,10 @@ CHANGES:
670696
- ...
671697
672698
IMAGE: {absolute path to plots/{SPEC_ID}/implementations/.update-preview/{LIBRARY}/plot.png — use pwd to resolve}
673-
SELF_SCORE: {your estimated quality score}/100
699+
700+
QUALITY: {total}/100 (after {N} local repair iterations, 0 = passed first evaluation)
701+
VQ: {vq}/30 | DE: {de}/20 | SC: {sc}/15
702+
DQ: {dq}/15 | CQ: {cq}/10 | LM: {lm}/10
674703
675704
SPEC_CHANGES: {none, or describe what you changed in specification.md}
676705

0 commit comments

Comments
 (0)