Skip to content

Commit 3a32ff8

Browse files
authored
Merge pull request #59 from Forward-Future/codex/production-grade-product-qa
[codex] Upgrade full product evaluation loop
2 parents d48ac01 + dff498b commit 3a32ff8

9 files changed

Lines changed: 95 additions & 82 deletions

File tree

scripts/loop-data.mjs

Lines changed: 22 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -407,37 +407,41 @@ export const loops = [
407407
slug: "full-product-evaluation-loop",
408408
title: "The full product evaluation loop",
409409
summary:
410-
"Tests every major product capability and fixes outcomes below the quality bar.",
411-
seoTitle: "Full Product Evaluation Loop for AI Systems | Loop Library",
410+
"Recreates production locally, tests every product surface, and fixes all verified bugs holistically.",
411+
seoTitle: "Production-Grade Full Product Evaluation Loop | Loop Library",
412412
description:
413413
"A comprehensive product-quality workflow that evaluates realistic scenarios across every major capability, fixes weak outcomes, and reruns them to the defined bar.",
414414
categoryLabel: "AI product evaluation workflow",
415415
author: "Matthew Berman",
416416
published: "2026-06-16",
417-
modified: "2026-06-17",
417+
modified: "2026-06-21",
418418
prompt:
419-
"Create [N] realistic scenarios covering every major capability. Before testing, define clear success criteria and choose a consistent evaluation method, such as pass/fail checks or a scoring rubric. Run every scenario under the same conditions and record evidence for each outcome. Fix the underlying cause of anything that does not meet the criteria, rerun the affected scenarios, and then rerun the complete set. Continue until every scenario meets the original quality bar.",
420-
verifyTitle: "Every one of the [N] scenarios meets the defined quality bar.",
419+
"Build sanitized, production-scale local data under production-like settings. Inventory every user-facing feature, role, route, button, input, modal, state, and workflow; define documented acceptance criteria and finite risk-based edge cases for each. Test as a real user, logging every bug with reproduction evidence. Review findings for shared causes and dependencies; implement coherent fixes with regression tests, then rerun the full inventory. Stop at a clean pass or blocked handoff. Ask before production, sensitive data, or destructive actions.",
420+
verifyTitle: "Every inventoried product surface meets its documented acceptance criteria.",
421421
verifyDetail:
422-
"The final evaluated run covers every major capability under the original conditions.",
422+
"The final full regression run covers every inventoried surface and its finite risk-based edge cases in the production-like local environment, with each reproducible bug fixed and backed by evidence.",
423423
useWhen:
424-
"Use this for an end-to-end product evaluation when quality must be measured across the full feature set rather than a narrow regression or a few hand-picked examples.",
424+
"Use this for an exhaustive, end-to-end application QA pass when a production-like local environment and complete interactive-surface coverage matter more than a narrow regression or sample of major features.",
425425
steps: [
426-
"List every major capability, define the success criteria and evaluation method, choose [N], and allocate realistic scenarios across the product surface.",
427-
"Run the full set under consistent conditions and evaluate every outcome with evidence.",
428-
"Document each scenario that misses the criteria, fix the underlying issue, and add focused regression coverage where appropriate.",
429-
"Rerun affected scenarios and then the complete set until every outcome meets the original quality bar.",
426+
"Build a sanitized or synthetic production-scale local dataset, mirror safe production settings, and record unavoidable differences.",
427+
"Inventory every user-facing feature, role, route, control, state, and workflow; define documented acceptance criteria and a finite risk-based edge-case set for each item.",
428+
"Exercise every inventory item as a real user under its normal and defined edge-case conditions, logging each bug immediately with reproducible evidence.",
429+
"Review the complete bug set for shared causes, dependencies, and conflicting fixes, then implement the smallest coherent solution with regression coverage.",
430+
"Rerun affected paths and the complete inventory; stop only at a clean full pass or an explicit blocked handoff.",
430431
],
431432
why:
432-
"A fixed capability map and consistent evaluation method make product quality visible across the whole system. Requiring a final complete run catches fixes that improve one scenario while weakening another.",
433+
"A finite surface inventory prevents major controls and states from disappearing behind a few happy-path scenarios. Reviewing all findings before fixing them exposes shared causes and interactions, while the final full run catches changes that repair one path but weaken another.",
433434
note:
434-
"Keep the scenario set representative and preserve failed examples. Aggregate results can hide severe misses, so require every scenario to clear the bar.",
435+
"Do not copy secrets or sensitive production data into the local environment, touch production without approval, or count an untested or blocked surface as passing. Preserve the inventory, bug log, environment differences, and final evidence for review.",
435436
keywords: [
436-
"AI product evaluation",
437-
"full product testing",
438-
"response scoring",
439-
"quality benchmark",
440-
"feature coverage",
437+
"production-grade QA",
438+
"production-like local testing",
439+
"exhaustive product testing",
440+
"real user testing",
441+
"UI control coverage",
442+
"edge case testing",
443+
"bug documentation",
444+
"full regression testing",
441445
],
442446
related: ["quality-streak-loop", "production-data-cleanup-loop"],
443447
},

site/catalog.json

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -478,28 +478,32 @@
478478
},
479479
"author": "Matthew Berman",
480480
"published": "2026-06-16",
481-
"modified": "2026-06-17",
481+
"modified": "2026-06-21",
482482
"description": "A comprehensive product-quality workflow that evaluates realistic scenarios across every major capability, fixes weak outcomes, and reruns them to the defined bar.",
483-
"useWhen": "Use this for an end-to-end product evaluation when quality must be measured across the full feature set rather than a narrow regression or a few hand-picked examples.",
484-
"prompt": "Create [N] realistic scenarios covering every major capability. Before testing, define clear success criteria and choose a consistent evaluation method, such as pass/fail checks or a scoring rubric. Run every scenario under the same conditions and record evidence for each outcome. Fix the underlying cause of anything that does not meet the criteria, rerun the affected scenarios, and then rerun the complete set. Continue until every scenario meets the original quality bar.",
483+
"useWhen": "Use this for an exhaustive, end-to-end application QA pass when a production-like local environment and complete interactive-surface coverage matter more than a narrow regression or sample of major features.",
484+
"prompt": "Build sanitized, production-scale local data under production-like settings. Inventory every user-facing feature, role, route, button, input, modal, state, and workflow; define documented acceptance criteria and finite risk-based edge cases for each. Test as a real user, logging every bug with reproduction evidence. Review findings for shared causes and dependencies; implement coherent fixes with regression tests, then rerun the full inventory. Stop at a clean pass or blocked handoff. Ask before production, sensitive data, or destructive actions.",
485485
"verification": {
486-
"title": "Every one of the [N] scenarios meets the defined quality bar.",
487-
"detail": "The final evaluated run covers every major capability under the original conditions."
486+
"title": "Every inventoried product surface meets its documented acceptance criteria.",
487+
"detail": "The final full regression run covers every inventoried surface and its finite risk-based edge cases in the production-like local environment, with each reproducible bug fixed and backed by evidence."
488488
},
489489
"steps": [
490-
"List every major capability, define the success criteria and evaluation method, choose [N], and allocate realistic scenarios across the product surface.",
491-
"Run the full set under consistent conditions and evaluate every outcome with evidence.",
492-
"Document each scenario that misses the criteria, fix the underlying issue, and add focused regression coverage where appropriate.",
493-
"Rerun affected scenarios and then the complete set until every outcome meets the original quality bar."
494-
],
495-
"why": "A fixed capability map and consistent evaluation method make product quality visible across the whole system. Requiring a final complete run catches fixes that improve one scenario while weakening another.",
496-
"implementationNote": "Keep the scenario set representative and preserve failed examples. Aggregate results can hide severe misses, so require every scenario to clear the bar.",
490+
"Build a sanitized or synthetic production-scale local dataset, mirror safe production settings, and record unavoidable differences.",
491+
"Inventory every user-facing feature, role, route, control, state, and workflow; define documented acceptance criteria and a finite risk-based edge-case set for each item.",
492+
"Exercise every inventory item as a real user under its normal and defined edge-case conditions, logging each bug immediately with reproducible evidence.",
493+
"Review the complete bug set for shared causes, dependencies, and conflicting fixes, then implement the smallest coherent solution with regression coverage.",
494+
"Rerun affected paths and the complete inventory; stop only at a clean full pass or an explicit blocked handoff."
495+
],
496+
"why": "A finite surface inventory prevents major controls and states from disappearing behind a few happy-path scenarios. Reviewing all findings before fixing them exposes shared causes and interactions, while the final full run catches changes that repair one path but weaken another.",
497+
"implementationNote": "Do not copy secrets or sensitive production data into the local environment, touch production without approval, or count an untested or blocked surface as passing. Preserve the inventory, bug log, environment differences, and final evidence for review.",
497498
"keywords": [
498-
"AI product evaluation",
499-
"full product testing",
500-
"response scoring",
501-
"quality benchmark",
502-
"feature coverage"
499+
"production-grade QA",
500+
"production-like local testing",
501+
"exhaustive product testing",
502+
"real user testing",
503+
"UI control coverage",
504+
"edge case testing",
505+
"bug documentation",
506+
"full regression testing"
503507
],
504508
"related": [
505509
{

site/catalog.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,10 @@ URL above.
9494
## 010 — [The full product evaluation loop](https://signals.forwardfuture.ai/loop-library/loops/full-product-evaluation-loop/)
9595

9696
- Category: Evaluation
97-
- Use when: Use this for an end-to-end product evaluation when quality must be measured across the full feature set rather than a narrow regression or a few hand-picked examples.
98-
- Prompt: Create [N] realistic scenarios covering every major capability. Before testing, define clear success criteria and choose a consistent evaluation method, such as pass/fail checks or a scoring rubric. Run every scenario under the same conditions and record evidence for each outcome. Fix the underlying cause of anything that does not meet the criteria, rerun the affected scenarios, and then rerun the complete set. Continue until every scenario meets the original quality bar.
99-
- Verify: Every one of the [N] scenarios meets the defined quality bar. The final evaluated run covers every major capability under the original conditions.
100-
- Keywords: AI product evaluation, full product testing, response scoring, quality benchmark, feature coverage
97+
- Use when: Use this for an exhaustive, end-to-end application QA pass when a production-like local environment and complete interactive-surface coverage matter more than a narrow regression or sample of major features.
98+
- Prompt: Build sanitized, production-scale local data under production-like settings. Inventory every user-facing feature, role, route, button, input, modal, state, and workflow; define documented acceptance criteria and finite risk-based edge cases for each. Test as a real user, logging every bug with reproduction evidence. Review findings for shared causes and dependencies; implement coherent fixes with regression tests, then rerun the full inventory. Stop at a clean pass or blocked handoff. Ask before production, sensitive data, or destructive actions.
99+
- Verify: Every inventoried product surface meets its documented acceptance criteria. The final full regression run covers every inventoried surface and its finite risk-based edge cases in the production-like local environment, with each reproducible bug fixed and backed by evidence.
100+
- Keywords: production-grade QA, production-like local testing, exhaustive product testing, real user testing, UI control coverage, edge case testing, bug documentation, full regression testing
101101
- Related: [The quality streak loop](https://signals.forwardfuture.ai/loop-library/loops/quality-streak-loop/), [The production data cleanup loop](https://signals.forwardfuture.ai/loop-library/loops/production-data-cleanup-loop/)
102102

103103
## 011 — [The test-suite speed loop](https://signals.forwardfuture.ai/loop-library/loops/test-suite-speed-loop/)

site/catalog.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,10 +94,10 @@ URL above.
9494
## 010 — [The full product evaluation loop](https://signals.forwardfuture.ai/loop-library/loops/full-product-evaluation-loop/)
9595

9696
- Category: Evaluation
97-
- Use when: Use this for an end-to-end product evaluation when quality must be measured across the full feature set rather than a narrow regression or a few hand-picked examples.
98-
- Prompt: Create [N] realistic scenarios covering every major capability. Before testing, define clear success criteria and choose a consistent evaluation method, such as pass/fail checks or a scoring rubric. Run every scenario under the same conditions and record evidence for each outcome. Fix the underlying cause of anything that does not meet the criteria, rerun the affected scenarios, and then rerun the complete set. Continue until every scenario meets the original quality bar.
99-
- Verify: Every one of the [N] scenarios meets the defined quality bar. The final evaluated run covers every major capability under the original conditions.
100-
- Keywords: AI product evaluation, full product testing, response scoring, quality benchmark, feature coverage
97+
- Use when: Use this for an exhaustive, end-to-end application QA pass when a production-like local environment and complete interactive-surface coverage matter more than a narrow regression or sample of major features.
98+
- Prompt: Build sanitized, production-scale local data under production-like settings. Inventory every user-facing feature, role, route, button, input, modal, state, and workflow; define documented acceptance criteria and finite risk-based edge cases for each. Test as a real user, logging every bug with reproduction evidence. Review findings for shared causes and dependencies; implement coherent fixes with regression tests, then rerun the full inventory. Stop at a clean pass or blocked handoff. Ask before production, sensitive data, or destructive actions.
99+
- Verify: Every inventoried product surface meets its documented acceptance criteria. The final full regression run covers every inventoried surface and its finite risk-based edge cases in the production-like local environment, with each reproducible bug fixed and backed by evidence.
100+
- Keywords: production-grade QA, production-like local testing, exhaustive product testing, real user testing, UI control coverage, edge case testing, bug documentation, full regression testing
101101
- Related: [The quality streak loop](https://signals.forwardfuture.ai/loop-library/loops/quality-streak-loop/), [The production data cleanup loop](https://signals.forwardfuture.ai/loop-library/loops/production-data-cleanup-loop/)
102102

103103
## 011 — [The test-suite speed loop](https://signals.forwardfuture.ai/loop-library/loops/test-suite-speed-loop/)

site/feed.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@
114114
<id>https://signals.forwardfuture.ai/loop-library/loops/full-product-evaluation-loop/</id>
115115
<link href="https://signals.forwardfuture.ai/loop-library/loops/full-product-evaluation-loop/" />
116116
<published>2026-06-16T00:00:00-07:00</published>
117-
<updated>2026-06-17T00:00:00-07:00</updated>
117+
<updated>2026-06-21T00:00:00-07:00</updated>
118118
<author>
119119
<name>Matthew Berman</name>
120120
</author>

site/index.html

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1025,7 +1025,7 @@ <h3>
10251025
data-category="evaluation"
10261026
data-published="2026-06-16"
10271027
data-featured="true"
1028-
data-search="full product evaluation realistic tests test cases scenarios major features capabilities score responses results outcomes success criteria pass fail scoring rubric evidence quality bar rerun matthew berman"
1028+
data-search="production grade product qa full app testing production like local dataset real user every feature role route button input modal state workflow edge case bug documentation regression fix matthew berman"
10291029
>
10301030
<td class="cell-loop">
10311031
<div class="loop-meta">
@@ -1038,17 +1038,18 @@ <h3>
10381038
The full product evaluation loop
10391039
</a>
10401040
</h3>
1041-
<p class="loop-summary">Tests every major product capability and fixes outcomes below the quality bar.</p>
1041+
<p class="loop-summary">Recreates production locally, tests every product surface, and fixes all verified bugs holistically.</p>
10421042
<p data-prompt>
1043-
Create [N] realistic scenarios covering every major
1044-
capability. Before testing, define clear success criteria
1045-
and choose a consistent evaluation method, such as pass/fail
1046-
checks or a scoring rubric. Run every scenario under the
1047-
same conditions and record evidence for each outcome. Fix
1048-
the underlying cause of anything that does not meet the
1049-
criteria, rerun the affected scenarios, and then rerun the
1050-
complete set. Continue until every scenario meets the
1051-
original quality bar.
1043+
Build sanitized, production-scale local data under
1044+
production-like settings. Inventory every user-facing
1045+
feature, role, route, button, input, modal, state, and
1046+
workflow; define documented acceptance criteria and finite
1047+
risk-based edge cases for each. Test as a real user,
1048+
logging every bug with reproduction evidence. Review
1049+
findings for shared causes and dependencies; implement
1050+
coherent fixes with regression tests, then rerun the full
1051+
inventory. Stop at a clean pass or blocked handoff. Ask
1052+
before production, sensitive data, or destructive actions.
10521053
</p>
10531054
</td>
10541055
<td class="cell-action">

0 commit comments

Comments
 (0)