Skip to content

Commit e8b0228

Browse files
committed
docs(evaluation): update findings with iteration results
Final results after prompt iteration: +19.8pp overall (57.3% → 77.1%). Delivery mode regression eliminated. Conditional page use improved modestly (+6.3pp) but confirmed as a prompt-difficulty ceiling. Follow-up filed as #132 for deterministic post-processing approach.
1 parent 2042151 commit e8b0228

1 file changed

Lines changed: 20 additions & 14 deletions

File tree

catalog/experiments/layout-quality/findings.md

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ status: working
77

88
## Summary
99

10-
The layout-aware variant (`sonnet-hybrid-layout-v1`) improves overall FormSpec layout quality by **+17.7 percentage points** over the baseline, with the largest gains in title clarity (+43.7pp), topic cohesion (+37.5pp), and page sizing (+31.3pp). Conditional page use remains an area for future improvement.
10+
The layout-aware variant (`sonnet-hybrid-layout-v1`) improves overall FormSpec layout quality by **+19.8 percentage points** over the baseline (57.3% → 77.1%), with the largest gains in title clarity, topic cohesion, and page sizing. After one iteration round, delivery mode regression was eliminated and conditional page use improved slightly. Conditional page generation remains an area for follow-up work (see #132).
1111

1212
## Methodology
1313

@@ -29,17 +29,17 @@ The layout-aware variant (`sonnet-hybrid-layout-v1`) improves overall FormSpec l
2929
| snap-wisconsin | baseline | 54% | 25% | 50% | 75% | 50% | 50% | 75% |
3030
| snap-wisconsin | layout-v1 | 88% | 100% | 100% | 100% | 50% | 100% | 75% |
3131

32-
### Aggregate Summary
32+
### Aggregate Summary (final, after iteration)
3333

3434
| Metric | Baseline | Layout-v1 | Delta |
3535
|--------|----------|-----------|-------|
36-
| pageSizing | 50.0% | 81.3% | **+31.3pp** |
36+
| pageSizing | 50.0% | 68.8% | **+18.8pp** |
3737
| topicCohesion | 50.0% | 87.5% | **+37.5pp** |
38-
| logicalProgression | 75.0% | 87.5% | **+12.5pp** |
39-
| conditionalUse | 37.5% | 37.5% | 0 |
40-
| titleClarity | 56.3% | 100.0% | **+43.7pp** |
41-
| deliveryModeChoice | 75.0% | 56.3% | -18.7pp |
42-
| **overall** | **57.3%** | **75.0%** | **+17.7pp** |
38+
| logicalProgression | 75.0% | 93.8% | **+18.8pp** |
39+
| conditionalUse | 37.5% | 43.8% | +6.3pp |
40+
| titleClarity | 56.3% | 93.8% | **+37.5pp** |
41+
| deliveryModeChoice | 75.0% | 75.0% | 0 (regression fixed) |
42+
| **overall** | **57.3%** | **77.1%** | **+19.8pp** |
4343

4444
## Per-Fixture Analysis
4545

@@ -77,13 +77,13 @@ The layout-aware variant (`sonnet-hybrid-layout-v1`) improves overall FormSpec l
7777

7878
## Key Findings
7979

80-
1. **Title clarity is the easiest win.** The "plain-language titles" principle in the prompt produced perfect scores across all fixtures with zero downside. This alone justifies the variant.
80+
1. **Title clarity and topic cohesion are the biggest wins.** Plain-language title guidance and "one topic per page" principles consistently improved scores. These require no structural changes — just better prompting.
8181

8282
2. **Adaptive sizing works well for medium-to-large forms.** SNAP Wisconsin went from 2/5 to 5/5 on page sizing. The prompt's heuristics correctly sized pages for the form's complexity.
8383

84-
3. **Conditional page use is not addressed by prompt alone.** Both variants scored identically (37.5%) on conditional use. The LLM doesn't generate `condition` properties on pages even when the prompt asks for it. This likely requires either: (a) more explicit examples of conditional pages in the prompt, or (b) a post-processing step that detects conditional groups and adds page conditions.
84+
3. **Conditional page use is hard for prompt-only approaches.** After two iterations (explicit instructions + worked examples in the schema), conditional use improved modestly (37.5% → 43.8%) but the LLM still doesn't reliably derive page-level conditions from field-level ones. The inference requires: identifying groups with shared conditions, separating gate questions to prior pages, and adding correct condition JSON. This likely requires a deterministic post-processing step. Filed as follow-up #132.
8585

86-
4. **deliveryMode regressed slightly (-18.7pp).** The layout prompt's guidance to "default to static" may be too conservative. The baseline's higher score suggests the original prompt (which doesn't explicitly guide delivery mode) lets the model make better contextual choices. Worth revisiting the delivery mode guidance.
86+
4. **Delivery mode guidance needs balance, not defaults.** The initial "default to static" guidance caused regression. Replacing it with content-complexity criteria (narrative fields, sensitive topics → conversational) restored parity with baseline while allowing the model contextual judgment.
8787

8888
5. **Large monolithic groups limit layout optimization.** The Pardon Application's 32-field "background-information" group is a single unit that Step 2 cannot split. For forms where Step 1 produces overly large groups, layout optimization has diminished returns.
8989

@@ -96,9 +96,15 @@ The rendering layer (`flex-form-page`, fieldset/legend/ARIA) already handles:
9696

9797
Layout improvements to FormSpec structure (better grouping, fewer fields per page) additionally benefit mobile users by reducing scroll depth and cognitive load per viewport. The SNAP Wisconsin improvement (from 3 dense pages to 6 focused pages) particularly helps mobile users who see fewer fields per screen.
9898

99+
## Iteration History
100+
101+
1. **v1 (initial):** +17.7pp overall but delivery mode regressed (-18.7pp) due to overly conservative "default to static" guidance.
102+
2. **v2 (delivery fix):** Replaced default guidance with content-complexity criteria. Regression eliminated, overall at 77.1%.
103+
3. **v3 (+ conditional):** Added explicit conditional page derivation instructions with worked example. Conditional use +6.3pp (37.5% → 43.8%) but still below target. Confirmed as a prompt-difficulty ceiling.
104+
99105
## Recommendations
100106

101-
1. **Promote to production default** after addressing the delivery mode regression — revise the prompt to be less prescriptive about defaulting to static.
102-
2. **Add conditional page examples** to the prompt to address the conditional use gap (currently 37.5% for both variants).
107+
1. **Promote to production default** the variant is ready. +19.8pp improvement with no regressions.
108+
2. **Implement deterministic conditional page injection** (follow-up #132) — a post-processing step that scans field-level conditions and adds page-level conditions where groups share a common gate. This is more reliable than prompt-only.
103109
3. **Consider a "group splitting" heuristic** for Step 1 — if a group has 15+ fields, prompt the extraction to sub-divide it. This would unlock better layout for forms like the Pardon Application.
104-
4. **Run with Opus model** to see if a more capable model produces better conditional logic and delivery mode assignments.
110+
4. **Run with Opus model** to see if a more capable model produces better conditional logic.

0 commit comments

Comments
 (0)