feat(web-ui): PRD stress-test results view + refinement (#562)#606
Conversation
Render stress-test ambiguities as answerable cards and fold answers
back into a refined PRD version.
Backend (codeframe/core/prd_stress_test.py, ui/routers/prd_v2.py):
- Add `severity` ("blocking"|"warning") to Ambiguity + classify parsing
and the decomposition prompt.
- Emit structured `ambiguities` in the stress-test `complete` SSE event
(new `ambiguity_to_dict` serializer).
- New `POST /api/v2/prd/stress-test/refine`: reconstruct Ambiguity objects
from submitted answers, run `resolve_ambiguities_into_prd`, persist a new
PRD version. Registered before `/{prd_id}`. Surfaces a 502 when the LLM
rewrite is a no-op (truncation) and rejects blank answers.
- Extract shared `_resolve_llm_provider` helper (stream + refine).
Frontend (web-ui):
- New `AmbiguityCard` (question text, severity badge, answer textarea,
recommendation).
- `StressTestModal` results view: "X of Y answered" progress + cards +
[Refine PRD], disabled until every blocking ambiguity is answered (and
at least one answer given). Refine mutates the PRD editor on success.
- `useStressTestStream` accumulates structured ambiguities; new
`prdApi.refineStressTest`; new TS types.
Adapts the issue's Traycer plan to the SSE architecture shipped in #561
(modal results view instead of a 3rd-column panel; enrich the existing
GET SSE event instead of a new synchronous endpoint).
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
WalkthroughAmbiguities now carry severity ("blocking"/"warning") and are serialized in the stress-test SSE complete event. A new POST /api/v2/prd/stress-test/refine endpoint accepts answered ambiguities, reconstructs and resolves them into a refined PRD version, and the frontend adds AmbiguityCard, modal results view, hook updates, and an API client method to submit answers. ChangesPRD Stress-Test Refinement Feature
Sequence DiagramsequenceDiagram
participant User as User
participant Modal as StressTestModal
participant Card as AmbiguityCard
participant API as prdApi.refineStressTest
participant Backend as /api/v2/prd/stress-test/refine
participant LLM as LLM
participant Page as PRD Page
User->>Modal: Stress test completes with ambiguities
Modal->>Card: Render AmbiguityCard for each ambiguity
User->>Card: Type answer into textarea
Card->>Modal: onChange(id, value) updates local answers
Modal->>Modal: enable Refine when blocking answered & ≥1 answer
User->>Modal: Click [Refine PRD]
Modal->>API: POST { prd_id, answers }
API->>Backend: forward refine request
Backend->>Backend: validate & reconstruct Ambiguities
Backend->>LLM: request rewritten PRD based on answers
LLM->>Backend: refined PRD content
Backend->>Backend: persist new PRD version
Backend-->>API: return PrdResponse
API-->>Modal: deliver new PrdResponse
Modal->>Page: onRefined(newPrd)
Page->>Page: update cached PRD and close modal
Modal->>User: show success toast
🎯 4 (Complex) | ⏱️ ~45 minutes Possibly Related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Demo verification (Phase 11 — hard gate)Each acceptance criterion mapped to outcome evidence (real FastAPI TestClient + real SQLite DB for backend; real jsdom render for frontend — only the LLM is mocked).
Hardening (from cross-family |
|
Code Review - PR 606 feedback posted below |
CodeFRAME Development GuidelinesLast updated: 2026-05-11 Product VisionCodeFrame is a project delivery system: Think → Build → Prove → Ship. It owns the edges of the AI coding pipeline — everything BEFORE code gets written (PRD, specification, task decomposition) and everything AFTER (verification gates, quality memory, deployment). The actual code writing is delegated to frontier coding agents (Claude Code, Codex, OpenCode). CodeFrame does not compete with coding agents. It orchestrates them. Status: CLI ✅ | Server ✅ | ReAct agent ✅ | Web UI ✅ | Agent adapters ✅ | Multi-provider LLM ✅ | Next: Phase 4A — See If you are an agent working in this repo: do not improvise architecture. Follow the documents listed below. Primary Contract (MUST FOLLOW)
Rule 0: If a change does not directly support the Think → Build → Prove → Ship pipeline, do not implement it. Current Focus: Phase 4APhase 5.4 is complete — PRD stress-test web UI: trigger + streaming (#561). Backend: Phase 5.3 is complete — Async notifications cover both surfaces:
Phase 5.2 is complete — Costs page now ships per-task and per-agent breakdowns (#558) on top of the spend summary (#557). Backend: Phase 5.1 is complete — Settings page now ships three working tabs: Agent (#554), API Keys (#555), and PROOF9 Defaults + Workspace Config (#556). Backend: Phase 3.5C is complete — Next, in order:
See Architecture Rules (non-negotiable)1) Core must be headless
Core is allowed to: read/write durable state (SQLite/filesystem), run orchestration/worker loops, emit events to an append-only event log, call adapters via interfaces (LLM, git, fs). 2) CLI must not require a serverGolden Path commands must work from the CLI with no server running. FastAPI is optional, started explicitly via 3) Agent state transitions flow through runtime
This separation prevents duplicate state transitions (e.g., DONE→DONE errors). 4) Keep commits runnableAt all times: Current Statev2 Architecture
Phase 3 Web UI (actively developed — not legacy)Next.js 16 App Router, TypeScript, Shadcn/UI, Tailwind CSS, Hugeicons, XTerm.js, WebSocket + SSE. Shipped pages: Testing: What's implementedFull feature list in Repository StructureCommandsPython / CLIuv run pytest # All tests
uv run pytest -m v2 # v2 tests only
uv run pytest tests/core/ # Core module tests
uv run pytest tests/lifecycle/ # Lifecycle tests (no live API calls — uses MockProvider)
uv run ruff check .
# Web UI
cd web-ui && npm test
cd web-ui && npm run buildGolden Path CLI# Workspace
cf init <repo> [--detect | --tech-stack "..." | --tech-stack-interactive]
cf status
# PRD
cf prd add <file.md>
cf prd show
# Tasks
cf tasks generate
cf tasks list [--status READY]
cf tasks show <id>
# Work — single task
cf work start <task-id> [--execute] [--engine react|plan] [--verbose] [--dry-run]
cf work start <task-id> --execute --stall-timeout 120 --stall-action retry|blocker|fail
cf work start <task-id> --execute --llm-provider openai --llm-model gpt-4o
cf work stop <task-id>
cf work resume <task-id>
cf work follow <task-id> [--tail 50]
cf work diagnose <task-id>
# Work — batch
cf work batch run [<id>...] [--all-ready] [--engine react|plan]
cf work batch run --strategy serial|parallel|auto [--max-parallel 4] [--retry 3]
cf work batch run --all-ready --llm-provider openai --llm-model qwen2.5-coder:7b
cf work batch status|cancel|resume [batch_id]
# Blockers
cf blocker list
cf blocker show <id>
cf blocker answer <id> "answer"
# Quality / State
cf review && cf patch export && cf commit
cf checkpoint create|list|restore
cf summary
# Environment
cf env check|install|doctor
# GitHub PR
cf pr create|status|checks|mergeNote: What NOT to do
Testing / DemoingQuality check (covers both backend and web UI)uv run pytest && uv run ruff check .
cd web-ui && npm test && npm run buildNew v2 tests: add Demoing against a sample project (e.g.,
|
| Doc | Purpose |
|---|---|
docs/VISION.md |
North star: Think → Build → Prove → Ship thesis |
docs/PRODUCT_ROADMAP.md |
Current roadmap — Phase 3.5/4/5 web product completeness |
docs/GOLDEN_PATH.md |
CLI-first workflow contract |
docs/CLI_WIREFRAME.md |
Command → module mapping |
docs/AGENT_SYSTEM_REFERENCE.md |
Component table, model selection, execution flows, self-correction |
docs/REACT_AGENT_ARCHITECTURE.md |
ReAct deep-dive: tools, editor, token management |
docs/PHASE_3_UI_ARCHITECTURE.md |
Web UI architecture (Next.js, pages, components) |
docs/PHASE_2_DEVELOPER_GUIDE.md |
Server layer + v2 router patterns |
docs/PHASE_2_CLI_API_MAPPING.md |
CLI to API endpoint mapping |
docs/QUICKSTART.md |
User-facing quickstart guide |
Archived (completed plans, old gap analyses): docs/archive/
Legacy (v1 reference only): SPRINTS.md, sprints/, specs/, CODEFRAME_SPEC.md
|
Code Review — PR 606 Good implementation overall. Scope matches issue 562, backend/frontend split is clean, and test coverage is solid. A few things worth addressing before merge. Backend: codeframe/core/prd_stress_test.py Severity parsing is handled well: explicit allowlist + fallback to blocking guards against unexpected LLM output. ambiguity_to_dict is a clean serializer - right call to keep it separate from the dataclass. The severity field currently uses plain str; using Literal[blocking, warning] from typing would catch typos at the type-checker level. Backend: codeframe/ui/routers/prd_v2.py (1) HTTP 400 for missing API key is semantically wrong. When ANTHROPIC_API_KEY is absent, the client request is well-formed — the server lacks configuration. HTTP 503 Service Unavailable is the correct status. Using 400 tells the caller their request is malformed, which is misleading. (2) Reconstructed Ambiguity objects silently drop source_node_title and recommendation. Fine if resolve_ambiguities_into_prd only needs label, questions, and resolved_answer in its LLM prompt (which appears to be the case), but a short inline comment would help the next reader confirm it is intentional rather than a bug. (3) 502 for a no-op rewrite (refined_content == record.content) is a solid guard. Correct call. (4) Route ordering comment is good — the FastAPI catch-all footgun is exactly what needs a note. (5) AmbiguityAnswer.questions has no min_length constraint. If SSE always produces at least one question, the model could enforce that invariant for direct API callers too. Frontend: AmbiguityCard.tsx Clean controlled component. aria-label on the textarea makes it testable and accessible. No issues. Frontend: StressTestModal.tsx The dual disable condition (blockingUnanswered > 0 || answeredCount === 0) is correct and the warnings-only edge case is explicitly tested. Good. mutatePrd(newPrd, false) is the right SWR pattern. Error path in handleRefine uses apiError.detail as a toast string — ApiError.detail can be a nested object in our error schema (from api_error() in the router). Add a string type-check to prevent [object Object] appearing in the UI. Frontend: api.ts workspace_path as a query param on a POST is consistent with the rest of the API. Fine. Tests Backend: severity parsing/fallback, serializer round-trip, complete payload shape, refine happy path, 404/400/422/502, and route-ordering all covered. Well done. Frontend: AmbiguityCard unit tests are thorough; StressTestModal covers the warnings-only edge case (easy to miss); useStressTestStream updated correctly. One gap: no test for the toast error path when apiError.detail is an object rather than a string. Summary
The core implementation is solid and the test coverage earns confidence. The 400→503 fix and the toast error coercion are my only blockers. |
…fields Address review feedback on #562: - Missing API key / unknown provider is a service-availability problem, not a malformed request -> 503 (was 400). - Add a comment noting source_node_title/recommendation are intentionally empty in the reconstructed Ambiguity (resolve only reads label/questions/ resolved_answer).
|
Thanks for the review. Triaged every finding: Applied
Verified — no change needed
Deferred (minor, YAGNI)
The two codex cross-family findings (empty-payload refine, no-op-rewrite duplicate version) plus the blank-answer guard were already addressed in the initial PR. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
codeframe/ui/routers/prd_v2.py (1)
383-390: ⚡ Quick winUse
ErrorCodes.SERVICE_UNAVAILABLEfor 503 responses.The 503 status and "LLM provider unavailable" message indicate a service availability issue, but the error code
EXECUTION_FAILEDtypically signals a processing failure during execution. UsingSERVICE_UNAVAILABLEwould be semantically consistent with both the HTTP status and the error message, making client-side error handling more intuitive.♻️ Proposed fix
raise HTTPException( status_code=503, detail=api_error( - "LLM provider unavailable", ErrorCodes.EXECUTION_FAILED, str(exc) + "LLM provider unavailable", ErrorCodes.SERVICE_UNAVAILABLE, str(exc) ), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@codeframe/ui/routers/prd_v2.py` around lines 383 - 390, Replace the incorrect error code used for the 503 raise: in the HTTPException that sets status_code=503 and calls api_error("LLM provider unavailable", ErrorCodes.EXECUTION_FAILED, str(exc)), change ErrorCodes.EXECUTION_FAILED to ErrorCodes.SERVICE_UNAVAILABLE so the api_error invocation and HTTP 503 response use ErrorCodes.SERVICE_UNAVAILABLE (retaining the same message and exc string).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@codeframe/ui/routers/prd_v2.py`:
- Around line 383-390: Replace the incorrect error code used for the 503 raise:
in the HTTPException that sets status_code=503 and calls api_error("LLM provider
unavailable", ErrorCodes.EXECUTION_FAILED, str(exc)), change
ErrorCodes.EXECUTION_FAILED to ErrorCodes.SERVICE_UNAVAILABLE so the api_error
invocation and HTTP 503 response use ErrorCodes.SERVICE_UNAVAILABLE (retaining
the same message and exc string).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5ee1fcd3-5bc0-49f3-b854-86afd40cc124
📒 Files selected for processing (2)
codeframe/ui/routers/prd_v2.pytests/ui/test_prd_stress_test_router.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/ui/test_prd_stress_test_router.py
Code Review: PRD Stress-Test Results View + Refinement (#562)Critical1. Blocking sync LLM call on the async event loop (
The existing SSE path already solves this correctly via import asyncio
refined_content = await asyncio.to_thread(
resolve_ambiguities_into_prd, record.content, ambiguities, provider
)
Important2. The dataclass field accepts any from typing import Literal
severity: Literal["blocking", "warning"] = "blocking"The TypeScript side already defines 3. The signature is def ambiguity_to_dict(amb: Ambiguity) -> dict[str, object]:4.
Minor5. Unsafe const apiError = err as ApiError;
toast.error(apiError.detail || 'Failed to refine PRD. Please try again.');The fallback string saves this in practice, but it's an unsafe cast on const detail = (err as { detail?: unknown }).detail;
toast.error(typeof detail === 'string' ? detail : 'Failed to refine PRD. Please try again.');Architecture / Coverage NotesArchitecture compliance looks good: Test coverage is thorough — severity parsing, round-trip serialization, error cases (404/503/502/422), route ordering, and frontend card/modal/stream hook behavior are all covered. All backend tests are marked The blocking event loop issue (#1) is the only blocker for merge. Items #2–#4 are straightforward hardening improvements worth addressing in this PR. |
Address re-review on #562: - CRITICAL: wrap resolve_ambiguities_into_prd in asyncio.to_thread so the refine endpoint no longer blocks the event loop (mirrors the SSE path). - severity typed Literal['blocking','warning']; ambiguity_to_dict returns dict[str, object]. - AmbiguityAnswer.label gains min_length=1 (empty label -> 422). - Safer error handling in handleRefine (typeof-string guard instead of an unchecked ApiError cast).
|
Addressed the re-review (commit
Backend 46 tests + frontend modal suite pass; |
Code Review -- PR 606 (follow-up)All items from the previous two review rounds have been applied cleanly: asyncio.to_thread wrapping the blocking LLM call, Literal typing on severity, the _resolve_llm_provider DRY refactor, and the 400 to 503 status correction. Two remaining issues in the current diff. 1. Wrong error code on the 503 response (prd_v2.py ~line 289) The 503 still uses ErrorCodes.EXECUTION_FAILED. EXECUTION_FAILED signals a processing failure; a missing API key or unconfigured provider is a service-availability problem. This is what CodeRabbit flagged and it is still unaddressed. Use ErrorCodes.SERVICE_UNAVAILABLE (or the equivalent in this project's error catalog) so API clients can distinguish a bad request from the server being unable to reach the LLM. 2. Second 404 after create_new_version is misleading (prd_v2.py ~line 337) At that point get_by_id has already confirmed the PRD exists. If create_new_version returns None, that is a server-side fault, not a not-found, and the 404 message 'No PRD with id ...' will actively mislead callers who just confirmed the PRD existed. Change to a 500 with 'Failed to persist new PRD version'. Minor -- frontend error detail extraction (StressTestModal.tsx) 422 responses from Pydantic return detail as an array of objects, not a string, so typeof detail === 'string' is always false for validation errors and the generic fallback fires every time. Not a blocker, but consider also checking Array.isArray(detail) and pulling detail[0]?.msg if a specific message is wanted. Everything else is good Severity parsing with explicit allowlist + fallback to blocking is the right defensive approach for an LLM-parsed field. The route registration comment is exactly the right non-obvious note to leave. AmbiguityCard aria-label on the textarea is correct for accessibility. The ambiguities ?? [] default in useStressTestStream handles old server formats safely. Pydantic whitespace rejection, blank-label guard, and empty-answers guard close the obvious API abuse vectors. Test coverage across severity parsing, serializer round-trip, 404/400/422/502/503 paths, and the full frontend render/refine flow is comprehensive. Ready to merge after items 1 and 2 are addressed. |
Address follow-up review on #562: - 503 (LLM unavailable) now uses ErrorCodes.SERVICE_UNAVAILABLE instead of EXECUTION_FAILED, so clients can distinguish a bad request from the server being unable to reach the LLM. - A None from create_new_version (PRD already confirmed to exist) is a persistence fault -> 500 INTERNAL_ERROR, not a misleading 404. Covered by test_refine_persistence_failure_returns_500.
|
Addressed both blockers (commit
Minor (frontend 422 detail) — verified no change needed:
|
Code Review — PR 606 (follow-up, round 4)Both blockers from the previous round are confirmed fixed in commit ✅ 503 → ✅ Misleading 404 → 500 One minor inconsistency (non-blocking)The outer except Exception as e:
logger.error(f"Failed to refine PRD: {e}", exc_info=True)
raise HTTPException(
status_code=500,
detail=api_error("Failed to refine PRD", ErrorCodes.EXECUTION_FAILED, str(e)),
)The explicit Everything else is clean
Ready to merge. |
Code Review — PR 606 (round 5, follow-up)No new commits since Status of all prior items
Remaining minor inconsistencyThe outer catch-all in Everything else is cleanNo new issues in the latest commit. Implementation is correct and the test suite is comprehensive. Ready to merge. |
Summary
Implements issue #562 — the PRD stress-test results view and PRD refinement, building on the SSE trigger/streaming shipped in #561.
Users can now answer the ambiguities the stress-test surfaces, inline, and fold those answers back into a refined PRD version.
What changed
Backend
Ambiguity.severity("blocking"|"warning") added to the model, parsed from the classification LLM response, and requested in the decomposition prompt.completeSSE event now carries structuredambiguities(id, label, source, questions, recommendation, severity) via a newambiguity_to_dictserializer — previously it only emitted rendered markdown.POST /api/v2/prd/stress-test/refine— reconstructsAmbiguityobjects from submitted answers, runsresolve_ambiguities_into_prd, and persists a new PRD version. Registered before the/{prd_id}catch-all._resolve_llm_providerhelper (used by both the stream and refine endpoints).Frontend (
web-ui)AmbiguityCard— question text, severity badge, answer textarea, recommendation helper.StressTestModalresults view —"X of Y answered"progress, a card per ambiguity, and [Refine PRD]. Refine mutates the PRD editor on success (mutatePrd).useStressTestStreamaccumulates structured ambiguities; newprdApi.refineStressTest; new TS types.Deviations from the issue's Traycer plan (with reasons)
StressTestPanel— [Phase 5.4] PRD stress-test web UI: trigger and streaming progress #561 shipped a modal.completeevent instead of adding a new synchronousPOST /stress-testendpoint that the (outdated) plan assumed.Acceptance criteria
npm testanduv run pytestpassTesting
tests/core/test_prd_stress_test.py(severity parsing/fallback, serializer,completepayload) +tests/ui/test_prd_stress_test_router.py(refine happy path, 404/400/422/502, route ordering). 360 passed acrosstests/ui+ prd core.AmbiguityCard.test.tsx, extendedStressTestModal.test.tsx(cards, refine enable/disable incl. warnings-only, refine→onRefined),useStressTestStream.test.ts. Full suite 925+ passing;npm run buildsucceeds.Known limitations / hardening (from a cross-family
codexreview, addressed)422) for direct API callers.Closes #562
Summary by CodeRabbit
New Features
Validation & Errors
UI
Tests
Docs