Skip to content

feat(web-ui): PRD stress-test results view + refinement (#562)#606

Merged
frankbria merged 5 commits into
mainfrom
feat/562-stress-test-results-refine
May 31, 2026
Merged

feat(web-ui): PRD stress-test results view + refinement (#562)#606
frankbria merged 5 commits into
mainfrom
feat/562-stress-test-results-refine

Conversation

@frankbria

@frankbria frankbria commented May 31, 2026

Copy link
Copy Markdown
Owner

Summary

Implements issue #562 — the PRD stress-test results view and PRD refinement, building on the SSE trigger/streaming shipped in #561.

Users can now answer the ambiguities the stress-test surfaces, inline, and fold those answers back into a refined PRD version.

What changed

Backend

  • Ambiguity.severity ("blocking" | "warning") added to the model, parsed from the classification LLM response, and requested in the decomposition prompt.
  • The stress-test complete SSE event now carries structured ambiguities (id, label, source, questions, recommendation, severity) via a new ambiguity_to_dict serializer — previously it only emitted rendered markdown.
  • POST /api/v2/prd/stress-test/refine — reconstructs Ambiguity objects from submitted answers, runs resolve_ambiguities_into_prd, and persists a new PRD version. Registered before the /{prd_id} catch-all.
  • Shared _resolve_llm_provider helper (used by both the stream and refine endpoints).

Frontend (web-ui)

  • New AmbiguityCard — question text, severity badge, answer textarea, recommendation helper.
  • StressTestModal results view"X of Y answered" progress, a card per ambiguity, and [Refine PRD]. Refine mutates the PRD editor on success (mutatePrd).
  • useStressTestStream accumulates structured ambiguities; new prdApi.refineStressTest; new TS types.

Deviations from the issue's Traycer plan (with reasons)

  • Results live in the existing modal, not a 3rd-column StressTestPanel[Phase 5.4] PRD stress-test web UI: trigger and streaming progress #561 shipped a modal.
  • Enriched the existing GET SSE complete event instead of adding a new synchronous POST /stress-test endpoint that the (outdated) plan assumed.
  • Full decomposition-tree visualization is out of scope — the acceptance criteria require only ambiguity cards; the streaming log already shows the goal breakdown (YAGNI).

Acceptance criteria

  • Ambiguity cards render with question text and answer fields
  • [Refine PRD] is disabled until all blocking questions answered
  • Refined PRD saves and is visible in the PRD editor
  • npm test and uv run pytest pass

Testing

  • Backend: tests/core/test_prd_stress_test.py (severity parsing/fallback, serializer, complete payload) + tests/ui/test_prd_stress_test_router.py (refine happy path, 404/400/422/502, route ordering). 360 passed across tests/ui + prd core.
  • Frontend: new AmbiguityCard.test.tsx, extended StressTestModal.test.tsx (cards, refine enable/disable incl. warnings-only, refine→onRefined), useStressTestStream.test.ts. Full suite 925+ passing; npm run build succeeds.

Known limitations / hardening (from a cross-family codex review, addressed)

  • Refine stays disabled until ≥1 answer is given (avoids submitting an empty payload when only warnings exist).
  • A no-op LLM rewrite (truncation fallback) returns 502 instead of silently saving a duplicate version.
  • Blank/whitespace-only answers are rejected (422) for direct API callers.

Closes #562

Summary by CodeRabbit

  • New Features

    • Ambiguities now carry a severity (blocking|warning, default blocking) and resolved-answer data; stress-test completion emits structured ambiguity results and a refine flow that creates a new PRD version.
  • Validation & Errors

    • Answers reject blank/whitespace-only input; refine flow surfaces errors for missing PRD, missing provider/API key, no-op refinements, and persistence failures.
  • UI

    • Ambiguity cards, progress indicator, gated “Refine PRD” action, toasts for success/error, and modal wiring to update the displayed PRD on refine.
  • Tests

    • Expanded coverage for severity parsing, streaming payloads, refine endpoint and UI refine flows.
  • Docs

    • Roadmap and guidance updated to reflect shipped stress-test results and refine behavior.

Render stress-test ambiguities as answerable cards and fold answers
back into a refined PRD version.

Backend (codeframe/core/prd_stress_test.py, ui/routers/prd_v2.py):
- Add `severity` ("blocking"|"warning") to Ambiguity + classify parsing
  and the decomposition prompt.
- Emit structured `ambiguities` in the stress-test `complete` SSE event
  (new `ambiguity_to_dict` serializer).
- New `POST /api/v2/prd/stress-test/refine`: reconstruct Ambiguity objects
  from submitted answers, run `resolve_ambiguities_into_prd`, persist a new
  PRD version. Registered before `/{prd_id}`. Surfaces a 502 when the LLM
  rewrite is a no-op (truncation) and rejects blank answers.
- Extract shared `_resolve_llm_provider` helper (stream + refine).

Frontend (web-ui):
- New `AmbiguityCard` (question text, severity badge, answer textarea,
  recommendation).
- `StressTestModal` results view: "X of Y answered" progress + cards +
  [Refine PRD], disabled until every blocking ambiguity is answered (and
  at least one answer given). Refine mutates the PRD editor on success.
- `useStressTestStream` accumulates structured ambiguities; new
  `prdApi.refineStressTest`; new TS types.

Adapts the issue's Traycer plan to the SSE architecture shipped in #561
(modal results view instead of a 3rd-column panel; enrich the existing
GET SSE event instead of a new synchronous endpoint).
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5026054e-657c-4bc9-a6a3-464433397d5e

📥 Commits

Reviewing files that changed from the base of the PR and between 19c525c and 13cd780.

📒 Files selected for processing (2)
  • CLAUDE.md
  • docs/PRODUCT_ROADMAP.md
✅ Files skipped from review due to trivial changes (1)
  • CLAUDE.md

Walkthrough

Ambiguities now carry severity ("blocking"/"warning") and are serialized in the stress-test SSE complete event. A new POST /api/v2/prd/stress-test/refine endpoint accepts answered ambiguities, reconstructs and resolves them into a refined PRD version, and the frontend adds AmbiguityCard, modal results view, hook updates, and an API client method to submit answers.

Changes

PRD Stress-Test Refinement Feature

Layer / File(s) Summary
Ambiguity severity field and serialization
codeframe/core/prd_stress_test.py, tests/core/test_prd_stress_test.py
Ambiguity.severity field defaults to "blocking". LLM system prompt now requires severity. Parser normalizes severity from LLM output. New ambiguity_to_dict() serializer converts Ambiguity to a transport dict. Streaming "complete" event now includes ambiguities array. Tests verify default/parsed/invalid severity handling and serialization round-trip.
PRD refinement endpoint and provider resolution
codeframe/ui/routers/prd_v2.py, tests/ui/test_prd_stress_test_router.py
Adds Pydantic DTOs AmbiguityAnswer and StressTestRefineRequest (reject whitespace-only answers). Introduces _resolve_llm_provider() with env/workspace fallback and user-facing errors. SSE stream maps provider resolution failures to in-stream error frames. New POST /api/v2/prd/stress-test/refine reconstructs ambiguities from answers, calls resolver, rejects unchanged output, persists a new PRD version, and maps failures to appropriate HTTP responses. Tests cover success, 404/503/502 paths, validation errors, and route precedence.
Frontend types, hook, and API client
web-ui/src/types/prd.ts, web-ui/src/types/index.ts, web-ui/src/hooks/useStressTestStream.ts, web-ui/src/lib/api.ts, web-ui/src/__tests__/hooks/useStressTestStream.test.ts
Adds AmbiguitySeverity and StressTestAmbiguity. Extends StressTestCompleteEvent and StressTestResultData with ambiguities. Hook reads event.ambiguities into result. Adds prdApi.refineStressTest() client method. Hook tests updated to expect full ambiguity payload.
AmbiguityCard component for individual display
web-ui/src/components/prd/AmbiguityCard.tsx, web-ui/src/__tests__/components/prd/AmbiguityCard.test.tsx
New React component renders severity badge (blocking/warning), source node, question list, recommendation, and a controlled textarea; calls onChange(id, value) on edits. Tests cover rendering, badge variants, control behavior, and callback invocation.
StressTestModal results view and refinement
web-ui/src/components/prd/StressTestModal.tsx, web-ui/src/__tests__/components/prd/StressTestModal.test.tsx
Modal accepts prdId and onRefined, tracks user answers and refining state, computes answered/blocking counts, renders AmbiguityCard list, and provides a Refine PRD action that posts answers to the API, handles success/error toasts, triggers onRefined, and closes the modal. Tests cover UI states, enable/disable logic, submission payloads, and error handling.
PRD page integration and exports
web-ui/src/app/prd/page.tsx, web-ui/src/components/prd/index.ts
PRD page wires prdId and onRefined into the modal and updates cached PRD on refinement. AmbiguityCard added to PRD components barrel exports.

Sequence Diagram

sequenceDiagram
  participant User as User
  participant Modal as StressTestModal
  participant Card as AmbiguityCard
  participant API as prdApi.refineStressTest
  participant Backend as /api/v2/prd/stress-test/refine
  participant LLM as LLM
  participant Page as PRD Page

  User->>Modal: Stress test completes with ambiguities
  Modal->>Card: Render AmbiguityCard for each ambiguity
  User->>Card: Type answer into textarea
  Card->>Modal: onChange(id, value) updates local answers
  Modal->>Modal: enable Refine when blocking answered & ≥1 answer
  User->>Modal: Click [Refine PRD]
  Modal->>API: POST { prd_id, answers }
  API->>Backend: forward refine request
  Backend->>Backend: validate & reconstruct Ambiguities
  Backend->>LLM: request rewritten PRD based on answers
  LLM->>Backend: refined PRD content
  Backend->>Backend: persist new PRD version
  Backend-->>API: return PrdResponse
  API-->>Modal: deliver new PrdResponse
  Modal->>Page: onRefined(newPrd)
  Page->>Page: update cached PRD and close modal
  Modal->>User: show success toast
Loading

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly Related PRs

  • frankbria/codeframe#595: Extends PRD stress-test SSE streaming by adding serialized ambiguity payloads and severity handling, related to the streaming infrastructure used here.

"🐰 I nibble on questions, label them with care,

blocking or warning, each caught in my snare.
Cards ask, users answer, the PRD gets refined,
hop! a new version saved — neat and well-lined. 🥕"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.88% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a stress-test results view and PRD refinement feature to the web UI, directly corresponding to the core changes across backend and frontend files.
Linked Issues check ✅ Passed All primary objectives from #562 are met: ambiguity cards render with questions/severity/answers (#562), refine button gates on blocking questions (#562), refined PRD persists and appears in editor (#562), and test suites pass (#562).
Out of Scope Changes check ✅ Passed All changes align with #562 scope: results view UI, refinement flow, severity handling, and validation/error cases; explicitly out-of-scope items (decomposition tree visualization, re-running stress-test) are correctly excluded.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/562-stress-test-results-refine

Comment @coderabbitai help to get the list of available commands and usage tips.

@frankbria

Copy link
Copy Markdown
Owner Author

Demo verification (Phase 11 — hard gate)

Each acceptance criterion mapped to outcome evidence (real FastAPI TestClient + real SQLite DB for backend; real jsdom render for frontend — only the LLM is mocked).

Criterion Evidence
1. Ambiguity cards render with question text + answer fields Live stress-test complete event carries ambiguity_count=1, label='AUTH SCOPE', severity='blocking', question='Email/password or OAuth?'. Render asserts: "renders an answerable card per ambiguity with question text", "renders label, questions, and recommendation", "reflects the controlled answer value".
2. [Refine PRD] disabled until all blocking answered "disables Refine PRD until all blocking questions are answered" (blocker unanswered → disabled; answered → enabled) + "keeps Refine PRD disabled until at least one answer is given (warnings only)".
3. Refined PRD saves + visible in editor Live POST /stress-test/refine200, version 1 → 2, content head '# Invoice SaaS (refined)', 2 versions persisted in DB. onRefinedmutatePrd reflects it in the editor ("refines the PRD and reports the new version via onRefined").
4. npm test + uv run pytest pass 925 frontend tests; 360 backend tests across tests/ui + prd core.

Hardening (from cross-family codex review): truncated LLM rewrite → 502 (no duplicate version); blank/whitespace answer → 422; empty-payload refine prevented client-side.

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review - PR 606 feedback posted below

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

CodeFRAME Development Guidelines

Last updated: 2026-05-11

Product Vision

CodeFrame is a project delivery system: Think → Build → Prove → Ship.

It owns the edges of the AI coding pipeline — everything BEFORE code gets written (PRD, specification, task decomposition) and everything AFTER (verification gates, quality memory, deployment). The actual code writing is delegated to frontier coding agents (Claude Code, Codex, OpenCode).

CodeFrame does not compete with coding agents. It orchestrates them.

THINK:  cf prd generate → cf prd stress-test → cf tasks generate
BUILD:  cf work start --engine claude-code  (or codex, opencode, built-in)
PROVE:  cf proof run  (9-gate evidence-based quality system)
SHIP:   cf pr create → cf pr merge
LOOP:   Glitch → cf proof capture → New REQ → Enforced forever

Status: CLI ✅ | Server ✅ | ReAct agent ✅ | Web UI ✅ | Agent adapters ✅ | Multi-provider LLM ✅ | Next: Phase 4A — See docs/PRODUCT_ROADMAP.md.

If you are an agent working in this repo: do not improvise architecture. Follow the documents listed below.


Primary Contract (MUST FOLLOW)

  1. Golden Path: docs/GOLDEN_PATH.md — the only workflow we build until it works end-to-end
  2. Command Tree + Module Mapping: docs/CLI_WIREFRAME.md — CLI commands → core modules
  3. Product Roadmap: docs/PRODUCT_ROADMAP.md — current phase plan (Phase 3.5/4/5)
  4. Vision: docs/VISION.md — north star for all decisions
  5. Agent System Reference: docs/AGENT_SYSTEM_REFERENCE.md — agent components, execution flows

Rule 0: If a change does not directly support the Think → Build → Prove → Ship pipeline, do not implement it.

Current Focus: Phase 4A

Phase 5.4 is complete — PRD stress-test web UI: trigger + streaming (#561). Backend: GET /api/v2/prd/stress-test SSE endpoint streams goals_extracted, goal_analyzed, complete, and error events from core/prd_stress_test.py:stress_test_prd_stream(), resolving the LLM provider via the standard chain and applying the standard rate limit. Frontend: useStressTestStream hook manages the SSE connection and event accumulation; StressTestModal renders the streaming progress and is opened via a "Stress Test" button on the /prd page (enabled only when a PRD exists). Results rendering (#562) is out of scope and still pending.

Phase 5.3 is complete — Async notifications cover both surfaces:

  • Browser + in-app center ([Phase 5.3] Async notifications: browser notifications + in-app center #559): useNotifications hook with workspace-scoped localStorage persistence and browser Notification dispatch (only when tab hidden + permission granted); NotificationProvider in root layout; NotificationCenter (bell icon + dropdown) mounts in sidebar footer. BatchExecutionMonitor dispatches batch.completed on terminal status transitions (distinguishing COMPLETED/FAILED/CANCELLED in both the in-app message and the success icon) and blocker.created on per-task BLOCKED transitions. /execution requests browser permission once on mount when permission is 'default'. /proof dispatches gate.run.failed per failed gate when a proof run completes with passed === false. Known limitation: notifications only fire while BatchExecutionMonitor is mounted (cross-page background poller is out of scope; tracked for future work).
  • Outbound webhook ([Phase 5.3] Async notifications: outbound webhook #560): Settings → Notifications tab takes a single URL + enabled toggle, persisted to .codeframe/notifications_config.json via atomic_write_json. GET/PUT /api/v2/settings/notifications and POST /api/v2/settings/notifications/test (test fires a sample payload and surfaces status code). WebhookNotificationService.send_event is the generic backend; dispatched fire-and-forget (5s timeout) from core/conductor.py on BATCH_COMPLETED only (not PARTIAL/FAILED/CANCELLED), core/blockers.py:create() after BLOCKER_CREATED, and ui/routers/pr_v2.py:merge_pull_request after successful merge. Failures are logged but never break the triggering operation.

Phase 5.2 is complete — Costs page now ships per-task and per-agent breakdowns (#558) on top of the spend summary (#557). Backend: GET /api/v2/costs/tasks?days=N&limit=M (top-N tasks with titles, agent, tokens, cost) and GET /api/v2/costs/by-agent?days=N (per-agent rollup + total input/output tokens), both via TokenRepository.get_top_tasks_by_cost and get_costs_by_agent. Task board cards show an inline MoneyBag02Icon cost badge with token-breakdown tooltip when cost data exists. Fixed a v2 data-loss bug where react_agent int-cast UUID task IDs and stored NULL in token_usage.

Phase 5.1 is complete — Settings page now ships three working tabs: Agent (#554), API Keys (#555), and PROOF9 Defaults + Workspace Config (#556). Backend: GET/PUT /api/v2/proof/config and /api/v2/workspaces/config, plus run_proof() now honors enabled_gates filtering and strictness (strict vs warn). Atomic JSON writes via codeframe/ui/routers/_helpers.atomic_write_json. The 9-gate canonical order and proof_config.json filename live in codeframe/core/proof/models.py.

Phase 3.5C is completeCaptureGlitchModal form (description/markdown, source, scope, gate obligations, severity, expiry) reachable from the PROOF9 page and the persistent sidebar "Capture Glitch" button. REQ detail view (/proof/[req_id]) ships markdown description rendering, ProofScope metadata display, obligations table with Latest Run column, sortable/filterable evidence history, and empty-state CTA. Backend: ScopeOut model on RequirementResponse. Issues #568, #569.

Next, in order:

See docs/PRODUCT_ROADMAP.md for full specs and issue links.


Architecture Rules (non-negotiable)

1) Core must be headless

codeframe/core/** must NOT import FastAPI, WebSocket frameworks, HTTP request/response objects, or UI modules.

Core is allowed to: read/write durable state (SQLite/filesystem), run orchestration/worker loops, emit events to an append-only event log, call adapters via interfaces (LLM, git, fs).

2) CLI must not require a server

Golden Path commands must work from the CLI with no server running. FastAPI is optional, started explicitly via codeframe serve, and must wrap core.

3) Agent state transitions flow through runtime

  • Agent (agent.py) manages its own AgentState (IDLE, PLANNING, EXECUTING, BLOCKED, COMPLETED, FAILED)
  • Runtime (runtime.py) handles all TaskStatus transitions (BACKLOG, READY, IN_PROGRESS, DONE, BLOCKED)
  • Agent does NOT call tasks.update_status() — runtime does this based on agent state

This separation prevents duplicate state transitions (e.g., DONE→DONE errors).

4) Keep commits runnable

At all times: codeframe --help works, Golden Path stubs can run, no breaking renames/moves.


Current State

v2 Architecture

  • Core-first: Domain logic lives in codeframe/core/ (headless, no FastAPI imports)
  • CLI-first: Golden Path works without any running FastAPI server
  • Adapters: LLM providers in codeframe/adapters/llm/
  • Server/UI optional: FastAPI and UI are thin adapters over core; web UI connects via REST/WebSocket

Phase 3 Web UI (actively developed — not legacy)

Next.js 16 App Router, TypeScript, Shadcn/UI, Tailwind CSS, Hugeicons, XTerm.js, WebSocket + SSE.

Shipped pages: /, /prd, /tasks, /execution, /execution/[taskId], /blockers, /proof, /proof/[req_id], /review, /sessions, /sessions/[id], /settings.

Testing: cd web-ui && npm test must pass; npm run build must succeed. The frontend-tests CI job enforces this on every PR.

What's implemented

Full feature list in docs/PRODUCT_ROADMAP.md. Key capabilities: ReAct agent execution, batch execution (serial/parallel/auto), task dependencies, stall detection, self-correction, GitHub PR workflow, SSE streaming, API auth, rate limiting, OpenAPI docs, multi-provider LLM (Anthropic/OpenAI-compatible), agent adapters (ClaudeCode/Codex/OpenCode/Kilocode), worktree isolation, E2B cloud execution, interactive agent sessions (WebSocket chat + XTerm.js terminal), PROOF9 quality system (gate runs, per-gate evidence, run history).


Repository Structure

codeframe/
├── core/           # Headless domain + orchestration (NO FastAPI imports)
│   ├── react_agent.py, tools.py, editor.py   # ReAct engine (default)
│   ├── agent.py, planner.py, executor.py     # Plan engine (legacy --engine plan)
│   ├── runtime.py                            # Run lifecycle, engine selection
│   ├── conductor.py                          # Batch orchestration + worker pool
│   ├── dependency_graph.py, dependency_analyzer.py
│   ├── gates.py, fix_tracker.py, quick_fixes.py  # Verification + self-correction
│   ├── stall_detector.py, stall_monitor.py   # Stall detection
│   ├── tasks.py, blockers.py, prd.py, workspace.py
│   ├── context.py, state_machine.py, events.py, streaming.py
│   ├── environment.py, installer.py, diagnostics.py, diagnostic_agent.py
│   ├── credentials.py, agents_config.py
│   └── sandbox/context.py, sandbox/worktree.py   # Isolation abstractions
├── adapters/
│   ├── llm/base.py, llm/anthropic.py, llm/openai.py, llm/mock.py
│   └── e2b/        # Cloud sandbox (optional: pip install codeframe[cloud])
├── cli/app.py      # Typer CLI entry + subcommands
├── ui/             # FastAPI server (thin adapter over core)
│   ├── server.py, models.py, dependencies.py
│   └── routers/    # 16 v2 router modules
├── auth/           # API key service + auth dependencies
├── lib/            # rate_limiter.py, audit_logger.py, metrics_tracker.py
└── platform_store/ # Control-plane store: auth, api keys, audit logs,
                    # interactive sessions, token usage (slim Database + repos)

web-ui/             # Phase 3 Web UI (Next.js, actively developed)
tests/
├── core/           # Core module tests (auto-marked v2)
├── adapters/       # LLM + E2B adapter tests
├── agents/         # dependency_resolver tests
├── integration/    # Cross-module integration tests
├── lifecycle/      # End-to-end lifecycle tests (CLI + API + web, uses MockProvider)
└── ui/             # FastAPI router tests

Commands

Python / CLI

uv run pytest                     # All tests
uv run pytest -m v2               # v2 tests only
uv run pytest tests/core/         # Core module tests
uv run pytest tests/lifecycle/    # Lifecycle tests (no live API calls — uses MockProvider)
uv run ruff check .

# Web UI
cd web-ui && npm test
cd web-ui && npm run build

Golden Path CLI

# Workspace
cf init <repo> [--detect | --tech-stack "..." | --tech-stack-interactive]
cf status

# PRD
cf prd add <file.md>
cf prd show

# Tasks
cf tasks generate
cf tasks list [--status READY]
cf tasks show <id>

# Work — single task
cf work start <task-id> [--execute] [--engine react|plan] [--verbose] [--dry-run]
cf work start <task-id> --execute --stall-timeout 120 --stall-action retry|blocker|fail
cf work start <task-id> --execute --llm-provider openai --llm-model gpt-4o
cf work stop <task-id>
cf work resume <task-id>
cf work follow <task-id> [--tail 50]
cf work diagnose <task-id>

# Work — batch
cf work batch run [<id>...] [--all-ready] [--engine react|plan]
cf work batch run --strategy serial|parallel|auto [--max-parallel 4] [--retry 3]
cf work batch run --all-ready --llm-provider openai --llm-model qwen2.5-coder:7b
cf work batch status|cancel|resume [batch_id]

# Blockers
cf blocker list
cf blocker show <id>
cf blocker answer <id> "answer"

# Quality / State
cf review && cf patch export && cf commit
cf checkpoint create|list|restore
cf summary

# Environment
cf env check|install|doctor

# GitHub PR
cf pr create|status|checks|merge

Note: codeframe serve exists but Golden Path does not depend on it.


What NOT to do

  • Don't add HTTP endpoints to support CLI commands (CLI must work without a server)
  • Don't require codeframe serve for CLI workflows
  • Don't implement UI concepts (tabs, panels, progress bars) inside codeframe/core/
  • Don't "clean up the repo" as a goal — only refactor to enable the pipeline
  • Don't update task status from agent.py — let runtime.py handle transitions
  • Don't skip web UI testing when verifying features that have a web surface
  • Don't leave a CI gate disabled when its feature area becomes active. Re-enable DISABLED: / # COMMENTED OUT: jobs before the first PR in that area. Verify frontend-tests is wired into test-summary.

Testing / Demoing

Quality check (covers both backend and web UI)

uv run pytest && uv run ruff check .
cd web-ui && npm test && npm run build

New v2 tests: add @pytest.mark.v2 or pytestmark = pytest.mark.v2 at module level.

Demoing against a sample project (e.g., cf-test/)

You are observing the CodeFRAME agent's work, not doing the work yourself.

  • Do NOT help out, fix errors, or write code on behalf of the agent
  • Do NOT intervene when the agent makes mistakes — that's data
  • Report what worked, what failed, final state vs. acceptance criteria

Practical Working Mode

  1. Read docs/GOLDEN_PATH.md — confirm the change is required
  2. Find the command in docs/CLI_WIREFRAME.md
  3. Implement core functionality in codeframe/core/
  4. Call it from Typer command in codeframe/cli/
  5. Emit events + persist state
  6. Keep it runnable. Commit.

When unsure: simpler state, fewer dependencies, smaller surface area, core-first, CLI-first.


Environment Variables

ANTHROPIC_API_KEY=sk-ant-...          # Required for Anthropic provider (default)
E2B_API_KEY=e2b_...                   # Required for --engine cloud
DATABASE_PATH=./codeframe.db          # Optional

# LLM Provider selection (multi-provider support)
# Priority: CLI flag > env var > .codeframe/config.yaml > default (anthropic)
CODEFRAME_LLM_PROVIDER=anthropic      # Provider: anthropic (default), openai, ollama, vllm, compatible
CODEFRAME_LLM_MODEL=gpt-4o            # Model override (used with openai/ollama/vllm/compatible)
OPENAI_API_KEY=sk-...                 # Required for openai provider; not needed for local providers
OPENAI_BASE_URL=http://localhost:11434/v1  # Base URL override (for ollama, vllm, or custom endpoints)
# Per-workspace config: .codeframe/config.yaml supports llm: block
# llm:
#   provider: openai
#   model: qwen2.5-coder:7b
#   base_url: http://localhost:11434/v1   # optional, for local models

# Optional — Rate limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_DEFAULT=100/minute
RATE_LIMIT_AUTH=10/minute
RATE_LIMIT_AI=20/minute
REDIS_URL=redis://localhost:6379

CODEFRAME_API_KEY_SECRET=<secret>     # API key hashing

Documentation

Doc Purpose
docs/VISION.md North star: Think → Build → Prove → Ship thesis
docs/PRODUCT_ROADMAP.md Current roadmap — Phase 3.5/4/5 web product completeness
docs/GOLDEN_PATH.md CLI-first workflow contract
docs/CLI_WIREFRAME.md Command → module mapping
docs/AGENT_SYSTEM_REFERENCE.md Component table, model selection, execution flows, self-correction
docs/REACT_AGENT_ARCHITECTURE.md ReAct deep-dive: tools, editor, token management
docs/PHASE_3_UI_ARCHITECTURE.md Web UI architecture (Next.js, pages, components)
docs/PHASE_2_DEVELOPER_GUIDE.md Server layer + v2 router patterns
docs/PHASE_2_CLI_API_MAPPING.md CLI to API endpoint mapping
docs/QUICKSTART.md User-facing quickstart guide

Archived (completed plans, old gap analyses): docs/archive/

Legacy (v1 reference only): SPRINTS.md, sprints/, specs/, CODEFRAME_SPEC.md

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review — PR 606

Good implementation overall. Scope matches issue 562, backend/frontend split is clean, and test coverage is solid. A few things worth addressing before merge.

Backend: codeframe/core/prd_stress_test.py

Severity parsing is handled well: explicit allowlist + fallback to blocking guards against unexpected LLM output. ambiguity_to_dict is a clean serializer - right call to keep it separate from the dataclass. The severity field currently uses plain str; using Literal[blocking, warning] from typing would catch typos at the type-checker level.

Backend: codeframe/ui/routers/prd_v2.py

(1) HTTP 400 for missing API key is semantically wrong. When ANTHROPIC_API_KEY is absent, the client request is well-formed — the server lacks configuration. HTTP 503 Service Unavailable is the correct status. Using 400 tells the caller their request is malformed, which is misleading.

(2) Reconstructed Ambiguity objects silently drop source_node_title and recommendation. Fine if resolve_ambiguities_into_prd only needs label, questions, and resolved_answer in its LLM prompt (which appears to be the case), but a short inline comment would help the next reader confirm it is intentional rather than a bug.

(3) 502 for a no-op rewrite (refined_content == record.content) is a solid guard. Correct call.

(4) Route ordering comment is good — the FastAPI catch-all footgun is exactly what needs a note.

(5) AmbiguityAnswer.questions has no min_length constraint. If SSE always produces at least one question, the model could enforce that invariant for direct API callers too.

Frontend: AmbiguityCard.tsx

Clean controlled component. aria-label on the textarea makes it testable and accessible. No issues.

Frontend: StressTestModal.tsx

The dual disable condition (blockingUnanswered > 0 || answeredCount === 0) is correct and the warnings-only edge case is explicitly tested. Good. mutatePrd(newPrd, false) is the right SWR pattern. Error path in handleRefine uses apiError.detail as a toast string — ApiError.detail can be a nested object in our error schema (from api_error() in the router). Add a string type-check to prevent [object Object] appearing in the UI.

Frontend: api.ts

workspace_path as a query param on a POST is consistent with the rest of the API. Fine.

Tests

Backend: severity parsing/fallback, serializer round-trip, complete payload shape, refine happy path, 404/400/422/502, and route-ordering all covered. Well done. Frontend: AmbiguityCard unit tests are thorough; StressTestModal covers the warnings-only edge case (easy to miss); useStressTestStream updated correctly. One gap: no test for the toast error path when apiError.detail is an object rather than a string.

Summary

Priority Issue
Blocking HTTP 400 should be 503 for missing API key
Worth fixing apiError.detail string coercion before toast
Minor Literal type on severity; comment on dropped fields in refine; questions min_length

The core implementation is solid and the test coverage earns confidence. The 400→503 fix and the toast error coercion are my only blockers.

…fields

Address review feedback on #562:
- Missing API key / unknown provider is a service-availability problem,
  not a malformed request -> 503 (was 400).
- Add a comment noting source_node_title/recommendation are intentionally
  empty in the reconstructed Ambiguity (resolve only reads label/questions/
  resolved_answer).
@frankbria

Copy link
Copy Markdown
Owner Author

Thanks for the review. Triaged every finding:

Applied

  • HTTP 400 → 503 for missing LLM config (commit 127fbbb). Agreed — a missing API key / unknown provider is a service-availability problem, not a malformed request. Test updated (test_refine_missing_api_key_returns_503).
  • Inline comment on dropped source_node_title/recommendation in the reconstructed Ambiguity (same commit) — confirms it's intentional (resolve_ambiguities_into_prd only reads label/questions/resolved_answer).

Verified — no change needed

  • apiError.detail[object Object] in toast: the axios interceptor already runs normalizeErrorDetail() on every response, coercing the structured {error, code, detail} object into a string before it reaches handleRefine. ApiError.detail is typed string, so the object case cannot reach the toast. The suggested string-coercion (and a test for it) would guard an unreachable path.

Deferred (minor, YAGNI)

  • Literal["blocking","warning"] on severity: the value is already runtime-validated against an allowlist with a blocking fallback in classify_and_decompose; the dataclass stays str to avoid rippling a type change for no runtime benefit.
  • questions min_length: empty questions is harmless (only joined into prompt text); enforcing ≥1 could reject otherwise-valid input.

The two codex cross-family findings (empty-payload refine, no-op-rewrite duplicate version) plus the blank-answer guard were already addressed in the initial PR.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
codeframe/ui/routers/prd_v2.py (1)

383-390: ⚡ Quick win

Use ErrorCodes.SERVICE_UNAVAILABLE for 503 responses.

The 503 status and "LLM provider unavailable" message indicate a service availability issue, but the error code EXECUTION_FAILED typically signals a processing failure during execution. Using SERVICE_UNAVAILABLE would be semantically consistent with both the HTTP status and the error message, making client-side error handling more intuitive.

♻️ Proposed fix
         raise HTTPException(
             status_code=503,
             detail=api_error(
-                "LLM provider unavailable", ErrorCodes.EXECUTION_FAILED, str(exc)
+                "LLM provider unavailable", ErrorCodes.SERVICE_UNAVAILABLE, str(exc)
             ),
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@codeframe/ui/routers/prd_v2.py` around lines 383 - 390, Replace the incorrect
error code used for the 503 raise: in the HTTPException that sets
status_code=503 and calls api_error("LLM provider unavailable",
ErrorCodes.EXECUTION_FAILED, str(exc)), change ErrorCodes.EXECUTION_FAILED to
ErrorCodes.SERVICE_UNAVAILABLE so the api_error invocation and HTTP 503 response
use ErrorCodes.SERVICE_UNAVAILABLE (retaining the same message and exc string).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@codeframe/ui/routers/prd_v2.py`:
- Around line 383-390: Replace the incorrect error code used for the 503 raise:
in the HTTPException that sets status_code=503 and calls api_error("LLM provider
unavailable", ErrorCodes.EXECUTION_FAILED, str(exc)), change
ErrorCodes.EXECUTION_FAILED to ErrorCodes.SERVICE_UNAVAILABLE so the api_error
invocation and HTTP 503 response use ErrorCodes.SERVICE_UNAVAILABLE (retaining
the same message and exc string).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5ee1fcd3-5bc0-49f3-b854-86afd40cc124

📥 Commits

Reviewing files that changed from the base of the PR and between 737c144 and 127fbbb.

📒 Files selected for processing (2)
  • codeframe/ui/routers/prd_v2.py
  • tests/ui/test_prd_stress_test_router.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/ui/test_prd_stress_test_router.py

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review: PRD Stress-Test Results View + Refinement (#562)

Critical

1. Blocking sync LLM call on the async event loop (prd_v2.py:408)

refine_prd_from_stress_test is async def but calls resolve_ambiguities_into_prd — a synchronous, blocking LLM call — directly on the event loop. This stalls all concurrent requests for the duration of the LLM round-trip (typically several seconds).

The existing SSE path already solves this correctly via asyncio.to_thread inside stress_test_prd_stream. The fix is consistent:

import asyncio
refined_content = await asyncio.to_thread(
    resolve_ambiguities_into_prd, record.content, ambiguities, provider
)

prd.create_new_version on the next few lines is synchronous SQLite I/O and should be wrapped the same way, though its latency is negligible compared to the LLM call.


Important

2. severity field should be Literal["blocking", "warning"] (prd_stress_test.py:57)

The dataclass field accepts any str at the type level; only classify_and_decompose normalizes the value at runtime. Using a Literal type annotation makes the constraint explicit and lets static analysis catch mismatches:

from typing import Literal
severity: Literal["blocking", "warning"] = "blocking"

The TypeScript side already defines AmbiguitySeverity = 'blocking' | 'warning' — the Python model should match.

3. ambiguity_to_dict return type is untyped (prd_stress_test.py:335)

The signature is -> dict:. The frontend depends on the exact shape of this dict, so a typed return (-> dict[str, object]: or a TypedDict) gives static analyzers something to check:

def ambiguity_to_dict(amb: Ambiguity) -> dict[str, object]:

4. AmbiguityAnswer.label has no content validation (prd_v2.py:102)

answer has min_length=1 plus a whitespace-strip validator. label has neither. An empty label would produce a confusing resolution line in the LLM prompt (e.g., "- : question → Answer: y"). Adding min_length=1 is consistent with the rest of the model.


Minor

5. Unsafe err as ApiError cast in handleRefine (StressTestModal.tsx:114)

const apiError = err as ApiError;
toast.error(apiError.detail || 'Failed to refine PRD. Please try again.');

The fallback string saves this in practice, but it's an unsafe cast on unknown. The safer pattern used elsewhere (e.g., useProofRun.ts) is:

const detail = (err as { detail?: unknown }).detail;
toast.error(typeof detail === 'string' ? detail : 'Failed to refine PRD. Please try again.');

Architecture / Coverage Notes

Architecture compliance looks good: core/prd_stress_test.py has no FastAPI imports (headless boundary maintained), the /stress-test/refine route is registered before the /{prd_id} catch-all (and explicitly tested in test_refine_route_not_shadowed_by_prd_id), and the stateless design (client round-trips ambiguity content rather than server persisting run state) is appropriate and documented.

Test coverage is thorough — severity parsing, round-trip serialization, error cases (404/503/502/422), route ordering, and frontend card/modal/stream hook behavior are all covered. All backend tests are marked @pytest.mark.v2.

The blocking event loop issue (#1) is the only blocker for merge. Items #2#4 are straightforward hardening improvements worth addressing in this PR.

Address re-review on #562:
- CRITICAL: wrap resolve_ambiguities_into_prd in asyncio.to_thread so the
  refine endpoint no longer blocks the event loop (mirrors the SSE path).
- severity typed Literal['blocking','warning']; ambiguity_to_dict returns
  dict[str, object].
- AmbiguityAnswer.label gains min_length=1 (empty label -> 422).
- Safer error handling in handleRefine (typeof-string guard instead of an
  unchecked ApiError cast).
@frankbria

Copy link
Copy Markdown
Owner Author

Addressed the re-review (commit 1f26526):

  • Critical — blocking sync LLM call on the event loop: fixed. resolve_ambiguities_into_prd now runs via asyncio.to_thread, mirroring stress_test_prd_stream. The refine endpoint no longer stalls concurrent requests.
  • severityLiteral["blocking", "warning"]: applied (matches the TS AmbiguitySeverity union; mypy clean).
  • ambiguity_to_dict typed return: now -> dict[str, object].
  • AmbiguityAnswer.label min_length=1: applied + test test_refine_rejects_empty_label (422).
  • Unsafe err as ApiError cast: replaced with the typeof detail === 'string' guard pattern; dropped the now-unused ApiError import.

Backend 46 tests + frontend modal suite pass; npm run build and mypy codeframe/ clean.

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review -- PR 606 (follow-up)

All items from the previous two review rounds have been applied cleanly: asyncio.to_thread wrapping the blocking LLM call, Literal typing on severity, the _resolve_llm_provider DRY refactor, and the 400 to 503 status correction. Two remaining issues in the current diff.

1. Wrong error code on the 503 response (prd_v2.py ~line 289)

The 503 still uses ErrorCodes.EXECUTION_FAILED. EXECUTION_FAILED signals a processing failure; a missing API key or unconfigured provider is a service-availability problem. This is what CodeRabbit flagged and it is still unaddressed. Use ErrorCodes.SERVICE_UNAVAILABLE (or the equivalent in this project's error catalog) so API clients can distinguish a bad request from the server being unable to reach the LLM.

2. Second 404 after create_new_version is misleading (prd_v2.py ~line 337)

At that point get_by_id has already confirmed the PRD exists. If create_new_version returns None, that is a server-side fault, not a not-found, and the 404 message 'No PRD with id ...' will actively mislead callers who just confirmed the PRD existed. Change to a 500 with 'Failed to persist new PRD version'.

Minor -- frontend error detail extraction (StressTestModal.tsx)

422 responses from Pydantic return detail as an array of objects, not a string, so typeof detail === 'string' is always false for validation errors and the generic fallback fires every time. Not a blocker, but consider also checking Array.isArray(detail) and pulling detail[0]?.msg if a specific message is wanted.

Everything else is good

Severity parsing with explicit allowlist + fallback to blocking is the right defensive approach for an LLM-parsed field. The route registration comment is exactly the right non-obvious note to leave. AmbiguityCard aria-label on the textarea is correct for accessibility. The ambiguities ?? [] default in useStressTestStream handles old server formats safely. Pydantic whitespace rejection, blank-label guard, and empty-answers guard close the obvious API abuse vectors. Test coverage across severity parsing, serializer round-trip, 404/400/422/502/503 paths, and the full frontend render/refine flow is comprehensive.

Ready to merge after items 1 and 2 are addressed.

Address follow-up review on #562:
- 503 (LLM unavailable) now uses ErrorCodes.SERVICE_UNAVAILABLE instead of
  EXECUTION_FAILED, so clients can distinguish a bad request from the server
  being unable to reach the LLM.
- A None from create_new_version (PRD already confirmed to exist) is a
  persistence fault -> 500 INTERNAL_ERROR, not a misleading 404. Covered by
  test_refine_persistence_failure_returns_500.
@frankbria

Copy link
Copy Markdown
Owner Author

Addressed both blockers (commit 19c525c):

  1. 503 error code: now ErrorCodes.SERVICE_UNAVAILABLE (was EXECUTION_FAILED) so clients can distinguish a malformed request from the server being unable to reach the LLM.
  2. Misleading 404 after create_new_version: since get_by_id already confirmed the PRD exists, a None is a persistence fault → now 500 INTERNAL_ERROR ("Failed to persist new PRD version"). New test test_refine_persistence_failure_returns_500.

Minor (frontend 422 detail) — verified no change needed: prdApi.refineStressTest goes through the shared axios client, whose response interceptor runs normalizeErrorDetail() and rejects with ApiError { detail: string }. Pydantic's array-of-objects 422 body is collapsed to "msg; msg" before it reaches handleRefine, so typeof detail === 'string' is true and the specific validation message is shown — the array case never reaches the component. (Same axios-interceptor detail the previous rounds flagged.)

mypy codeframe/ clean; 15 refine-router tests pass.

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review — PR 606 (follow-up, round 4)

Both blockers from the previous round are confirmed fixed in commit 19c525c:

✅ 503 → ErrorCodes.SERVICE_UNAVAILABLE
The _resolve_llm_provider failure path now correctly uses SERVICE_UNAVAILABLE instead of EXECUTION_FAILED. Clients can now distinguish a provider-configuration gap from a bad request.

✅ Misleading 404 → 500 INTERNAL_ERROR
The create_new_version returning None after get_by_id confirmed the PRD exists is now a persistence-fault 500, not a not-found 404. The test_refine_persistence_failure_returns_500 test anchors this.


One minor inconsistency (non-blocking)

The outer except Exception catch-all in refine_prd_from_stress_test (prd_v2.py, end of the try block) uses ErrorCodes.EXECUTION_FAILED for what is an unexpected internal error (e.g. a DB exception from create_new_version):

except Exception as e:
    logger.error(f"Failed to refine PRD: {e}", exc_info=True)
    raise HTTPException(
        status_code=500,
        detail=api_error("Failed to refine PRD", ErrorCodes.EXECUTION_FAILED, str(e)),
    )

The explicit None-check directly above it already uses INTERNAL_ERROR correctly; this outer catch-all is slightly inconsistent. The 500 status code is correct and it's a rarely-hit path, so this won't cause a regression — but swapping to INTERNAL_ERROR would make the error catalog uniform. Worth a follow-up, not a blocker.


Everything else is clean

  • Severity parsing (allowlist + fallback to "blocking") is the right defensive pattern for an LLM-parsed field.
  • asyncio.to_thread correctly offloads the blocking LLM call.
  • _resolve_llm_provider DRY helper is well-documented and used consistently by both the stream and refine endpoints.
  • AmbiguityCard aria-labels, answeredCount === 0 guard, and the ambiguities ?? [] fallback are all correct.
  • Pydantic min_length/whitespace validators on both answer and label close the obvious API abuse vectors.
  • Route ordering comment is the right non-obvious note to leave, and test_refine_route_not_shadowed_by_prd_id anchors it.
  • Test coverage across severity parsing, serializer round-trip, 404/400/422/502/503/500 paths, and the full frontend render/refine flow is comprehensive.

Ready to merge.

@claude

claude Bot commented May 31, 2026

Copy link
Copy Markdown

Code Review — PR 606 (round 5, follow-up)

No new commits since 19c525c. Confirming current state for completeness.

Status of all prior items

Round Item Status
1 Blocking LLM call on async event loop ✅ Fixed — asyncio.to_thread
1 severityLiteral["blocking", "warning"] ✅ Fixed
1 ambiguity_to_dict untyped return ✅ Fixed — dict[str, object]
1 AmbiguityAnswer.label missing min_length ✅ Fixed
1 Unsafe err as ApiError cast in handleRefine ✅ Fixed
2 503 using EXECUTION_FAILED instead of SERVICE_UNAVAILABLE ✅ Fixed
2 Misleading 404 after create_new_version (should be 500) ✅ Fixed — INTERNAL_ERROR
4 Outer except Exception catch-all uses EXECUTION_FAILED ⏳ Still open (non-blocking)

Remaining minor inconsistency

The outer catch-all in refine_prd_from_stress_test (prd_v2.py, end of the try block) still uses ErrorCodes.EXECUTION_FAILED for what is an unexpected internal error (e.g. a DB exception that escapes create_new_version). The explicit None-check directly above it correctly uses INTERNAL_ERROR. The 500 status code is right and this is a rarely-hit path — no regression risk. Worth a one-line follow-up in a subsequent PR.

Everything else is clean

No new issues in the latest commit. Implementation is correct and the test suite is comprehensive. Ready to merge.

@frankbria frankbria merged commit 6cd1554 into main May 31, 2026
11 checks passed
@frankbria frankbria deleted the feat/562-stress-test-results-refine branch May 31, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Phase 5.4] PRD stress-test web UI: results view and PRD refinement

1 participant