OpenBuildApp System Design

1. Executive Summary

OpenBuildApp is a production-grade software twin engine, not a cloning toy. A user places an authorized target artifact into a workspace, the platform launches an isolated run, specialized agents observe the target and reconstruct a twin, and deterministic verification loops continue until the twin is behaviorally close enough to be trusted. Every action is logged, every important inference is explainable, and every run can be replayed and rolled back.

The product is built around a run ledger, strict agent contracts, sandbox-first execution, and evidence packs that make every conclusion reviewable by a human. The core unit of work is a run folder, not a conversational session.

2. Product Positioning

Positioning statement

OpenBuildApp is the first software archaeology and reconstruction platform for authorized analysis: it does not merely generate code from prompts, it studies a real software artifact, constructs a reversible behavioral model, produces a twin, and proves how close the twin is with replay, diffing, and structured evidence.

What it competes against

AI coding assistants that start from intent but not evidence
UI copiers that mimic layout without understanding behavior
Test automation tools that observe behavior but do not reconstruct it
Reverse engineering utilities that lack productized safety, replay, or auditability

What it is not

A DRM bypass tool
A credential extraction tool
A covert traffic interception system
A generic autonomous coding shell

3. Design Principles

Single source of truth per run: run-state.json
Event-sourced execution: every material step emits an append-only event
Behavior before pixels: interaction and state fidelity outrank cosmetic similarity
Safe by default: sandbox, policy gate, confirmation gate, rollback gate
Short-lived agent memory: agents consume scoped artifacts, then exit
No uncontrolled peer-to-peer agent chat: all coordination flows through the orchestrator
Deterministic artifacts: repeatable folder layout, schemas, logs, and replay frames
Explainability: every major inference cites evidence references
Hypotheses must be testable, versioned, and either validated, downgraded, or rejected
Confidence is an explicit field on observations, models, verification results, and final reports
The machine-readable model is a first-class product, not a byproduct of code generation

4. System Shape

User Workspace
  -> Run Bootstrapper
  -> Control Layer
       -> Policy Gate
       -> Run Ledger
       -> Snapshot Manager
       -> Agent Dispatcher
  -> Observation Layer
  -> Cognitive Graph Layer
  -> Reconstruction Layer
  -> Verification Layer
  -> Repair Layer
  -> Reporting Layer
  -> Final Evidence Pack + Replay Archive

5. Layered Architecture

Layer 1: Control Layer

Purpose: Own the run, gate actions, compress context, arbitrate conflicts, and decide whether to continue, retry, rollback, or escalate.

Core services:

Orchestrator runtime
Policy engine
Run-state store
Event bus
Snapshot scheduler
Approval gate

Primary artifacts:

run-state.json
events.jsonl
task.md
policy.md
summary.md

Layer 2: Observation Layer

Purpose: Capture what the target does without generating implementation code.

Adapters:

Browser instrumentation for web URLs
Window capture and accessibility bridge for desktop apps
Emulator instrumentation for mobile packages
Frame and OCR pipeline for screenshots and recordings
Static file readers for logs and source folders

Primary artifacts:

screenshots
screen segments
DOM or window trees
accessibility trees
network summaries
interaction traces

Layer 3: Cognitive Graph Layer

Purpose: Convert raw observations into a living, inspectable cognition model that the orchestrator and downstream agents can reason over.

Core subsystems:

Cognitive Memory Graph
Autonomous Hypothesis Engine
Living State Model
Confidence propagation engine

Graph nodes:

screens
components
controls
actions
states
transitions
entities
backend behaviors
evidence artifacts
hypotheses

Graph edges:

contains
triggers
transitions_to
reads_from
writes_to
resembles
supports
contradicts
proven_by
inferred_from

Primary artifacts:

models/cognitive-memory-graph.json
models/component-map.json
models/navigation-map.json
models/state-graph.json
models/event-graph.json
models/data-model.json
models/backend-contract.json
models/hypotheses.json

Layer 3A: Cognitive Memory Graph

The graph is the platform's durable software cognition substrate. It is not a cache. It is the machine-readable explanation of what the system believes, why it believes it, and how strongly.

Each node and edge must store:

stable id
type
confidence score
provenance
evidence refs
last-updated step
conflict markers where applicable

Layer 3B: Autonomous Hypothesis Engine

The hypothesis engine forms and validates claims such as:

this control opens a modal
this state transition represents save success
this endpoint returns paginated results
this component is stable across screens
this animation is ornamental, not state-bearing

Hypothesis lifecycle:

proposed -> queued -> tested -> validated
                         -> downgraded
                         -> rejected

Hypotheses must include:

natural-language claim
machine-checkable expectation
triggering evidence
required validation action
confidence before test
confidence after test

Layer 4: Reconstruction Layer

Purpose: Produce a working twin with explicit assumptions.

Output targets:

web twin for MVP
desktop shell wrapper in V2
mobile twin targets in V3

Primary artifacts:

generated source code
shadow API definitions
local state engine
replayable tests

Layer 5: Verification Layer

Purpose: Measure behavioral and visual distance between target and twin.

Verification modes:

pixel diff
structural diff
timing diff
replay diff
state diff
accessibility parity check
perceptual reality verification
critical path regression verification

Primary artifacts:

verification/report.json
annotated diff images
replay mismatch logs
fidelity scores
perceptual mismatch report

Layer 6: Repair Layer

Purpose: Translate verification failures into targeted patches without destabilizing passing behavior.

Strategies:

defect clustering by root cause
patch selection by highest leverage mismatch
regression guard generation before applying risky repairs
rollback to last known-good snapshot on failure

Layer 7: Reporting Layer

Purpose: Turn the run into an auditable product artifact.

Primary outputs:

evidence.md
summary.md
report.json
exportable evidence bundle
architecture delta notes

6. Input Classification Matrix

Input type	Primary pipeline	Confidence ceiling	Notes
URL for web app	browser capture + DOM + network + replay	high	Best early target for MVP
source code folder	static parse + build/run in sandbox + screenshot capture	high	Best for explainable twin generation
desktop executable	VM or sandbox runner + window tree + screenshot + a11y	medium	Stronger isolation required
mobile package	emulator + UI automation + screenshot + a11y	medium	Start in V2
screenshot set	visual inference only	low	No behavioral claims beyond evidence
screen recording	temporal visual inference + OCR + event hints	medium-low	Good for flow reconstruction
logs	state/event augmentation	medium	Never treated as full truth alone

7. Technology Architecture

OpenBuildApp should be a monorepo with a split between product UI, orchestration, capture, and shared contracts.

/apps
  /control-center        # split-screen dashboard, replay timeline, run control UI
  /twin-runtime          # runtime shell for generated twins
/services
  /orchestrator          # run lifecycle, agent dispatch, policy gate
  /capture-gateway       # screenshot, a11y, DOM, replay, timing capture
  /sandbox-manager       # VM/container/session isolation and reset
  /verification-engine   # diffing, scoring, replay comparison
/packages
  /contracts             # zod/json schemas and shared types
  /agent-runtime         # base agent interfaces and runners
  /prompt-contracts      # agent contract compiler and templates
  /model-store           # state graph, component map, navigation map
  /evidence-kit          # evidence bundle assembly

Opinionated stack

Product UI: React + TypeScript + Vite + TanStack Router + Zustand
Desktop shell: Tauri for local control center packaging
Orchestration: TypeScript service for fast agent iteration and shared contracts
Isolation helpers: Rust sidecars for snapshotting, desktop capture, and low-level timing fidelity
Storage: SQLite for run metadata, JSONL for append-only event logs, filesystem for evidence artifacts
Schemas: JSON Schema plus runtime validation in Zod

8. Canonical Run Data Model

Each run gets a deterministic folder and a canonical ledger:

run-state.json: current state snapshot
events.jsonl: append-only events
actions.jsonl: operator or agent actions with monotonic sequence ids
artifacts/index.json: artifact catalog with hashes and provenance
models/live-state-target.json: current target-side live state projection
models/live-state-twin.json: current twin-side live state projection
models/hypotheses.json: current active and historical hypotheses

run-state.json should contain:

run metadata
input classification
current phase
active agent
approval state
latest snapshot id
verification status
fidelity scores
stop reason
model coverage
current confidence band
active hypotheses count
current target state id
current twin state id

9. The 21-Agent System

Control

Master Orchestrator Agent
- Owns phase transitions, routing, stop decisions, and evidence compression.

Intake and Governance

Workspace Intake Agent
- Detects input type, hashes artifacts, validates workspace readiness.
Task Spec Agent
- Produces and updates task.md, success criteria, and scope.
Policy Agent
- Produces and enforces policy.md, approvals, and red lines.

Observation

UI Observer Agent
- Captures layout, typography, hierarchy, spacing, and screen inventory.
Interaction Observer Agent
- Records clicks, typing, focus, hover, menus, and visible transitions.
Animation Observer Agent
- Measures timing, easing, motion paths, and transition durations.
Accessibility Agent
- Extracts a11y trees, semantic roles, labels, focus order, and landmarks.
Network Observer Agent
- Summarizes safe request and response patterns without storing secrets.

Modeling

State Graph Agent
- Builds a state machine from observed events and UI changes.
Navigation Agent
- Maps routes, screens, windows, modal layers, and back-stack logic.
Component Mapper Agent
- Clusters visible primitives into reusable component candidates.
Backend Behavior Inference Agent
- Infers response contracts and data dependencies from allowed evidence.
Data Model Agent
- Infers entity shapes, collections, filters, and client-state patterns.

Reconstruction

Frontend Generator Agent
- Produces the twin UI structure and component tree.
Interaction Coding Agent
- Implements user actions, forms, keyboard flows, and local behaviors.
Shadow API Agent
- Creates mocks, inferred endpoints, and deterministic local adapters.

Verification and Repair

Verification Agent
- Runs scoring, diffing, replay comparison, and gate checks.
Repair Agent
- Applies targeted fixes based on verification deltas.
Rollback and Snapshot Agent
- Manages restore points, clean resets, and failed-repair recovery.

Reporting

Reporting and Evidence Agent
- Produces evidence packs, summaries, changelogs, and final reports.

10. Run-Time Flow

Phase 0: Workspace Setup

Create run folder and seed deterministic files
Hash inputs and register artifacts
Create initial snapshot

Phase 1: Input Classification

Detect artifact class and confidence
Select pipeline and safe tool envelope
Mark unsupported claims early

Phase 2: Safe Launch and Passive Observation

Launch target in isolated environment
Begin screenshot, DOM, a11y, and timing capture
Avoid active interaction until baseline state is recorded

Phase 3: Structured Exploration

Explore low-risk actions first
Never click destructive or purchase-like controls without approval
Record all actions, before and after states, and confidence notes
Propose and queue testable hypotheses about controls, transitions, and data behavior

Phase 4: Behavior Modeling

Build screen graph, state graph, navigation map, and component inventory
Cross-link observations to specific evidence references
Update the cognitive memory graph and prune rejected hypotheses
Maintain a living state model for both target and twin candidates

Phase 5: Reconstruction

Generate initial twin
Generate shadow backend and local state engine
Generate replayable tests for observed critical paths

Phase 6: Verification

Run deterministic comparisons
Produce weighted fidelity scores
Create defect clusters with evidence links
Run perceptual reality checks for motion, delay, hierarchy, feedback, and error handling
Update confidence based on replay stability and verification reproducibility

Phase 7: Repair Loop

Patch highest-value failures first
Re-run targeted verification
Escalate after iteration cap, policy gate, or unstable observations

Phase 8: Finalization

Freeze final snapshot
Build evidence pack and replay archive
Produce final summary with known gaps and confidence level
Archive the time-machine replay so every step can be scrubbed with model state and evidence

11. Stop Conditions

The orchestrator stops the run when any of the following is true:

verification thresholds are met
the user requests stop
a policy boundary is hit
the target becomes unstable or nondeterministic beyond tolerance
the repair loop hits iteration or cost ceilings
evidence is insufficient to make safe claims

12. Safety Model

OpenBuildApp uses four hard gates:

Authorization Gate
- Runs only on user-authorized artifacts in user-owned workspaces.
Policy Gate
- Blocks disallowed actions such as credential harvesting, DRM bypass, or privileged destructive actions.
Confirmation Gate
- Requires human approval for risky UI interactions, external side effects, or irreversible steps.
Rollback Gate
- Requires a restorable snapshot before any action classified as state-mutating.

Privacy rules

Store only redacted request and response summaries by default
Hash secrets, never persist raw secrets in evidence
Expire sensitive transient artifacts after run completion
Keep evidence precise enough for audit but not rich enough to leak user data

13. Verification Model

Verification is a scorecard plus hard gates.

Hard gates

no blocked safety violations
replay can execute without fatal divergence
critical user journeys complete
generated twin is reproducible from stored artifacts

Weighted fidelity score

35% interaction fidelity
25% state fidelity
20% visual fidelity
10% timing fidelity
10% accessibility fidelity

Perceptual reality verification

In addition to raw scores, OpenBuildApp must answer: "Would a careful human notice this as a different product experience?"

Perceptual checks:

motion smoothness and sequencing
delay and responsiveness
information hierarchy
interaction feedback
error-state behavior
perceived polish and transition coherence

Acceptance thresholds

MVP: 0.70 weighted score, no critical journey failures
V2: 0.85 weighted score, no major visual or replay regressions
V3: 0.92 weighted score, low timing variance, high state parity

14. Visual Dashboard Concept

The control center should not look like a generic chat UI. It should feel like a mission console.

Main layout

+----------------------------------------------------------------------------------+
| Run header: status | target type | fidelity score | approvals | snapshot id     |
+--------------------------------+-----------------------------------------------+
| Target viewport                 | Twin viewport                                  |
| live frame / captured state     | current generated twin                         |
| diff overlay toggle             | mismatch hotspots                              |
+--------------------------------+-----------------------------------------------+
| Replay timeline with step ids, state transitions, timing marks, and agent lanes |
+--------------------------------+--------------------+---------------------------+
| Evidence panel                  | Verification panel | Agent activity panel     |
| screenshots, notes, requests    | score deltas       | current phase, blockers  |
+----------------------------------------------------------------------------------+

Critical interactions

scrub replay timeline and sync both panes
toggle overlay modes: pixel, structure, state, timing
inspect mismatch hotspots and jump to evidence
freeze on a step and view agent rationale plus source artifacts
approve or deny gated actions inline

15. Replay and Evidence Pack

Replay model

Replay is a synchronized, step-indexed bundle:

action id
timestamp
target frame
twin frame
pre-state hash
post-state hash
active agent
evidence references
mismatch annotations
confidence at that moment
model state snapshot id
hypothesis changes at that moment

Time Machine Mode

Every replay step must let the user inspect:

target screenshot
twin screenshot
current target state
current twin state
active graph nodes and edges
active hypothesis outcomes
approvals and policy decisions
responsible agent and its contract

Evidence pack contents

source manifest with hashes
task and policy files
screenshots and key frames
replay traces
diff images
network summaries
inferred models
generated twin metadata
verification reports
final summary and confidence notes

16. MVP, V2, and V3

MVP

Scope:

web URLs and source folders only
8 effective agents
run bootstrapper
basic observation pipeline
first-pass twin generation
replay log
screenshot diff
rollback snapshots
initial evidence pack

Must ship:

Workspace Intake Agent
Task Spec Agent
Policy Agent
UI Observer Agent
Interaction Observer Agent
State Graph Agent
Frontend Generator Agent
Verification Agent
Orchestrator

V2

Scope:

full 21-agent graph
split-screen live twin diff dashboard
network summaries and richer state modeling
desktop target support
stronger scoring and defect clustering

V3

Scope:

mobile package support
deeper timing fidelity
improved automated repair loop
broader artifact fusion
stronger audit export and observability

17. MVP Build Plan

Milestone 1: Run System

deterministic run folder creation
task.md, policy.md, evidence.md, run-state.json
artifact hashing and registry
snapshot and rollback stub

Milestone 2: Observation Core

URL and source-folder classification
browser capture
screenshots, DOM, and interaction recording
action logging and replay timeline generation

Milestone 3: Modeling Core

screen inventory
navigation map
lightweight state graph
component extraction heuristics
first cognitive memory graph
initial hypothesis runner

Milestone 4: Reconstruction Core

initial web twin generator
local shadow API stubs
route and interaction scaffolding

Milestone 5: Verification and Repair

visual diff
replay diff
weighted scorecard
one-pass repair loop
first perceptual verification heuristic pass

Milestone 6: Control Center

split-screen dashboard
replay scrubber
diff overlays
approvals and run controls

18. MVP Acceptance Criteria

A user can drop a URL or source folder into the workspace and start a run.
The system generates a deterministic run folder with required artifacts.
The orchestrator never issues an active interaction before baseline capture completes.
Every active interaction is logged with before and after state references.
The generated twin can be launched and compared side by side with the target.
A fidelity report is produced with at least pixel, replay, and state metrics.
The run can be stopped and rolled back to a clean snapshot.
The final evidence pack can be zipped and reviewed offline.

19. Key Technical Risks and Mitigations

Partial observability
- Risk: many targets do not expose internal state, so the system can overclaim.
- Mitigation: separate observed facts from inferred claims, propagate confidence, and require evidence refs for every state transition.
Nondeterministic timing
- Risk: timing varies across runs and environments, producing false mismatches.
- Mitigation: use tolerance bands, medians, variance buckets, and multi-run replay baselines rather than exact equality.
Sandbox isolation complexity
- Risk: strong isolation can reduce visibility, while weak isolation increases risk.
- Mitigation: define target-class-specific sandbox profiles and never enable a capture adapter without a matching policy profile.
Desktop and mobile capture complexity
- Risk: OS APIs, emulators, and window trees are inconsistent.
- Mitigation: ship web-first in MVP, desktop in V2 behind adapter contracts, mobile in V3 with emulator-only support first.
Overfitting to visuals instead of behavior
- Risk: the twin can look right but behave wrong.
- Mitigation: keep interaction and state fidelity weighted above visual fidelity and treat critical path failures as hard stops.
Repair loops breaking previously working flows
- Risk: automated patches regress passing behavior.
- Mitigation: generate regression guards before risky repairs and require targeted replays after every patch.
Secret leakage in logs and screenshots
- Risk: captures can accidentally store sensitive information.
- Mitigation: default redaction, secret scanners on artifacts, masked evidence views, and short retention on sensitive transient files.
Confidence calibration
- Risk: the system can sound precise when the evidence is weak.
- Mitigation: maintain confidence at node, edge, hypothesis, and run levels, and down-rank claims that fail repeated verification.
Agent conflict and context bloat
- Risk: multi-agent systems can contradict themselves and become expensive.
- Mitigation: force all coordination through the orchestrator, store compressed shared state in the graph, and require structured outputs only.

20. Why This Is Different

The unit of work is a reversible run, not a prompt.
The system reconstructs from evidence, not from imagination.
Every agent has an explicit contract, scope, and stop condition.
Verification is deterministic and multi-dimensional, not aesthetic.
The dashboard exposes mismatch mechanics in real time.
Replay is first-class, not a debug afterthought.
Rollback and audit are built in from day one.
The system treats behavior as the primary product surface.

21. Build Order

Run ledger and workspace bootstrapper
Task, policy, evidence, and schema enforcement
Snapshot, rollback, and approval gate
Web observation adapters and passive capture
Action log, event log, and replay timeline
State graph, navigation map, and first cognitive memory graph
Hypothesis engine and living state model
First twin generator and shadow API scaffolding
Verification engine with hard gates and weighted scoring
Repair loop with regression guards
Mission-control dashboard and replay inspector
Evidence bundle exporter and archive browser

FilesExpand file tree

openbuildapp-system-design.md

Latest commit

History

openbuildapp-system-design.md

File metadata and controls

OpenBuildApp System Design

1. Executive Summary

2. Product Positioning

3. Design Principles

4. System Shape

5. Layered Architecture

Layer 1: Control Layer

Layer 2: Observation Layer

Layer 3: Cognitive Graph Layer

Layer 3A: Cognitive Memory Graph

Layer 3B: Autonomous Hypothesis Engine

Layer 4: Reconstruction Layer

Layer 5: Verification Layer

Layer 6: Repair Layer

Layer 7: Reporting Layer

6. Input Classification Matrix

7. Technology Architecture

8. Canonical Run Data Model

9. The 21-Agent System

Control

Intake and Governance

Observation

Modeling

Reconstruction

Verification and Repair

Reporting

10. Run-Time Flow

Phase 0: Workspace Setup

Phase 1: Input Classification

Phase 2: Safe Launch and Passive Observation

Phase 3: Structured Exploration

Phase 4: Behavior Modeling

Phase 5: Reconstruction

Phase 6: Verification

Phase 7: Repair Loop

Phase 8: Finalization

11. Stop Conditions

12. Safety Model

13. Verification Model

14. Visual Dashboard Concept

Main layout

Critical interactions

15. Replay and Evidence Pack

Replay model

Time Machine Mode

Evidence pack contents

16. MVP, V2, and V3

MVP

V2

V3

17. MVP Build Plan

Milestone 1: Run System

Milestone 2: Observation Core

Milestone 3: Modeling Core

Milestone 4: Reconstruction Core

Milestone 5: Verification and Repair

Milestone 6: Control Center

18. MVP Acceptance Criteria

19. Key Technical Risks and Mitigations

20. Why This Is Different

21. Build Order