Skip to content

Latest commit

 

History

History
757 lines (575 loc) · 23.9 KB

File metadata and controls

757 lines (575 loc) · 23.9 KB

OpenBuildApp System Design

1. Executive Summary

OpenBuildApp is a production-grade software twin engine, not a cloning toy. A user places an authorized target artifact into a workspace, the platform launches an isolated run, specialized agents observe the target and reconstruct a twin, and deterministic verification loops continue until the twin is behaviorally close enough to be trusted. Every action is logged, every important inference is explainable, and every run can be replayed and rolled back.

The product is built around a run ledger, strict agent contracts, sandbox-first execution, and evidence packs that make every conclusion reviewable by a human. The core unit of work is a run folder, not a conversational session.

2. Product Positioning

Positioning statement

OpenBuildApp is the first software archaeology and reconstruction platform for authorized analysis: it does not merely generate code from prompts, it studies a real software artifact, constructs a reversible behavioral model, produces a twin, and proves how close the twin is with replay, diffing, and structured evidence.

What it competes against

  • AI coding assistants that start from intent but not evidence
  • UI copiers that mimic layout without understanding behavior
  • Test automation tools that observe behavior but do not reconstruct it
  • Reverse engineering utilities that lack productized safety, replay, or auditability

What it is not

  • A DRM bypass tool
  • A credential extraction tool
  • A covert traffic interception system
  • A generic autonomous coding shell

3. Design Principles

  • Single source of truth per run: run-state.json
  • Event-sourced execution: every material step emits an append-only event
  • Behavior before pixels: interaction and state fidelity outrank cosmetic similarity
  • Safe by default: sandbox, policy gate, confirmation gate, rollback gate
  • Short-lived agent memory: agents consume scoped artifacts, then exit
  • No uncontrolled peer-to-peer agent chat: all coordination flows through the orchestrator
  • Deterministic artifacts: repeatable folder layout, schemas, logs, and replay frames
  • Explainability: every major inference cites evidence references
  • Hypotheses must be testable, versioned, and either validated, downgraded, or rejected
  • Confidence is an explicit field on observations, models, verification results, and final reports
  • The machine-readable model is a first-class product, not a byproduct of code generation

4. System Shape

User Workspace
  -> Run Bootstrapper
  -> Control Layer
       -> Policy Gate
       -> Run Ledger
       -> Snapshot Manager
       -> Agent Dispatcher
  -> Observation Layer
  -> Cognitive Graph Layer
  -> Reconstruction Layer
  -> Verification Layer
  -> Repair Layer
  -> Reporting Layer
  -> Final Evidence Pack + Replay Archive

5. Layered Architecture

Layer 1: Control Layer

Purpose: Own the run, gate actions, compress context, arbitrate conflicts, and decide whether to continue, retry, rollback, or escalate.

Core services:

  • Orchestrator runtime
  • Policy engine
  • Run-state store
  • Event bus
  • Snapshot scheduler
  • Approval gate

Primary artifacts:

  • run-state.json
  • events.jsonl
  • task.md
  • policy.md
  • summary.md

Layer 2: Observation Layer

Purpose: Capture what the target does without generating implementation code.

Adapters:

  • Browser instrumentation for web URLs
  • Window capture and accessibility bridge for desktop apps
  • Emulator instrumentation for mobile packages
  • Frame and OCR pipeline for screenshots and recordings
  • Static file readers for logs and source folders

Primary artifacts:

  • screenshots
  • screen segments
  • DOM or window trees
  • accessibility trees
  • network summaries
  • interaction traces

Layer 3: Cognitive Graph Layer

Purpose: Convert raw observations into a living, inspectable cognition model that the orchestrator and downstream agents can reason over.

Core subsystems:

  • Cognitive Memory Graph
  • Autonomous Hypothesis Engine
  • Living State Model
  • Confidence propagation engine

Graph nodes:

  • screens
  • components
  • controls
  • actions
  • states
  • transitions
  • entities
  • backend behaviors
  • evidence artifacts
  • hypotheses

Graph edges:

  • contains
  • triggers
  • transitions_to
  • reads_from
  • writes_to
  • resembles
  • supports
  • contradicts
  • proven_by
  • inferred_from

Primary artifacts:

  • models/cognitive-memory-graph.json
  • models/component-map.json
  • models/navigation-map.json
  • models/state-graph.json
  • models/event-graph.json
  • models/data-model.json
  • models/backend-contract.json
  • models/hypotheses.json

Layer 3A: Cognitive Memory Graph

The graph is the platform's durable software cognition substrate. It is not a cache. It is the machine-readable explanation of what the system believes, why it believes it, and how strongly.

Each node and edge must store:

  • stable id
  • type
  • confidence score
  • provenance
  • evidence refs
  • last-updated step
  • conflict markers where applicable

Layer 3B: Autonomous Hypothesis Engine

The hypothesis engine forms and validates claims such as:

  • this control opens a modal
  • this state transition represents save success
  • this endpoint returns paginated results
  • this component is stable across screens
  • this animation is ornamental, not state-bearing

Hypothesis lifecycle:

proposed -> queued -> tested -> validated
                         -> downgraded
                         -> rejected

Hypotheses must include:

  • natural-language claim
  • machine-checkable expectation
  • triggering evidence
  • required validation action
  • confidence before test
  • confidence after test

Layer 4: Reconstruction Layer

Purpose: Produce a working twin with explicit assumptions.

Output targets:

  • web twin for MVP
  • desktop shell wrapper in V2
  • mobile twin targets in V3

Primary artifacts:

  • generated source code
  • shadow API definitions
  • local state engine
  • replayable tests

Layer 5: Verification Layer

Purpose: Measure behavioral and visual distance between target and twin.

Verification modes:

  • pixel diff
  • structural diff
  • timing diff
  • replay diff
  • state diff
  • accessibility parity check
  • perceptual reality verification
  • critical path regression verification

Primary artifacts:

  • verification/report.json
  • annotated diff images
  • replay mismatch logs
  • fidelity scores
  • perceptual mismatch report

Layer 6: Repair Layer

Purpose: Translate verification failures into targeted patches without destabilizing passing behavior.

Strategies:

  • defect clustering by root cause
  • patch selection by highest leverage mismatch
  • regression guard generation before applying risky repairs
  • rollback to last known-good snapshot on failure

Layer 7: Reporting Layer

Purpose: Turn the run into an auditable product artifact.

Primary outputs:

  • evidence.md
  • summary.md
  • report.json
  • exportable evidence bundle
  • architecture delta notes

6. Input Classification Matrix

Input type Primary pipeline Confidence ceiling Notes
URL for web app browser capture + DOM + network + replay high Best early target for MVP
source code folder static parse + build/run in sandbox + screenshot capture high Best for explainable twin generation
desktop executable VM or sandbox runner + window tree + screenshot + a11y medium Stronger isolation required
mobile package emulator + UI automation + screenshot + a11y medium Start in V2
screenshot set visual inference only low No behavioral claims beyond evidence
screen recording temporal visual inference + OCR + event hints medium-low Good for flow reconstruction
logs state/event augmentation medium Never treated as full truth alone

7. Technology Architecture

OpenBuildApp should be a monorepo with a split between product UI, orchestration, capture, and shared contracts.

/apps
  /control-center        # split-screen dashboard, replay timeline, run control UI
  /twin-runtime          # runtime shell for generated twins
/services
  /orchestrator          # run lifecycle, agent dispatch, policy gate
  /capture-gateway       # screenshot, a11y, DOM, replay, timing capture
  /sandbox-manager       # VM/container/session isolation and reset
  /verification-engine   # diffing, scoring, replay comparison
/packages
  /contracts             # zod/json schemas and shared types
  /agent-runtime         # base agent interfaces and runners
  /prompt-contracts      # agent contract compiler and templates
  /model-store           # state graph, component map, navigation map
  /evidence-kit          # evidence bundle assembly

Opinionated stack

  • Product UI: React + TypeScript + Vite + TanStack Router + Zustand
  • Desktop shell: Tauri for local control center packaging
  • Orchestration: TypeScript service for fast agent iteration and shared contracts
  • Isolation helpers: Rust sidecars for snapshotting, desktop capture, and low-level timing fidelity
  • Storage: SQLite for run metadata, JSONL for append-only event logs, filesystem for evidence artifacts
  • Schemas: JSON Schema plus runtime validation in Zod

8. Canonical Run Data Model

Each run gets a deterministic folder and a canonical ledger:

  • run-state.json: current state snapshot
  • events.jsonl: append-only events
  • actions.jsonl: operator or agent actions with monotonic sequence ids
  • artifacts/index.json: artifact catalog with hashes and provenance
  • models/live-state-target.json: current target-side live state projection
  • models/live-state-twin.json: current twin-side live state projection
  • models/hypotheses.json: current active and historical hypotheses

run-state.json should contain:

  • run metadata
  • input classification
  • current phase
  • active agent
  • approval state
  • latest snapshot id
  • verification status
  • fidelity scores
  • stop reason
  • model coverage
  • current confidence band
  • active hypotheses count
  • current target state id
  • current twin state id

9. The 21-Agent System

Control

  1. Master Orchestrator Agent
    • Owns phase transitions, routing, stop decisions, and evidence compression.

Intake and Governance

  1. Workspace Intake Agent
    • Detects input type, hashes artifacts, validates workspace readiness.
  2. Task Spec Agent
    • Produces and updates task.md, success criteria, and scope.
  3. Policy Agent
    • Produces and enforces policy.md, approvals, and red lines.

Observation

  1. UI Observer Agent
    • Captures layout, typography, hierarchy, spacing, and screen inventory.
  2. Interaction Observer Agent
    • Records clicks, typing, focus, hover, menus, and visible transitions.
  3. Animation Observer Agent
    • Measures timing, easing, motion paths, and transition durations.
  4. Accessibility Agent
    • Extracts a11y trees, semantic roles, labels, focus order, and landmarks.
  5. Network Observer Agent
    • Summarizes safe request and response patterns without storing secrets.

Modeling

  1. State Graph Agent
    • Builds a state machine from observed events and UI changes.
  2. Navigation Agent
    • Maps routes, screens, windows, modal layers, and back-stack logic.
  3. Component Mapper Agent
    • Clusters visible primitives into reusable component candidates.
  4. Backend Behavior Inference Agent
    • Infers response contracts and data dependencies from allowed evidence.
  5. Data Model Agent
    • Infers entity shapes, collections, filters, and client-state patterns.

Reconstruction

  1. Frontend Generator Agent
    • Produces the twin UI structure and component tree.
  2. Interaction Coding Agent
    • Implements user actions, forms, keyboard flows, and local behaviors.
  3. Shadow API Agent
    • Creates mocks, inferred endpoints, and deterministic local adapters.

Verification and Repair

  1. Verification Agent
    • Runs scoring, diffing, replay comparison, and gate checks.
  2. Repair Agent
    • Applies targeted fixes based on verification deltas.
  3. Rollback and Snapshot Agent
    • Manages restore points, clean resets, and failed-repair recovery.

Reporting

  1. Reporting and Evidence Agent
    • Produces evidence packs, summaries, changelogs, and final reports.

10. Run-Time Flow

Phase 0: Workspace Setup

  • Create run folder and seed deterministic files
  • Hash inputs and register artifacts
  • Create initial snapshot

Phase 1: Input Classification

  • Detect artifact class and confidence
  • Select pipeline and safe tool envelope
  • Mark unsupported claims early

Phase 2: Safe Launch and Passive Observation

  • Launch target in isolated environment
  • Begin screenshot, DOM, a11y, and timing capture
  • Avoid active interaction until baseline state is recorded

Phase 3: Structured Exploration

  • Explore low-risk actions first
  • Never click destructive or purchase-like controls without approval
  • Record all actions, before and after states, and confidence notes
  • Propose and queue testable hypotheses about controls, transitions, and data behavior

Phase 4: Behavior Modeling

  • Build screen graph, state graph, navigation map, and component inventory
  • Cross-link observations to specific evidence references
  • Update the cognitive memory graph and prune rejected hypotheses
  • Maintain a living state model for both target and twin candidates

Phase 5: Reconstruction

  • Generate initial twin
  • Generate shadow backend and local state engine
  • Generate replayable tests for observed critical paths

Phase 6: Verification

  • Run deterministic comparisons
  • Produce weighted fidelity scores
  • Create defect clusters with evidence links
  • Run perceptual reality checks for motion, delay, hierarchy, feedback, and error handling
  • Update confidence based on replay stability and verification reproducibility

Phase 7: Repair Loop

  • Patch highest-value failures first
  • Re-run targeted verification
  • Escalate after iteration cap, policy gate, or unstable observations

Phase 8: Finalization

  • Freeze final snapshot
  • Build evidence pack and replay archive
  • Produce final summary with known gaps and confidence level
  • Archive the time-machine replay so every step can be scrubbed with model state and evidence

11. Stop Conditions

The orchestrator stops the run when any of the following is true:

  • verification thresholds are met
  • the user requests stop
  • a policy boundary is hit
  • the target becomes unstable or nondeterministic beyond tolerance
  • the repair loop hits iteration or cost ceilings
  • evidence is insufficient to make safe claims

12. Safety Model

OpenBuildApp uses four hard gates:

  1. Authorization Gate
    • Runs only on user-authorized artifacts in user-owned workspaces.
  2. Policy Gate
    • Blocks disallowed actions such as credential harvesting, DRM bypass, or privileged destructive actions.
  3. Confirmation Gate
    • Requires human approval for risky UI interactions, external side effects, or irreversible steps.
  4. Rollback Gate
    • Requires a restorable snapshot before any action classified as state-mutating.

Privacy rules

  • Store only redacted request and response summaries by default
  • Hash secrets, never persist raw secrets in evidence
  • Expire sensitive transient artifacts after run completion
  • Keep evidence precise enough for audit but not rich enough to leak user data

13. Verification Model

Verification is a scorecard plus hard gates.

Hard gates

  • no blocked safety violations
  • replay can execute without fatal divergence
  • critical user journeys complete
  • generated twin is reproducible from stored artifacts

Weighted fidelity score

  • 35% interaction fidelity
  • 25% state fidelity
  • 20% visual fidelity
  • 10% timing fidelity
  • 10% accessibility fidelity

Perceptual reality verification

In addition to raw scores, OpenBuildApp must answer: "Would a careful human notice this as a different product experience?"

Perceptual checks:

  • motion smoothness and sequencing
  • delay and responsiveness
  • information hierarchy
  • interaction feedback
  • error-state behavior
  • perceived polish and transition coherence

Acceptance thresholds

  • MVP: 0.70 weighted score, no critical journey failures
  • V2: 0.85 weighted score, no major visual or replay regressions
  • V3: 0.92 weighted score, low timing variance, high state parity

14. Visual Dashboard Concept

The control center should not look like a generic chat UI. It should feel like a mission console.

Main layout

+----------------------------------------------------------------------------------+
| Run header: status | target type | fidelity score | approvals | snapshot id     |
+--------------------------------+-----------------------------------------------+
| Target viewport                 | Twin viewport                                  |
| live frame / captured state     | current generated twin                         |
| diff overlay toggle             | mismatch hotspots                              |
+--------------------------------+-----------------------------------------------+
| Replay timeline with step ids, state transitions, timing marks, and agent lanes |
+--------------------------------+--------------------+---------------------------+
| Evidence panel                  | Verification panel | Agent activity panel     |
| screenshots, notes, requests    | score deltas       | current phase, blockers  |
+----------------------------------------------------------------------------------+

Critical interactions

  • scrub replay timeline and sync both panes
  • toggle overlay modes: pixel, structure, state, timing
  • inspect mismatch hotspots and jump to evidence
  • freeze on a step and view agent rationale plus source artifacts
  • approve or deny gated actions inline

15. Replay and Evidence Pack

Replay model

Replay is a synchronized, step-indexed bundle:

  • action id
  • timestamp
  • target frame
  • twin frame
  • pre-state hash
  • post-state hash
  • active agent
  • evidence references
  • mismatch annotations
  • confidence at that moment
  • model state snapshot id
  • hypothesis changes at that moment

Time Machine Mode

Every replay step must let the user inspect:

  • target screenshot
  • twin screenshot
  • current target state
  • current twin state
  • active graph nodes and edges
  • active hypothesis outcomes
  • approvals and policy decisions
  • responsible agent and its contract

Evidence pack contents

  • source manifest with hashes
  • task and policy files
  • screenshots and key frames
  • replay traces
  • diff images
  • network summaries
  • inferred models
  • generated twin metadata
  • verification reports
  • final summary and confidence notes

16. MVP, V2, and V3

MVP

Scope:

  • web URLs and source folders only
  • 8 effective agents
  • run bootstrapper
  • basic observation pipeline
  • first-pass twin generation
  • replay log
  • screenshot diff
  • rollback snapshots
  • initial evidence pack

Must ship:

  • Workspace Intake Agent
  • Task Spec Agent
  • Policy Agent
  • UI Observer Agent
  • Interaction Observer Agent
  • State Graph Agent
  • Frontend Generator Agent
  • Verification Agent
  • Orchestrator

V2

Scope:

  • full 21-agent graph
  • split-screen live twin diff dashboard
  • network summaries and richer state modeling
  • desktop target support
  • stronger scoring and defect clustering

V3

Scope:

  • mobile package support
  • deeper timing fidelity
  • improved automated repair loop
  • broader artifact fusion
  • stronger audit export and observability

17. MVP Build Plan

Milestone 1: Run System

  • deterministic run folder creation
  • task.md, policy.md, evidence.md, run-state.json
  • artifact hashing and registry
  • snapshot and rollback stub

Milestone 2: Observation Core

  • URL and source-folder classification
  • browser capture
  • screenshots, DOM, and interaction recording
  • action logging and replay timeline generation

Milestone 3: Modeling Core

  • screen inventory
  • navigation map
  • lightweight state graph
  • component extraction heuristics
  • first cognitive memory graph
  • initial hypothesis runner

Milestone 4: Reconstruction Core

  • initial web twin generator
  • local shadow API stubs
  • route and interaction scaffolding

Milestone 5: Verification and Repair

  • visual diff
  • replay diff
  • weighted scorecard
  • one-pass repair loop
  • first perceptual verification heuristic pass

Milestone 6: Control Center

  • split-screen dashboard
  • replay scrubber
  • diff overlays
  • approvals and run controls

18. MVP Acceptance Criteria

  • A user can drop a URL or source folder into the workspace and start a run.
  • The system generates a deterministic run folder with required artifacts.
  • The orchestrator never issues an active interaction before baseline capture completes.
  • Every active interaction is logged with before and after state references.
  • The generated twin can be launched and compared side by side with the target.
  • A fidelity report is produced with at least pixel, replay, and state metrics.
  • The run can be stopped and rolled back to a clean snapshot.
  • The final evidence pack can be zipped and reviewed offline.

19. Key Technical Risks and Mitigations

  1. Partial observability
    • Risk: many targets do not expose internal state, so the system can overclaim.
    • Mitigation: separate observed facts from inferred claims, propagate confidence, and require evidence refs for every state transition.
  2. Nondeterministic timing
    • Risk: timing varies across runs and environments, producing false mismatches.
    • Mitigation: use tolerance bands, medians, variance buckets, and multi-run replay baselines rather than exact equality.
  3. Sandbox isolation complexity
    • Risk: strong isolation can reduce visibility, while weak isolation increases risk.
    • Mitigation: define target-class-specific sandbox profiles and never enable a capture adapter without a matching policy profile.
  4. Desktop and mobile capture complexity
    • Risk: OS APIs, emulators, and window trees are inconsistent.
    • Mitigation: ship web-first in MVP, desktop in V2 behind adapter contracts, mobile in V3 with emulator-only support first.
  5. Overfitting to visuals instead of behavior
    • Risk: the twin can look right but behave wrong.
    • Mitigation: keep interaction and state fidelity weighted above visual fidelity and treat critical path failures as hard stops.
  6. Repair loops breaking previously working flows
    • Risk: automated patches regress passing behavior.
    • Mitigation: generate regression guards before risky repairs and require targeted replays after every patch.
  7. Secret leakage in logs and screenshots
    • Risk: captures can accidentally store sensitive information.
    • Mitigation: default redaction, secret scanners on artifacts, masked evidence views, and short retention on sensitive transient files.
  8. Confidence calibration
    • Risk: the system can sound precise when the evidence is weak.
    • Mitigation: maintain confidence at node, edge, hypothesis, and run levels, and down-rank claims that fail repeated verification.
  9. Agent conflict and context bloat
    • Risk: multi-agent systems can contradict themselves and become expensive.
    • Mitigation: force all coordination through the orchestrator, store compressed shared state in the graph, and require structured outputs only.

20. Why This Is Different

  • The unit of work is a reversible run, not a prompt.
  • The system reconstructs from evidence, not from imagination.
  • Every agent has an explicit contract, scope, and stop condition.
  • Verification is deterministic and multi-dimensional, not aesthetic.
  • The dashboard exposes mismatch mechanics in real time.
  • Replay is first-class, not a debug afterthought.
  • Rollback and audit are built in from day one.
  • The system treats behavior as the primary product surface.

21. Build Order

  1. Run ledger and workspace bootstrapper
  2. Task, policy, evidence, and schema enforcement
  3. Snapshot, rollback, and approval gate
  4. Web observation adapters and passive capture
  5. Action log, event log, and replay timeline
  6. State graph, navigation map, and first cognitive memory graph
  7. Hypothesis engine and living state model
  8. First twin generator and shadow API scaffolding
  9. Verification engine with hard gates and weighted scoring
  10. Repair loop with regression guards
  11. Mission-control dashboard and replay inspector
  12. Evidence bundle exporter and archive browser