Date: 2026-01-17 Status: Analysis Complete Recommendation: NO - Keep episodes.json separate from capture.db
After comprehensive analysis of the OpenAdapt ecosystem's data architecture, I recommend keeping episodes.json separate from capture.db. While there are theoretical benefits to consolidation, the practical costs and architectural tradeoffs make separation the superior design for this specific use case.
Key Finding: The current separation reflects a separation of concerns - raw event capture vs. ML-derived semantic segmentation - that should be preserved.
Location: {recording_dir}/capture.db
Size: ~320 KB for 60s recording
Purpose: Store raw platform events during capture
-- Metadata table (1 row per recording)
CREATE TABLE capture (
id TEXT PRIMARY KEY,
started_at REAL NOT NULL,
ended_at REAL,
platform TEXT NOT NULL,
screen_width INTEGER NOT NULL,
screen_height INTEGER NOT NULL,
pixel_ratio REAL DEFAULT 1.0,
task_description TEXT,
double_click_interval_seconds REAL,
double_click_distance_pixels REAL,
video_start_time REAL,
audio_start_time REAL,
metadata JSON
);
-- Raw events table (1561 rows for 60s recording)
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
timestamp REAL NOT NULL,
type TEXT NOT NULL, -- 'mouse.move', 'mouse.down', 'key.down', 'screen.frame'
data JSON NOT NULL, -- Event-specific data
parent_id INTEGER,
FOREIGN KEY (parent_id) REFERENCES events(id)
);
CREATE INDEX idx_events_timestamp ON events(timestamp);
CREATE INDEX idx_events_type ON events(type);Event Distribution (turn-off-nightshift recording):
screen.frame: 457 events (29%)
mouse.move: 1046 events (67%)
mouse.down: 13 events (0.8%)
mouse.up: 13 events (0.8%)
key.down: 16 events (1%)
key.up: 16 events (1%)
Total: 1561 events
Characteristics:
- Written during recording (real-time append-only writes)
- Platform-specific raw events
- No ML dependencies
- Optimized for fast writes during capture
- Never modified after recording ends
Location: {recording_dir}/episodes.json
Size: ~4 KB for 2 episodes
Purpose: Store ML-segmented episodes with semantic understanding
{
"recording_id": "turn-off-nightshift",
"recording_name": "Turn Off Night Shift Demo",
"episodes": [
{
"episode_id": "episode_001",
"name": "Navigate to System Settings",
"description": "User opens System Settings application from the dock...",
"application": "macOS System Settings",
"start_time": 0.0,
"end_time": 3.5,
"duration": 3.5,
"recording_ids": ["turn-off-nightshift"],
"frame_indices": [0, 1, 2, 3, 4],
"steps": [
"Click System Settings icon in dock",
"Wait for Settings window to open",
"Click on Displays in sidebar"
],
"boundary_confidence": 0.92,
"coherence_score": 0.88,
"screenshots": {
"thumbnail": "../openadapt-capture/.../step_0.png",
"key_frames": [...]
}
}
],
"boundaries": [
{
"timestamp": 3.5,
"confidence": 0.92,
"reason": "Transition from navigation to settings configuration"
}
],
"llm_model": "gpt-4o",
"processing_timestamp": "2026-01-17T12:00:00.000000",
"coverage": 1.0,
"avg_confidence": 0.935
}Characteristics:
- Created by ML pipeline (openadapt-ml) AFTER recording
- High-level semantic understanding
- ML model metadata (model, confidence, etc.)
- Can be regenerated with different models/parameters
- Human-readable format for inspection
transcript.json (~4 KB):
{
"text": "Okay, it's recording. I think it should be...",
"segments": [
{
"start": 0.0,
"end": 5.6,
"text": "Okay, it's recording..."
}
]
}- Whisper-generated speech-to-text
- Optional (only if audio captured)
- ML-derived (like episodes)
Media Files:
video.mp4(~1.7 MB) - Screen recordingaudio.flac(~930 KB) - Audio recordingscreenshots/*.png(22 files, ~100 KB each) - Frame snapshots
Analysis of the codebase reveals clear separation of concerns:
-
Different Lifecycles:
capture.db: Created during recording (openadapt-capture)episodes.json: Created during segmentation (openadapt-ml)
-
Different Ownership:
capture.db: Owned by openadapt-capture packageepisodes.json: Owned by openadapt-ml package
-
Different Stability:
capture.db: Immutable after recordingepisodes.json: Can be regenerated with different models
-
Different Consumers:
capture.db: Used by playback, training, evalsepisodes.json: Used by viewers, benchmarks, demos
Current Problem: Episode data lives outside the main recording database.
Benefit: All recording data in one place reduces file management complexity.
Reality Check: We ALREADY have a single source of truth - the recording directory. The boundary is files vs. database, not location.
Benefit: Atomic updates ensure data consistency.
Reality Check:
- Episodes are written ONCE after ML processing (not updated)
- No concurrent writers (ML pipeline is single-threaded)
- ACID is overkill for write-once data
Benefit: SQL queries for filtering episodes by time, confidence, etc.
Example Queries:
-- Find high-confidence episodes
SELECT * FROM episodes WHERE boundary_confidence > 0.9;
-- Episodes in time range
SELECT * FROM episodes WHERE start_time >= 2.0 AND end_time <= 5.0;
-- Join episodes with events
SELECT e.*, COUNT(ev.id) as event_count
FROM episodes e
JOIN events ev ON ev.timestamp BETWEEN e.start_time AND e.end_time
GROUP BY e.id;Reality Check:
- Current use case: Load ALL episodes and display them (no filtering)
- Viewers load entire file into memory for interactivity
- Typical recording has 2-10 episodes (not 1000s)
- JSON parsing is ~1ms for typical file (negligible)
Benefit: Link episodes to events with database constraints.
Example:
CREATE TABLE episodes (
id INTEGER PRIMARY KEY,
name TEXT,
start_event_id INTEGER,
end_event_id INTEGER,
FOREIGN KEY (start_event_id) REFERENCES events(id),
FOREIGN KEY (end_event_id) REFERENCES events(id)
);Reality Check:
- Episodes reference time ranges, not specific event IDs
- Events and episodes are at different semantic levels
- No referential integrity needed (episodes can be deleted/regenerated)
Benefit: Database indexes speed up queries.
Reality Check:
- No performance issues with current JSON approach
- Typical episode file: 4 KB, <100 lines
- Loading time: <1ms (unmeasurable)
- Database overhead would ADD latency, not reduce it
Benefit: Update episode metadata atomically.
Reality Check:
- Episodes are IMMUTABLE after generation
- Updates mean "regenerate entire file with new ML model"
- Atomic file replacement (write to temp, then rename) already provides atomicity
Benefit: SQLite supports schema migrations (Alembic).
Reality Check:
- JSON schemas are more flexible (add fields without migration)
- Breaking changes = regenerate episodes (not migrate)
- Legacy OpenAdapt DB schema (alembic_version table) shows migration complexity
Impact: HIGH
Required Changes:
- Modify openadapt-ml segmentation pipeline to write to SQLite
- Update all viewers to query DB instead of loading JSON
- Migrate existing ~10+ recordings
- Update catalog system to scan DB tables
- Change screenshot generation scripts
- Update all documentation and examples
Affected Packages:
- openadapt-capture (schema changes)
- openadapt-ml (segmentation output)
- openadapt-viewer (all viewers)
- openadapt-evals (benchmark loading)
Estimated Effort: 2-3 weeks of development + testing
Impact: HIGH
Current Code Pattern (everywhere):
# Load episodes
with open(f"{recording_dir}/episodes.json") as f:
data = json.load(f)
episodes = data["episodes"]New Code Pattern (everywhere):
# Load episodes
conn = sqlite3.connect(f"{recording_dir}/capture.db")
cursor = conn.execute("SELECT * FROM episodes WHERE recording_id = ?", (rec_id,))
episodes = [dict(row) for row in cursor.fetchall()]
conn.close()Files Requiring Changes: 16+ files across ecosystem
Impact: MEDIUM
Current ML Workflow:
# openadapt-ml segmentation
def segment_recording(recording_path):
events = load_events_from_db(recording_path)
episodes = ml_model.segment(events) # Returns Python dicts
# Write to JSON (one line)
output = {"episodes": episodes, "metadata": {...}}
json.dump(output, open("episodes.json", "w"), indent=2)Proposed DB Workflow:
# openadapt-ml segmentation
def segment_recording(recording_path):
events = load_events_from_db(recording_path)
episodes = ml_model.segment(events)
# Write to DB (complex)
conn = sqlite3.connect(f"{recording_path}/capture.db")
for episode in episodes:
conn.execute("INSERT INTO episodes VALUES (?, ?, ?, ...)", (...))
for step in episode["steps"]:
conn.execute("INSERT INTO episode_steps VALUES (?, ?, ?)", (...))
conn.commit()
conn.close()Problem:
- ML models work with Python dicts/JSON naturally
- DB writes require schema awareness and SQL construction
- Debugging harder (can't
cat episodes.jsonto inspect)
Impact: MEDIUM
Current Workflow:
# Inspect episodes
cat turn-off-nightshift/episodes.json | jq '.episodes[0]'
# Share episodes via Slack/email
cat episodes.json
# Edit for testing
vim episodes.json
# Git diff to see changes
git diff episodes.jsonProposed Workflow:
# Inspect episodes
sqlite3 turn-off-nightshift/capture.db "SELECT * FROM episodes"
# Share requires exporting
sqlite3 capture.db ".mode json" ".output episodes.json" "SELECT * FROM episodes"
# Edit requires SQL
sqlite3 capture.db "UPDATE episodes SET name = 'New Name' WHERE id = 1"
# Git diff harder (binary format)
# Can't see what changed in DBImpact on Development:
- Slower debugging
- Harder collaboration (binary diffs)
- More tooling required (sqlite3 CLI, DB browsers)
Impact: LOW-MEDIUM
Current Prototyping:
# Test new segmentation model
episodes = {
"episodes": [
{"name": "Test", "start_time": 0, ...}
]
}
json.dump(episodes, open("test.json", "w"))
# Viewer picks it up immediately
python generate_viewer.py --episodes test.jsonProposed Prototyping:
# Test new segmentation model
conn = sqlite3.connect("test.db")
conn.execute("CREATE TABLE episodes (...)")
conn.execute("INSERT INTO episodes VALUES (...)")
conn.commit()
# Viewer needs to be aware of DB location
python generate_viewer.py --db test.dbImpact: HIGH (Architectural)
Current Architecture (Clean Separation):
openadapt-capture/
└── Owns: capture.db, video.mp4, audio.flac, screenshots/
openadapt-ml/
└── Owns: episodes.json, transcript.json (in recording dir)
└── Reads: capture.db (read-only)
Proposed Architecture (Tight Coupling):
openadapt-capture/
└── Owns: capture.db (now includes ML outputs!)
openadapt-ml/
└── Writes to: capture.db (modifies capture package's data!)
└── Reads: capture.db
Problems:
- openadapt-ml now writes to openadapt-capture's database
- Cross-package ownership (who owns the schema?)
- Harder to version packages independently
- Violates single responsibility principle
Impact: MEDIUM
Scenario: New ML model adds semantic_tags field to episodes.
Current (JSON):
# Old format still works
{"episodes": [{"name": "...", "start_time": 0}]}
# New format adds field
{"episodes": [{"name": "...", "start_time": 0, "semantic_tags": ["ui", "settings"]}]}
# Viewers handle both (default to empty list if missing)
tags = episode.get("semantic_tags", [])Proposed (DB):
-- Need migration
ALTER TABLE episodes ADD COLUMN semantic_tags JSON;
-- Need to version schema
CREATE TABLE schema_version (version INTEGER);
-- Need migration script
python migrate_db.py --from-version 1 --to-version 2
-- Old viewers break if they don't expect new schemaReality: JSON's schemaless nature is a FEATURE for ML outputs.
Impact: MEDIUM
Current Test Setup:
# tests/test_viewer.py
def test_segmentation_viewer():
episodes = {"episodes": [{"name": "Test", ...}]}
with tempfile.NamedTemporaryFile("w", suffix=".json") as f:
json.dump(episodes, f)
viewer = generate_viewer(f.name)
assert "Test" in viewerProposed Test Setup:
def test_segmentation_viewer():
with tempfile.NamedTemporaryFile(suffix=".db") as f:
conn = sqlite3.connect(f.name)
conn.executescript(EPISODES_SCHEMA) # Need to maintain schema DDL
conn.execute("INSERT INTO episodes VALUES (?...)", (...))
conn.commit()
conn.close()
viewer = generate_viewer(f.name)
assert "Test" in viewerMore Fragile: DB tests require schema creation, harder to set up.
For completeness, here's how it WOULD look:
-- Main episodes table
CREATE TABLE episodes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
episode_id TEXT UNIQUE NOT NULL, -- "episode_001"
recording_id TEXT NOT NULL, -- "turn-off-nightshift"
name TEXT,
description TEXT,
application TEXT,
start_time REAL NOT NULL,
end_time REAL NOT NULL,
duration REAL GENERATED ALWAYS AS (end_time - start_time) VIRTUAL,
start_time_formatted TEXT,
end_time_formatted TEXT,
boundary_confidence REAL,
coherence_score REAL,
thumbnail_path TEXT,
created_at REAL DEFAULT (unixepoch()),
FOREIGN KEY (recording_id) REFERENCES capture(id)
);
-- Episode steps (one-to-many)
CREATE TABLE episode_steps (
id INTEGER PRIMARY KEY AUTOINCREMENT,
episode_id INTEGER NOT NULL,
step_index INTEGER NOT NULL,
description TEXT NOT NULL,
frame_index INTEGER,
screenshot_path TEXT,
FOREIGN KEY (episode_id) REFERENCES episodes(id) ON DELETE CASCADE,
UNIQUE (episode_id, step_index)
);
-- Episode boundaries (one-to-many with recording)
CREATE TABLE episode_boundaries (
id INTEGER PRIMARY KEY AUTOINCREMENT,
recording_id TEXT NOT NULL,
timestamp REAL NOT NULL,
confidence REAL NOT NULL,
reason TEXT,
FOREIGN KEY (recording_id) REFERENCES capture(id) ON DELETE CASCADE
);
-- Segmentation metadata (one-to-one with recording)
CREATE TABLE segmentation_metadata (
recording_id TEXT PRIMARY KEY,
llm_model TEXT NOT NULL,
processing_timestamp TEXT NOT NULL,
coverage REAL,
avg_confidence REAL,
segmentation_version TEXT,
FOREIGN KEY (recording_id) REFERENCES capture(id) ON DELETE CASCADE
);
-- Indexes for common queries
CREATE INDEX idx_episodes_recording ON episodes(recording_id);
CREATE INDEX idx_episodes_time ON episodes(start_time, end_time);
CREATE INDEX idx_episodes_confidence ON episodes(boundary_confidence);
CREATE INDEX idx_steps_episode ON episode_steps(episode_id);
CREATE INDEX idx_boundaries_recording ON episode_boundaries(recording_id);Observations:
- Much more complex than JSON (4 tables vs 1 file)
- Need to maintain schema DDL
- Requires migrations for changes
- Higher cognitive load for developers
Phase 1: Dual Write (2 weeks)
- ML pipeline writes BOTH JSON and DB
- Viewers read from JSON (existing behavior)
- Validate DB writes are correct
Phase 2: Dual Read (1 week)
- Viewers read from DB if present, else JSON
- Test all viewers with DB source
- Fix any issues
Phase 3: Deprecate JSON (1 week)
- ML pipeline writes only to DB
- Viewers read only from DB
- Remove JSON support
Phase 4: Migration (ongoing)
- Script to migrate existing recordings
- Run on all ~25 recordings in ecosystem
def load_episodes(recording_path):
db_path = Path(recording_path) / "capture.db"
json_path = Path(recording_path) / "episodes.json"
# Try DB first
if has_episodes_table(db_path):
return load_from_db(db_path)
# Fallback to JSON and migrate
if json_path.exists():
episodes = load_from_json(json_path)
migrate_to_db(episodes, db_path)
return episodes
return []Use Alembic (like legacy OpenAdapt):
# alembic/versions/001_add_episodes.py
def upgrade():
op.create_table('episodes', ...)
op.create_table('episode_steps', ...)
def downgrade():
op.drop_table('episode_steps')
op.drop_table('episodes')Current: transcript.json (~4 KB, Whisper output)
Recommendation: KEEP AS JSON
Reasoning:
- Same issues as episodes.json (ML-derived, write-once)
- Segments are simple list (no complex queries needed)
- Easy to regenerate with different Whisper models
- Human-readable format useful for debugging
Schema if we moved:
CREATE TABLE transcript (
recording_id TEXT PRIMARY KEY,
full_text TEXT,
FOREIGN KEY (recording_id) REFERENCES capture(id)
);
CREATE TABLE transcript_segments (
id INTEGER PRIMARY KEY,
recording_id TEXT NOT NULL,
start_time REAL NOT NULL,
end_time REAL NOT NULL,
text TEXT NOT NULL,
FOREIGN KEY (recording_id) REFERENCES capture(id)
);Current: Various JSON files in openadapt-evals
Recommendation: KEEP AS JSON
Reasoning:
- Not part of capture data (separate lifecycle)
- Multiple formats (per model, per run)
- Used for analysis/visualization (load everything)
- Git-trackable for comparing runs
Current: Separate SQLite DB for indexing all recordings
Recommendation: KEEP SEPARATE ✓
Reasoning:
- Cross-recording metadata (perfect for DB)
- Many recordings (hundreds eventually)
- Search/filter queries (SQL shines here)
- Central index (one DB for all recordings)
This is the RIGHT use of a database: aggregation across many entities.
Primary Reasons:
-
Separation of Concerns ✓✓✓
- Raw events (capture.db) vs semantic segments (episodes.json)
- Different packages own different files
- Clean architectural boundaries
-
No Performance Benefit ✓✓
- Current JSON loading is <1ms (negligible)
- Typical files: 2-10 episodes (not thousands)
- Database overhead would ADD latency
-
High Migration Cost ✗✗✗
- 16+ files to modify across 4 packages
- Breaking changes for all consumers
- 2-3 weeks of development effort
-
ML Pipeline Compatibility ✓✓
- JSON is natural output for Python ML models
- Easy debugging (cat, jq, vim)
- Git-friendly (text diffs)
-
Flexibility for Experimentation ✓
- Easy to regenerate with different models
- No schema migrations needed
- Fast prototyping
The database approach would be justified if:
- Scale: Recordings had 100+ episodes (not 2-10)
- Queries: Needed complex filtering across episodes (not "load all")
- Updates: Episodes were updated frequently (not write-once)
- Joins: Needed to join episodes with events regularly (not done)
- Concurrency: Multiple writers updating episodes (not happening)
Reality: None of these conditions apply to OpenAdapt.
Instead of moving to DB, improve the JSON format:
{
"schema_version": "1.0.0",
"recording_id": "turn-off-nightshift",
"recording_name": "Turn Off Night Shift Demo",
"metadata": {
"llm_model": "gpt-4o",
"processing_timestamp": "2026-01-17T12:00:00Z",
"coverage": 1.0,
"avg_confidence": 0.935,
"openadapt_ml_version": "0.2.0"
},
"episodes": [...],
"boundaries": [...],
"index": {
"by_time": {
"0-5": ["episode_001"],
"5-10": ["episode_002"]
},
"by_confidence": {
"high": ["episode_002"],
"medium": ["episode_001"]
}
}
}Benefits:
- Maintains JSON simplicity
- Adds metadata for versioning
- Optional index for faster lookups
- Backward compatible
✅ Document Decision: This analysis document ✅ Update CLAUDE.md: Reference this decision ✅ No Action Required: Keep current architecture
🔲 Add schema_version to episodes.json: Track format versions 🔲 Validation: Add JSON schema validation in ML pipeline 🔲 Documentation: Document episodes.json format in openadapt-ml
🔲 Explore Parquet: For large-scale episode analysis (100+ recordings) 🔲 GraphQL API: If remote querying becomes needed 🔲 Episode Search: Full-text search across episode descriptions (catalog.db)
Databases:
/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift/capture.db(320 KB, 1561 events)/Users/abrichr/oa/src/OpenAdapt/openadapt.db(legacy schema with alembic_version)~/.openadapt/catalog.db(central index, correct use of SQLite)
JSON Files:
/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift/episodes.json(4 KB, 2 episodes)/Users/abrichr/oa/src/openadapt-capture/turn-off-nightshift/transcript.json(4 KB, 4 segments)/Users/abrichr/oa/src/openadapt-ml/segmentation_output/*.json(various)
Code Files:
src/openadapt_viewer/catalog.py- Catalog schema (correct DB use)src/openadapt_viewer/scanner.py- Recording discoverysrc/openadapt_viewer/viewers/benchmark/real_data_loader.py- Episode loading- 16+ files across ecosystem using episodes.json
| File | Size | Rows/Items | Purpose |
|---|---|---|---|
| capture.db | 320 KB | 1561 events | Raw event storage |
| episodes.json | 4 KB | 2 episodes | ML segmentation |
| transcript.json | 4 KB | 4 segments | Speech transcription |
| video.mp4 | 1.7 MB | 60s | Screen recording |
| screenshots/*.png | 2.2 MB | 22 frames | Visual frames |
Observation: Episodes are 1% the size of the DB, not a storage concern.
| Operation | Current (JSON) | Proposed (DB) |
|---|---|---|
| Load episodes | <1ms | ~2-5ms (open DB, query, close) |
| Inspect file | cat episodes.json |
sqlite3 .dump |
| Edit for testing | vim (instant) | SQL (slower) |
| Share via Slack | Copy/paste JSON | Export required |
| Git diff | Text diff | Binary (no diff) |
Conclusion: JSON is faster for all current use cases.
The current architecture is CORRECT for this use case.
The separation of capture.db (raw events) and episodes.json (ML-derived semantics) reflects sound architectural principles:
- Separation of concerns: Capture vs. analysis
- Package boundaries: openadapt-capture vs. openadapt-ml
- Data lifecycle: Immutable capture vs. regenerable analysis
- Technology fit: Database for events, JSON for ML outputs
Recommendation: Keep the current architecture and focus engineering effort on higher-value features.
Confidence Level: HIGH (95%)
Dissenting Opinion: If OpenAdapt grows to handle 1000s of episodes per recording or needs complex temporal queries across episodes and events, revisit this decision. But that's not the current reality or near-term roadmap.
Analysis Completed: 2026-01-17 Analyst: Claude (Sonnet 4.5) Review Status: Ready for team review