Skip to content

Commit fb0a458

Browse files
bradleyshepcloutiertylerclockwork-labs-bot
authored
LLM Benchmark: Sequential Upgrades Test (#4817)
# Description of Changes AI app generation benchmark comparing SpacetimeDB vs PostgreSQL (Express + Socket.io + Drizzle ORM). Same AI model (Claude Sonnet 4.6), same prompts, same chat app, two backends. Upgraded through 12 feature levels, manually graded at each level, bugs fixed, all costs measured via OpenTelemetry. Results viewable at: https://spacetimedb.com/llms-benchmark-sequential-upgrade ## Benchmark harness (`tools/llm-sequential-upgrade/`) - `run.sh`: orchestrates headless Claude Code sessions for code generation, sequential upgrades, and bug fixes. Tracks all API costs via OTel. Supports `--upgrade`, `--fix`, `--composed-prompt`, `--resume-session` modes. - `grade.sh` / `grade-agents.sh`: grading harnesses for manual testing of generated apps. - `docker-compose.otel.yaml`: OTel collector + PostgreSQL services. - `generate-report.mjs` / `parse-telemetry.mjs`: aggregate per-session telemetry into cost reports. - Backend guidelines in `backends/`: SpacetimeDB SDK reference, config templates, server setup docs, PostgreSQL setup with Drizzle/Socket.io guidance. **After #4740 merges, we will likely want to update this so that it reads backend and SDK guidance from SKILLS** ## Two complete benchmark runs **Run 1 (20260403):** Original methodology. **Run 2 (20260406):** Refined methodology with domain bias removed from SpacetimeDB SDK docs and PostgreSQL instructions made feature-spec-neutral. **Note: no meaningful changes in results were observed with these changes. Domain familiarity biases were very small and almost certainly not the cause of STDB's major gains over PG stack.** Each run contains full L1-L12 app source for both backends, level snapshots preserving state before each upgrade, and per-session OTel cost summaries. ## 12 feature levels | Level | Feature | |---|---| | L1 | Basic Chat + Typing + Read Receipts + Unread Counts | | L2 | Scheduled Messages | | L3 | Ephemeral Messages | | L4 | Message Reactions | | L5 | Message Editing with History | | L6 | Real-Time Permissions (kick, ban, promote) | | L7 | Rich User Presence | | L8 | Message Threading | | L9 | Private Rooms + Direct Messages | | L10 | Room Activity Indicators | | L11 | Draft Sync | | L12 | Anonymous to Registered Migration | ## Results | | Run 1 (20260403) | Run 2 (20260406) | |---|---|---| | **SpacetimeDB total cost** | $13.33 | $12.62 | | **PostgreSQL total cost** | $17.80 | $19.68 | | **SpacetimeDB bugs** | 5 | 2 | | **PostgreSQL bugs** | 19 | 8 | | **SpacetimeDB fix sessions** | 4 | 1 | | **PostgreSQL fix sessions** | 17 | 10 | Both runs agree: SpacetimeDB apps are cheaper to build, have fewer bugs, and require fewer fix iterations. The refined methodology (Run 2) widened the cost gap and **confirmed the advantage is structural, not an artifact of domain-biased SDK docs.** ## Performance benchmark (`perf-benchmark/`) Stress throughput tool that fires concurrent writers at peak saturation against the AI-generated `send_message` handlers. | Tier | SpacetimeDB (avg) | PostgreSQL (avg) | Ratio | |---|---|---|---| | AI-generated (as-shipped) | 5,267 msgs/sec | 694 msgs/sec | 7.6x | | PG rate limit removed | 5,267 msgs/sec | 1,070 msgs/sec | 4.9x | | Optimized (same features kept) | 25,278 msgs/sec | 1,139 msgs/sec | 22x | The gap widens with optimization because SpacetimeDB's bottleneck is fixable code patterns in the reducer while PostgreSQL's bottleneck is architectural (sequential network round-trips to an external database). Optimized reference code with all features preserved is in `perf-benchmark/results/optimized-reference/`. ## Data handling Per-session cost summaries (`cost-summary.json`, `COST_REPORT.md`, `metadata.json`) are committed. Raw OTel telemetry (`raw-telemetry.jsonl`) containing PII is excluded via `.gitignore` and stored privately. # API and ABI breaking changes None. All changes are in `tools/llm-sequential-upgrade/`. No production code, library, or SDK changes. # Expected complexity level and risk **1 - Trivial.** Self-contained benchmarking tooling and data. No interaction with production code. # Testing - [x] L1-L12 upgrades completed on all 4 apps (2 backends x 2 runs) with OTel cost capture - [x] All levels manually graded after each upgrade; bugs filed and fixed via the harness - [x] Methodology refinement between runs validated (domain bias removal, feature-neutral instructions) - [x] Stress benchmarks run across both runs x 3 tiers (as-shipped, rate-limit-removed, optimized) - [x] Optimized benchmarks verified to preserve all original features - [x] Sensitive data (PII in raw telemetry) removed from repo and gitignored - [ ] Reviewer: spot-check that METRICS_DATA.json / METRICS_REPORT.json numbers match the telemetry cost-summary.json files --------- Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com> Co-authored-by: clockwork-labs-bot <clockwork-labs-bot@users.noreply.github.com>
1 parent 0abf20b commit fb0a458

126 files changed

Lines changed: 13739 additions & 194 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

tools/llm-oneshot/.cursor/rules/patterns-typescript.mdc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ spacetime publish chat-app-20260106-183045 --module-path backend/spacetimedb
6060
"type": "module",
6161
"version": "1.0.0",
6262
"dependencies": {
63-
"spacetimedb": "^1.11.0"
63+
"spacetimedb": "^2.0.0"
6464
}
6565
}
6666
```
@@ -109,7 +109,7 @@ src/index.ts → Import schema, define all reducers and lifecycle hooks
109109
"dependencies": {
110110
"react": "^18.3.1",
111111
"react-dom": "^18.3.1",
112-
"spacetimedb": "^1.11.0"
112+
"spacetimedb": "^2.0.0"
113113
},
114114
"devDependencies": {
115115
"@types/react": "^18.3.18",

tools/llm-oneshot/apps/chat-app/prompts/composed/01_basic.md

Lines changed: 60 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,45 @@
22

33
Create a **real-time chat app**.
44

5-
**See `language/*.md` for language-specific setup, architecture, and constraints.**
65

7-
## UI Requirements
6+
## UI & Style Guide
87

9-
Use SpacetimeDB brand styling (dark theme).
8+
### Layout
9+
- **Sidebar** (left, ~220px fixed): app title/branding, user info with status, room list, online users
10+
- **Main area** (right, flex): room header bar, scrollable message list, input bar pinned to bottom
11+
- **Panels** (right slide-in or overlay): threads, pinned messages, profiles, settings
12+
13+
### Visual Design
14+
- Dark theme using the brand colors from the language section below
15+
- Background: darkest shade for main bg, slightly lighter for sidebar and cards
16+
- Text: light on dark, muted color for timestamps and secondary info
17+
- Borders: subtle 1px, low contrast against background
18+
- Consistent spacing scale (8/12/16/24px)
19+
- Font: system font stack, clear hierarchy (bold headers, regular body, small muted metadata)
20+
- Rounded corners on inputs, buttons, cards, and message containers
21+
22+
### Components
23+
- **Messages**: sender name (colored) + timestamp (muted) + text. Group consecutive messages from same sender. Action buttons appear on hover only (which buttons depend on the features below).
24+
- **Inputs**: full-width, rounded, subtle border, placeholder text, focus ring using primary color
25+
- **Buttons**: filled with primary color for main actions, outlined/ghost for secondary. Clear hover and active states.
26+
- **Badges**: small pill-shaped with count, contrasting color (e.g., unread count on rooms)
27+
- **Modals/panels**: slide-in from right with subtle backdrop, or dropdown overlays
28+
- **Status indicators**: small colored dots (green=online, yellow=away, red=DND, grey=offline)
29+
- **Room list**: room names with optional icon prefix (#), active room highlighted, unread badge
30+
31+
### Interaction & UX
32+
- Show loading/connecting state while backend connects (spinner or skeleton, not blank screen)
33+
- Empty states: helpful text when no rooms, no messages, no results ("Create a room to get started")
34+
- Error feedback: inline error messages or toast notifications, never silent failures
35+
- Smooth transitions: fade/slide for panels, modals, and state changes
36+
- Hover reveals: message action buttons, tooltips on reactions, user profile cards
37+
- Keyboard support: Enter to send messages, Escape to close modals/panels
38+
- Auto-scroll to newest message, with scroll-to-bottom button when scrolled up
1039

1140
## Features
1241

42+
**Important:** Each feature below includes a "UI contract" section specifying required element attributes for automated testing. You MUST follow these — they define the user-facing interface. Your architecture, state management, and backend design are entirely up to you.
43+
1344
### Basic Chat Features
1445

1546
- Users can set a display name
@@ -18,20 +49,44 @@ Use SpacetimeDB brand styling (dark theme).
1849
- Show who's online
1950
- Include reasonable validation (e.g., don't let users spam, enforce sensible limits)
2051

52+
**UI contract:**
53+
- Name input: `placeholder` contains "name" (case-insensitive)
54+
- Name submit: `button` with text "Join", "Register", "Set Name", or `type="submit"`
55+
- Room creation: `button` with text containing "Create" or "New" or "+"
56+
- Room name input: `placeholder` contains "room" or "name" (case-insensitive)
57+
- Message input: `placeholder` contains "message" (case-insensitive)
58+
- Send message: pressing Enter in the message input sends the message
59+
- Room list: room names visible as clickable text in a sidebar or list
60+
- Join room: clicking room name joins/enters it, or a `button` with text "Join"
61+
- Leave room: `button` with text "Leave"
62+
- Online users: user names displayed as text in a visible user list or member panel
63+
2164
### Typing Indicators
2265

23-
- Show when other users are currently typing in a room
66+
- Show when other users are currently typing in the SAME room (typing must be scoped to room — do not broadcast typing to users in different rooms)
2467
- Typing indicator should automatically expire after a few seconds of inactivity
2568
- Display "User is typing..." or "Multiple users are typing..." in the UI
2669

70+
**UI contract:**
71+
- Typing text: visible text containing "typing" (case-insensitive) when another user types
72+
- Auto-expiry: typing indicator text disappears within 6 seconds of inactivity
73+
2774
### Read Receipts
2875

2976
- Track which users have seen which messages
30-
- Display "Seen by X, Y, Z" under messages (or a seen indicator)
77+
- Display "Seen by X, Y, Z" under messages — only show OTHER users who have seen it, not the sender
3178
- Update read status in real-time as users view messages
3279

80+
**UI contract:**
81+
- Receipt text: text containing "seen" or "read" (case-insensitive) appears near messages after another user views them
82+
- Reader names: the receipt text includes the viewing user’s display name
83+
3384
### Unread Message Counts
3485

3586
- Show unread message count badges on the room list
3687
- Track last-read position per user per room
3788
- Update counts in real-time as new messages arrive or are read
89+
90+
**UI contract:**
91+
- Badge: a visible numeric badge (e.g., "3") appears next to room names in the sidebar when there are unread messages
92+
- Badge clears when the room is opened/entered

tools/llm-oneshot/apps/chat-app/prompts/composed/02_scheduled.md

Lines changed: 66 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,45 @@
22

33
Create a **real-time chat app**.
44

5-
**See `language/*.md` for language-specific setup, architecture, and constraints.**
65

7-
## UI Requirements
6+
## UI & Style Guide
87

9-
Use SpacetimeDB brand styling (dark theme).
8+
### Layout
9+
- **Sidebar** (left, ~220px fixed): app title/branding, user info with status, room list, online users
10+
- **Main area** (right, flex): room header bar, scrollable message list, input bar pinned to bottom
11+
- **Panels** (right slide-in or overlay): threads, pinned messages, profiles, settings
12+
13+
### Visual Design
14+
- Dark theme using the brand colors from the language section below
15+
- Background: darkest shade for main bg, slightly lighter for sidebar and cards
16+
- Text: light on dark, muted color for timestamps and secondary info
17+
- Borders: subtle 1px, low contrast against background
18+
- Consistent spacing scale (8/12/16/24px)
19+
- Font: system font stack, clear hierarchy (bold headers, regular body, small muted metadata)
20+
- Rounded corners on inputs, buttons, cards, and message containers
21+
22+
### Components
23+
- **Messages**: sender name (colored) + timestamp (muted) + text. Group consecutive messages from same sender. Action buttons appear on hover only (which buttons depend on the features below).
24+
- **Inputs**: full-width, rounded, subtle border, placeholder text, focus ring using primary color
25+
- **Buttons**: filled with primary color for main actions, outlined/ghost for secondary. Clear hover and active states.
26+
- **Badges**: small pill-shaped with count, contrasting color (e.g., unread count on rooms)
27+
- **Modals/panels**: slide-in from right with subtle backdrop, or dropdown overlays
28+
- **Status indicators**: small colored dots (green=online, yellow=away, red=DND, grey=offline)
29+
- **Room list**: room names with optional icon prefix (#), active room highlighted, unread badge
30+
31+
### Interaction & UX
32+
- Show loading/connecting state while backend connects (spinner or skeleton, not blank screen)
33+
- Empty states: helpful text when no rooms, no messages, no results ("Create a room to get started")
34+
- Error feedback: inline error messages or toast notifications, never silent failures
35+
- Smooth transitions: fade/slide for panels, modals, and state changes
36+
- Hover reveals: message action buttons, tooltips on reactions, user profile cards
37+
- Keyboard support: Enter to send messages, Escape to close modals/panels
38+
- Auto-scroll to newest message, with scroll-to-bottom button when scrolled up
1039

1140
## Features
1241

42+
**Important:** Each feature below includes a "UI contract" section specifying required element attributes for automated testing. You MUST follow these — they define the user-facing interface. Your architecture, state management, and backend design are entirely up to you.
43+
1344
### Basic Chat Features
1445

1546
- Users can set a display name
@@ -18,26 +49,56 @@ Use SpacetimeDB brand styling (dark theme).
1849
- Show who's online
1950
- Include reasonable validation (e.g., don't let users spam, enforce sensible limits)
2051

52+
**UI contract:**
53+
- Name input: `placeholder` contains "name" (case-insensitive)
54+
- Name submit: `button` with text "Join", "Register", "Set Name", or `type="submit"`
55+
- Room creation: `button` with text containing "Create" or "New" or "+"
56+
- Room name input: `placeholder` contains "room" or "name" (case-insensitive)
57+
- Message input: `placeholder` contains "message" (case-insensitive)
58+
- Send message: pressing Enter in the message input sends the message
59+
- Room list: room names visible as clickable text in a sidebar or list
60+
- Join room: clicking room name joins/enters it, or a `button` with text "Join"
61+
- Leave room: `button` with text "Leave"
62+
- Online users: user names displayed as text in a visible user list or member panel
63+
2164
### Typing Indicators
2265

23-
- Show when other users are currently typing in a room
66+
- Show when other users are currently typing in the SAME room (typing must be scoped to room — do not broadcast typing to users in different rooms)
2467
- Typing indicator should automatically expire after a few seconds of inactivity
2568
- Display "User is typing..." or "Multiple users are typing..." in the UI
2669

70+
**UI contract:**
71+
- Typing text: visible text containing "typing" (case-insensitive) when another user types
72+
- Auto-expiry: typing indicator text disappears within 6 seconds of inactivity
73+
2774
### Read Receipts
2875

2976
- Track which users have seen which messages
30-
- Display "Seen by X, Y, Z" under messages (or a seen indicator)
77+
- Display "Seen by X, Y, Z" under messages — only show OTHER users who have seen it, not the sender
3178
- Update read status in real-time as users view messages
3279

80+
**UI contract:**
81+
- Receipt text: text containing "seen" or "read" (case-insensitive) appears near messages after another user views them
82+
- Reader names: the receipt text includes the viewing user’s display name
83+
3384
### Unread Message Counts
3485

3586
- Show unread message count badges on the room list
3687
- Track last-read position per user per room
3788
- Update counts in real-time as new messages arrive or are read
3889

90+
**UI contract:**
91+
- Badge: a visible numeric badge (e.g., "3") appears next to room names in the sidebar when there are unread messages
92+
- Badge clears when the room is opened/entered
93+
3994
### Scheduled Messages
4095

4196
- Users can compose a message and schedule it to send at a future time
4297
- Show pending scheduled messages to the author (with option to cancel)
4398
- Message appears in the room at the scheduled time
99+
100+
**UI contract:**
101+
- Schedule button: `button` with text "Schedule" or `aria-label` containing "schedule", or an icon button with `title` containing "schedule"
102+
- Time picker: an `input[type="datetime-local"]` or `input[type="time"]` or `input[type="number"]` for setting the send time
103+
- Pending list: text "Scheduled" or "Pending" visible when viewing scheduled messages
104+
- Cancel: `button` with text "Cancel" next to pending scheduled messages

tools/llm-oneshot/apps/chat-app/prompts/composed/03_realtime.md

Lines changed: 73 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,45 @@
22

33
Create a **real-time chat app**.
44

5-
**See `language/*.md` for language-specific setup, architecture, and constraints.**
65

7-
## UI Requirements
8-
9-
Use SpacetimeDB brand styling (dark theme).
6+
## UI & Style Guide
7+
8+
### Layout
9+
- **Sidebar** (left, ~220px fixed): app title/branding, user info with status, room list, online users
10+
- **Main area** (right, flex): room header bar, scrollable message list, input bar pinned to bottom
11+
- **Panels** (right slide-in or overlay): threads, pinned messages, profiles, settings
12+
13+
### Visual Design
14+
- Dark theme using the brand colors from the language section below
15+
- Background: darkest shade for main bg, slightly lighter for sidebar and cards
16+
- Text: light on dark, muted color for timestamps and secondary info
17+
- Borders: subtle 1px, low contrast against background
18+
- Consistent spacing scale (8/12/16/24px)
19+
- Font: system font stack, clear hierarchy (bold headers, regular body, small muted metadata)
20+
- Rounded corners on inputs, buttons, cards, and message containers
21+
22+
### Components
23+
- **Messages**: sender name (colored) + timestamp (muted) + text. Group consecutive messages from same sender. Action buttons appear on hover only (which buttons depend on the features below).
24+
- **Inputs**: full-width, rounded, subtle border, placeholder text, focus ring using primary color
25+
- **Buttons**: filled with primary color for main actions, outlined/ghost for secondary. Clear hover and active states.
26+
- **Badges**: small pill-shaped with count, contrasting color (e.g., unread count on rooms)
27+
- **Modals/panels**: slide-in from right with subtle backdrop, or dropdown overlays
28+
- **Status indicators**: small colored dots (green=online, yellow=away, red=DND, grey=offline)
29+
- **Room list**: room names with optional icon prefix (#), active room highlighted, unread badge
30+
31+
### Interaction & UX
32+
- Show loading/connecting state while backend connects (spinner or skeleton, not blank screen)
33+
- Empty states: helpful text when no rooms, no messages, no results ("Create a room to get started")
34+
- Error feedback: inline error messages or toast notifications, never silent failures
35+
- Smooth transitions: fade/slide for panels, modals, and state changes
36+
- Hover reveals: message action buttons, tooltips on reactions, user profile cards
37+
- Keyboard support: Enter to send messages, Escape to close modals/panels
38+
- Auto-scroll to newest message, with scroll-to-bottom button when scrolled up
1039

1140
## Features
1241

42+
**Important:** Each feature below includes a "UI contract" section specifying required element attributes for automated testing. You MUST follow these — they define the user-facing interface. Your architecture, state management, and backend design are entirely up to you.
43+
1344
### Basic Chat Features
1445

1546
- Users can set a display name
@@ -18,32 +49,68 @@ Use SpacetimeDB brand styling (dark theme).
1849
- Show who's online
1950
- Include reasonable validation (e.g., don't let users spam, enforce sensible limits)
2051

52+
**UI contract:**
53+
- Name input: `placeholder` contains "name" (case-insensitive)
54+
- Name submit: `button` with text "Join", "Register", "Set Name", or `type="submit"`
55+
- Room creation: `button` with text containing "Create" or "New" or "+"
56+
- Room name input: `placeholder` contains "room" or "name" (case-insensitive)
57+
- Message input: `placeholder` contains "message" (case-insensitive)
58+
- Send message: pressing Enter in the message input sends the message
59+
- Room list: room names visible as clickable text in a sidebar or list
60+
- Join room: clicking room name joins/enters it, or a `button` with text "Join"
61+
- Leave room: `button` with text "Leave"
62+
- Online users: user names displayed as text in a visible user list or member panel
63+
2164
### Typing Indicators
2265

23-
- Show when other users are currently typing in a room
66+
- Show when other users are currently typing in the SAME room (typing must be scoped to room — do not broadcast typing to users in different rooms)
2467
- Typing indicator should automatically expire after a few seconds of inactivity
2568
- Display "User is typing..." or "Multiple users are typing..." in the UI
2669

70+
**UI contract:**
71+
- Typing text: visible text containing "typing" (case-insensitive) when another user types
72+
- Auto-expiry: typing indicator text disappears within 6 seconds of inactivity
73+
2774
### Read Receipts
2875

2976
- Track which users have seen which messages
30-
- Display "Seen by X, Y, Z" under messages (or a seen indicator)
77+
- Display "Seen by X, Y, Z" under messages — only show OTHER users who have seen it, not the sender
3178
- Update read status in real-time as users view messages
3279

80+
**UI contract:**
81+
- Receipt text: text containing "seen" or "read" (case-insensitive) appears near messages after another user views them
82+
- Reader names: the receipt text includes the viewing user’s display name
83+
3384
### Unread Message Counts
3485

3586
- Show unread message count badges on the room list
3687
- Track last-read position per user per room
3788
- Update counts in real-time as new messages arrive or are read
3889

90+
**UI contract:**
91+
- Badge: a visible numeric badge (e.g., "3") appears next to room names in the sidebar when there are unread messages
92+
- Badge clears when the room is opened/entered
93+
3994
### Scheduled Messages
4095

4196
- Users can compose a message and schedule it to send at a future time
4297
- Show pending scheduled messages to the author (with option to cancel)
4398
- Message appears in the room at the scheduled time
4499

100+
**UI contract:**
101+
- Schedule button: `button` with text "Schedule" or `aria-label` containing "schedule", or an icon button with `title` containing "schedule"
102+
- Time picker: an `input[type="datetime-local"]` or `input[type="time"]` or `input[type="number"]` for setting the send time
103+
- Pending list: text "Scheduled" or "Pending" visible when viewing scheduled messages
104+
- Cancel: `button` with text "Cancel" next to pending scheduled messages
105+
45106
### Ephemeral/Disappearing Messages
46107

47108
- Users can send messages that auto-delete after a set duration (e.g., 1 minute, 5 minutes)
48109
- Show a countdown or indicator that the message will disappear
49110
- Message is permanently deleted from the database when time expires
111+
112+
**UI contract:**
113+
- Ephemeral toggle: `select`, `button`, or `input` with text/label containing "ephemeral", "disappear", or "expire" (case-insensitive)
114+
- Duration options: selectable durations (e.g., 30s, 1m, 5m)
115+
- Indicator: visible text containing a countdown, "expires", or "disappearing" on ephemeral messages
116+
- Deletion: the message text is removed from the DOM after the duration expires

0 commit comments

Comments
 (0)