Skip to content

Commit d0f55e8

Browse files
committed
feat: benchmark harness rewrite — MCP client, first baseline results
Rewrote run.ts to use MCP SDK client directly instead of CLI shelling. 38 queries × 9 categories against 11-file small corpus. Baseline results: - Recall@5: 94.6% - Recall@10: 97.7% - MRR: 0.868 - Content Hit Rate: 21.4% (snippet issue, not retrieval) - Mean Latency: 653ms (570ms steady state) Strengths: temporal (100%), task_recall (100%), exact_fact (100%) Weakness: relational (86.7%), content snippets missing specific values
1 parent 561c218 commit d0f55e8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+3295
-0
lines changed

benchmark/README.md

Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Basic Memory Benchmark
2+
3+
Open, reproducible retrieval quality benchmarks for the Basic Memory OpenClaw plugin.
4+
5+
## Why
6+
7+
Memory systems for AI agents make big claims with no reproducible evidence. We're building benchmarks in the open to:
8+
9+
1. **Improve Basic Memory** — evals are a feedback loop, not a marketing tool
10+
2. **Compare honestly** — show where we're strong AND where we're weak
11+
3. **Publish methodology** — anyone can reproduce our results or challenge them
12+
13+
## What We Measure
14+
15+
### Retrieval Quality (primary)
16+
- **Recall@K** — does the correct memory appear in the top K results?
17+
- **Precision@K** — of the top K results, how many are actually relevant?
18+
- **MRR** — Mean Reciprocal Rank: where does the first correct answer appear?
19+
- **Content Hit Rate** — for exact facts, did the expected value appear in results?
20+
21+
### Query Categories
22+
| Category | What it tests |
23+
|----------|---------------|
24+
| `exact_fact` | Keyword precision — find specific values |
25+
| `semantic` | Vector similarity — find conceptually related content |
26+
| `temporal` | Date awareness — retrieve by when things happened |
27+
| `relational` | Graph traversal — follow connections between entities |
28+
| `cross_note` | Multi-document recall — stitch information across files |
29+
| `task_recall` | Structured task queries — find active/assigned tasks |
30+
| `needle_in_haystack` | Exact token retrieval — find specific IDs, URLs, numbers |
31+
| `absence` | Knowing what ISN'T there — or is planned but not done |
32+
| `evolving_fact` | Freshness — prefer newer data over stale entries |
33+
34+
### Providers Compared
35+
1. **Basic Memory** (`bm search`) — semantic graph + observations + relations
36+
2. **OpenClaw builtin** (`memory-core`) — SQLite + vector + BM25 hybrid
37+
3. **QMD** (experimental) — BM25 + vectors + reranking sidecar
38+
39+
## Quick Start
40+
41+
```bash
42+
# Prerequisites: bm CLI installed
43+
# https://github.com/basicmachines-co/basic-memory
44+
45+
# Run the benchmark (small corpus, default)
46+
just benchmark
47+
48+
# Verbose output (per-query details)
49+
just benchmark-verbose
50+
51+
# Run all corpus sizes to see scaling behavior
52+
just benchmark-all
53+
54+
# Run a specific size
55+
just benchmark-medium
56+
just benchmark-large
57+
```
58+
59+
## Corpus Tiers
60+
61+
Three nested corpus sizes test how retrieval scales with data growth. Each tier is a superset of the previous — medium contains all of small, large contains all of medium.
62+
63+
### Small (~10 files, ~12KB) — `corpus-small/`
64+
A single day's work. Baseline: "does search work at all?"
65+
- 1 MEMORY.md, 4 daily notes, 2 tasks, 2 people, 2 topics
66+
67+
### Medium (~35-40 files, ~50KB) — `corpus-medium/`
68+
A working week. Tests noise resistance and temporal ranking.
69+
- Everything in small + 7 more daily notes, 3 more tasks (incl. done), 3 more people, 3 more topics
70+
- Done tasks that should NOT appear in active task queries
71+
- More entities competing for relevance on each query
72+
- 2-hop relation chains
73+
74+
### Large (~100-120 files, ~150-200KB) — `corpus-large/`
75+
A month of accumulated knowledge. The real stress test.
76+
- Everything in medium + 25 more daily notes, 10 more tasks, 10 more people/orgs, 15 more topics
77+
- Deep needle-in-haystack: specific IDs buried in old notes
78+
- 3+ hop relation chains
79+
- Heavy cross-document synthesis requirements
80+
- Stale vs fresh fact resolution at scale
81+
82+
### What scaling reveals
83+
84+
| Metric | Small → Medium | Medium → Large |
85+
|--------|---------------|----------------|
86+
| Recall@5 | Should hold steady | May degrade — more noise |
87+
| MRR | Should hold steady | Ranking quality under pressure |
88+
| Latency | Baseline | Index size impact |
89+
| Content hit | High | Needle-in-haystack stress |
90+
91+
If recall drops significantly from small → large, that's the signal to improve chunking, ranking, or indexing.
92+
93+
## Queries
94+
95+
`benchmark/queries.json` contains 38 annotated queries with:
96+
- Ground truth file paths (which files contain the answer)
97+
- Expected content strings (for exact fact verification)
98+
- Category labels (for per-category scoring)
99+
- Notes explaining edge cases
100+
101+
## Results
102+
103+
Results are written to `benchmark/results/` as JSON with full per-query breakdowns:
104+
- Overall metrics (recall, precision, MRR, latency)
105+
- Category breakdown
106+
- Individual query scores
107+
- Failure analysis
108+
109+
## Contributing
110+
111+
We welcome contributions:
112+
- **Add queries** — especially edge cases you've encountered
113+
- **Expand the corpus** — more realistic memory patterns
114+
- **Add providers** — help us compare against other memory systems
115+
- **Challenge methodology** — if our scoring is unfair, tell us
116+
117+
## License
118+
119+
MIT — same as the plugin.

benchmark/corpus-large/MEMORY.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# MEMORY.md - Long-Term Memory
2+
3+
## About Me
4+
- Name: Atlas 🔭. First boot: 2026-01-15.
5+
- Running on dev machine (Ubuntu 24.04, always on)
6+
- GitHub: atlas-bot (atlas@stellartools.dev), member of stellartools org
7+
8+
## About the Human
9+
- Name: Maya Chen
10+
- Role: Founder of Stellar Tools
11+
- Timezone: America/Los_Angeles (PST/PDT)
12+
- Prefers Slack over email for quick things
13+
- Morning person — most productive before noon
14+
15+
## Team
16+
- Maya (founder, full-stack)
17+
- Raj Patel (eng, backend)
18+
- Lena Vogt (design, part-time)
19+
- All currently bootstrapped, no outside funding
20+
21+
## Stellar Tools — The Product
22+
- **What:** Developer productivity CLI that aggregates metrics across GitHub, Linear, and Slack
23+
- **Core differentiator:** Single pane of glass for engineering velocity — no dashboards, just terminal
24+
- **Tech:** Rust, SQLite, gRPC, ships as single binary
25+
- **OSS:** github.com/stellartools/stl (~1,800 stars)
26+
- **Cloud:** app.stellartools.dev (hosted dashboard)
27+
- **Pricing:** $9/mo per seat (team plan), free for solo devs
28+
- **Revenue:** ~$2,100 MRR, 65 paying teams, growing 8% month-over-month
29+
- **Active dev:** Linear integration, webhook pipeline, team analytics view
30+
31+
## Architecture Decisions
32+
- Chose SQLite over Postgres for local-first story (2026-01-20)
33+
- gRPC for service mesh, REST for public API (2026-01-22)
34+
- Ship as single binary — no Docker required (key differentiator)
35+
- Webhook ingestion via async queue, not synchronous processing (2026-02-01)
36+
37+
## Competitive Landscape
38+
- **LinearB** — enterprise, expensive ($30/seat), heavy setup
39+
- **Sleuth** — DORA metrics focused, SaaS only
40+
- **Swarmia** — good UX but GitHub-only, no Linear integration
41+
- **Our moat:** CLI-first, works offline, single binary, respects developer privacy
42+
43+
## Communication
44+
- **Slack:** stellartools workspace (Maya + Raj + Lena)
45+
- **Email:** maya@stellartools.dev, atlas@stellartools.dev
46+
- **GitHub:** stellartools org
47+
- **Linear:** Stellar Tools workspace (project key: STL)
48+
49+
## Opinions & Lessons
50+
- "Ship the CLI first, web dashboard second" — Maya, every standup
51+
- Rust compile times are brutal but the binary size payoff is worth it
52+
- SQLite WAL mode is mandatory for concurrent reads during metric aggregation
53+
- Never trust webhook delivery — always implement idempotent handlers
54+
- The onboarding flow is our weakest point right now (users drop off at OAuth)
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
---
2+
title: '2026-02-10'
3+
type: note
4+
permalink: memory/2026-02-10
5+
---
6+
7+
# 2026-02-10 — Monday
8+
9+
## Observations
10+
- [date] 2026-02-10
11+
- [type] daily-note
12+
13+
## Standup
14+
- Maya working on Linear webhook integration — parsing cycle time events
15+
- Raj fixing the SQLite connection pooling bug (issue STL-142)
16+
- Lena delivered new onboarding mockups — 3-step flow replacing the current 7-step monster
17+
18+
## Decisions
19+
- Moving webhook processing to background queue (Bull MQ equivalent in Rust)
20+
- Decided to drop support for GitLab in v1 — focus on GitHub + Linear only
21+
- OAuth flow will use PKCE instead of implicit grant (security review finding)
22+
23+
## Bug Report
24+
- User "fasttrack_dev" reported metrics dashboard shows stale data after timezone change
25+
- Root cause: cache key doesn't include timezone offset
26+
- Raj will fix in STL-145
27+
28+
## Claw Time
29+
Read an interesting paper on "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — the key insight is that combining parametric and non-parametric memory outperforms either alone. Relevant to how we think about metric aggregation caching.
30+
31+
## Relations
32+
- relates_to [[Linear Integration]]
33+
- relates_to [[Onboarding Redesign]]
34+
- relates_to [[Raj Patel]]
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: '2026-02-11'
3+
type: note
4+
permalink: memory/2026-02-11
5+
---
6+
7+
# 2026-02-11 — Tuesday
8+
9+
## Observations
10+
- [date] 2026-02-11
11+
- [type] daily-note
12+
13+
## Standup
14+
- Maya shipped webhook queue prototype — 3x throughput improvement on ingestion
15+
- Raj closed STL-142 (connection pooling) and STL-145 (timezone cache key)
16+
- Lena presenting onboarding redesign to Maya at 2pm
17+
18+
## Customer Feedback
19+
- Team at Nexus Labs (12 seats) requesting Jira integration — told them post-v1
20+
- "fasttrack_dev" confirmed timezone fix works — sent us a thank-you tweet
21+
- New trial signup from a YC W26 batch company called "Cortex AI"
22+
23+
## Meeting: Onboarding Redesign Review (2pm)
24+
- Lena's new flow: Install → Connect GitHub → See first metric (3 steps)
25+
- Old flow had 7 steps including email verification, team invite, preference wizard
26+
- Maya approved the simplified flow — "if they can see value in 60 seconds, they'll stay"
27+
- Decision: remove email verification from onboarding, move to settings
28+
- Target: ship new onboarding by Feb 20
29+
30+
## Architecture Discussion
31+
- Maya proposed caching webhook payloads in SQLite before processing
32+
- Advantage: replay capability, audit trail, crash recovery
33+
- Raj concerned about disk usage on high-volume teams
34+
- Compromise: 7-day retention with configurable TTL per team
35+
- Linear workspace ID: ws_stl_prod_7x2k
36+
37+
## Evening
38+
- Pushed v0.8.3 hotfix for the timezone bug
39+
- Release notes drafted and posted to #announcements
40+
41+
## Relations
42+
- relates_to [[Onboarding Redesign]]
43+
- relates_to [[Lena Vogt]]
44+
- relates_to [[Cortex AI]]
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
---
2+
title: '2026-02-12'
3+
type: note
4+
permalink: memory/2026-02-12
5+
---
6+
7+
# 2026-02-12 — Wednesday
8+
9+
## Observations
10+
- [date] 2026-02-12
11+
- [type] daily-note
12+
13+
## Standup
14+
- Maya integrating Linear cycle time API — endpoint is rate-limited to 100 req/min
15+
- Raj building webhook replay tool based on yesterday's architecture decision
16+
- Lena started implementing new onboarding UI in the web dashboard
17+
18+
## Metrics Review
19+
- MRR: $2,100 (up from $1,950 last month)
20+
- Active teams: 65 (net +3 this week)
21+
- Churn: 2 teams churned — both cited "not enough integrations"
22+
- Trial-to-paid conversion: 23% (target: 30%)
23+
- GitHub stars: 1,847 (was 1,800 last week)
24+
25+
## Incident
26+
- 14:30 PST: webhook ingestion queue backed up for ~20 minutes
27+
- Root cause: Linear sent a burst of 2,000 events from a large workspace migration
28+
- Raj's fix: added backpressure mechanism with configurable max queue depth
29+
- No data loss, all events eventually processed
30+
- Postmortem scheduled for Friday
31+
32+
## Customer Call: Cortex AI
33+
- CTO James Liu, 8-person engineering team
34+
- They want to track deployment frequency + lead time (DORA metrics)
35+
- Currently using a janky spreadsheet
36+
- Interested in team plan at $9/seat
37+
- Follow-up demo scheduled for Feb 18 at 10am PST
38+
- Email: james@cortexai.dev
39+
40+
## Security
41+
- Dependabot flagged a vulnerability in the HTTP client crate (hyper v0.14)
42+
- Upgraded to hyper v1.2 — breaking changes in connection pooling API
43+
- Raj handling the migration, ETA Thursday
44+
45+
## Relations
46+
- relates_to [[Cortex AI]]
47+
- relates_to [[Raj Patel]]
48+
- relates_to [[Linear Integration]]
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
---
2+
title: '2026-02-14'
3+
type: note
4+
permalink: memory/2026-02-14
5+
---
6+
7+
# 2026-02-14 — Friday
8+
9+
## Observations
10+
- [date] 2026-02-14
11+
- [type] daily-note
12+
13+
## Standup
14+
- Maya: Linear integration MVP done — cycle time + throughput metrics working
15+
- Raj: webhook replay tool shipped, hyper v1.2 migration complete
16+
- Lena: onboarding UI 70% done, blocked on OAuth flow changes
17+
18+
## Postmortem: Webhook Queue Backup (Feb 12)
19+
- Timeline: 14:22-14:42 PST, 20 min degradation
20+
- Impact: ~200 teams experienced delayed metric updates
21+
- Root cause: no backpressure on ingestion queue, Linear burst exceeded capacity
22+
- Fix: configurable max queue depth (default 5000), overflow to disk-backed queue
23+
- Action items:
24+
1. Add queue depth alerting (Maya, by Feb 19)
25+
2. Load test with simulated burst traffic (Raj, by Feb 21)
26+
3. Document incident response runbook (Atlas, by Feb 17)
27+
- Severity: P2 (degraded but not down)
28+
29+
## Pricing Discussion
30+
- Maya considering a free tier change: currently unlimited for solo, might cap at 3 repos
31+
- Raj argues against caps — "developers hate artificial limits"
32+
- Decision deferred to next week — want to look at usage data first
33+
- Current pricing: $9/seat/month for teams, free for solo developers
34+
- Enterprise inquiries: 2 (both want SSO + audit logs)
35+
36+
## Deploy
37+
- v0.9.0-beta.1 tagged with Linear integration
38+
- Changelog: Linear cycle time, throughput, backlog age metrics
39+
- Webhook replay tool (admin only)
40+
- Breaking: dropped GitLab support (deprecated in v0.8)
41+
42+
## Relations
43+
- relates_to [[Linear Integration]]
44+
- relates_to [[Onboarding Redesign]]
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
---
2+
title: Lena Vogt
3+
type: person
4+
permalink: people/lena-vogt
5+
---
6+
7+
# Lena Vogt
8+
9+
## Observations
10+
- [name] Lena Vogt
11+
- [role] Designer (part-time)
12+
- [email] lena@stellartools.dev
13+
- [timezone] Europe/Berlin (CET/CEST)
14+
- [expertise] UX design, Figma, onboarding flows
15+
- [joined] 2026-01-25
16+
- [status] active team member
17+
18+
## Notes
19+
- Part-time contributor, available ~20 hours/week
20+
- Delivered the 3-step onboarding mockups that Maya approved
21+
- Currently implementing the onboarding UI in the web dashboard
22+
- Timezone difference means async collaboration with Maya/Raj
23+
- Strong advocate for "time to first value" metric
24+
25+
## Relations
26+
- works_at [[Stellar Tools]]
27+
- collaborates_with [[Maya Chen]]
28+
- collaborates_with [[Raj Patel]]
29+
- working_on [[Onboarding Redesign]]

0 commit comments

Comments
 (0)