Fix PV metric to match paper's orchestration-focused evaluation by harsh-kr11 · Pull Request #4 · harsh-kr11/behavioral-memory

harsh-kr11 · 2026-05-19T15:58:27Z

Summary

Fixed Parameter Validity (PV) metric to align with the paper's intent (Section IV.C, IV.D.3). The paper explicitly treats SQL-level parameter differences as "minor errors" — PV should measure orchestration decisions, not content generation.
Categorized parameters into three types:
- Orchestration (format, channel, source_step, mode, etc.) — exact match
- Content (query, body, subject, title) — structural/semantic match (right tables, right domain terms)
- Identifier (recipient, target_name) — lenient key-term match
Updated README with reproduced live benchmark results (gemini-2.5-pro, pgvector) and a detailed "How the Agent Learns" section explaining the feedback loop architecture.

Reproduced Results (gemini-2.5-pro, pgvector)

Metric	Paper	Live Run
TSA	83.3%	86.7%
PV	84.0%	82.2%
PCR	63.3%	80.0%
ESA	83.3%	86.7%
McNemar p	0.004	0.039

All results within the paper's 95% bootstrap confidence intervals.

Files Changed

src/behavioral_memory/evaluation/metrics.py — Core fix: parameter categorization and semantic matching
README.md — Reproduced results table, learning architecture diagram
examples/run_live_benchmark.py — pgvector flag for paper reproduction
docs/GETTING_STARTED.md — Updated pgvector setup instructions
Makefile — Added benchmark-pg target
CONTRIBUTING.md — Updated test count

Test plan

All 104 existing tests pass (pytest tests/ -v)
ruff check and mypy pass
python examples/validate_pipeline.py passes all 30 checks
Gatekeeper ablation script uses the corrected metrics via compute_metrics

Made with Cursor

The paper (Section IV.D.3) explicitly treats SQL-level parameter differences as "minor errors" — PV should measure orchestration decisions (tool selection, format, channel, dependencies), not content generation (SQL text, email prose). Categorize parameters into orchestration (exact match), content (structural/semantic match), and identifier (lenient match). Update README with reproduced live benchmark results. Co-authored-by: Cursor <cursoragent@cursor.com>

…omments Co-authored-by: Cursor <cursoragent@cursor.com>

harskuma and others added 2 commits May 19, 2026 21:27

Fix lint and typecheck: format metrics.py, remove stale type-ignore c…

5fcbafe

…omments Co-authored-by: Cursor <cursoragent@cursor.com>

harsh-kr11 self-assigned this May 19, 2026

harsh-kr11 merged commit 7bc31c4 into main May 19, 2026
5 checks passed

harsh-kr11 deleted the fix/metrics-and-results branch May 19, 2026 16:05

harsh-kr11 mentioned this pull request May 19, 2026

Full audit: dedup fix, PV tests, README rewrite, release workflow #5

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PV metric to match paper's orchestration-focused evaluation#4

Fix PV metric to match paper's orchestration-focused evaluation#4
harsh-kr11 merged 2 commits into
mainfrom
fix/metrics-and-results

harsh-kr11 commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harsh-kr11 commented May 19, 2026

Summary

Reproduced Results (gemini-2.5-pro, pgvector)

Files Changed

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants