Skip to content

Fix PV metric to match paper's orchestration-focused evaluation#4

Merged
harsh-kr11 merged 2 commits into
mainfrom
fix/metrics-and-results
May 19, 2026
Merged

Fix PV metric to match paper's orchestration-focused evaluation#4
harsh-kr11 merged 2 commits into
mainfrom
fix/metrics-and-results

Conversation

@harsh-kr11
Copy link
Copy Markdown
Owner

Summary

  • Fixed Parameter Validity (PV) metric to align with the paper's intent (Section IV.C, IV.D.3). The paper explicitly treats SQL-level parameter differences as "minor errors" — PV should measure orchestration decisions, not content generation.
  • Categorized parameters into three types:
    • Orchestration (format, channel, source_step, mode, etc.) — exact match
    • Content (query, body, subject, title) — structural/semantic match (right tables, right domain terms)
    • Identifier (recipient, target_name) — lenient key-term match
  • Updated README with reproduced live benchmark results (gemini-2.5-pro, pgvector) and a detailed "How the Agent Learns" section explaining the feedback loop architecture.

Reproduced Results (gemini-2.5-pro, pgvector)

Metric Paper Live Run
TSA 83.3% 86.7%
PV 84.0% 82.2%
PCR 63.3% 80.0%
ESA 83.3% 86.7%
McNemar p 0.004 0.039

All results within the paper's 95% bootstrap confidence intervals.

Files Changed

  • src/behavioral_memory/evaluation/metrics.py — Core fix: parameter categorization and semantic matching
  • README.md — Reproduced results table, learning architecture diagram
  • examples/run_live_benchmark.py — pgvector flag for paper reproduction
  • docs/GETTING_STARTED.md — Updated pgvector setup instructions
  • Makefile — Added benchmark-pg target
  • CONTRIBUTING.md — Updated test count

Test plan

  • All 104 existing tests pass (pytest tests/ -v)
  • ruff check and mypy pass
  • python examples/validate_pipeline.py passes all 30 checks
  • Gatekeeper ablation script uses the corrected metrics via compute_metrics

Made with Cursor

harskuma and others added 2 commits May 19, 2026 21:27
The paper (Section IV.D.3) explicitly treats SQL-level parameter
differences as "minor errors" — PV should measure orchestration
decisions (tool selection, format, channel, dependencies), not
content generation (SQL text, email prose).

Categorize parameters into orchestration (exact match), content
(structural/semantic match), and identifier (lenient match).
Update README with reproduced live benchmark results.

Co-authored-by: Cursor <cursoragent@cursor.com>
…omments

Co-authored-by: Cursor <cursoragent@cursor.com>
@harsh-kr11 harsh-kr11 self-assigned this May 19, 2026
@harsh-kr11 harsh-kr11 merged commit 7bc31c4 into main May 19, 2026
5 checks passed
@harsh-kr11 harsh-kr11 deleted the fix/metrics-and-results branch May 19, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants