Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash by KhanMih · Pull Request #7 · harsh-kr11/behavioral-memory

KhanMih · 2026-05-24T10:06:11Z

Multi-model benchmark addressing single-model evaluation feedback.

Summary

Addresses reviewer feedback that the study is a single-model evaluation. This PR runs the 30-task benchmark across three Gemini models and presents side-by-side results.

Changes

Bug fix — Fixed `PlanEngine` to handle `langchain-google-genai` v4+ response format where `response.content` returns a list of content blocks instead of a plain string. Without this fix, Gemini 3.x models fail on every task with JSON parse errors.
New script — `examples/compare_models.py` loads multiple benchmark result JSONs and produces Rich terminal tables, McNemar cross-model tests, and markdown-ready output for the README.
README update — Key Results section now shows a side-by-side three-model comparison table with real benchmark numbers. Original paper results preserved in a collapsible block.
CLI update — `_print_paper_results_table()` now displays multi-model Dynamic Retrieval results.
Makefile — Added `benchmark-multi` target to run all three models sequentially.

Files changed

`src/behavioral_memory/planner/engine.py` — `_extract_text()` helper for v4+ content blocks
`src/behavioral_memory/cli.py` — multi-model results table
`examples/compare_models.py` — new comparison script
`README.md` — multi-model results + reproduction instructions
`Makefile` — `benchmark-multi` target

Co-authored-by: Cursor <cursoragent@cursor.com>

mehkhan and others added 3 commits May 24, 2026 15:28

Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash

7058d8d

Fix lint: remove unused itertools.combinations import

a55054a

Co-authored-by: Cursor <cursoragent@cursor.com>

Fix formatting: apply ruff format to compare_models.py

46a3183

Co-authored-by: Cursor <cursoragent@cursor.com>

harsh-kr11 merged commit 8bdc9f5 into harsh-kr11:main May 25, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash#7

Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash#7
harsh-kr11 merged 3 commits into
harsh-kr11:mainfrom
KhanMih:multi-model-benchmark

KhanMih commented May 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KhanMih commented May 24, 2026

Summary

Changes

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants