Skip to content

Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash#7

Merged
harsh-kr11 merged 3 commits into
harsh-kr11:mainfrom
KhanMih:multi-model-benchmark
May 25, 2026
Merged

Add multi-model benchmark: Gemini 2.5 Pro, 3 Flash, 3.5 Flash#7
harsh-kr11 merged 3 commits into
harsh-kr11:mainfrom
KhanMih:multi-model-benchmark

Conversation

@KhanMih
Copy link
Copy Markdown

@KhanMih KhanMih commented May 24, 2026

Multi-model benchmark addressing single-model evaluation feedback.

Summary

Addresses reviewer feedback that the study is a single-model evaluation. This PR runs the 30-task benchmark across three Gemini models and presents side-by-side results.

Changes

  • Bug fix — Fixed `PlanEngine` to handle `langchain-google-genai` v4+ response format where `response.content` returns a list of content blocks instead of a plain string. Without this fix, Gemini 3.x models fail on every task with JSON parse errors.
  • New script — `examples/compare_models.py` loads multiple benchmark result JSONs and produces Rich terminal tables, McNemar cross-model tests, and markdown-ready output for the README.
  • README update — Key Results section now shows a side-by-side three-model comparison table with real benchmark numbers. Original paper results preserved in a collapsible block.
  • CLI update — `_print_paper_results_table()` now displays multi-model Dynamic Retrieval results.
  • Makefile — Added `benchmark-multi` target to run all three models sequentially.

Files changed

  • `src/behavioral_memory/planner/engine.py` — `_extract_text()` helper for v4+ content blocks
  • `src/behavioral_memory/cli.py` — multi-model results table
  • `examples/compare_models.py` — new comparison script
  • `README.md` — multi-model results + reproduction instructions
  • `Makefile` — `benchmark-multi` target

mehkhan and others added 3 commits May 24, 2026 15:28
@harsh-kr11 harsh-kr11 merged commit 8bdc9f5 into harsh-kr11:main May 25, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants