A production-ready, faithful implementation of the Berkeley Function-Calling Leaderboard benchmark, designed to drop into the elizaOS benchmark harness.
The package vendors the relevant Apache-2.0 BFCL evaluator pieces (multi-turn
runtime + tool implementations) with attribution — see
executable_runtime/NOTICE.
| Group | Category | Scoring path | Network |
|---|---|---|---|
| Non-live single-turn | simple, multiple, parallel, parallel_multiple, relevance, irrelevance, sql, java, javascript |
AST equality | no |
| Non-live single-turn — gated | rest_api |
AST equality (currently no live REST runner) | yes |
| Live (user-contributed) | live_simple, live_multiple, live_parallel, live_parallel_multiple, live_relevance, live_irrelevance |
AST equality | no |
| Multi-turn | multi_turn_base, multi_turn_miss_func, multi_turn_miss_param, multi_turn_long_context |
Executable runtime (state equality) | no |
| Agentic — web_search | web_search_base, web_search_no_snippet |
Executable runtime | yes |
| Agentic — memory | memory_kv, memory_vector, memory_rec_sum |
(skipped — see below) | no |
| Non-scoring | format_sensitivity |
n/a | n/a |
Categories that require live network / credentials are gated behind the
--enable-network CLI flag. Without it, those tests are reported in the
skipped_no_credentials bucket and excluded from the accuracy
denominator with a logged warning — they are not silently failed.
Currently network-gated:
rest_apiweb_search_base,web_search_no_snippet
# Run without network — REST / web_search are skipped
python -m benchmarks.bfcl run --sample 200
# Opt in to network-backed scoring
python -m benchmarks.bfcl run --sample 200 --enable-networkMemory categories (memory_kv, memory_vector, memory_rec_sum) depend on
additional upstream scaffolding (snapshot dirs, prereq conversation files,
bfcl_eval.utils helpers) that we do not vendor. They are reported in the
skipped_unsupported bucket. To enable, install bfcl-eval from upstream
and the runtime will fall through to the upstream helpers.
AST equality between the predicted and ground-truth call list, using the
faithful BFCL AST checker logic. The AST checker preserves a small set of
intentional quirk-tolerance fixes (annotated in
evaluators/ast_evaluator.py) — these are
documented overrides of upstream behaviour that the upstream issue tracker
also reports as undesirable.
Per-turn agent loop:
- Feed the current user turn (plus any prior assistant + tool result messages) into the agent.
- Decode the upstream-canonical python-list-of-calls from the agent's
response (
decode_python_calls). - Execute the calls against a per-test
ExecutableRuntimethat holds the vendored upstream tool instances (GorillaFileSystem, MathAPI, TwitterAPI, ...). - Repeat for the next turn (state persists across turns).
- At the end, also execute the ground-truth trajectory against a fresh runtime and compare final per-class instance state.
The previous evaluator used synthetic always-success mock handlers —
exec_accuracy was always 1.0 regardless of the model's calls. That
behaviour is removed. The new evaluator actually invokes the upstream tool
implementations and compares output state. Unregistered functions return
failure, NOT auto-success.
# List available models / providers
python -m benchmarks.bfcl models
# Sample run (50 stratified tests)
python -m benchmarks.bfcl run --sample 50
# Run a specific category set
python -m benchmarks.bfcl run --categories simple,multiple,multi_turn_base
# Full benchmark, with network-gated categories enabled
python -m benchmarks.bfcl run --full --enable-network
# Local data path (avoids HuggingFace)
python -m benchmarks.bfcl run --local-data ./data/bfcl
# Run all + show baselines
python -m benchmarks.bfcl info --baselinesThe run summary now distinguishes:
total_tests— number of tests that contributed to accuracy.passed_tests/failed_tests— amongtotal_tests.skipped_tests— excluded from accuracy.skipped_by_reason— bucketed by status:skipped_no_credentialsskipped_no_ground_truthskipped_unsupported
Vendored upstream sources are Apache License 2.0 (Berkeley/Gorilla). See
executable_runtime/NOTICE for attribution.