Commit be86a51
LLM Benchmark Improvements + More Evals (#4740)
# Description of Changes
LLM benchmark infrastructure improvements and new benchmark tasks.
**Runner & scoring:**
- Add retry logic with backoff for LLM API calls (rate limits,
502/503/504, timeouts)
- Fix `generation_duration_ms` to only time the successful attempt, not
retries+sleep delays
- Add `--dry-run` flag to run benchmarks without saving results
- Add OpenRouter client as unified fallback when direct vendor keys
aren't set
- Add web search mode via OpenRouter `:online` suffix
- Extract shared OpenAI-compatible response types into `oa_compat.rs`
- Add `ReducerCallBothScorer` for calling reducers on both golden and
LLM databases
- Set `max_tokens` on OpenRouter and Meta clients to prevent silent
truncation
**Model routing:**
- Add `ModelRoute` with display name, vendor, API model, and OpenRouter
model ID
- Support ad-hoc model IDs via `--models vendor:model` without static
registration
- Add model name normalization (OpenRouter IDs, case variants →
canonical display names)
**Context modes:**
- Add `guidelines`, `cursor_rules`, `search`, `no_context` modes with
`is_empty_context_mode()` helper
- Add mode-specific prompt preambles
- Consolidate mode alias normalization (`none`/`no_guidelines` →
`no_context`)
**CI workflows:**
- Add `llm-benchmark-periodic.yml` for scheduled nightly runs with
per-language failure tracking
- **Note**: The periodic workflow requires `OPENROUTER_API_KEY`,
`LLM_BENCHMARK_UPLOAD_URL`, and `LLM_BENCHMARK_API_KEY` as GitHub
secrets.
- Add `llm-benchmark-validate-goldens.yml` for validating golden answers
still compile
**Results & summary:**
- Add `cmd_status` to show incomplete benchmark combinations with rerun
commands
- Add `cmd_analyze` for LLM-powered failure analysis
- Split `normalize_details_file` from `write_summary_from_details_file`
- Derive task categories from filesystem for summary generation
- Add timestamp tracking (`started_at`/`finished_at`) and token usage
**New benchmark tasks:**
- 30 new tasks across auth, data_modeling, queries, basics, and schema
categories
- Updated/fixed existing task prompts and golden answers
# API and ABI breaking changes
None. Internal tooling only.
# Expected complexity level and risk
2 — Changes are scoped to the LLM benchmark CLI tool
(`xtask-llm-benchmark`) and CI workflows. No impact on SpacetimeDB core.
# Testing
- [x] `cargo check -p xtask-llm-benchmark` — zero errors, zero warnings
- [x] Dry run: `llm_benchmark run --lang typescript --modes no_context
--tasks t_001 --models openai:gpt-5-mini --dry-run` — ran end-to-end,
confirmed no results saved to disk
- [ ] Verify periodic workflow runs successfully on next scheduled
trigger
---------
Co-authored-by: Tyler Cloutier <cloutiertyler@users.noreply.github.com>1 parent 6bab85d commit be86a51
356 files changed
Lines changed: 9999 additions & 44461 deletions
File tree
- .github/workflows
- crates/cli
- src/subcommands
- docs
- llms
- scripts
- static
- ai-rules
- skills
- cli
- concepts
- cpp-server
- csharp-client
- csharp-server
- rust-server
- spacetimedb-concepts
- spacetimedb-csharp
- spacetimedb-rust
- spacetimedb-typescript
- typescript-client
- typescript-server
- unity
- unreal
- tools/xtask-llm-benchmark
- src
- api
- benchmarks
- auth
- t_026_auth_identity_check
- answers
- tasks
- t_027_private_vs_public_table
- answers
- tasks
- t_041_registered_user_gate
- answers
- tasks
- t_042_admin_bootstrap
- answers
- tasks
- t_043_role_based_access
- answers
- tasks
- t_044_ban_list
- answers
- tasks
- t_045_rate_limit
- answers
- tasks
- basics
- t_000_empty_reducers/tasks
- t_001_basic_tables/tasks
- t_002_scheduled_table
- answers
- tasks
- t_003_struct_in_table/tasks
- t_004_insert/tasks
- t_005_update
- answers
- tasks
- t_006_delete
- answers
- tasks
- t_007_crud
- answers
- tasks
- t_008_index_lookup
- answers
- tasks
- t_009_init
- answers
- tasks
- t_010_connect/tasks
- t_011_helper_function/tasks
- t_038_schedule_at_time
- answers
- tasks
- t_039_cancel_schedule
- answers
- tasks
- t_040_lifecycle_player
- answers
- tasks
- data_modeling
- t_024_event_table
- answers
- tasks
- t_025_optional_fields
- answers
- tasks
- t_028_cascade_delete
- answers
- tasks
- t_029_filter_and_aggregate
- answers
- tasks
- t_030_two_table_join
- answers
- tasks
- t_031_unique_constraint
- answers
- tasks
- queries
- t_022_view_basic
- answers
- tasks
- t_023_view_per_user
- answers
- tasks
- t_032_range_query
- answers
- tasks
- t_033_sort_and_limit
- answers
- tasks
- t_034_find_first
- answers
- tasks
- t_035_select_distinct
- answers
- tasks
- t_036_count_without_collect
- answers
- tasks
- t_037_multi_column_filter
- answers
- tasks
- schema
- t_012_spacetime_product_type/tasks
- t_013_spacetime_sum_type
- answers
- tasks
- t_014_elementary_columns
- answers
- tasks
- t_015_product_type_columns
- answers
- tasks
- t_016_sum_type_columns
- answers
- tasks
- t_017_scheduled_columns
- answers
- tasks
- t_018_constraints
- answers
- tasks
- t_019_many_to_many
- answers
- tasks
- t_020_ecs
- answers
- tasks
- t_021_multi_column_index
- answers
- tasks
- bench
- bin
- context
- eval
- generated
- llm
- clients
- results
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
558 | 558 | | |
559 | 559 | | |
560 | 560 | | |
561 | | - | |
562 | | - | |
563 | | - | |
564 | | - | |
565 | | - | |
566 | | - | |
567 | | - | |
568 | | - | |
569 | | - | |
570 | | - | |
571 | | - | |
572 | | - | |
573 | | - | |
574 | | - | |
575 | | - | |
576 | | - | |
577 | | - | |
578 | | - | |
579 | | - | |
580 | | - | |
581 | | - | |
582 | | - | |
583 | | - | |
584 | | - | |
585 | | - | |
586 | | - | |
587 | | - | |
588 | | - | |
589 | | - | |
590 | | - | |
591 | | - | |
592 | | - | |
593 | | - | |
594 | 561 | | |
595 | 562 | | |
596 | 563 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
0 commit comments