Commit b2bf619
refactor: store summaries (17KB) instead of full results (220KB) per model
- New: evaluations/summaries/ — scores only, no raw responses or
per-permutation details. 17KB vs 220KB per model.
- evaluations/results/ added to .gitignore (full results reproducible
via pilot.py)
- Report generator reads from summaries/
- pilot.py auto-generates summary after each run
- Fixed filename suffix to use exact model IDs (CodeRabbit feedback)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 4602bbe commit b2bf619
15 files changed
Lines changed: 4775 additions & 62665 deletions
File tree
- evaluations
- results
- summaries
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
| 2 | + | |
| 3 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
312 | 312 | | |
313 | 313 | | |
314 | 314 | | |
315 | | - | |
316 | | - | |
| 315 | + | |
| 316 | + | |
317 | 317 | | |
318 | | - | |
319 | | - | |
320 | | - | |
321 | | - | |
322 | | - | |
323 | | - | |
324 | | - | |
325 | | - | |
326 | | - | |
327 | | - | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
328 | 324 | | |
329 | 325 | | |
330 | 326 | | |
| |||
462 | 458 | | |
463 | 459 | | |
464 | 460 | | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
465 | 473 | | |
466 | 474 | | |
467 | 475 | | |
| |||
0 commit comments