name Eval regression about A golden-dataset question that used to pass now fails (often auto-filed by eval-nightly). title eval: regression on [case_id] labels bug test Case ID: Category: Difficulty: Question Expected answer Actual answer When did it regress? Reproduction uv run pytest eval/test_golden_qa.py -k "<case_id>" -v Suspected cause Invariant touched