Commit 4160a09
docs: update benchmark viewer GIF with multi-task eval results
Replace the old single-task (0% success) GIF with a new animation
showing the phase0_multi_domain_v3 evaluation (5 tasks, 2 pass,
3 fail, 40% success rate). The new GIF cycles through the overview,
task selection, and step-by-step screenshot replay for both passing
and failing tasks across different Windows application domains.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent ffcb41d commit 4160a09
1 file changed
Loading
0 commit comments