I think some metrics regarding the total and average time it took each model to solve the tasks would be a helpful addition to the leaderboard, especially for reasoning models.
By just looking at the percentage completed and cost involved, I could get a wrong impression that model A is a better choice that model B, although realistically, model B might be a much better fit for everyday use as it solves tasks in 1/10th the time.
I think some metrics regarding the total and average time it took each model to solve the tasks would be a helpful addition to the leaderboard, especially for reasoning models.
By just looking at the percentage completed and cost involved, I could get a wrong impression that model A is a better choice that model B, although realistically, model B might be a much better fit for everyday use as it solves tasks in 1/10th the time.