- - Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer *new* questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench, @shetty2025gso). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
0 commit comments