gso-bench

tecunningham · tecunningham · commit 772b67fb5cab · 2026-02-05T09:34:07.000-08:00
diff --git a/_freeze/posts/2026-01-29-knowledge-creating-llms/execute-results/html.json b/_freeze/posts/2026-01-29-knowledge-creating-llms/execute-results/html.json
diff --git a/docs/posts/2026-01-29-knowledge-creating-llms.html b/docs/posts/2026-01-29-knowledge-creating-llms.html
@@ -302,7 +302,7 @@ <h1>Knowledge-Sharing vs Knowledge-Creating LLMs</h1>
 <p>New LLM-based discovery techniques (e.g.&nbsp;AlphaEvolve (<span class="citation" data-cites="novikov2025alphaevolve">Novikov et al. (<a href="#ref-novikov2025alphaevolve" role="doc-biblioref">2025</a>)</span>), TTT-Discover (<span class="citation" data-cites="yuksekgonul2026learning">Yuksekgonul et al. (<a href="#ref-yuksekgonul2026learning" role="doc-biblioref">2026</a>)</span>)) are distinct from prior AI discovery applications (e.g.&nbsp;AlphaFold, AlphaTensor) in that they are <em>general</em> methods, they can relatively quickly be adapted to any arbitrary optimization problem.</p>
 <p>Knowledge-creating LLMs will differ from knowledge-sharing LLMs in a number of ways:</p>
 <ul>
-<li><p>Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer <em>new</em> questions, e.g.&nbsp;solve an unsolved mathematical problem (<a href="https://epoch.ai/frontiermath/open-problems">FrontierMath Open Problems</a>) or set a new record on an optimization problem (e.g.&nbsp;GSO-bench). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.</p></li>
+<li><p>Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer <em>new</em> questions, e.g.&nbsp;solve an unsolved mathematical problem (<a href="https://epoch.ai/frontiermath/open-problems">FrontierMath Open Problems</a>) or set a new record on an optimization problem (e.g.&nbsp;GSO-bench, <span class="citation" data-cites="shetty2025gso">Shetty et al. (<a href="#ref-shetty2025gso" role="doc-biblioref">2025</a>)</span>). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.</p></li>
 <li><p>Knowledge-creating LLMs have high returns to compute on individual problems, unlike knowledge-sharing LLMs for which returns asymptote quickly. It can be worth spending billions of tokens to solve a single problem if the solution is generally applicable.</p></li>
 <li><p>Knowledge-creating LLMs will be adopted by leader firms more than followers.</p></li>
 <li><p>The demand for new knowledge is much less elastic than the demand for existing knowledge because there are high returns to <em>exclusivity</em> of new knowledge. Thus LLM-providers are likely to license their technology exclusively rather than expose them through a general-purpose API. Sarah Friar, OpenAI’s CFO, said in <a href="https://openai.com/index/a-business-that-scales-with-the-value-of-intelligence/">January 2026</a>:</p></li>
@@ -340,8 +340,8 @@ <h1>A Visual Illustration</h1>
 </div>
 </div>
 </section>
-<section id="there-are-only-a-dozen-deep-problems-galaxy-brain" class="level1">
-<h1>There are Only a Dozen Deep Problems (Galaxy Brain)</h1>
+<section id="there-are-only-a-dozen-deep-problems" class="level1">
+<h1>There are Only a Dozen Deep Problems</h1>
 <dl>
 <dt>If you squint, a billion problems resolve into just a dozen common problems.</dt>
 <dd>
@@ -358,11 +358,11 @@ <h1>There are Only a Dozen Deep Problems (Galaxy Brain)</h1>
 </dd>
 <dt>(3) Statistical inference problems.</dt>
 <dd>
-For many classes of supervised learning problems there exists an existing “best practice”, e.g.&nbsp;a <a href="https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/">recent article</a> says <em>“Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard”</em>. Thus if you ask an LLM to do statistical inference, it can relatively easily find the existing best-practice, but then it is much harder to advance on that (and if it does, then the solution would be very generally applicable).
+For many classes of supervised learning problems there exists an existing “best practice”, e.g.&nbsp;a <a href="https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/">recent article</a> says <em>“Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard”</em>.
 </dd>
 <dt>This is a difficulty for LLM benchmarking.</dt>
 <dd>
-<p>The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it’s difficult to test the model’s intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.).</p>
+<p>The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it’s difficult to test the model’s intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.). Thus we either have to devise problems sufficiently weird that it’s difficult to map them to textbook problems, or instead ask LLMs to advance the knowledge frontier on one of the existing canonical problems.</p>
 </dd>
 <dt>Labs will spend a lot on fixed inference, a little on variable inference.</dt>
 <dd>
@@ -439,6 +439,9 @@ <h1>An Economic Model [UNFINISHED]</h1>
 <div id="ref-novikov2025alphaevolve" class="csl-entry" role="listitem">
 Novikov, Alexander, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, et al. 2025. <span>“AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery.”</span> <em>arXiv Preprint</em> arXiv:2506.13131. <a href="https://doi.org/10.48550/arXiv.2506.13131">https://doi.org/10.48550/arXiv.2506.13131</a>.
 </div>
+<div id="ref-shetty2025gso" class="csl-entry" role="listitem">
+Shetty, Manish, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. 2025. <span>“GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents.”</span> <a href="https://arxiv.org/pdf/2505.23671.pdf">https://arxiv.org/pdf/2505.23671.pdf</a>.
+</div>
 <div id="ref-yuksekgonul2026learning" class="csl-entry" role="listitem">
 Yuksekgonul, Mert, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, et al. 2026. <span>“Learning to Discover at Test Time.”</span> <em>arXiv Preprint</em> arXiv:2601.16175. <a href="https://test-time-training.github.io/discover.pdf">https://test-time-training.github.io/discover.pdf</a>.
 </div>
diff --git a/docs/search.json b/docs/search.json
@@ -950,5 +950,19 @@
     "title": "Optimal Coronavirus Policy Should be Front-Loaded",
     "section": "Prior Discussion",
     "text": "Prior Discussion\nThere’s been some discussion of zig-zagging by the Imperial group (paper) and by Timothy Gowers (twitter & post)\nGowers says the optimal policy is very short zig-zags (changing policy every other day), however I think this is misleading. It comes from fixing the lower-threshold and optimizing the upper-threshold. If instead you fixed the upper-threshold and optimized the lower-threshold, then the optimal cycle-length will be long.\nIf you choose both the upper and lower threshold (both T and S) then he notes that they’ll both be arbitarily low. However this ignores the cost of getting to zero given current cases.\nInstead a well-defined problem is to choose an optimal time-path of policy given some start-point and end-point. In that case it’ll be a path of gradually decreasing strictness (without zig-zags).\nYou can see the intuition in the diagram below: the total infections is approximately the area under the zig-zag (not quite: because the y-axis is ln(cases), but this won’t matter for the argument). Thus you can reduce the area under the line by lowering the upper threshold. However if you instead take the upper threshold as fixed, then it’s optimal to choose a lower threshold that is as low as possible, i.e. you want long cycles, not short cycles.\n\n\n\nabc"
+  },
+  {
+    "objectID": "posts/2026-01-29-knowledge-creating-llms.html",
+    "href": "posts/2026-01-29-knowledge-creating-llms.html",
+    "title": "Knowledge-Creating LLMs",
+    "section": "",
+    "text": "Thanks to Zoë Hitzig & Parker Whitfill for helpful comments."
+  },
+  {
+    "objectID": "posts/2026-01-29-knowledge-creating-llms.html#footnotes",
+    "href": "posts/2026-01-29-knowledge-creating-llms.html#footnotes",
+    "title": "Knowledge-Creating LLMs",
+    "section": "Footnotes",
+    "text": "Footnotes\n\n\nMany other technologies share knowledge – speaking, writing, printing, the internet – LLMs just continue this progression but further lower the costs of sharing.↩︎"
   }
 ]
diff --git a/docs/sitemap.xml b/docs/sitemap.xml
@@ -120,4 +120,8 @@
     <loc>tecunningham.github.io/posts/2020-04-05-front-loading-restrictions.html</loc>
     <lastmod>2025-10-29T15:46:32.993Z</lastmod>
   </url>
+  <url>
+    <loc>tecunningham.github.io/posts/2026-01-29-knowledge-creating-llms.html</loc>
+    <lastmod>2026-02-05T17:33:55.198Z</lastmod>
+  </url>
 </urlset>
diff --git a/posts/2026-01-29-knowledge-creating-llms.qmd b/posts/2026-01-29-knowledge-creating-llms.qmd
@@ -89,7 +89,7 @@ Knowledge-creating LLMs.
 
      Knowledge-creating LLMs will differ from knowledge-sharing LLMs in a number of ways:
 
-     - Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer *new* questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
+     - Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer *new* questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench, @shetty2025gso). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
 
      - Knowledge-creating LLMs have high returns to compute on individual problems, unlike knowledge-sharing LLMs for which returns asymptote quickly. It can be worth spending billions of tokens to solve a single problem if the solution is generally applicable.
 
@@ -247,7 +247,7 @@ We can then illustrate a knowledge-creating LLM as pushing below the human front
  -->
 
 
-#           There are Only a Dozen Deep Problems (Galaxy Brain)
+#           There are Only a Dozen Deep Problems
 
 If you squint, a billion problems resolve into just a dozen common problems.
 :  
@@ -265,12 +265,12 @@ If you squint, a billion problems resolve into just a dozen common problems.
 
 (3) Statistical inference problems.
 : 
-    For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says *"Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard"*. Thus if you ask an LLM to do statistical inference, it can relatively easily find the existing best-practice, but then it is much harder to advance on that (and if it does, then the solution would be very generally applicable).
+    For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says *"Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard"*.
 
 This is a difficulty for LLM benchmarking.
 : 
 
-    The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.).
+    The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.). Thus we either have to devise problems sufficiently weird that it's difficult to map them to textbook problems, or instead ask LLMs to advance the knowledge frontier on one of the existing canonical problems.
 
 Labs will spend a lot on fixed inference, a little on variable inference.
 : 

Original file line number	Diff line number	Diff line change
`@@ -89,7 +89,7 @@ Knowledge-creating LLMs.`
`89`	`89`
`90`	`90`	`Knowledge-creating LLMs will differ from knowledge-sharing LLMs in a number of ways:`
`91`	`91`
`92`		- - Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer new questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
	`92`	+ - Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer new questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench, @shetty2025gso). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
`93`	`93`
`94`	`94`	`- Knowledge-creating LLMs have high returns to compute on individual problems, unlike knowledge-sharing LLMs for which returns asymptote quickly. It can be worth spending billions of tokens to solve a single problem if the solution is generally applicable.`
`95`	`95`
`@@ -247,7 +247,7 @@ We can then illustrate a knowledge-creating LLM as pushing below the human front`
`247`	`247`	`-->`
`248`	`248`
`249`	`249`
`250`		`-# There are Only a Dozen Deep Problems (Galaxy Brain)`
	`250`	`+# There are Only a Dozen Deep Problems`
`251`	`251`
`252`	`252`	`If you squint, a billion problems resolve into just a dozen common problems.`
`253`	`253`	`:`
`@@ -265,12 +265,12 @@ If you squint, a billion problems resolve into just a dozen common problems.`
`265`	`265`
`266`	`266`	`(3) Statistical inference problems.`
`267`	`267`	`:`
`268`		`- For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says "Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard". Thus if you ask an LLM to do statistical inference, it can relatively easily find the existing best-practice, but then it is much harder to advance on that (and if it does, then the solution would be very generally applicable).`
	`268`	`+ For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says "Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard".`
`269`	`269`
`270`	`270`	`This is a difficulty for LLM benchmarking.`
`271`	`271`	`:`
`272`	`272`
`273`		`- The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.).`
	`273`	+ The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.). Thus we either have to devise problems sufficiently weird that it's difficult to map them to textbook problems, or instead ask LLMs to advance the knowledge frontier on one of the existing canonical problems.
`274`	`274`
`275`	`275`	`Labs will spend a lot on fixed inference, a little on variable inference.`
`276`	`276`	`:`