You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/posts/2026-01-29-knowledge-creating-llms.html
+8-5Lines changed: 8 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -302,7 +302,7 @@ <h1>Knowledge-Sharing vs Knowledge-Creating LLMs</h1>
302
302
<p>New LLM-based discovery techniques (e.g. AlphaEvolve (<spanclass="citation" data-cites="novikov2025alphaevolve">Novikov et al. (<ahref="#ref-novikov2025alphaevolve" role="doc-biblioref">2025</a>)</span>), TTT-Discover (<spanclass="citation" data-cites="yuksekgonul2026learning">Yuksekgonul et al. (<ahref="#ref-yuksekgonul2026learning" role="doc-biblioref">2026</a>)</span>)) are distinct from prior AI discovery applications (e.g. AlphaFold, AlphaTensor) in that they are <em>general</em> methods, they can relatively quickly be adapted to any arbitrary optimization problem.</p>
303
303
<p>Knowledge-creating LLMs will differ from knowledge-sharing LLMs in a number of ways:</p>
304
304
<ul>
305
-
<li><p>Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer <em>new</em> questions, e.g. solve an unsolved mathematical problem (<ahref="https://epoch.ai/frontiermath/open-problems">FrontierMath Open Problems</a>) or set a new record on an optimization problem (e.g. GSO-bench). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.</p></li>
305
+
<li><p>Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer <em>new</em> questions, e.g. solve an unsolved mathematical problem (<ahref="https://epoch.ai/frontiermath/open-problems">FrontierMath Open Problems</a>) or set a new record on an optimization problem (e.g. GSO-bench, <spanclass="citation" data-cites="shetty2025gso">Shetty et al. (<ahref="#ref-shetty2025gso" role="doc-biblioref">2025</a>)</span>). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.</p></li>
306
306
<li><p>Knowledge-creating LLMs have high returns to compute on individual problems, unlike knowledge-sharing LLMs for which returns asymptote quickly. It can be worth spending billions of tokens to solve a single problem if the solution is generally applicable.</p></li>
307
307
<li><p>Knowledge-creating LLMs will be adopted by leader firms more than followers.</p></li>
308
308
<li><p>The demand for new knowledge is much less elastic than the demand for existing knowledge because there are high returns to <em>exclusivity</em> of new knowledge. Thus LLM-providers are likely to license their technology exclusively rather than expose them through a general-purpose API. Sarah Friar, OpenAI’s CFO, said in <ahref="https://openai.com/index/a-business-that-scales-with-the-value-of-intelligence/">January 2026</a>:</p></li>
<dt>If you squint, a billion problems resolve into just a dozen common problems.</dt>
347
347
<dd>
@@ -358,11 +358,11 @@ <h1>There are Only a Dozen Deep Problems (Galaxy Brain)</h1>
358
358
</dd>
359
359
<dt>(3) Statistical inference problems.</dt>
360
360
<dd>
361
-
For many classes of supervised learning problems there exists an existing “best practice”, e.g. a <ahref="https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/">recent article</a> says <em>“Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard”</em>. Thus if you ask an LLM to do statistical inference, it can relatively easily find the existing best-practice, but then it is much harder to advance on that (and if it does, then the solution would be very generally applicable).
361
+
For many classes of supervised learning problems there exists an existing “best practice”, e.g. a <ahref="https://developer.nvidia.com/blog/the-kaggle-grandmasters-playbook-7-battle-tested-modeling-techniques-for-tabular-data/">recent article</a> says <em>“Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard”</em>.
362
362
</dd>
363
363
<dt>This is a difficulty for LLM benchmarking.</dt>
364
364
<dd>
365
-
<p>The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it’s difficult to test the model’s intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.).</p>
365
+
<p>The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it’s difficult to test the model’s intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.). Thus we either have to devise problems sufficiently weird that it’s difficult to map them to textbook problems, or instead ask LLMs to advance the knowledge frontier on one of the existing canonical problems.</p>
366
366
</dd>
367
367
<dt>Labs will spend a lot on fixed inference, a little on variable inference.</dt>
368
368
<dd>
@@ -439,6 +439,9 @@ <h1>An Economic Model [UNFINISHED]</h1>
Shetty, Manish, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica. 2025. <span>“GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents.”</span><ahref="https://arxiv.org/pdf/2505.23671.pdf">https://arxiv.org/pdf/2505.23671.pdf</a>.
Yuksekgonul, Mert, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, et al. 2026. <span>“Learning to Discover at Test Time.”</span><em>arXiv Preprint</em> arXiv:2601.16175. <ahref="https://test-time-training.github.io/discover.pdf">https://test-time-training.github.io/discover.pdf</a>.
Copy file name to clipboardExpand all lines: docs/search.json
+14Lines changed: 14 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -950,5 +950,19 @@
950
950
"title": "Optimal Coronavirus Policy Should be Front-Loaded",
951
951
"section": "Prior Discussion",
952
952
"text": "Prior Discussion\nThere’s been some discussion of zig-zagging by the Imperial group (paper) and by Timothy Gowers (twitter & post)\nGowers says the optimal policy is very short zig-zags (changing policy every other day), however I think this is misleading. It comes from fixing the lower-threshold and optimizing the upper-threshold. If instead you fixed the upper-threshold and optimized the lower-threshold, then the optimal cycle-length will be long.\nIf you choose both the upper and lower threshold (both T and S) then he notes that they’ll both be arbitarily low. However this ignores the cost of getting to zero given current cases.\nInstead a well-defined problem is to choose an optimal time-path of policy given some start-point and end-point. In that case it’ll be a path of gradually decreasing strictness (without zig-zags).\nYou can see the intuition in the diagram below: the total infections is approximately the area under the zig-zag (not quite: because the y-axis is ln(cases), but this won’t matter for the argument). Thus you can reduce the area under the line by lowering the upper threshold. However if you instead take the upper threshold as fixed, then it’s optimal to choose a lower threshold that is as low as possible, i.e. you want long cycles, not short cycles.\n\n\n\nabc"
"text": "Footnotes\n\n\nMany other technologies share knowledge – speaking, writing, printing, the internet – LLMs just continue this progression but further lower the costs of sharing.↩︎"
Copy file name to clipboardExpand all lines: posts/2026-01-29-knowledge-creating-llms.qmd
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -89,7 +89,7 @@ Knowledge-creating LLMs.
89
89
90
90
Knowledge-creating LLMs will differ from knowledge-sharing LLMs in a number of ways:
91
91
92
-
- Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer *new* questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
92
+
- Knowledge-creating LLMs will have qualitatively different benchmarks: instead of seeing if they can answer questions which we already know the answer to (most existing benchmarks), we want them to answer *new* questions, e.g. solve an unsolved mathematical problem ([FrontierMath Open Problems](https://epoch.ai/frontiermath/open-problems)) or set a new record on an optimization problem (e.g. GSO-bench, @shetty2025gso). We can use these new frontier benchmarks are indices for capability, but they are more challenging to interpret because the frontier is always moving.
93
93
94
94
- Knowledge-creating LLMs have high returns to compute on individual problems, unlike knowledge-sharing LLMs for which returns asymptote quickly. It can be worth spending billions of tokens to solve a single problem if the solution is generally applicable.
95
95
@@ -247,7 +247,7 @@ We can then illustrate a knowledge-creating LLM as pushing below the human front
247
247
-->
248
248
249
249
250
-
# There are Only a Dozen Deep Problems (Galaxy Brain)
250
+
# There are Only a Dozen Deep Problems
251
251
252
252
If you squint, a billion problems resolve into just a dozen common problems.
253
253
:
@@ -265,12 +265,12 @@ If you squint, a billion problems resolve into just a dozen common problems.
265
265
266
266
(3) Statistical inference problems.
267
267
:
268
-
For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says *"Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard"*. Thus if you ask an LLM to do statistical inference, it can relatively easily find the existing best-practice, but then it is much harder to advance on that (and if it does, then the solution would be very generally applicable).
268
+
For many classes of supervised learning problems there exists an existing "best practice", e.g. a [recent article][nvidia] says *"Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard"*.
269
269
270
270
This is a difficulty for LLM benchmarking.
271
271
:
272
272
273
-
The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.).
273
+
The mapping between idiosyncratic and canonical problems is a difficulty for LLM benchmarking. If each problem can be mapped to a canonical problem, and there exists a best-known-algorithm for each of those canonical problems, then it's difficult to test the model's intelligence. A reasonably smart LLM will know how to map a new problem into a canonical problem, and will know the textbook best-practice for that canonical problem (XGboost, ARIMA, gaussian process, branch-and-cut, PPO, etc.). Thus we either have to devise problems sufficiently weird that it's difficult to map them to textbook problems, or instead ask LLMs to advance the knowledge frontier on one of the existing canonical problems.
274
274
275
275
Labs will spend a lot on fixed inference, a little on variable inference.
0 commit comments