You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Behemoth delay was due to performance concerns, not safety concerns.
Added active vs total parameter counts, MoE expert counts, and source links.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
<p><ahref="https://www.anthropic.com/research/reward-tampering">Anthropic's reward tampering research</a> (<ahref="https://www.anthropic.com/research/reward-tampering">Denison et al., 2024</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
745
+
<p>Anthropic's reward tampering research (<ahref="https://www.anthropic.com/research/reward-tampering">Denison et al., 2024</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
745
746
</div>
746
747
<divclass="warning-box" data-title="What emerged without explicit training">
747
748
<p>Models trained to agree with users spontaneously learned to:</p>
@@ -768,13 +769,13 @@ <h1 id="from-deception-to-detection">From deception to detection</h1>
<h1id="mechanistic-interpretability-seeing-inside-the-black-box">Mechanistic interpretability: seeing inside the black box</h1>
771
-
<divclass="definition-box" data-title="From neurons to features to circuits">
772
+
<divclass="note-box" data-title="From neurons to features to circuits">
772
773
<p>The core problem: individual neurons in LLMs respond to many unrelated concepts (<strong>polysemanticity</strong>), making them uninterpretable. The breakthrough: decompose activations into sparse, meaningful <strong>features</strong>.</p>
<p>A <strong>monosemantic</strong> feature responds to exactly one concept (e.g., "Golden Gate Bridge" or "deception"). Sparse autoencoders (SAEs) can decompose polysemantic neurons into monosemantic features — giving us interpretable building blocks for understanding what a model represents internally (<ahref="https://transformer-circuits.pub/2023/monosemantic-features/">Bricken et al., 2023</a>).</p>
@@ -802,14 +803,14 @@ <h1 id="mechanistic-interpretability-seeing-inside-the-black-box">Mechanistic in
802
803
</tbody>
803
804
</table>
804
805
</div>
805
-
<divclass="important-box" data-title="What this means">
806
+
<divclass="tip-box" data-title="What this means">
806
807
<p>We can now trace <em>why</em> a model produces a specific output — which features activated, how they connected, and what computation they performed. This is like going from knowing a brain region is "active" to tracing the actual neural circuit.</p>
<divclass="definition-box" data-title="How it works">
812
-
<p><ahref="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">Circuit tracing</a> (<ahref="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">Lindsey et al., 2025</a>) introduced <strong>Cross-Layer Transcoders (CLTs)</strong> — a new architecture that creates an interpretable replacement model, producing <strong>attribution graphs</strong>:</p>
813
+
<p>Circuit tracing (<ahref="https://transformer-circuits.pub/2025/attribution-graphs/methods.html">Lindsey et al., 2025</a>) introduced <strong>Cross-Layer Transcoders (CLTs)</strong> — a new architecture that creates an interpretable replacement model, producing <strong>attribution graphs</strong>.</p>
<p>"The broader labor market has <strong>not experienced a discernible disruption</strong> since ChatGPT's release." Fewer than 10% of US firms use AI regularly as of mid-2025. (<ahref="https://budgetlab.yale.edu/research/ai-and-macroeconomy-what-economics-literature-can-tell-us">Yale Budget Lab, 2025</a>)</p>
863
864
</div>
864
865
<divclass="warning-box" data-title="But look closer at white-collar work">
865
866
<ul>
866
867
<li><ahref="https://www.challengergray.com/blog/2025-year-end-challenger-report-highest-q4-layoffs-since-2008-lowest-ytd-hiring-since-2010/"><strong>~55,000 US layoffs</strong></a> directly attributed to AI in 2025 (<ahref="https://www.challengergray.com/blog/2025-year-end-challenger-report-highest-q4-layoffs-since-2008-lowest-ytd-hiring-since-2010/">Challenger, Gray & Christmas, 2025</a>)</li>
867
-
<li>US employers announced <strong>696,309 total job cuts</strong> in the first 5 months of 2025 — up 80% year-over-year</li>
868
+
<li>US employers announced <ahref="https://www.investopedia.com/job-cuts-reach-highest-levels-since-pandemic-11749055"><strong>696,309 total job cuts</strong></a> in the first 5 months of 2025 — up 80% year-over-year</li>
868
869
<li><strong>79% of employed US women</strong> work in high-automation-risk jobs vs. 58% of men (<ahref="https://kenaninstitute.unc.edu/kenan-insight/will-generative-ai-disproportionately-affect-the-jobs-of-women/">Kenan Institute, 2023</a>)</li>
869
-
<li>McKinsey laid off 200 tech employees; uses AI agents for junior consultant tasks</li>
870
-
<li>Salesforce cut 4,000 customer support roles</li>
870
+
<li>McKinsey laid off <ahref="https://financialpost.com/fp-work/mckinsey-thousands-layoffs-consulting-slowdown">200 tech employees</a>; uses AI agents for junior consultant tasks</li>
871
+
<li>Salesforce cut <ahref="https://www.cnbc.com/2025/09/02/salesforce-ceo-confirms-4000-layoffs-because-i-need-less-heads-with-ai.html">4,000 customer support roles</a></li>
<p><ahref="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance">Harvard Business Review</a> found companies are laying off workers based on AI's <strong><em>anticipated</em> future performance</strong>, not current displacement — a forward-looking disruption pattern unlike prior automation waves (<ahref="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance">HBR, January 2026</a>).</p>
875
+
<p>Harvard Business Review found that companies are laying off workers based on AI's <strong><em>anticipated</em> future performance</strong>, not current displacement — a forward-looking disruption pattern unlike prior automation waves (<ahref="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance">HBR, January 2026</a>).</p>
<p><ahref="https://artificialintelligenceact.eu/"><strong>EU AI Act</strong></a> — risk-based framework, first provisions active since February 2, 2025:</p>
<li><strong>Status</strong>: Rules are live but no enforcement actions yet (as of Feb 2026). Finland became the first active national enforcer (Jan 2026). Full high-risk compliance required by August 2026.</li>
888
889
</ul>
889
890
</div>
890
-
<divclass="note-box" data-title="United States: deregulate and compete">
891
+
<divclass="important-box" data-title="United States: deregulate and compete">
<li><strong>Day 1</strong>: Revoked Biden's October 2023 AI safety executive order</li>
894
-
<li><strong>January 2025</strong>: Signed EO 14179 — "Removing Barriers to American Leadership in AI"</li>
895
-
<li><strong>December 2025</strong>: Signed EO seeking <strong>federal preemption of state AI laws</strong> — states with "onerous AI laws" lose federal broadband funding</li>
894
+
<li><strong>Day 1</strong>: Revoked Biden's October 2023 AI safety executive order by signing an <ahref="https://www.whitehouse.gov/presidential-actions/2025/01/removing-barriers-to-american-leadership-in-artificial-intelligence/">EO</a> "Removing Barriers to American Leadership in AI"</li>
895
+
<li><strong>December 2025</strong>: Signed <ahref="https://www.whitehouse.gov/presidential-actions/2025/12/eliminating-state-law-obstruction-of-national-artificial-intelligence-policy/">EO</a> seeking <strong>federal preemption of state AI laws</strong> — states with "onerous AI laws" lose federal broadband funding</li>
896
896
<li>No comprehensive federal AI legislation has passed Congress</li>
897
897
</ul>
898
898
</div>
@@ -904,7 +904,7 @@ <h1 id="deepfakes-and-elections-the-2024-test">Deepfakes and elections: the 2024
904
904
<ul>
905
905
<li><ahref="https://www.fcc.gov/document/fcc-issues-6m-fine-nh-robocalls"><strong>AI robocalls</strong></a> impersonated Biden urging NH voters not to vote (creator fined <strong>$6M</strong>, criminally indicted)</li>
906
906
<li><ahref="https://blogs.microsoft.com/on-the-issues/2024/10/23/as-the-u-s-election-nears-russia-iran-and-china-step-up-influence-efforts/"><strong>Storm-1516</strong></a> network created deepfake videos of candidates — one shared by Elon Musk</li>
907
-
<li><strong>India</strong>: Celebrity deepfakes criticizing Modi went viral on WhatsApp</li>
907
+
<li><strong>India</strong>: <ahref="https://www.reuters.com/world/india/deepfakes-bollywood-stars-spark-worries-ai-meddling-india-election-2024-04-22/">Celebrity deepfakes criticizing Modi</a> went viral on WhatsApp</li>
908
908
<li><strong>Germany</strong>: <ahref="https://www.isdglobal.org/digital-dispatch/coordinated-disinformation-network-uses-ai-media-impersonation-to-target-german-election/">100+ AI-powered websites</a> distributing deepfakes ahead of elections</li>
909
909
</ul>
910
910
</div>
@@ -915,42 +915,47 @@ <h1 id="deepfakes-and-elections-the-2024-test">Deepfakes and elections: the 2024
915
915
<p>The risk moved from individual deepfakes to <strong>AI-powered disinformation infrastructure</strong> — chatbots, automated accounts, and poisoned information ecosystems that operate continuously.</p>
<p>Meta's potential decision to withhold Behemoth represents a critical acknowledgment: <strong>there may be a capability level above which open release is irresponsible</strong>. Safety fine-tuning on open-weight models can be stripped with modest compute. Once released, weights cannot be recalled.</p>
955
+
<p>Behemoth's delay was driven by <ahref="https://www.computerworld.com/article/3987990/meta-hits-pause-on-llama-4-behemoth-ai-model-amid-capability-concerns.html">benchmark performance falling short of internal targets</a>, not safety concerns. But the broader question remains: <strong>is there a capability level above which open release is irresponsible?</strong> Safety fine-tuning on open-weight models can be stripped with modest compute. Once released, weights cannot be recalled.</p>
951
956
</div>
952
957
<divclass="tip-box" data-title="The tension">
953
-
<p>Advocates of openness emphasize auditability, democratization, and preventing power concentration. Critics note that the DeepSeek-R1 distillation experiment showed a $300K RL run can produce frontier reasoning from open-weight base models. Is openness still net positive at the frontier?</p>
958
+
<p>Advocates of openness emphasize auditability, democratization, and preventing power concentration. Critics note that the <ahref="https://arxiv.org/abs/2501.12948">DeepSeek-R1</a> distillation experiment showed a $300K RL run can produce frontier reasoning from open-weight base models. Is openness still net positive at the frontier?</p>
0 commit comments