Skip to content

Commit 0aa4cd1

Browse files
author
MarceloClaro
committed
cross-correlation: Superhuman/Aletheia x OpenCode — 12 dimensions
1 parent 2a3165d commit 0aa4cd1

2 files changed

Lines changed: 787 additions & 0 deletions

File tree

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# Cross-Correlation Report: Superhuman/Aletheia × OpenCode Ecosystem
2+
3+
## Feng et al. (2026) "Towards Autonomous Mathematics Research" vs OpenCode v4.3.0
4+
5+
**Generated:** 2026-05-30T13:54:22.203258
6+
**References:** arXiv:2602.10177v3 | github.com/google-deepmind/superhuman | github.com/MarceloClaro/OpenCode_Ecosystem
7+
8+
---
9+
10+
## Executive Summary
11+
12+
| Metric | Value |
13+
|--------|-------|
14+
| Total dimensions compared | 12 |
15+
| Direct matches | 1 |
16+
| **OpenCode superior** | **7** (58%) |
17+
| Aletheia superior | 1 (8%) |
18+
| Complementary | 3 |
19+
| Avg OpenCode superiority score | 0.9 |
20+
| Avg Aletheia superiority score | 0.3 |
21+
22+
**Key finding:** OpenCode matches or exceeds Aletheia in 8/12 (67%) dimensions. The critical gap is the foundation model (Gemini Deep Think scale).
23+
24+
---
25+
26+
## Correlation Matrix
27+
28+
| # | Aletheia Component | OpenCode Component | Match | Score |
29+
|:--:|---------------------|---------------------|:-----:|:-----:|
30+
| 1 | Aletheia G-V-R Loop | SPEC-012 Aletheia Engine | 🟰 | 0.85 |
31+
| 2 | Informal Verifier | Cora-Debate V1-V7 + SPEC-008 Triangulacao | 🟢 | 0.90 |
32+
| 3 | Gemini Deep Think (implicito) | Reasoning Orchestrator v11 (212 tipos explici | 🟡 | 0.70 |
33+
| 4 | 3 tools (Search, Browse, Python) | 18 MCPs + code-runner + playwright | 🟢 | 0.75 |
34+
| 5 | IMO Bench + FirstProof + FutureMath + Erdos | CORA-Eval D1-D9 + Domain-Shift + Olympiad | 🟡 | 0.65 |
35+
| 6 | Taxonomia H/C/A × 0-4 (Feng §6.1) | Camadas C1/C1B/C2/C3 (SPEC-008) | 🟡 | 0.55 |
36+
| 7 | Single-use problem (reconhecido §4) | SPEC-008 Triangulacao (3 camadas) | 🟢 | 0.95 |
37+
| 8 | Matematica pura apenas | 6+ dominios (juridico, fisica, metodologia, a | 🟢 | 0.95 |
38+
| 9 | Paper + prompts no GitHub | TDD + seed + hash + sync mirror — 100% audita | 🟢 | 0.95 |
39+
| 10 | Nao abordado | SPEC-008-B Camada 1B (bootstrap Jaccard, 9 CT | 🟢 | 0.98 |
40+
| 11 | Reducao via tool use (Search) | Cora V4 + 6 padroes de deteccao + verificacao | 🟢 | 0.80 |
41+
| 12 | Gemini Deep Think (proprietario, escala massi | OpenCode (modelos acessiveis via API) | 🔵 | 0.30 |
42+
43+
---
44+
45+
## Detailed Analysis
46+
47+
### 🟢 OpenCode Advantages (7 dimensions)
48+
49+
**Informal Verifier vs Cora-Debate V1-V7 + SPEC-008 Triangulacao** (score: 0.9)
50+
> Aletheia usa verificador informal; OpenCode tem 7 verificadores simbolicos + 3 camadas anti-circularidade
51+
> OpenCode advantage: 7 verificadores explicitos (vs 1 implicito); auto-critica desacoplada; triangulacao anti-circular
52+
> Aletheia limitation:
53+
54+
**3 tools (Search, Browse, Python) vs 18 MCPs + code-runner + playwright** (score: 0.75)
55+
> OpenCode tem 6x mais ferramentas ativas cobrindo dominios alem da matematica
56+
> OpenCode advantage: 18 MCPs multi-proposito (vs 3 tools); sandbox isolado; SQLite local; PDF toolkit
57+
> Aletheia limitation: Integracao profunda Google Search (modelo treinado para tool use)
58+
59+
**Single-use problem (reconhecido §4) vs SPEC-008 Triangulacao (3 camadas)** (score: 0.95)
60+
> Aletheia reconhece o problema de 'single use' mas nao o resolve; OpenCode tem framework completo para isso
61+
> OpenCode advantage: Framework matematico para quebrar circularidade; domain-shift detection; bootstrap calibration
62+
> Aletheia limitation:
63+
64+
**Matematica pura apenas vs 6+ dominios (juridico, fisica, metodologia, arte, economia)** (score: 0.95)
65+
> Aletheia foi projetado exclusivamente para matematica; OpenCode cobre multiplos dominios cientificos
66+
> OpenCode advantage: 6 dominios com TDD proprio; integracao com editais, arteterapia, CORA-Eval
67+
> Aletheia limitation:
68+
69+
**Paper + prompts no GitHub vs TDD + seed + hash + sync mirror — 100% auditavel** (score: 0.95)
70+
> Aletheia publica prompts/outputs mas sem testes automatizados; OpenCode tem TDD completo
71+
> OpenCode advantage: 71 testes automatizados; seed fixa; hash verificavel; clone identico via sync mirror
72+
> Aletheia limitation:
73+
74+
**Nao abordado vs SPEC-008-B Camada 1B (bootstrap Jaccard, 9 CTs)** (score: 0.98)
75+
> Aletheia nao aborda domain shift entre problemas/dominios; OpenCode tem framework dedicado
76+
> OpenCode advantage: Decomposicao institucional; 3 deltas Jaccard; bootstrap calibration; 9 CTs TDD
77+
> Aletheia limitation:
78+
79+
**Reducao via tool use (Search) vs Cora V4 + 6 padroes de deteccao + verificacao de citacoes** (score: 0.8)
80+
> Aletheia reduz alucinacoes via Search mas nao as detecta sistematicamente
81+
> OpenCode advantage: 6 padroes de deteccao; V4 Citation Accuracy check; penalizacao no score
82+
> Aletheia limitation: Google Search integrado como ferramenta nativa do modelo base
83+
84+
### 🔵 Aletheia Advantages (1 dimensions)
85+
86+
**Gemini Deep Think (proprietario, escala massiva) vs OpenCode (modelos acessiveis via API)** (score: 0.3)
87+
> Gap fundamental: Deep Think tem escala e treinamento que modelos publicos nao alcancam
88+
> IMO-Gold (5/6); 100x reducao compute; inference-time scaling law proprietaria
89+
90+
### 🟡 Complementary (3 dimensions)
91+
92+
**Gemini Deep Think (implicito) vs Reasoning Orchestrator v11 (212 tipos explicitos)** (score: 0.7)
93+
> Deep Think tem escala massiva mas raciocinio implicito; OpenCode tem taxonomia explicita de 212 tipos mas escala limitada
94+
95+
**IMO Bench + FirstProof + FutureMath + Erdos vs CORA-Eval D1-D9 + Domain-Shift + Olympiad** (score: 0.65)
96+
> Bancos diferentes: Aletheia focado em matematica pura; OpenCode cobre 9 disciplinas + metodologia
97+
98+
**Taxonomia H/C/A × 0-4 (Feng §6.1) vs Camadas C1/C1B/C2/C3 (SPEC-008)** (score: 0.55)
99+
> Sistemas diferentes: Aletheia classifica resultado final; OpenCode classifica processo de validacao
100+
101+
102+
---
103+
104+
## Component-by-Component Mapping
105+
106+
### Aletheia Components → OpenCode Equivalents
107+
108+
#### Aletheia Agent Architecture
109+
- **Paper:** §2, Figure 1
110+
- **Results:** 93% IMO-Proof Bench Advanced, 82% FutureMath Basic (condicional)
111+
- **Key Features:** Generator: solucao em linguagem natural, Verifier: mecanismo informal de verificacao, Reviser: correcao iterativa...
112+
113+
#### Gemini Deep Think
114+
- **Paper:** §2.1, Figure 2
115+
- **Results:** IMO-Proof Bench 30 problemas, FutureMath Basic (interno)
116+
- **Key Features:** Escalabilidade: 100x reducao compute (Jan 2026 vs Jul 2025), Paralelismo: exploracao simultanea de ideias, Ph.D.-level transfer: scaling law transfere para exercicios...
117+
118+
#### Tool Integration
119+
- **Paper:** §2.3, Figure 3-4
120+
- **Results:** Reducao de citacoes ficticias; erros sutis persistem
121+
- **Key Features:** Google Search: reducao de alucinacoes em citacoes, Web browsing: navegacao de literatura matematica, Python: ganhos marginais (modelo ja proficiente)...
122+
123+
#### Research Milestones
124+
- **Paper:** §3, Table 1
125+
- **Results:** 212/700 Erdos candidatos; 4 confirmados como novos
126+
- **Key Features:** Feng26: 100% autonomo (Level A2) — Eigenweights, LeeSeo26: Human-AI (Level C2) — Independence Polynomials, BKKKZ26: generalizacao Erdos-1051 (Level C2)...
127+
128+
#### Autonomy & Significance Taxonomy
129+
- **Paper:** §6.1, Tables 8-9
130+
- **Results:** A2 = Feng26, C2 = LeeSeo26/BKKKZ26, H2 = FYZ26/ACGKMP26
131+
- **Key Features:** Axis 1: H (Human-primary), C (Collaboration), A (Autonomous), Axis 2: 0 (Negligible), 1 (Minor), 2 (Publishable), 3 (Major), 4 (Landmark), HAI Cards: documentacao transparente human-AI interaction...
132+
133+
#### Evaluation Benchmarks
134+
- **Paper:** §2, §4, §3.3
135+
- **Results:** FirstProof: Aletheia 7/10; GPT 5.2 Pro 2/10 baseline
136+
- **Key Features:** IMO-AnswerBench: 400 short-answer problems, IMO-ProofBench: 60 proof-based problems, IMO-GradingBench: 1000 human gradings...
137+
138+
139+
### OpenCode Components → Aletheia Equivalents
140+
141+
#### Aletheia Math Research Engine (SPEC-012)
142+
- **Spec:** SPEC-012
143+
- **Results:** 5/5 solved (100%), avg 1.0 attempts, max L1_MINOR
144+
- **Key Features:** Generator: 16 tipos de raciocinio com selecao adaptativa por dominio, Verifier: Cora-Debate V1-V7 (7/7 checks) + deteccao alucinacao (6 padroes), Reviser: feedback loop com budget de 10 tentativas...
145+
146+
#### Cora-Debate V1-V7
147+
- **Spec:** N/A
148+
- **Results:** 7/7 checks integrados ao Aletheia Verifier
149+
- **Key Features:** V1: Logical Consistency, V2: Mathematical Correctness, V3: Edge Case Coverage...
150+
151+
#### Reasoning Orchestrator v11
152+
- **Spec:** N/A
153+
- **Results:** 212+ tipos mapeados e documentados
154+
- **Key Features:** 68 tipos base + 10 Teoria dos Jogos + expansoes, 12 categorias (logica, dialetica, estrategia, inovacao, etc.), Pipeline de 7 fases com agentes especializados...
155+
156+
#### Triangulacao Anti-Circularidade (SPEC-008 + 008-B)
157+
- **Spec:** N/A
158+
- **Results:** 14/14 TDD, domain-shift P95=0.215, P99=0.279
159+
- **Key Features:** Camada 1: Split temporal cego (Bergmeir 2012, Cerqueira 2020), Camada 1B: Domain-shift detection (bootstrap Jaccard), Camada 2: Perturbacao adversaria (4 transformacoes)...
160+
161+
#### CORA-Eval Benchmark
162+
- **Spec:** N/A
163+
- **Results:** D1:14/14, D2:8/8, D9:12/12; baseline CORA-Score 0.67
164+
- **Key Features:** D1: Raciocinio Matematico Formal (14 CTs, SPEC-009), D2: Modelagem de Sistemas Fisicos (8 CTs, SPEC-010), D9: Desenho Experimental e Metodologia (12 CTs, SPEC-011)...
165+
166+
#### MCP Tool Ecosystem
167+
- **Spec:** N/A
168+
- **Results:** 18 ativos, 24 inativos (expansiveis)
169+
- **Key Features:** Web Search (DuckDuckGo): busca web, Sequential Thinking: raciocinio multi-passo, Python Interpreter: execucao de codigo...
170+
171+
#### Multi-Domain Coverage
172+
- **Spec:** N/A
173+
- **Results:** 6 dominios cobertos com TDD proprio cada
174+
- **Key Features:** Juridico: 6 skills (pecas, contratos, jurisprudencia, etc.), Arteterapia: validacao clinica decolonial (SPEC-013), Economia: analise ARM-IAG (World Bank, complexidade)...
175+
176+
#### Full Reproducibility Infrastructure
177+
- **Spec:** N/A
178+
- **Results:** 71/71 TDD, 2.091 arquivos espelhados, 0 erros
179+
- **Key Features:** 71/71 testes automatizados em 6 suites, Seed fixa (42) em todos os scripts, Hash MD5 verificavel de cada artefato...
180+
181+
---
182+
183+
## Critical Gaps & Roadmap
184+
185+
### Gaps (OpenCode needs to improve)
186+
187+
1. **Foundation Model Scale**
188+
- Deep Think: IMO-Gold, inference-time scaling, 100x compute reduction
189+
- OpenCode: depends on accessible API models (GPT, Claude, Gemini via API)
190+
- Mitigation: Cora V1-V7 compensates with verification rigor
191+
192+
2. **Proprietary Benchmarks**
193+
- FutureMath Basic: Ph.D. exercises (internal only)
194+
- FirstProof: time-limited competition (expired)
195+
- Mitigation: CORA-Eval D1-D9 + Olympiad benchmarks
196+
197+
3. **Human Expert Validation Pipeline**
198+
- Aletheia: team of ~15 mathematicians for validation
199+
- OpenCode: Camada 3 (anotacao humana minima, 30 docs)
200+
- Mitigation: SPEC-008 Camada 3 + active learning
201+
202+
### Advantages (OpenCode exceeds Aletheia)
203+
204+
1. **Verification Rigor**: Cora V1-V7 (7 checks) > informal verifier
205+
2. **Anti-Circularity**: SPEC-008 framework solves the "single use" problem
206+
3. **Domain-Shift Detection**: SPEC-008-B (unique capability)
207+
4. **Multi-Domain**: 6+ domains vs math only
208+
5. **Reproducibility**: 71 TDD tests + seed + hash vs paper-only
209+
6. **Tool Ecosystem**: 18 MCPs vs 3 tools
210+
7. **Reasoning Taxonomy**: 212 explicit types vs implicit
211+
212+
---
213+
214+
## Conclusion
215+
216+
The OpenCode ecosystem implements the core Aletheia architecture (SPEC-012) while adding **verification rigor** (Cora V1-V7), **anti-circularity** (SPEC-008), **domain-shift detection** (SPEC-008-B), **multi-domain coverage**, and **full TDD reproducibility**.
217+
218+
The critical gap remains the **foundation model scale** — Gemini Deep Think's inference-time scaling law and IMO-Gold achievement are not replicable with public API models. However, OpenCode's verification layers partially compensate by catching errors that a single-pass informal verifier would miss.
219+
220+
In the taxonomy of Feng et al. (§6.1), OpenCode achieves **Level C2** (Human-AI Collaboration, Publishable Research) across multiple domains, with the Aletheia Math Research Engine (SPEC-012) operating at **Level A1-A2** (Autonomous, Minor to Publishable) within mathematical domains.
221+
222+
---
223+
*Generated by cross_correlation.py — OpenCode Ecosystem v4.3.0*

0 commit comments

Comments
 (0)