|
| 1 | +# Cross-Correlation Report: Superhuman/Aletheia × OpenCode Ecosystem |
| 2 | + |
| 3 | +## Feng et al. (2026) "Towards Autonomous Mathematics Research" vs OpenCode v4.3.0 |
| 4 | + |
| 5 | +**Generated:** 2026-05-30T13:54:22.203258 |
| 6 | +**References:** arXiv:2602.10177v3 | github.com/google-deepmind/superhuman | github.com/MarceloClaro/OpenCode_Ecosystem |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## Executive Summary |
| 11 | + |
| 12 | +| Metric | Value | |
| 13 | +|--------|-------| |
| 14 | +| Total dimensions compared | 12 | |
| 15 | +| Direct matches | 1 | |
| 16 | +| **OpenCode superior** | **7** (58%) | |
| 17 | +| Aletheia superior | 1 (8%) | |
| 18 | +| Complementary | 3 | |
| 19 | +| Avg OpenCode superiority score | 0.9 | |
| 20 | +| Avg Aletheia superiority score | 0.3 | |
| 21 | + |
| 22 | +**Key finding:** OpenCode matches or exceeds Aletheia in 8/12 (67%) dimensions. The critical gap is the foundation model (Gemini Deep Think scale). |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Correlation Matrix |
| 27 | + |
| 28 | +| # | Aletheia Component | OpenCode Component | Match | Score | |
| 29 | +|:--:|---------------------|---------------------|:-----:|:-----:| |
| 30 | +| 1 | Aletheia G-V-R Loop | SPEC-012 Aletheia Engine | 🟰 | 0.85 | |
| 31 | +| 2 | Informal Verifier | Cora-Debate V1-V7 + SPEC-008 Triangulacao | 🟢 | 0.90 | |
| 32 | +| 3 | Gemini Deep Think (implicito) | Reasoning Orchestrator v11 (212 tipos explici | 🟡 | 0.70 | |
| 33 | +| 4 | 3 tools (Search, Browse, Python) | 18 MCPs + code-runner + playwright | 🟢 | 0.75 | |
| 34 | +| 5 | IMO Bench + FirstProof + FutureMath + Erdos | CORA-Eval D1-D9 + Domain-Shift + Olympiad | 🟡 | 0.65 | |
| 35 | +| 6 | Taxonomia H/C/A × 0-4 (Feng §6.1) | Camadas C1/C1B/C2/C3 (SPEC-008) | 🟡 | 0.55 | |
| 36 | +| 7 | Single-use problem (reconhecido §4) | SPEC-008 Triangulacao (3 camadas) | 🟢 | 0.95 | |
| 37 | +| 8 | Matematica pura apenas | 6+ dominios (juridico, fisica, metodologia, a | 🟢 | 0.95 | |
| 38 | +| 9 | Paper + prompts no GitHub | TDD + seed + hash + sync mirror — 100% audita | 🟢 | 0.95 | |
| 39 | +| 10 | Nao abordado | SPEC-008-B Camada 1B (bootstrap Jaccard, 9 CT | 🟢 | 0.98 | |
| 40 | +| 11 | Reducao via tool use (Search) | Cora V4 + 6 padroes de deteccao + verificacao | 🟢 | 0.80 | |
| 41 | +| 12 | Gemini Deep Think (proprietario, escala massi | OpenCode (modelos acessiveis via API) | 🔵 | 0.30 | |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## Detailed Analysis |
| 46 | + |
| 47 | +### 🟢 OpenCode Advantages (7 dimensions) |
| 48 | + |
| 49 | +**Informal Verifier vs Cora-Debate V1-V7 + SPEC-008 Triangulacao** (score: 0.9) |
| 50 | +> Aletheia usa verificador informal; OpenCode tem 7 verificadores simbolicos + 3 camadas anti-circularidade |
| 51 | +> OpenCode advantage: 7 verificadores explicitos (vs 1 implicito); auto-critica desacoplada; triangulacao anti-circular |
| 52 | +> Aletheia limitation: |
| 53 | +
|
| 54 | +**3 tools (Search, Browse, Python) vs 18 MCPs + code-runner + playwright** (score: 0.75) |
| 55 | +> OpenCode tem 6x mais ferramentas ativas cobrindo dominios alem da matematica |
| 56 | +> OpenCode advantage: 18 MCPs multi-proposito (vs 3 tools); sandbox isolado; SQLite local; PDF toolkit |
| 57 | +> Aletheia limitation: Integracao profunda Google Search (modelo treinado para tool use) |
| 58 | +
|
| 59 | +**Single-use problem (reconhecido §4) vs SPEC-008 Triangulacao (3 camadas)** (score: 0.95) |
| 60 | +> Aletheia reconhece o problema de 'single use' mas nao o resolve; OpenCode tem framework completo para isso |
| 61 | +> OpenCode advantage: Framework matematico para quebrar circularidade; domain-shift detection; bootstrap calibration |
| 62 | +> Aletheia limitation: |
| 63 | +
|
| 64 | +**Matematica pura apenas vs 6+ dominios (juridico, fisica, metodologia, arte, economia)** (score: 0.95) |
| 65 | +> Aletheia foi projetado exclusivamente para matematica; OpenCode cobre multiplos dominios cientificos |
| 66 | +> OpenCode advantage: 6 dominios com TDD proprio; integracao com editais, arteterapia, CORA-Eval |
| 67 | +> Aletheia limitation: |
| 68 | +
|
| 69 | +**Paper + prompts no GitHub vs TDD + seed + hash + sync mirror — 100% auditavel** (score: 0.95) |
| 70 | +> Aletheia publica prompts/outputs mas sem testes automatizados; OpenCode tem TDD completo |
| 71 | +> OpenCode advantage: 71 testes automatizados; seed fixa; hash verificavel; clone identico via sync mirror |
| 72 | +> Aletheia limitation: |
| 73 | +
|
| 74 | +**Nao abordado vs SPEC-008-B Camada 1B (bootstrap Jaccard, 9 CTs)** (score: 0.98) |
| 75 | +> Aletheia nao aborda domain shift entre problemas/dominios; OpenCode tem framework dedicado |
| 76 | +> OpenCode advantage: Decomposicao institucional; 3 deltas Jaccard; bootstrap calibration; 9 CTs TDD |
| 77 | +> Aletheia limitation: |
| 78 | +
|
| 79 | +**Reducao via tool use (Search) vs Cora V4 + 6 padroes de deteccao + verificacao de citacoes** (score: 0.8) |
| 80 | +> Aletheia reduz alucinacoes via Search mas nao as detecta sistematicamente |
| 81 | +> OpenCode advantage: 6 padroes de deteccao; V4 Citation Accuracy check; penalizacao no score |
| 82 | +> Aletheia limitation: Google Search integrado como ferramenta nativa do modelo base |
| 83 | +
|
| 84 | +### 🔵 Aletheia Advantages (1 dimensions) |
| 85 | + |
| 86 | +**Gemini Deep Think (proprietario, escala massiva) vs OpenCode (modelos acessiveis via API)** (score: 0.3) |
| 87 | +> Gap fundamental: Deep Think tem escala e treinamento que modelos publicos nao alcancam |
| 88 | +> IMO-Gold (5/6); 100x reducao compute; inference-time scaling law proprietaria |
| 89 | +
|
| 90 | +### 🟡 Complementary (3 dimensions) |
| 91 | + |
| 92 | +**Gemini Deep Think (implicito) vs Reasoning Orchestrator v11 (212 tipos explicitos)** (score: 0.7) |
| 93 | +> Deep Think tem escala massiva mas raciocinio implicito; OpenCode tem taxonomia explicita de 212 tipos mas escala limitada |
| 94 | +
|
| 95 | +**IMO Bench + FirstProof + FutureMath + Erdos vs CORA-Eval D1-D9 + Domain-Shift + Olympiad** (score: 0.65) |
| 96 | +> Bancos diferentes: Aletheia focado em matematica pura; OpenCode cobre 9 disciplinas + metodologia |
| 97 | +
|
| 98 | +**Taxonomia H/C/A × 0-4 (Feng §6.1) vs Camadas C1/C1B/C2/C3 (SPEC-008)** (score: 0.55) |
| 99 | +> Sistemas diferentes: Aletheia classifica resultado final; OpenCode classifica processo de validacao |
| 100 | +
|
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +## Component-by-Component Mapping |
| 105 | + |
| 106 | +### Aletheia Components → OpenCode Equivalents |
| 107 | + |
| 108 | +#### Aletheia Agent Architecture |
| 109 | +- **Paper:** §2, Figure 1 |
| 110 | +- **Results:** 93% IMO-Proof Bench Advanced, 82% FutureMath Basic (condicional) |
| 111 | +- **Key Features:** Generator: solucao em linguagem natural, Verifier: mecanismo informal de verificacao, Reviser: correcao iterativa... |
| 112 | + |
| 113 | +#### Gemini Deep Think |
| 114 | +- **Paper:** §2.1, Figure 2 |
| 115 | +- **Results:** IMO-Proof Bench 30 problemas, FutureMath Basic (interno) |
| 116 | +- **Key Features:** Escalabilidade: 100x reducao compute (Jan 2026 vs Jul 2025), Paralelismo: exploracao simultanea de ideias, Ph.D.-level transfer: scaling law transfere para exercicios... |
| 117 | + |
| 118 | +#### Tool Integration |
| 119 | +- **Paper:** §2.3, Figure 3-4 |
| 120 | +- **Results:** Reducao de citacoes ficticias; erros sutis persistem |
| 121 | +- **Key Features:** Google Search: reducao de alucinacoes em citacoes, Web browsing: navegacao de literatura matematica, Python: ganhos marginais (modelo ja proficiente)... |
| 122 | + |
| 123 | +#### Research Milestones |
| 124 | +- **Paper:** §3, Table 1 |
| 125 | +- **Results:** 212/700 Erdos candidatos; 4 confirmados como novos |
| 126 | +- **Key Features:** Feng26: 100% autonomo (Level A2) — Eigenweights, LeeSeo26: Human-AI (Level C2) — Independence Polynomials, BKKKZ26: generalizacao Erdos-1051 (Level C2)... |
| 127 | + |
| 128 | +#### Autonomy & Significance Taxonomy |
| 129 | +- **Paper:** §6.1, Tables 8-9 |
| 130 | +- **Results:** A2 = Feng26, C2 = LeeSeo26/BKKKZ26, H2 = FYZ26/ACGKMP26 |
| 131 | +- **Key Features:** Axis 1: H (Human-primary), C (Collaboration), A (Autonomous), Axis 2: 0 (Negligible), 1 (Minor), 2 (Publishable), 3 (Major), 4 (Landmark), HAI Cards: documentacao transparente human-AI interaction... |
| 132 | + |
| 133 | +#### Evaluation Benchmarks |
| 134 | +- **Paper:** §2, §4, §3.3 |
| 135 | +- **Results:** FirstProof: Aletheia 7/10; GPT 5.2 Pro 2/10 baseline |
| 136 | +- **Key Features:** IMO-AnswerBench: 400 short-answer problems, IMO-ProofBench: 60 proof-based problems, IMO-GradingBench: 1000 human gradings... |
| 137 | + |
| 138 | + |
| 139 | +### OpenCode Components → Aletheia Equivalents |
| 140 | + |
| 141 | +#### Aletheia Math Research Engine (SPEC-012) |
| 142 | +- **Spec:** SPEC-012 |
| 143 | +- **Results:** 5/5 solved (100%), avg 1.0 attempts, max L1_MINOR |
| 144 | +- **Key Features:** Generator: 16 tipos de raciocinio com selecao adaptativa por dominio, Verifier: Cora-Debate V1-V7 (7/7 checks) + deteccao alucinacao (6 padroes), Reviser: feedback loop com budget de 10 tentativas... |
| 145 | + |
| 146 | +#### Cora-Debate V1-V7 |
| 147 | +- **Spec:** N/A |
| 148 | +- **Results:** 7/7 checks integrados ao Aletheia Verifier |
| 149 | +- **Key Features:** V1: Logical Consistency, V2: Mathematical Correctness, V3: Edge Case Coverage... |
| 150 | + |
| 151 | +#### Reasoning Orchestrator v11 |
| 152 | +- **Spec:** N/A |
| 153 | +- **Results:** 212+ tipos mapeados e documentados |
| 154 | +- **Key Features:** 68 tipos base + 10 Teoria dos Jogos + expansoes, 12 categorias (logica, dialetica, estrategia, inovacao, etc.), Pipeline de 7 fases com agentes especializados... |
| 155 | + |
| 156 | +#### Triangulacao Anti-Circularidade (SPEC-008 + 008-B) |
| 157 | +- **Spec:** N/A |
| 158 | +- **Results:** 14/14 TDD, domain-shift P95=0.215, P99=0.279 |
| 159 | +- **Key Features:** Camada 1: Split temporal cego (Bergmeir 2012, Cerqueira 2020), Camada 1B: Domain-shift detection (bootstrap Jaccard), Camada 2: Perturbacao adversaria (4 transformacoes)... |
| 160 | + |
| 161 | +#### CORA-Eval Benchmark |
| 162 | +- **Spec:** N/A |
| 163 | +- **Results:** D1:14/14, D2:8/8, D9:12/12; baseline CORA-Score 0.67 |
| 164 | +- **Key Features:** D1: Raciocinio Matematico Formal (14 CTs, SPEC-009), D2: Modelagem de Sistemas Fisicos (8 CTs, SPEC-010), D9: Desenho Experimental e Metodologia (12 CTs, SPEC-011)... |
| 165 | + |
| 166 | +#### MCP Tool Ecosystem |
| 167 | +- **Spec:** N/A |
| 168 | +- **Results:** 18 ativos, 24 inativos (expansiveis) |
| 169 | +- **Key Features:** Web Search (DuckDuckGo): busca web, Sequential Thinking: raciocinio multi-passo, Python Interpreter: execucao de codigo... |
| 170 | + |
| 171 | +#### Multi-Domain Coverage |
| 172 | +- **Spec:** N/A |
| 173 | +- **Results:** 6 dominios cobertos com TDD proprio cada |
| 174 | +- **Key Features:** Juridico: 6 skills (pecas, contratos, jurisprudencia, etc.), Arteterapia: validacao clinica decolonial (SPEC-013), Economia: analise ARM-IAG (World Bank, complexidade)... |
| 175 | + |
| 176 | +#### Full Reproducibility Infrastructure |
| 177 | +- **Spec:** N/A |
| 178 | +- **Results:** 71/71 TDD, 2.091 arquivos espelhados, 0 erros |
| 179 | +- **Key Features:** 71/71 testes automatizados em 6 suites, Seed fixa (42) em todos os scripts, Hash MD5 verificavel de cada artefato... |
| 180 | + |
| 181 | +--- |
| 182 | + |
| 183 | +## Critical Gaps & Roadmap |
| 184 | + |
| 185 | +### Gaps (OpenCode needs to improve) |
| 186 | + |
| 187 | +1. **Foundation Model Scale** |
| 188 | + - Deep Think: IMO-Gold, inference-time scaling, 100x compute reduction |
| 189 | + - OpenCode: depends on accessible API models (GPT, Claude, Gemini via API) |
| 190 | + - Mitigation: Cora V1-V7 compensates with verification rigor |
| 191 | + |
| 192 | +2. **Proprietary Benchmarks** |
| 193 | + - FutureMath Basic: Ph.D. exercises (internal only) |
| 194 | + - FirstProof: time-limited competition (expired) |
| 195 | + - Mitigation: CORA-Eval D1-D9 + Olympiad benchmarks |
| 196 | + |
| 197 | +3. **Human Expert Validation Pipeline** |
| 198 | + - Aletheia: team of ~15 mathematicians for validation |
| 199 | + - OpenCode: Camada 3 (anotacao humana minima, 30 docs) |
| 200 | + - Mitigation: SPEC-008 Camada 3 + active learning |
| 201 | + |
| 202 | +### Advantages (OpenCode exceeds Aletheia) |
| 203 | + |
| 204 | +1. **Verification Rigor**: Cora V1-V7 (7 checks) > informal verifier |
| 205 | +2. **Anti-Circularity**: SPEC-008 framework solves the "single use" problem |
| 206 | +3. **Domain-Shift Detection**: SPEC-008-B (unique capability) |
| 207 | +4. **Multi-Domain**: 6+ domains vs math only |
| 208 | +5. **Reproducibility**: 71 TDD tests + seed + hash vs paper-only |
| 209 | +6. **Tool Ecosystem**: 18 MCPs vs 3 tools |
| 210 | +7. **Reasoning Taxonomy**: 212 explicit types vs implicit |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## Conclusion |
| 215 | + |
| 216 | +The OpenCode ecosystem implements the core Aletheia architecture (SPEC-012) while adding **verification rigor** (Cora V1-V7), **anti-circularity** (SPEC-008), **domain-shift detection** (SPEC-008-B), **multi-domain coverage**, and **full TDD reproducibility**. |
| 217 | + |
| 218 | +The critical gap remains the **foundation model scale** — Gemini Deep Think's inference-time scaling law and IMO-Gold achievement are not replicable with public API models. However, OpenCode's verification layers partially compensate by catching errors that a single-pass informal verifier would miss. |
| 219 | + |
| 220 | +In the taxonomy of Feng et al. (§6.1), OpenCode achieves **Level C2** (Human-AI Collaboration, Publishable Research) across multiple domains, with the Aletheia Math Research Engine (SPEC-012) operating at **Level A1-A2** (Autonomous, Minor to Publishable) within mathematical domains. |
| 221 | + |
| 222 | +--- |
| 223 | +*Generated by cross_correlation.py — OpenCode Ecosystem v4.3.0* |
0 commit comments