|
| 1 | +# Phase 1.1 — Download & Dataset Preparation Plan |
| 2 | + |
| 3 | +**Date**: 30 May 2026 |
| 4 | +**Dataset Scope**: 700 problems from 12 domains |
| 5 | +**Status**: FINAL (counts confirmed via GitHub API) |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 📊 Final Domain Breakdown |
| 10 | + |
| 11 | +Total confirmed: **700 problems** (1.63x vs v1.0's 430) |
| 12 | + |
| 13 | +| Domain | Files | % | Cumulative | |
| 14 | +|--------|-------|---|------------| |
| 15 | +| ErdosProblems | 430 | 61.4% | 430 | |
| 16 | +| Wikipedia | 124 | 17.7% | 554 | |
| 17 | +| GreensOpenProblems | 44 | 6.3% | 598 | |
| 18 | +| WrittenOnTheWallII | 26 | 3.7% | 624 | |
| 19 | +| Paper | 23 | 3.3% | 647 | |
| 20 | +| OEIS | 21 | 3.0% | 668 | |
| 21 | +| Mathoverflow | 11 | 1.6% | 679 | |
| 22 | +| Books (subdirs) | 3 | 0.4% | 682 | |
| 23 | +| Other | 5 | 0.7% | 687 | |
| 24 | +| Subsets | 2 | 0.3% | 689 | |
| 25 | +| Millenium | 4 | 0.6% | 693 | |
| 26 | +| OpenQuantumProblems | 3 | 0.4% | 696 | |
| 27 | +| HilbertProblems | 2 | 0.3% | 698 | |
| 28 | +| Kourovka | 2 | 0.3% | 700 | |
| 29 | +| **TOTAL** | **700** | **100%** | **700** | |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## 📥 Download Strategy |
| 34 | + |
| 35 | +### Option A: Clone Entire Repo (Recommended) |
| 36 | +```bash |
| 37 | +git clone --depth 1 https://github.com/google-deepmind/formal-conjectures.git |
| 38 | +cd formal-conjectures/FormalConjectures |
| 39 | +``` |
| 40 | + |
| 41 | +**Pros**: |
| 42 | +- All data + history |
| 43 | +- Can follow git updates |
| 44 | +- Preserve structure |
| 45 | + |
| 46 | +**Cons**: |
| 47 | +- Large (~50-100 MB) |
| 48 | +- Slower (~5 min) |
| 49 | + |
| 50 | +**Recommendation**: ✅ Use this for Phase 1.1 |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +### Option B: Download via GitHub API (Parallel) |
| 55 | +Extract URLs from GitHub API, download in parallel with curl. |
| 56 | + |
| 57 | +**Pros**: |
| 58 | +- Faster (~2 min) |
| 59 | +- Minimal space |
| 60 | + |
| 61 | +**Cons**: |
| 62 | +- Need to handle rate limiting (60 req/min unauthenticated) |
| 63 | +- Complex parallel logic |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +## 🔄 Implementation Plan |
| 68 | + |
| 69 | +### Stage 1: Clone (5 min) |
| 70 | +```bash |
| 71 | +git clone --depth 1 https://github.com/google-deepmind/formal-conjectures.git fc_repo |
| 72 | +# Output: ~50-100 MB |
| 73 | +``` |
| 74 | + |
| 75 | +### Stage 2: Extract Structure (10 min) |
| 76 | +```bash |
| 77 | +cd fc_repo/FormalConjectures |
| 78 | + |
| 79 | +# Copy all .lean files preserving domain structure |
| 80 | +mkdir -p ../../aletheia-superhuman-validation/raw_data/formal-conjectures |
| 81 | +for domain in ErdosProblems Wikipedia GreensOpenProblems WrittenOnTheWallII Paper OEIS Mathoverflow Books Other Subsets Millenium OpenQuantumProblems HilbertProblems Kourovka; do |
| 82 | + if [ -d "$domain" ]; then |
| 83 | + cp -r "$domain" ../../aletheia-superhuman-validation/raw_data/formal-conjectures/ |
| 84 | + fi |
| 85 | +done |
| 86 | + |
| 87 | +# Handle Books subdirs separately |
| 88 | +if [ -d "Books" ]; then |
| 89 | + mkdir -p ../../aletheia-superhuman-validation/raw_data/formal-conjectures/Books |
| 90 | + cp Books/**/*.lean ../../aletheia-superhuman-validation/raw_data/formal-conjectures/Books/ |
| 91 | +fi |
| 92 | +``` |
| 93 | + |
| 94 | +### Stage 3: Parse & Structure (30 min) |
| 95 | +Parse all 700 .lean files into `erdos_700_enriched.json`: |
| 96 | + |
| 97 | +```json |
| 98 | +{ |
| 99 | + "metadata": { |
| 100 | + "source": "google-deepmind/formal-conjectures", |
| 101 | + "date": "2026-05-30", |
| 102 | + "version": "1.0", |
| 103 | + "total_problems": 700, |
| 104 | + "domains": { |
| 105 | + "ErdosProblems": 430, |
| 106 | + "Wikipedia": 124, |
| 107 | + ... |
| 108 | + } |
| 109 | + }, |
| 110 | + "problems": [ |
| 111 | + { |
| 112 | + "id": "E001", |
| 113 | + "domain": "ErdosProblems", |
| 114 | + "filename": "1.lean", |
| 115 | + "statement": "For any set of n ≥ ...", |
| 116 | + "types": ["combinatorics", "graph_theory"], |
| 117 | + "difficulty": "medium", |
| 118 | + "dependencies": ["Finset", "Card"], |
| 119 | + "raw_lean_code": "theorem ..." |
| 120 | + }, |
| 121 | + ... |
| 122 | + ] |
| 123 | +} |
| 124 | +``` |
| 125 | + |
| 126 | +**Size estimate**: 2-3 MB |
| 127 | + |
| 128 | +### Stage 4: NLP Enrichment (1-2 hours) |
| 129 | +Extend with NLP analysis: |
| 130 | + |
| 131 | +```python |
| 132 | +for problem in problems: |
| 133 | + # Extract statement from Lean code |
| 134 | + statement = extract_statement_from_lean(problem.raw_lean_code) |
| 135 | + |
| 136 | + # Classify problem types (ML) |
| 137 | + types = classify_types(statement) # ["combinatorics", "discrete_math"] |
| 138 | + |
| 139 | + # Infer difficulty from statement length |
| 140 | + difficulty = infer_difficulty(len(statement), "lean_complexity") |
| 141 | + |
| 142 | + # Extract dependencies from Lean imports |
| 143 | + dependencies = extract_dependencies(problem.raw_lean_code) |
| 144 | + |
| 145 | + # Enrich |
| 146 | + problem.update({ |
| 147 | + "statement": statement, |
| 148 | + "types": types, |
| 149 | + "difficulty": difficulty, |
| 150 | + "dependencies": dependencies, |
| 151 | + "enriched": True |
| 152 | + }) |
| 153 | +``` |
| 154 | + |
| 155 | +**Model**: Use v1.0 trained classifier (F1=0.91) |
| 156 | + |
| 157 | +### Stage 5: Validation (30 min) |
| 158 | +Spot-check 50 samples (7% of 700): |
| 159 | +- 10 per major domain (Erdős, Wikipedia, GreensOpenProblems) |
| 160 | +- Random selection from others |
| 161 | + |
| 162 | +**Criteria**: |
| 163 | +- Statement extraction: correct |
| 164 | +- Type classification: ≥0.85 precision |
| 165 | +- Difficulty assignment: reasonable |
| 166 | +- Dependencies: complete |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## 🎯 Success Criteria |
| 171 | + |
| 172 | +| Criterion | Target | v1.0 Baseline | Notes | |
| 173 | +|-----------|--------|---------------|-------| |
| 174 | +| **Total problems** | 700 | 430 | ✅ +63% | |
| 175 | +| **Domains** | 12+ | 1 | ✅ Multi-domain | |
| 176 | +| **Parse success** | ≥95% | 100% | Expect slight degradation (non-Erdős) | |
| 177 | +| **Type classification F1** | ≥0.88 | 0.91 | Domain shift, acceptable -0.03 | |
| 178 | +| **Enrichment time** | <2 hours | N/A | Parallel NLP | |
| 179 | +| **Output JSON size** | 2-3 MB | 0.8 MB | Proportional | |
| 180 | +| **Spot-check (50 samples)** | ≥90% quality | N/A | Manual review | |
| 181 | + |
| 182 | +--- |
| 183 | + |
| 184 | +## ⏱️ Timeline |
| 185 | + |
| 186 | +| Stage | Task | Time | Dependency | |
| 187 | +|-------|------|------|------------| |
| 188 | +| 1 | Clone repo | 5 min | Network | |
| 189 | +| 2 | Extract .lean files | 10 min | Stage 1 | |
| 190 | +| 3 | Parse JSON structure | 30 min | Stage 2 | |
| 191 | +| 4 | NLP enrichment | 1-2 hours | Stage 3 | |
| 192 | +| 5 | Spot-check validation | 30 min | Stage 4 | |
| 193 | +| 6 | Save to repo | 5 min | Stage 5 | |
| 194 | +| **TOTAL** | **Phase 1.1** | **2.5-3.5 hours** | Sequential | |
| 195 | + |
| 196 | +--- |
| 197 | + |
| 198 | +## 📝 Output Artifacts |
| 199 | + |
| 200 | +### After Phase 1.1 Completion: |
| 201 | + |
| 202 | +1. **raw_data/formal-conjectures/** — Directory with all 700 .lean files preserved |
| 203 | + ``` |
| 204 | + raw_data/formal-conjectures/ |
| 205 | + ├── ErdosProblems/ |
| 206 | + │ ├── 1.lean |
| 207 | + │ ├── 10.lean |
| 208 | + │ └── ... |
| 209 | + ├── Wikipedia/ |
| 210 | + │ └── ... |
| 211 | + └── ... |
| 212 | + ``` |
| 213 | + |
| 214 | +2. **data/erdos_700_enriched.json** — Structured dataset (2-3 MB) |
| 215 | + ```json |
| 216 | + { |
| 217 | + "metadata": { ... }, |
| 218 | + "problems": [ ... ] // 700 items |
| 219 | + } |
| 220 | + ``` |
| 221 | + |
| 222 | +3. **ENRICHMENT_REPORT.md** — QA results |
| 223 | + - Parse success rate: X% |
| 224 | + - Type classification F1: X |
| 225 | + - Difficulty distribution |
| 226 | + - Domain breakdown analysis |
| 227 | + - 50 spot-check results |
| 228 | + |
| 229 | +--- |
| 230 | + |
| 231 | +## 🚀 Next Phase (1.2) |
| 232 | + |
| 233 | +After Phase 1.1 completion, proceed to **Phase 1.2: Infrastructure Setup**: |
| 234 | + |
| 235 | +- [ ] Detect GPU availability (CUDA, compute capability) |
| 236 | +- [ ] Create distributed_spec_processor.py (single-GPU or CPU fallback) |
| 237 | +- [ ] Benchmark: 430 → 700 in <30s on target hardware |
| 238 | +- [ ] Setup experiment tracking (MLflow, W&B) |
| 239 | + |
| 240 | +--- |
| 241 | + |
| 242 | +## 🔍 Decision Gates |
| 243 | + |
| 244 | +### Gate 1: Parse Success (Stage 3) |
| 245 | +- If <95% problems parse successfully |
| 246 | +- **Action**: Debug most common Lean syntax variations, retry parsing |
| 247 | + |
| 248 | +### Gate 2: Type Classification (Stage 4) |
| 249 | +- If F1 <0.85 |
| 250 | +- **Action**: Retrain classifier on combined domains, or accept degradation with explanation |
| 251 | + |
| 252 | +### Gate 3: Spot-Check (Stage 5) |
| 253 | +- If <85% quality in spot-check |
| 254 | +- **Action**: Review failure patterns, apply manual corrections to failed samples |
| 255 | + |
| 256 | +--- |
| 257 | + |
| 258 | +## 💾 Storage Plan |
| 259 | + |
| 260 | +**Repository structure**: |
| 261 | +``` |
| 262 | +aletheia-superhuman-validation/ |
| 263 | +├── data/ |
| 264 | +│ ├── erdos_430_baseline.json ← v1.0 (existing) |
| 265 | +│ ├── erdos_700_enriched.json ← v1.1 NEW |
| 266 | +│ └── distributions/ ← Domain analysis |
| 267 | +├── raw_data/ |
| 268 | +│ └── formal-conjectures/ ← 700 .lean files |
| 269 | +├── enrichment/ |
| 270 | +│ ├── classifier_model.pkl ← Pre-trained from v1.0 |
| 271 | +│ └── type_ontology.json ← 68 → 80+ reasoning types |
| 272 | +└── reports/ |
| 273 | + └── ENRICHMENT_REPORT.md ← QA summary |
| 274 | +``` |
| 275 | + |
| 276 | +**Size budget**: |
| 277 | +- raw_data: ~5-10 MB (compressed: 1-2 MB) |
| 278 | +- erdos_700_enriched.json: 2-3 MB |
| 279 | +- Models/reports: 0.5 MB |
| 280 | +- **Total**: ~8-16 MB (well within GitHub LFS) |
| 281 | + |
| 282 | +--- |
| 283 | + |
| 284 | +## ✅ Status |
| 285 | + |
| 286 | +- [x] Domain discovery (17 domains explored) |
| 287 | +- [x] Final counts confirmed (700 total) |
| 288 | +- [x] Download strategy defined |
| 289 | +- [ ] Clone repo |
| 290 | +- [ ] Extract .lean files |
| 291 | +- [ ] Parse JSON structure |
| 292 | +- [ ] NLP enrichment |
| 293 | +- [ ] Spot-check validation |
| 294 | +- [ ] Commit to aletheia-superhuman-validation/data/ |
| 295 | + |
| 296 | +--- |
| 297 | + |
| 298 | +**Ready to proceed with Stage 1?** ✨ |
| 299 | + |
0 commit comments