|
| 1 | +# Tech Progress Dashboard — Data Source Landscape (Research Log) |
| 2 | + |
| 3 | +_Last updated: 2026-02-14_ |
| 4 | + |
| 5 | +## 1) Scope definition (v1) |
| 6 | + |
| 7 | +### Goal |
| 8 | +Build a maintainable, high-trust dataset of **technology progress time series** for use in `posts/2026-02-14-tech-progress-dashboard.llm.qmd`. |
| 9 | + |
| 10 | +### Time scope |
| 11 | +- Include data points from **1950 onward**. |
| 12 | +- Prefer series with annual observations; allow coarser frequencies only when they represent key frontier milestones. |
| 13 | + |
| 14 | +### Metric scope |
| 15 | +Include series that track one of the following: |
| 16 | +1. **Cost decline** (e.g., inflation-adjusted $/unit performance) |
| 17 | +2. **Performance improvement** (e.g., speed, efficiency, accuracy, yield) |
| 18 | +3. **Scale and diffusion** (e.g., deployment, adoption, capacity) |
| 19 | +4. **Input intensity** (e.g., compute used in training, R&D effort) |
| 20 | + |
| 21 | +### Domain scope |
| 22 | +- Compute hardware and semiconductors |
| 23 | +- AI/ML model capability and compute |
| 24 | +- Energy generation and storage |
| 25 | +- Communications and networking |
| 26 | +- Biotech/health technology |
| 27 | +- Manufacturing and industrial productivity |
| 28 | +- Agriculture (selected benchmark technologies) |
| 29 | +- Transportation systems |
| 30 | + |
| 31 | +### Out-of-scope (for now) |
| 32 | +- Pre-1950 historical reconstructions |
| 33 | +- Single-paper one-off benchmark results with no replicable update path |
| 34 | +- Data locked behind expensive subscriptions unless a consistent legal extraction path exists |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## 2) Quality bar and scoring rubric |
| 39 | + |
| 40 | +Each source is scored on 5 dimensions (1–5): |
| 41 | + |
| 42 | +- **Coverage**: length and completeness (1950+ preferred) |
| 43 | +- **Methodological quality**: transparent definitions, data provenance, revision policy |
| 44 | +- **Accessibility**: open download/API, machine-readability, legal reuse clarity |
| 45 | +- **Updateability**: likely to keep updating for future dashboard refreshes |
| 46 | +- **Operational fit**: ease of integrating into normalized long-format pipeline |
| 47 | + |
| 48 | +### Tiers |
| 49 | +- **Tier A (include now)**: total score >= 22/25, no red flags |
| 50 | +- **Tier B (include selectively)**: 17–21, with caveats documented |
| 51 | +- **Tier C (monitor / not yet include)**: <= 16 or major access/provenance issues |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +## 3) Candidate source longlist (deep scan) |
| 56 | + |
| 57 | +## Compute hardware, semiconductors, AI compute |
| 58 | + |
| 59 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 60 | +|---|---|---:|---|---|---:|---| |
| 61 | +| Our World in Data (OWID) grapher | Transistors/chip, FLOPS/$, storage price | Varies; many start 1950–1980s | CSV + API-like URL params; open | Strong docs, reproducible, easy extraction; often sourced from underlying institutions | 24 | A | |
| 62 | +| OpenAlex | AI paper counts, citations by field | 1950s+ publication metadata | Open API + snapshots | Massive coverage, good for diffusion proxies; requires careful field classification | 22 | A | |
| 63 | +| Crossref | DOI publication trends, venue metadata | Broad historical coverage | Open API | Good scale; noisier than curated bibliometric sources | 19 | B | |
| 64 | +| arXiv bulk metadata | CS/AI preprint trends | 1991+ | Open bulk + API | Clean for modern AI era, not 1950-complete | 17 | B | |
| 65 | +| TOP500 | Frontier supercomputer LINPACK performance | 1993+ biannual | Public lists/download | Trusted frontier proxy; no 1950s coverage | 18 | B | |
| 66 | +| MLPerf results | Training/inference performance curves | ~2018+ | Public tables | High benchmark quality but short horizon | 14 | C | |
| 67 | +| Epoch AI dataset pages | Frontier model training compute | Mostly 2010+ | Public with mixed machine-readability | Substantive but methodology can evolve; verify each datapoint | 17 | B | |
| 68 | +| SemiAnalysis articles/data snippets | AI hardware economics | Recent | Mostly paywalled | Valuable context but weak reproducible access | 10 | C | |
| 69 | +| Semiconductor Industry Association (SIA) | Industry shipments/sales | Late 20th c.+ | Reports (mixed openness) | Useful aggregate indicators; sometimes paywalled details | 15 | C | |
| 70 | +| FRED (selected semiconductor indexes) | Price indexes, producer prices | Various; often decades | API | Reliable for macro proxies, less direct technical frontier | 20 | B | |
| 71 | + |
| 72 | +## Energy generation, storage, and power systems |
| 73 | + |
| 74 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 75 | +|---|---|---:|---|---|---:|---| |
| 76 | +| IEA Data Browser | Electricity generation tech mix, efficiency, prices | Often 1960+ | Mixed: many series free, some restricted | High quality official source; access terms vary | 21 | B | |
| 77 | +| EIA (US Energy Information Administration) | Generation costs, fuel efficiency, capacity | Many series 1949+ | Open API + downloads | Excellent long-run US series, consistent metadata | 24 | A | |
| 78 | +| Lazard LCOE reports | Utility-scale cost benchmarks | ~2009+ | Public PDF | Good for modern cost snapshots; short horizon, methodology changes | 14 | C | |
| 79 | +| NREL ATB | Forward-looking cost/performance assumptions | Mostly 2010+ | Public CSV | Great structure but not long-run historical primary source | 15 | C | |
| 80 | +| BP Statistical Review / Energy Institute | Global energy production/consumption | ~1965+ | Public spreadsheets | Long history, broadly used, transparent revisions | 22 | A | |
| 81 | +| Ember electricity datasets | Power-sector trends | mostly 2000+ | Open downloads | Strong recent coverage, not enough historical depth alone | 16 | C | |
| 82 | +| BNEF battery price announcements | Li-ion pack $/kWh | 2010s+ | Mostly press releases | Important indicator but source data often not fully open | 14 | C | |
| 83 | +| IRENA renewables cost database | Utility-scale technology costs | ~2010+ | Public reports/data | Good methodology, insufficient pre-2000 depth | 16 | C | |
| 84 | + |
| 85 | +## Communications and digital infrastructure |
| 86 | + |
| 87 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 88 | +|---|---|---:|---|---|---:|---| |
| 89 | +| ITU (International Telecommunication Union) | Broadband subscriptions, telecom diffusion | Several series from 1960s/70s | Public tables + API-like access | Canonical global telecom source; some usability friction | 22 | A | |
| 90 | +| OECD broadband portal | Prices, speeds, subscriptions | Mostly 1990s+ | Open tables/API | High trust for OECD countries; narrower geography | 20 | B | |
| 91 | +| World Bank WDI | Internet users, secure servers, telecom variables | Many series 1960+ depending variable | Open API + bulk CSV | Excellent operational access and metadata | 23 | A | |
| 92 | +| FCC data (US) | Broadband deployment/performance | Modern era | Public releases | Good US detail but short horizon | 15 | C | |
| 93 | +| Cisco annual internet reports (archival) | Traffic growth | 2000s+ | Public PDFs | Useful context but report methodology evolved | 13 | C | |
| 94 | + |
| 95 | +## Manufacturing, macro productivity, technology diffusion proxies |
| 96 | + |
| 97 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 98 | +|---|---|---:|---|---|---:|---| |
| 99 | +| Penn World Table (PWT) | TFP, capital deepening | 1950+ | Open downloads | Standard in growth accounting; model-based estimates need caveats | 22 | A | |
| 100 | +| EU KLEMS / World KLEMS | Sector productivity and ICT capital | Often 1970+ | Open downloads | Great granularity, weaker pre-1970 coverage | 20 | B | |
| 101 | +| Conference Board Total Economy Database | Productivity levels and growth | Mid-20th c.+ | Public downloads | High-quality macro benchmark | 21 | B | |
| 102 | +| UNIDO INDSTAT (where accessible) | Industrial output and intensity | Varies | Mixed access | Valuable, but access constraints in some components | 15 | C | |
| 103 | +| FRED macro series | Price deflators, productivity, industrial production | Many start pre-1950 | Open API | Strong operational backbone for normalization/deflation | 23 | A | |
| 104 | + |
| 105 | +## Agriculture and food technology |
| 106 | + |
| 107 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 108 | +|---|---|---:|---|---|---:|---| |
| 109 | +| FAOSTAT | Crop yields, inputs, production | 1961+ | Open API + bulk | Global, consistent, machine-readable | 23 | A | |
| 110 | +| USDA Quick Stats | US yields, adoption, productivity | Many series 1950+ | Open API | Excellent US depth and metadata | 24 | A | |
| 111 | +| Our World in Data agriculture grapher | Harmonized yield/diet/land series | Varies | Open CSV | Great dashboard-ready integration; verify underlying source transformations | 22 | A | |
| 112 | +| ASTI (agricultural R&D indicators) | R&D spending and personnel | Mostly 1980+ | Public tables | Important input proxy, but limited long-run depth | 17 | B | |
| 113 | + |
| 114 | +## Biotech and health technology |
| 115 | + |
| 116 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 117 | +|---|---|---:|---|---|---:|---| |
| 118 | +| NIH RePORTER / historical NIH funding | Biomedical R&D inputs | Mid-20th c.+ aggregates | API + files | Good funding signals; variable granularity over time | 20 | B | |
| 119 | +| FDA (Drugs@FDA, device approvals) | Approval counts/timelines | Modern electronic era primarily | Public DB | Robust for modern period; weak earlier machine-readable data | 16 | C | |
| 120 | +| CDC/NCHS (selected tech-related outcomes) | Mortality/outcome proxies | Long historical | Open datasets | Useful as outcomes, less direct as technical frontier measure | 18 | B | |
| 121 | +| GenBank growth statistics (NCBI) | Sequence database scale | 1980s+ | Public | Useful diffusion proxy; not direct performance/cost | 17 | B | |
| 122 | + |
| 123 | +## Patents, innovation intensity, cross-domain |
| 124 | + |
| 125 | +| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier | |
| 126 | +|---|---|---:|---|---|---:|---| |
| 127 | +| USPTO PatentsView | Patent counts, citations, classifications | 1976+ structured modern data; historical text back further | API + bulk | High-quality US patent backbone | 21 | B | |
| 128 | +| EPO PATSTAT (via access agreements) | Global patent panel data | Long historical | Restricted/licensed | Very strong data quality but access barriers | 15 | C | |
| 129 | +| WIPO IP statistics | Patent applications by country/field | Multi-decade | Public downloads | Strong international comparability at aggregate level | 21 | B | |
| 130 | +| OECD STAN / ANBERD / MSTI | R&D, industry innovation stats | Often 1970+ | Mixed (some open, some subscription) | High quality but access friction | 18 | B | |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## 4) Shortlist for immediate integration (recommended subset) |
| 135 | + |
| 136 | +These are highest-priority because they combine 1950+ relevance, quality, and accessible extraction: |
| 137 | + |
| 138 | +1. **OWID Grapher series** (cross-domain normalized quick wins) |
| 139 | +2. **EIA API** (US energy long-run, 1949+ in many series) |
| 140 | +3. **World Bank WDI API** (cross-country diffusion and infrastructure) |
| 141 | +4. **FAOSTAT + USDA Quick Stats** (agriculture and biological productivity anchors) |
| 142 | +5. **FRED API** (deflators + macro-normalization backbones) |
| 143 | +6. **PWT** (macro productivity context) |
| 144 | +7. **BP/Energy Institute Statistical Review** (global energy long-run) |
| 145 | +8. **ITU datasets** (communications diffusion) |
| 146 | + |
| 147 | +Secondary additions after MVP: |
| 148 | +- WIPO/PatentsView for innovation intensity overlays |
| 149 | +- TOP500 for compute frontier scaling (1993+) |
| 150 | +- Epoch AI model-compute series (careful provenance tagging) |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## 5) Proposed ongoing pipeline (for future source additions) |
| 155 | + |
| 156 | +## A. Repository structure |
| 157 | + |
| 158 | +- `posts/2026-02-14-tech-progress-dashboard.data-sources.llm.md` (this human-readable research log) |
| 159 | +- `posts/2026-02-14-tech-progress-dashboard.source-registry.llm.csv` (machine-readable source registry) |
| 160 | +- Future (optional): `posts/2026-02-14-tech-progress-dashboard.raw/` for fetched raw extracts |
| 161 | +- Future (optional): `posts/2026-02-14-tech-progress-dashboard.processed.csv` for normalized long-format panel |
| 162 | + |
| 163 | +## B. Source intake workflow |
| 164 | + |
| 165 | +1. **Discover** candidate source and add one row to registry with status `candidate` |
| 166 | +2. **Score** using rubric (coverage/method/access/updateability/fit) |
| 167 | +3. **License check**: verify legal reuse and citation requirements |
| 168 | +4. **Extraction test**: confirm scriptable fetch and parse |
| 169 | +5. **Quality audit**: |
| 170 | + - Missingness profile |
| 171 | + - Unit consistency |
| 172 | + - Breaks/redefinitions over time |
| 173 | + - Update cadence and revision handling |
| 174 | +6. **Decision gate**: |
| 175 | + - Promote to `approved` if Tier A or strong Tier B with caveats |
| 176 | + - Otherwise set `watchlist` or `rejected` |
| 177 | +7. **Integrate** into dashboard canonical schema |
| 178 | +8. **Document** provenance note and assumptions in the post |
| 179 | + |
| 180 | +## C. Canonical schema for series integration |
| 181 | + |
| 182 | +For each final series keep: |
| 183 | +- `id`, `name`, `domain`, `metric_type`, `frontier_class`, `direction`, `unit` |
| 184 | +- `year`, `value` |
| 185 | +- `source_name`, `source_url` |
| 186 | +- `provenance_note` |
| 187 | +- `quality_tier`, `last_verified_at` |
| 188 | + |
| 189 | +## D. Quality controls before publish |
| 190 | + |
| 191 | +- Check each selected series has at least 8 observations (or explicit frontier exception) |
| 192 | +- Verify at least one citation URL resolves |
| 193 | +- Validate no mixed nominal/real units without explicit deflator |
| 194 | +- Confirm directionality (`higher_better` / `lower_better`) and transformations |
| 195 | +- Keep changelog of source revisions affecting historical points |
| 196 | + |
| 197 | +--- |
| 198 | + |
| 199 | +## 6) Known risks and mitigation |
| 200 | + |
| 201 | +- **Risk: methodology changes over time** (e.g., benchmark definitions) |
| 202 | + - Mitigation: version-tag each source and keep per-series caveats. |
| 203 | +- **Risk: paywalled sources become inaccessible** |
| 204 | + - Mitigation: prefer open replicas or use source only for contextual triangulation. |
| 205 | +- **Risk: frontier series are sparse and noisy** |
| 206 | + - Mitigation: require minimum-point rule + robustness checks on trend fits. |
| 207 | +- **Risk: mixed geographic scope (US-only vs global)** |
| 208 | + - Mitigation: encode geography explicitly and avoid combining incompatible units. |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +## 7) Next execution step |
| 213 | + |
| 214 | +For the next pass, prioritize building fetch scripts and adding first approved series from: |
| 215 | +- OWID |
| 216 | +- EIA |
| 217 | +- World Bank WDI |
| 218 | +- FAOSTAT |
| 219 | +- USDA Quick Stats |
| 220 | +- FRED |
| 221 | + |
| 222 | +Then rerun dashboard with provenance table showing quality tier and last-verified date. |
0 commit comments