Skip to content

Commit 3ce9b1f

Browse files
authored
Merge pull request #4 from tecunningham/codex/conduct-literature-search-for-data-sources
Add local tech-dashboard ingestion pipeline and mark LLM-generated files
2 parents 1c3dc2f + cf21f98 commit 3ce9b1f

11 files changed

+13106
-1
lines changed
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Tech Progress Dashboard — Data Source Landscape (Research Log)
2+
3+
_Last updated: 2026-02-14_
4+
5+
## 1) Scope definition (v1)
6+
7+
### Goal
8+
Build a maintainable, high-trust dataset of **technology progress time series** for use in `posts/2026-02-14-tech-progress-dashboard.llm.qmd`.
9+
10+
### Time scope
11+
- Include data points from **1950 onward**.
12+
- Prefer series with annual observations; allow coarser frequencies only when they represent key frontier milestones.
13+
14+
### Metric scope
15+
Include series that track one of the following:
16+
1. **Cost decline** (e.g., inflation-adjusted $/unit performance)
17+
2. **Performance improvement** (e.g., speed, efficiency, accuracy, yield)
18+
3. **Scale and diffusion** (e.g., deployment, adoption, capacity)
19+
4. **Input intensity** (e.g., compute used in training, R&D effort)
20+
21+
### Domain scope
22+
- Compute hardware and semiconductors
23+
- AI/ML model capability and compute
24+
- Energy generation and storage
25+
- Communications and networking
26+
- Biotech/health technology
27+
- Manufacturing and industrial productivity
28+
- Agriculture (selected benchmark technologies)
29+
- Transportation systems
30+
31+
### Out-of-scope (for now)
32+
- Pre-1950 historical reconstructions
33+
- Single-paper one-off benchmark results with no replicable update path
34+
- Data locked behind expensive subscriptions unless a consistent legal extraction path exists
35+
36+
---
37+
38+
## 2) Quality bar and scoring rubric
39+
40+
Each source is scored on 5 dimensions (1–5):
41+
42+
- **Coverage**: length and completeness (1950+ preferred)
43+
- **Methodological quality**: transparent definitions, data provenance, revision policy
44+
- **Accessibility**: open download/API, machine-readability, legal reuse clarity
45+
- **Updateability**: likely to keep updating for future dashboard refreshes
46+
- **Operational fit**: ease of integrating into normalized long-format pipeline
47+
48+
### Tiers
49+
- **Tier A (include now)**: total score >= 22/25, no red flags
50+
- **Tier B (include selectively)**: 17–21, with caveats documented
51+
- **Tier C (monitor / not yet include)**: <= 16 or major access/provenance issues
52+
53+
---
54+
55+
## 3) Candidate source longlist (deep scan)
56+
57+
## Compute hardware, semiconductors, AI compute
58+
59+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
60+
|---|---|---:|---|---|---:|---|
61+
| Our World in Data (OWID) grapher | Transistors/chip, FLOPS/$, storage price | Varies; many start 1950–1980s | CSV + API-like URL params; open | Strong docs, reproducible, easy extraction; often sourced from underlying institutions | 24 | A |
62+
| OpenAlex | AI paper counts, citations by field | 1950s+ publication metadata | Open API + snapshots | Massive coverage, good for diffusion proxies; requires careful field classification | 22 | A |
63+
| Crossref | DOI publication trends, venue metadata | Broad historical coverage | Open API | Good scale; noisier than curated bibliometric sources | 19 | B |
64+
| arXiv bulk metadata | CS/AI preprint trends | 1991+ | Open bulk + API | Clean for modern AI era, not 1950-complete | 17 | B |
65+
| TOP500 | Frontier supercomputer LINPACK performance | 1993+ biannual | Public lists/download | Trusted frontier proxy; no 1950s coverage | 18 | B |
66+
| MLPerf results | Training/inference performance curves | ~2018+ | Public tables | High benchmark quality but short horizon | 14 | C |
67+
| Epoch AI dataset pages | Frontier model training compute | Mostly 2010+ | Public with mixed machine-readability | Substantive but methodology can evolve; verify each datapoint | 17 | B |
68+
| SemiAnalysis articles/data snippets | AI hardware economics | Recent | Mostly paywalled | Valuable context but weak reproducible access | 10 | C |
69+
| Semiconductor Industry Association (SIA) | Industry shipments/sales | Late 20th c.+ | Reports (mixed openness) | Useful aggregate indicators; sometimes paywalled details | 15 | C |
70+
| FRED (selected semiconductor indexes) | Price indexes, producer prices | Various; often decades | API | Reliable for macro proxies, less direct technical frontier | 20 | B |
71+
72+
## Energy generation, storage, and power systems
73+
74+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
75+
|---|---|---:|---|---|---:|---|
76+
| IEA Data Browser | Electricity generation tech mix, efficiency, prices | Often 1960+ | Mixed: many series free, some restricted | High quality official source; access terms vary | 21 | B |
77+
| EIA (US Energy Information Administration) | Generation costs, fuel efficiency, capacity | Many series 1949+ | Open API + downloads | Excellent long-run US series, consistent metadata | 24 | A |
78+
| Lazard LCOE reports | Utility-scale cost benchmarks | ~2009+ | Public PDF | Good for modern cost snapshots; short horizon, methodology changes | 14 | C |
79+
| NREL ATB | Forward-looking cost/performance assumptions | Mostly 2010+ | Public CSV | Great structure but not long-run historical primary source | 15 | C |
80+
| BP Statistical Review / Energy Institute | Global energy production/consumption | ~1965+ | Public spreadsheets | Long history, broadly used, transparent revisions | 22 | A |
81+
| Ember electricity datasets | Power-sector trends | mostly 2000+ | Open downloads | Strong recent coverage, not enough historical depth alone | 16 | C |
82+
| BNEF battery price announcements | Li-ion pack $/kWh | 2010s+ | Mostly press releases | Important indicator but source data often not fully open | 14 | C |
83+
| IRENA renewables cost database | Utility-scale technology costs | ~2010+ | Public reports/data | Good methodology, insufficient pre-2000 depth | 16 | C |
84+
85+
## Communications and digital infrastructure
86+
87+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
88+
|---|---|---:|---|---|---:|---|
89+
| ITU (International Telecommunication Union) | Broadband subscriptions, telecom diffusion | Several series from 1960s/70s | Public tables + API-like access | Canonical global telecom source; some usability friction | 22 | A |
90+
| OECD broadband portal | Prices, speeds, subscriptions | Mostly 1990s+ | Open tables/API | High trust for OECD countries; narrower geography | 20 | B |
91+
| World Bank WDI | Internet users, secure servers, telecom variables | Many series 1960+ depending variable | Open API + bulk CSV | Excellent operational access and metadata | 23 | A |
92+
| FCC data (US) | Broadband deployment/performance | Modern era | Public releases | Good US detail but short horizon | 15 | C |
93+
| Cisco annual internet reports (archival) | Traffic growth | 2000s+ | Public PDFs | Useful context but report methodology evolved | 13 | C |
94+
95+
## Manufacturing, macro productivity, technology diffusion proxies
96+
97+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
98+
|---|---|---:|---|---|---:|---|
99+
| Penn World Table (PWT) | TFP, capital deepening | 1950+ | Open downloads | Standard in growth accounting; model-based estimates need caveats | 22 | A |
100+
| EU KLEMS / World KLEMS | Sector productivity and ICT capital | Often 1970+ | Open downloads | Great granularity, weaker pre-1970 coverage | 20 | B |
101+
| Conference Board Total Economy Database | Productivity levels and growth | Mid-20th c.+ | Public downloads | High-quality macro benchmark | 21 | B |
102+
| UNIDO INDSTAT (where accessible) | Industrial output and intensity | Varies | Mixed access | Valuable, but access constraints in some components | 15 | C |
103+
| FRED macro series | Price deflators, productivity, industrial production | Many start pre-1950 | Open API | Strong operational backbone for normalization/deflation | 23 | A |
104+
105+
## Agriculture and food technology
106+
107+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
108+
|---|---|---:|---|---|---:|---|
109+
| FAOSTAT | Crop yields, inputs, production | 1961+ | Open API + bulk | Global, consistent, machine-readable | 23 | A |
110+
| USDA Quick Stats | US yields, adoption, productivity | Many series 1950+ | Open API | Excellent US depth and metadata | 24 | A |
111+
| Our World in Data agriculture grapher | Harmonized yield/diet/land series | Varies | Open CSV | Great dashboard-ready integration; verify underlying source transformations | 22 | A |
112+
| ASTI (agricultural R&D indicators) | R&D spending and personnel | Mostly 1980+ | Public tables | Important input proxy, but limited long-run depth | 17 | B |
113+
114+
## Biotech and health technology
115+
116+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
117+
|---|---|---:|---|---|---:|---|
118+
| NIH RePORTER / historical NIH funding | Biomedical R&D inputs | Mid-20th c.+ aggregates | API + files | Good funding signals; variable granularity over time | 20 | B |
119+
| FDA (Drugs@FDA, device approvals) | Approval counts/timelines | Modern electronic era primarily | Public DB | Robust for modern period; weak earlier machine-readable data | 16 | C |
120+
| CDC/NCHS (selected tech-related outcomes) | Mortality/outcome proxies | Long historical | Open datasets | Useful as outcomes, less direct as technical frontier measure | 18 | B |
121+
| GenBank growth statistics (NCBI) | Sequence database scale | 1980s+ | Public | Useful diffusion proxy; not direct performance/cost | 17 | B |
122+
123+
## Patents, innovation intensity, cross-domain
124+
125+
| Source | Example metrics | Coverage | Access | Quality notes | Score | Tier |
126+
|---|---|---:|---|---|---:|---|
127+
| USPTO PatentsView | Patent counts, citations, classifications | 1976+ structured modern data; historical text back further | API + bulk | High-quality US patent backbone | 21 | B |
128+
| EPO PATSTAT (via access agreements) | Global patent panel data | Long historical | Restricted/licensed | Very strong data quality but access barriers | 15 | C |
129+
| WIPO IP statistics | Patent applications by country/field | Multi-decade | Public downloads | Strong international comparability at aggregate level | 21 | B |
130+
| OECD STAN / ANBERD / MSTI | R&D, industry innovation stats | Often 1970+ | Mixed (some open, some subscription) | High quality but access friction | 18 | B |
131+
132+
---
133+
134+
## 4) Shortlist for immediate integration (recommended subset)
135+
136+
These are highest-priority because they combine 1950+ relevance, quality, and accessible extraction:
137+
138+
1. **OWID Grapher series** (cross-domain normalized quick wins)
139+
2. **EIA API** (US energy long-run, 1949+ in many series)
140+
3. **World Bank WDI API** (cross-country diffusion and infrastructure)
141+
4. **FAOSTAT + USDA Quick Stats** (agriculture and biological productivity anchors)
142+
5. **FRED API** (deflators + macro-normalization backbones)
143+
6. **PWT** (macro productivity context)
144+
7. **BP/Energy Institute Statistical Review** (global energy long-run)
145+
8. **ITU datasets** (communications diffusion)
146+
147+
Secondary additions after MVP:
148+
- WIPO/PatentsView for innovation intensity overlays
149+
- TOP500 for compute frontier scaling (1993+)
150+
- Epoch AI model-compute series (careful provenance tagging)
151+
152+
---
153+
154+
## 5) Proposed ongoing pipeline (for future source additions)
155+
156+
## A. Repository structure
157+
158+
- `posts/2026-02-14-tech-progress-dashboard.data-sources.llm.md` (this human-readable research log)
159+
- `posts/2026-02-14-tech-progress-dashboard.source-registry.llm.csv` (machine-readable source registry)
160+
- Future (optional): `posts/2026-02-14-tech-progress-dashboard.raw/` for fetched raw extracts
161+
- Future (optional): `posts/2026-02-14-tech-progress-dashboard.processed.csv` for normalized long-format panel
162+
163+
## B. Source intake workflow
164+
165+
1. **Discover** candidate source and add one row to registry with status `candidate`
166+
2. **Score** using rubric (coverage/method/access/updateability/fit)
167+
3. **License check**: verify legal reuse and citation requirements
168+
4. **Extraction test**: confirm scriptable fetch and parse
169+
5. **Quality audit**:
170+
- Missingness profile
171+
- Unit consistency
172+
- Breaks/redefinitions over time
173+
- Update cadence and revision handling
174+
6. **Decision gate**:
175+
- Promote to `approved` if Tier A or strong Tier B with caveats
176+
- Otherwise set `watchlist` or `rejected`
177+
7. **Integrate** into dashboard canonical schema
178+
8. **Document** provenance note and assumptions in the post
179+
180+
## C. Canonical schema for series integration
181+
182+
For each final series keep:
183+
- `id`, `name`, `domain`, `metric_type`, `frontier_class`, `direction`, `unit`
184+
- `year`, `value`
185+
- `source_name`, `source_url`
186+
- `provenance_note`
187+
- `quality_tier`, `last_verified_at`
188+
189+
## D. Quality controls before publish
190+
191+
- Check each selected series has at least 8 observations (or explicit frontier exception)
192+
- Verify at least one citation URL resolves
193+
- Validate no mixed nominal/real units without explicit deflator
194+
- Confirm directionality (`higher_better` / `lower_better`) and transformations
195+
- Keep changelog of source revisions affecting historical points
196+
197+
---
198+
199+
## 6) Known risks and mitigation
200+
201+
- **Risk: methodology changes over time** (e.g., benchmark definitions)
202+
- Mitigation: version-tag each source and keep per-series caveats.
203+
- **Risk: paywalled sources become inaccessible**
204+
- Mitigation: prefer open replicas or use source only for contextual triangulation.
205+
- **Risk: frontier series are sparse and noisy**
206+
- Mitigation: require minimum-point rule + robustness checks on trend fits.
207+
- **Risk: mixed geographic scope (US-only vs global)**
208+
- Mitigation: encode geography explicitly and avoid combining incompatible units.
209+
210+
---
211+
212+
## 7) Next execution step
213+
214+
For the next pass, prioritize building fetch scripts and adding first approved series from:
215+
- OWID
216+
- EIA
217+
- World Bank WDI
218+
- FAOSTAT
219+
- USDA Quick Stats
220+
- FRED
221+
222+
Then rerun dashboard with provenance table showing quality tier and last-verified date.

posts/2026-02-14-tech-progress-dashboard.qmd renamed to posts/2026-02-14-tech-progress-dashboard.llm.qmd

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,16 @@ format: html
88

99
<script src="https://cdn.plot.ly/plotly-2.35.2.min.js"></script>
1010

11+
## Data-source scope and research log
12+
13+
- Scope is currently restricted to **1950 onward** data.
14+
- Source search, quality scoring, and pipeline notes live in:
15+
- `posts/2026-02-14-tech-progress-dashboard.data-sources.llm.md`
16+
- `posts/2026-02-14-tech-progress-dashboard.source-registry.llm.csv`
17+
- Local normalized ingestion output lives in:
18+
- `posts/data/tech-progress-dashboard.llm/normalized-series.llm.csv`
19+
- Refresh command: `python tools/ingest_tech_progress_dashboard.py`
20+
1121
<style>
1222
#tpd-controls {
1323
display: grid;
@@ -514,4 +524,4 @@ This MVP is built to support **forecasting and measuring AI's impact on technolo
514524
5. Data inclusion threshold: at least 4 time points, direction-of-improvement defined, source URL recorded.
515525
6. Fitting method: per-series log-linear fit on historical data (`log(value) = a + b * year`).
516526
7. v1 excludes causal attribution and forecasting beyond observed history.
517-
8. v1 data are seed-series and should be upgraded to automated ingestion in a follow-up.
527+
8. v1 data are seed-series and should be upgraded to automated ingestion in a follow-up.
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
source_id,source_name,domain_group,example_metrics,coverage_start,coverage_end,access_mode,coverage_score,method_score,access_score,updateability_score,operational_fit_score,total_score,tier,status,notes
2+
owid_grapher,Our World in Data Grapher,multi-domain,"transistors_per_chip; gpu_perf_per_dollar; crop_yields",1950,2025,open_csv,5,5,5,4,5,24,A,approved,"Strong first-line source; verify underlying original source for each series"
3+
eia_api,US EIA API,energy,"generation; prices; efficiency",1949,2025,open_api,5,5,5,5,4,24,A,approved,"Excellent long-run US energy coverage"
4+
world_bank_wdi,World Bank WDI,communications_macro,"internet_users; broadband; infrastructure",1960,2025,open_api,5,4,5,4,5,23,A,approved,"Cross-country comparability with strong metadata"
5+
faostat,FAOSTAT,agriculture,"yields; production; inputs",1961,2025,open_api,5,5,5,4,4,23,A,approved,"Global ag backbone"
6+
usda_quickstats,USDA Quick Stats,agriculture,"us crop yields; adoption",1950,2025,open_api,5,5,5,4,5,24,A,approved,"Deep US historical coverage"
7+
fred_api,FRED,macro,"deflators; productivity; industrial indexes",1950,2025,open_api,5,5,5,4,4,23,A,approved,"Normalization backbone and supporting indicators"
8+
penn_world_table,Penn World Table,macro,"tfp; capital; gdp per worker",1950,2023,open_download,5,5,4,4,4,22,A,approved,"Model-based estimates require caveats"
9+
energy_institute_review,Energy Institute Statistical Review,energy,"global energy production and use",1965,2024,open_spreadsheet,4,5,4,4,5,22,A,approved,"Long-run global energy comparisons"
10+
itu_data,ITU datasets,communications,"subscriptions; access diffusion",1960,2025,open_table,5,5,4,4,4,22,A,approved,"Canonical telecom source"
11+
top500,TOP500,compute,"frontier_supercomputer_flops",1993,2025,public_download,3,5,4,5,4,21,B,candidate,"Frontier compute proxy; post-1993 only"
12+
patentsview,USPTO PatentsView,innovation,"patent_counts; citations",1976,2025,open_api,4,4,5,4,4,21,B,candidate,"Great for innovation intensity overlays"
13+
wipo_stats,WIPO IP Statistics,innovation,"country patent applications",1970,2024,public_download,4,5,4,4,4,21,B,candidate,"Global aggregate patent trends"
14+
epoch_ai,Epoch AI model datasets,ai_models,"training_compute; parameter_counts",2010,2025,public_web,2,4,4,4,3,17,B,watchlist,"Useful but needs per-point provenance auditing"
15+
bnef_battery,BloombergNEF battery prices,energy_storage,"li_ion_pack_price",2013,2025,press_release,2,4,2,4,2,14,C,watchlist,"Important metric but restricted underlying dataset"
16+
mlperf,MLPerf,ai_benchmarks,"training_time; inference_perf",2018,2025,public_table,1,5,4,5,3,18,B,watchlist,"High quality but short horizon for 1950+ dashboard"
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
# Tech Progress Dashboard local data cache
2+
3+
This folder contains locally cached and normalized data used by:
4+
- `posts/2026-02-14-tech-progress-dashboard.llm.qmd`
5+
6+
## Files
7+
- `normalized-series.llm.csv`: canonical long-format panel used for chart ingestion.
8+
- `ingest-log.llm.md`: last ingestion summary (counts and replacement coverage).
9+
- `raw/`: raw downloaded source extracts (currently OWID grapher CSVs).
10+
11+
## Refresh
12+
Run:
13+
14+
```bash
15+
python tools/ingest_tech_progress_dashboard.py
16+
```
17+
18+
The script parses seed rows from the post, refreshes downloadable sources, and writes normalized output.
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# Tech Progress Dashboard ingestion log
2+
3+
- Source post: `posts/2026-02-14-tech-progress-dashboard.llm.qmd`
4+
- Seed rows parsed from embedded JS: **84**
5+
- Series replaced with fresh OWID downloads: **transistors_per_chip, us_corn_yield, solar_pv_module_price, hdd_cost_per_gb**
6+
- Final normalized rows: **260**
7+
8+
## Notes
9+
- This pipeline keeps seed data for sources that are not yet script-downloaded (e.g., BNEF, Green500, NHGRI page-based data).
10+
- All rows are normalized to a single long-format CSV for local dashboard ingestion.

0 commit comments

Comments
 (0)