Commit 81f9661
committed
ETFs/Funds data quality + cross-asset invariant test + automated README stats
Per maintainer feedback on the initial version, this PR drops the
`country` column addition for ETFs/Funds — the country semantics for
those asset classes (investment scope, not listing-exchange country)
need a richer signal than `exchange -> country` can give. What
remains is still substantial:
## 1. Data-quality fixes on `etfs.csv` (surfaced by the new invariant)
- **14 non-ETF rows removed**: `^REIT` plus 13 US-equity duplicates
that already existed correctly in `equities.csv` (`BHF`, `BHFAN`,
`BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`,
`RZA`, `TPVG`).
- **56 cross-asset symbol collisions removed** — all 56 were corporate
bonds / senior notes / equity share-class rows misclassified as ETFs
(Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes,
CMS Energy junior subordinated notes, Conifer Holdings senior notes,
Qwest Corp notes, DTE Energy variants, `ASGI` = Aberdeen Global
Infrastructure Income Fund, etc.).
- **29 ETF rows with corrupted `exchange` values fixed** — issuer name
written into the exchange column instead of a real exchange code
(`Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`,
`Horizons ETFs Management`, `Harvest Portfolios Group`,
`IA Clarington Investments`, `National Bank Investments`,
`Caldwell Investment Management`, `Developed Markets`,
`Emerging Markets`, `High Yield Bonds`). Re-derived from ticker
suffix.
- **`FSST` completed** (Fidelity Sustainable U.S. Equity ETF) — row
previously had every field NaN. Filled name, currency, summary,
category_group, category, family, exchange (`PCX`), isin.
## 2. New cross-asset invariant test
`tests/test_invariants.py::test_no_symbol_collisions_across_asset_classes`
asserts that a given `symbol` belongs to at most one of `equities.csv`,
`etfs.csv`, `funds.csv`. This is the test that surfaced the 56 + 14 = 70
cleanup rows above; it would have caught all of them before they
landed on `main`.
## 3. Automated README statistics
`.github/workflows/database_update.yml` gains a new
`Update-README-Statistics` job that runs after the existing
`Add-New-Ticker` / `Update-Compression-Files` / `Update-Categorization-Files`
jobs. It recomputes the statistics tables from the on-disk CSVs and
rewrites `README.md` in-place, committing the result. The
`Check-GICS-Categorisation` job now also depends on it.
README is now split into three tables that reflect the actual schema
of each asset class rather than the previous combined-table layout
whose `Countries` column for ETFs/Funds had no backing data:
```
Table A — Equities only (has country, sector, industry)
| Product | Quantity | Sectors | Industries | Countries | Exchanges |
| Equities | 160.113 | 11 | 62 | 113 | 84 |
Table B — ETFs / Funds (no country: no schema for it)
| Product | Quantity | Families | Categories | Exchanges |
| ETFs | 36.485 | 320 | 51 | 51 |
| Funds | 57.853 | 1.540 | 74 | 33 |
Table C — Currencies / Cryptos / Indices / Money Markets
| Product | Quantity | Category |
| Currencies | 2.556 | 175 Currencies |
| Cryptocurrencies | 3.367 | 351 Cryptocurrencies |
| Indices | 91.178 | 63 Exchanges |
| Money Markets | 1.367 | 2 Exchanges |
```
The B-table columns are renamed from the previous "Sectors / Industries"
(which were really family/category counts) to be honest about what the
numbers represent.
## What this PR explicitly does NOT do
- **No `country` column on `etfs.csv` / `funds.csv`.** Per
@JerBouma's feedback, "country" for ETFs/Funds should reflect the
investment scope (e.g. iShares MSCI World ETF listed on NYSE has
global scope, not "United States"), not the listing exchange. The
obvious deterministic source (exchange -> country) doesn't fit that
semantics. Deferred to a focused discussion / future PR if the
community decides on a definition.
- **No country fill on missing equities.** 80k equity rows still have
empty `country`; filling them from listing exchange would be wrong
for the same reason ASML on Nasdaq stays "Netherlands". Needs an
actual issuer-domicile source.
## Test plan
- [ ] `pytest tests/` — 50 tests pass (was 49 before; the new
`test_no_symbol_collisions_across_asset_classes` is the +1)
- [ ] `test_no_symbol_collisions_across_asset_classes` passes
- [ ] `black --check tests/ financedatabase/` clean
- [ ] CI `Update-README-Statistics` job rewrites README correctly on
next data update
- [ ] Spot-check: `BHF`, `DTE`, `RGA`, etc. exist in `equities.csv`
only (not in `etfs.csv` anymore)
- [ ] Spot-check: `FSST` has all fields populated
Related: JerBouma#140 (test infra), JerBouma#143 (introduced the `exchange -> market`
invariant pattern), JerBouma#144 (proposes splitting CSVs by exchange).1 parent 0e9e8d3 commit 81f9661
11 files changed
Lines changed: 222 additions & 139 deletions
File tree
- .github/workflows
- database
- financedatabase
- tests
- csv/test_etfs
- json/test_etfs
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
429 | 540 | | |
430 | | - | |
| 541 | + | |
431 | 542 | | |
432 | 543 | | |
433 | 544 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | | - | |
28 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
29 | 32 | | |
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
36 | 39 | | |
37 | 40 | | |
38 | 41 | | |
| |||
0 commit comments