Commit 0ecc9d4
committed
Add country column to etfs.csv and funds.csv + API + invariant tests + automated README stats
Closes the long-standing data-shape gap where ETFs and Funds had no
country field, plus a series of related data-quality fixes surfaced by
the new cross-asset invariant test.
## Schema change: new `country` column on etfs.csv and funds.csv
Derived deterministically by chaining `exchange -> country` from
equities.csv (the source of truth for that mapping), with three small
manual overrides for fund-only exchange codes that don't appear in
equities.csv (`NAS`/`NYM`/`CME` -> United States) plus `NIM` for ETF
NextShares (also US). A second-pass ticker-suffix fallback handles
rows whose `exchange` was missing or corrupted.
Coverage after the fill:
| File | Rows | country populated |
|---|---:|---|
| `etfs.csv` | 36,485 | 100.00% |
| `funds.csv` | 57,853 | 100.00% |
Top 5 countries:
- **ETFs**: United States (17,723), Canada (6,688), Germany (5,598),
United Kingdom (2,137), Switzerland (1,463) — 33 distinct countries
- **Funds**: United States (47,643), Spain (5,383), Canada (1,898),
United Kingdom (1,769), India (496) — 24 distinct countries
## Data-quality fixes surfaced by the new invariant tests
- **Removed 14 non-ETF rows** from `etfs.csv` (^REIT plus 13 US-equity
duplicates that existed correctly in `equities.csv` already:
`BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`,
`PSEC`, `RGA`, `RZA`, `TPVG`).
- **Removed 56 cross-asset symbol collisions** between `equities.csv`
and `etfs.csv` — all 56 were corporate bonds / senior notes / equity
share-class rows that had ended up in `etfs.csv` (Brighthouse,
Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior
subordinated notes, Conifer Holdings senior notes, Qwest Corp notes,
DTE Energy variants, ASGI = Aberdeen Global Infrastructure Income
Fund, etc.).
- **Fixed 29 ETF rows with corrupted `exchange` values** (issuer name
written into the exchange column instead of a real exchange code:
`Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`,
`Horizons ETFs Management`, `Harvest Portfolios Group`,
`IA Clarington Investments`, `National Bank Investments`,
`Caldwell Investment Management`, `Developed Markets`,
`Emerging Markets`, `High Yield Bonds`). Re-derived from the ticker
suffix.
- **Completed FSST** (Fidelity Sustainable U.S. Equity ETF) — the row
previously had every field NaN. Filled name, currency, summary,
category_group, category, family, exchange (PCX), country, isin.
After this round: 0 cross-asset symbol collisions, 0 corrupted
exchange values flagged by the new invariants.
## API additions
`ETFs.select()`, `ETFs.show_options()`, `Funds.select()`, and
`Funds.show_options()` now accept a `country` parameter, validated
against the new column. Calling with an unknown country raises
`ValueError` matching the pattern used by the existing filters.
## New tests
In `tests/test_etfs.py` and `tests/test_funds.py`:
- `test_exchange_country_one_to_one` — asserts that every `exchange`
code on the asset maps to exactly one `country` value (the same
invariant introduced for equities `exchange -> market` in JerBouma#143).
- `test_select_with_invalid_value_raises` now also exercises the
`country` filter ValueError path.
New file `tests/test_invariants.py`:
- `test_no_symbol_collisions_across_asset_classes` — asserts that a
given `symbol` belongs to at most one of `equities.csv`,
`etfs.csv`, `funds.csv`. Catches the kind of drift fixed by the
56-row cleanup above before it can land on `main` again.
## Automated README statistics
`.github/workflows/database_update.yml` gains a new
`Update-README-Statistics` job that runs after the existing
Add-New-Ticker / Update-Compression / Update-Categorization jobs. It
recomputes both statistics tables from the on-disk CSVs and rewrites
README.md in-place, committing the result. The Check-GICS job now
also depends on it.
Replaced the meaningless `Countries` numbers for ETFs/Funds (which
were not backed by any column) with the now-real value derived from
the new `country` column. Other numbers also refreshed to current
state:
| Product | Quantity | Sectors | Industries | Countries | Exchanges |
|-------------------|-----------:|-----------:|--------------:|----------:|----------:|
| Equities | 160.113 | 11 | 62 | 113 | 84 |
| ETFs | 36.485 | 320 | 51 | 33 | 51 |
| Funds | 57.853 | 1.540 | 74 | 24 | 33 |
| Product | Quantity | Category |
|-------------------|----------:|-----------------------|
| Currencies | 2.556 | 175 Currencies |
| Cryptocurrencies | 3.367 | 351 Cryptocurrencies |
| Indices | 91.178 | 63 Exchanges |
| Money Markets | 1.367 | 2 Exchanges |
For ETFs/Funds the `Sectors` column = `family` count and the
`Industries` column = `category` count, which is how those numbers
were always interpreted in the README — now backed by real columns.
## Test plan
- [ ] `pytest tests/` — 52 tests pass (was 49 before this PR)
- [ ] `test_exchange_country_one_to_one` passes for ETFs and Funds
- [ ] `test_no_symbol_collisions_across_asset_classes` passes
- [ ] CI Update-README-Statistics job rewrites README correctly
- [ ] `black --check tests/ financedatabase/` clean
- [ ] Spot-check: `fd.ETFs(use_local_location=True).show_options(selection="country")` returns 33 countries
- [ ] Spot-check: `fd.Funds(use_local_location=True).select(country="Spain")` returns ~5,383 rows1 parent 0e9e8d3 commit 0ecc9d4
46 files changed
Lines changed: 94709 additions & 94581 deletions
File tree
- .github/workflows
- database
- financedatabase
- tests
- csv
- test_etfs
- test_funds
- json
- test_etfs
- test_funds
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
429 | 524 | | |
430 | | - | |
| 525 | + | |
431 | 526 | | |
432 | 527 | | |
433 | 528 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | | - | |
28 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| |||
0 commit comments