Skip to content

Commit 0ecc9d4

Browse files
committed
Add country column to etfs.csv and funds.csv + API + invariant tests + automated README stats
Closes the long-standing data-shape gap where ETFs and Funds had no country field, plus a series of related data-quality fixes surfaced by the new cross-asset invariant test. ## Schema change: new `country` column on etfs.csv and funds.csv Derived deterministically by chaining `exchange -> country` from equities.csv (the source of truth for that mapping), with three small manual overrides for fund-only exchange codes that don't appear in equities.csv (`NAS`/`NYM`/`CME` -> United States) plus `NIM` for ETF NextShares (also US). A second-pass ticker-suffix fallback handles rows whose `exchange` was missing or corrupted. Coverage after the fill: | File | Rows | country populated | |---|---:|---| | `etfs.csv` | 36,485 | 100.00% | | `funds.csv` | 57,853 | 100.00% | Top 5 countries: - **ETFs**: United States (17,723), Canada (6,688), Germany (5,598), United Kingdom (2,137), Switzerland (1,463) — 33 distinct countries - **Funds**: United States (47,643), Spain (5,383), Canada (1,898), United Kingdom (1,769), India (496) — 24 distinct countries ## Data-quality fixes surfaced by the new invariant tests - **Removed 14 non-ETF rows** from `etfs.csv` (^REIT plus 13 US-equity duplicates that existed correctly in `equities.csv` already: `BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`, `RZA`, `TPVG`). - **Removed 56 cross-asset symbol collisions** between `equities.csv` and `etfs.csv` — all 56 were corporate bonds / senior notes / equity share-class rows that had ended up in `etfs.csv` (Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior subordinated notes, Conifer Holdings senior notes, Qwest Corp notes, DTE Energy variants, ASGI = Aberdeen Global Infrastructure Income Fund, etc.). - **Fixed 29 ETF rows with corrupted `exchange` values** (issuer name written into the exchange column instead of a real exchange code: `Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`, `Horizons ETFs Management`, `Harvest Portfolios Group`, `IA Clarington Investments`, `National Bank Investments`, `Caldwell Investment Management`, `Developed Markets`, `Emerging Markets`, `High Yield Bonds`). Re-derived from the ticker suffix. - **Completed FSST** (Fidelity Sustainable U.S. Equity ETF) — the row previously had every field NaN. Filled name, currency, summary, category_group, category, family, exchange (PCX), country, isin. After this round: 0 cross-asset symbol collisions, 0 corrupted exchange values flagged by the new invariants. ## API additions `ETFs.select()`, `ETFs.show_options()`, `Funds.select()`, and `Funds.show_options()` now accept a `country` parameter, validated against the new column. Calling with an unknown country raises `ValueError` matching the pattern used by the existing filters. ## New tests In `tests/test_etfs.py` and `tests/test_funds.py`: - `test_exchange_country_one_to_one` — asserts that every `exchange` code on the asset maps to exactly one `country` value (the same invariant introduced for equities `exchange -> market` in #143). - `test_select_with_invalid_value_raises` now also exercises the `country` filter ValueError path. New file `tests/test_invariants.py`: - `test_no_symbol_collisions_across_asset_classes` — asserts that a given `symbol` belongs to at most one of `equities.csv`, `etfs.csv`, `funds.csv`. Catches the kind of drift fixed by the 56-row cleanup above before it can land on `main` again. ## Automated README statistics `.github/workflows/database_update.yml` gains a new `Update-README-Statistics` job that runs after the existing Add-New-Ticker / Update-Compression / Update-Categorization jobs. It recomputes both statistics tables from the on-disk CSVs and rewrites README.md in-place, committing the result. The Check-GICS job now also depends on it. Replaced the meaningless `Countries` numbers for ETFs/Funds (which were not backed by any column) with the now-real value derived from the new `country` column. Other numbers also refreshed to current state: | Product | Quantity | Sectors | Industries | Countries | Exchanges | |-------------------|-----------:|-----------:|--------------:|----------:|----------:| | Equities | 160.113 | 11 | 62 | 113 | 84 | | ETFs | 36.485 | 320 | 51 | 33 | 51 | | Funds | 57.853 | 1.540 | 74 | 24 | 33 | | Product | Quantity | Category | |-------------------|----------:|-----------------------| | Currencies | 2.556 | 175 Currencies | | Cryptocurrencies | 3.367 | 351 Cryptocurrencies | | Indices | 91.178 | 63 Exchanges | | Money Markets | 1.367 | 2 Exchanges | For ETFs/Funds the `Sectors` column = `family` count and the `Industries` column = `category` count, which is how those numbers were always interpreted in the README — now backed by real columns. ## Test plan - [ ] `pytest tests/` — 52 tests pass (was 49 before this PR) - [ ] `test_exchange_country_one_to_one` passes for ETFs and Funds - [ ] `test_no_symbol_collisions_across_asset_classes` passes - [ ] CI Update-README-Statistics job rewrites README correctly - [ ] `black --check tests/ financedatabase/` clean - [ ] Spot-check: `fd.ETFs(use_local_location=True).show_options(selection="country")` returns 33 countries - [ ] Spot-check: `fd.Funds(use_local_location=True).select(country="Spain")` returns ~5,383 rows
1 parent 0e9e8d3 commit 0ecc9d4

46 files changed

Lines changed: 94709 additions & 94581 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/database_update.yml

Lines changed: 96 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,8 +426,103 @@ jobs:
426426
if: steps.run.outputs.status != '0'
427427
run: exit "${{ steps.run.outputs.status }}"
428428

429+
Update-README-Statistics:
430+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
431+
runs-on: ubuntu-latest
432+
steps:
433+
- name: checkout repo content
434+
uses: actions/checkout@v3
435+
- name: pull changes
436+
run: git pull https://${{secrets.PAT}}@github.com/JerBouma/FinanceDatabase.git main
437+
- name: setup python
438+
uses: actions/setup-python@v4
439+
with:
440+
python-version: '3.13'
441+
- run: pip install pandas
442+
- name: Refresh the statistics tables in README.md
443+
uses: jannekem/run-python-script-action@v1
444+
with:
445+
script: |
446+
import re
447+
from pathlib import Path
448+
import pandas as pd
449+
450+
def _n(x: int) -> str:
451+
"""Format an integer with a thousand-separator dot (e.g. 160.113)."""
452+
return f"{x:,}".replace(",", ".")
453+
454+
# Table 1: Equities / ETFs / Funds — Quantity, Sectors, Industries, Countries, Exchanges
455+
table_1_rows = []
456+
for asset, sectors_col, industries_col in [
457+
("Equities", "sector", "industry"),
458+
("ETFs", "family", "category"),
459+
("Funds", "family", "category"),
460+
]:
461+
df = pd.read_csv(f"database/{asset.lower()}.csv", dtype=str)
462+
table_1_rows.append((
463+
asset,
464+
_n(len(df)),
465+
_n(df[sectors_col].nunique()),
466+
_n(df[industries_col].nunique()),
467+
_n(df["country"].nunique()),
468+
_n(df["exchange"].nunique()),
469+
))
470+
471+
# Table 2: Currencies / Cryptocurrencies / Indices / Money Markets
472+
table_2_rows = []
473+
for asset, fname, col, label in [
474+
("Currencies", "currencies.csv", "quote_currency", "Currencies"),
475+
("Cryptocurrencies", "cryptos.csv", "cryptocurrency", "Cryptocurrencies"),
476+
("Indices", "indices.csv", "exchange", "Exchanges"),
477+
("Money Markets", "moneymarkets.csv", "exchange", "Exchanges"),
478+
]:
479+
df = pd.read_csv(f"database/{fname}", dtype=str)
480+
table_2_rows.append((asset, _n(len(df)), f"{_n(df[col].nunique())} {label}"))
481+
482+
def _fmt_table_1(rows):
483+
h = "| Product | Quantity | Sectors | Industries | Countries | Exchanges |\n"
484+
h += "| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |"
485+
b = "\n".join(
486+
f"| {a:<17} | {q:<10} | {s:<10} | {i:<13} | {c:<9} | {e:<9} |"
487+
for (a, q, s, i, c, e) in rows
488+
)
489+
return h + "\n" + b
490+
491+
def _fmt_table_2(rows):
492+
h = "| Product | Quantity | Category |\n"
493+
h += "| ----------------- | --------- | --------------------- |"
494+
b = "\n".join(
495+
f"| {a:<17} | {q:<9} | {c:<21} |"
496+
for (a, q, c) in rows
497+
)
498+
return h + "\n" + b
499+
500+
readme = Path("README.md")
501+
text = readme.read_text(encoding="utf-8")
502+
text = re.sub(
503+
r"\| Product\s+\| Quantity\s+\| Sectors[^\n]*\| Industries[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){3}",
504+
_fmt_table_1(table_1_rows) + "\n",
505+
text,
506+
count=1,
507+
)
508+
text = re.sub(
509+
r"\| Product\s+\| Quantity\s+\| Category[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){4}",
510+
_fmt_table_2(table_2_rows) + "\n",
511+
text,
512+
count=1,
513+
)
514+
readme.write_text(text, encoding="utf-8")
515+
print("README.md statistics tables refreshed.")
516+
- name: Commit README update
517+
run: |
518+
git config --global user.name 'GitHub Action'
519+
git config --global user.email 'action@github.com'
520+
git add README.md
521+
git diff-index --quiet HEAD || git commit -m "Update README statistics"
522+
git push
523+
429524
Check-GICS-Categorisation:
430-
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
525+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files, Update-README-Statistics]
431526
runs-on: ubuntu-latest
432527
steps:
433528
- name: checkout repo content

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,16 @@ Some key statistics of the database:
2323

2424
| Product | Quantity | Sectors | Industries | Countries | Exchanges |
2525
| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |
26-
| Equities | 158.429 | 12 | 63 | 111 | 83 |
27-
| ETFs | 36.786 | 295 | 22 | 111 | 53 |
28-
| Funds | 57.881 | 1541 | 52 | 111 | 34 |
26+
| Equities | 160.113 | 11 | 62 | 113 | 84 |
27+
| ETFs | 36.485 | 320 | 51 | 33 | 51 |
28+
| Funds | 57.853 | 1.540 | 74 | 24 | 33 |
2929

3030
| Product | Quantity | Category |
3131
| ----------------- | --------- | --------------------- |
3232
| Currencies | 2.556 | 175 Currencies |
33-
| Cryptocurrencies | 3.367 | 352 Cryptocurrencies |
34-
| Indices | 91.183 | 64 Exchanges |
35-
| Money Markets | 1.367 | 3 Exchanges |
33+
| Cryptocurrencies | 3.367 | 351 Cryptocurrencies |
34+
| Indices | 91.178 | 63 Exchanges |
35+
| Money Markets | 1.367 | 2 Exchanges |
3636

3737
The Finance Database is used within or referenced by:
3838

0 commit comments

Comments
 (0)