Skip to content

Commit 81f9661

Browse files
committed
ETFs/Funds data quality + cross-asset invariant test + automated README stats
Per maintainer feedback on the initial version, this PR drops the `country` column addition for ETFs/Funds — the country semantics for those asset classes (investment scope, not listing-exchange country) need a richer signal than `exchange -> country` can give. What remains is still substantial: ## 1. Data-quality fixes on `etfs.csv` (surfaced by the new invariant) - **14 non-ETF rows removed**: `^REIT` plus 13 US-equity duplicates that already existed correctly in `equities.csv` (`BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`, `RZA`, `TPVG`). - **56 cross-asset symbol collisions removed** — all 56 were corporate bonds / senior notes / equity share-class rows misclassified as ETFs (Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior subordinated notes, Conifer Holdings senior notes, Qwest Corp notes, DTE Energy variants, `ASGI` = Aberdeen Global Infrastructure Income Fund, etc.). - **29 ETF rows with corrupted `exchange` values fixed** — issuer name written into the exchange column instead of a real exchange code (`Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`, `Horizons ETFs Management`, `Harvest Portfolios Group`, `IA Clarington Investments`, `National Bank Investments`, `Caldwell Investment Management`, `Developed Markets`, `Emerging Markets`, `High Yield Bonds`). Re-derived from ticker suffix. - **`FSST` completed** (Fidelity Sustainable U.S. Equity ETF) — row previously had every field NaN. Filled name, currency, summary, category_group, category, family, exchange (`PCX`), isin. ## 2. New cross-asset invariant test `tests/test_invariants.py::test_no_symbol_collisions_across_asset_classes` asserts that a given `symbol` belongs to at most one of `equities.csv`, `etfs.csv`, `funds.csv`. This is the test that surfaced the 56 + 14 = 70 cleanup rows above; it would have caught all of them before they landed on `main`. ## 3. Automated README statistics `.github/workflows/database_update.yml` gains a new `Update-README-Statistics` job that runs after the existing `Add-New-Ticker` / `Update-Compression-Files` / `Update-Categorization-Files` jobs. It recomputes the statistics tables from the on-disk CSVs and rewrites `README.md` in-place, committing the result. The `Check-GICS-Categorisation` job now also depends on it. README is now split into three tables that reflect the actual schema of each asset class rather than the previous combined-table layout whose `Countries` column for ETFs/Funds had no backing data: ``` Table A — Equities only (has country, sector, industry) | Product | Quantity | Sectors | Industries | Countries | Exchanges | | Equities | 160.113 | 11 | 62 | 113 | 84 | Table B — ETFs / Funds (no country: no schema for it) | Product | Quantity | Families | Categories | Exchanges | | ETFs | 36.485 | 320 | 51 | 51 | | Funds | 57.853 | 1.540 | 74 | 33 | Table C — Currencies / Cryptos / Indices / Money Markets | Product | Quantity | Category | | Currencies | 2.556 | 175 Currencies | | Cryptocurrencies | 3.367 | 351 Cryptocurrencies | | Indices | 91.178 | 63 Exchanges | | Money Markets | 1.367 | 2 Exchanges | ``` The B-table columns are renamed from the previous "Sectors / Industries" (which were really family/category counts) to be honest about what the numbers represent. ## What this PR explicitly does NOT do - **No `country` column on `etfs.csv` / `funds.csv`.** Per @JerBouma's feedback, "country" for ETFs/Funds should reflect the investment scope (e.g. iShares MSCI World ETF listed on NYSE has global scope, not "United States"), not the listing exchange. The obvious deterministic source (exchange -> country) doesn't fit that semantics. Deferred to a focused discussion / future PR if the community decides on a definition. - **No country fill on missing equities.** 80k equity rows still have empty `country`; filling them from listing exchange would be wrong for the same reason ASML on Nasdaq stays "Netherlands". Needs an actual issuer-domicile source. ## Test plan - [ ] `pytest tests/` — 50 tests pass (was 49 before; the new `test_no_symbol_collisions_across_asset_classes` is the +1) - [ ] `test_no_symbol_collisions_across_asset_classes` passes - [ ] `black --check tests/ financedatabase/` clean - [ ] CI `Update-README-Statistics` job rewrites README correctly on next data update - [ ] Spot-check: `BHF`, `DTE`, `RGA`, etc. exist in `equities.csv` only (not in `etfs.csv` anymore) - [ ] Spot-check: `FSST` has all fields populated Related: #140 (test infra), #143 (introduced the `exchange -> market` invariant pattern), #144 (proposes splitting CSVs by exchange).
1 parent 0e9e8d3 commit 81f9661

11 files changed

Lines changed: 222 additions & 139 deletions

File tree

.github/workflows/database_update.yml

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,8 +426,119 @@ jobs:
426426
if: steps.run.outputs.status != '0'
427427
run: exit "${{ steps.run.outputs.status }}"
428428

429+
Update-README-Statistics:
430+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
431+
runs-on: ubuntu-latest
432+
steps:
433+
- name: checkout repo content
434+
uses: actions/checkout@v3
435+
- name: pull changes
436+
run: git pull https://${{secrets.PAT}}@github.com/JerBouma/FinanceDatabase.git main
437+
- name: setup python
438+
uses: actions/setup-python@v4
439+
with:
440+
python-version: '3.13'
441+
- run: pip install pandas
442+
- name: Refresh the statistics tables in README.md
443+
uses: jannekem/run-python-script-action@v1
444+
with:
445+
script: |
446+
import re
447+
from pathlib import Path
448+
import pandas as pd
449+
450+
def _n(x: int) -> str:
451+
"""Format an integer with a thousand-separator dot (e.g. 160.113)."""
452+
return f"{x:,}".replace(",", ".")
453+
454+
# Table A: Equities — Quantity, Sectors, Industries, Countries, Exchanges
455+
eq = pd.read_csv("database/equities.csv", dtype=str)
456+
table_a_row = (
457+
"Equities",
458+
_n(len(eq)),
459+
_n(eq["sector"].nunique()),
460+
_n(eq["industry"].nunique()),
461+
_n(eq["country"].nunique()),
462+
_n(eq["exchange"].nunique()),
463+
)
464+
465+
# Table B: ETFs / Funds — Quantity, Families, Categories, Exchanges
466+
# (no Countries column: ETF/Fund 'country' is semantically the investment
467+
# scope, not a HQ — not derivable from listing exchange alone.)
468+
table_b_rows = []
469+
for asset in ("ETFs", "Funds"):
470+
df = pd.read_csv(f"database/{asset.lower()}.csv", dtype=str)
471+
table_b_rows.append((
472+
asset,
473+
_n(len(df)),
474+
_n(df["family"].nunique()),
475+
_n(df["category"].nunique()),
476+
_n(df["exchange"].nunique()),
477+
))
478+
479+
# Table C: Currencies / Cryptocurrencies / Indices / Money Markets
480+
table_c_rows = []
481+
for asset, fname, col, label in [
482+
("Currencies", "currencies.csv", "quote_currency", "Currencies"),
483+
("Cryptocurrencies", "cryptos.csv", "cryptocurrency", "Cryptocurrencies"),
484+
("Indices", "indices.csv", "exchange", "Exchanges"),
485+
("Money Markets", "moneymarkets.csv", "exchange", "Exchanges"),
486+
]:
487+
df = pd.read_csv(f"database/{fname}", dtype=str)
488+
table_c_rows.append((asset, _n(len(df)), f"{_n(df[col].nunique())} {label}"))
489+
490+
def _fmt_table_a(row):
491+
h = "| Product | Quantity | Sectors | Industries | Countries | Exchanges |\n"
492+
h += "| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |"
493+
a, q, s, i, c, e = row
494+
b = f"| {a:<17} | {q:<10} | {s:<10} | {i:<13} | {c:<9} | {e:<9} |"
495+
return h + "\n" + b
496+
497+
def _fmt_table_b(rows):
498+
h = "| Product | Quantity | Families | Categories | Exchanges |\n"
499+
h += "| ----------------- | ---------- | ---------- | ------------- | --------- |"
500+
b = "\n".join(
501+
f"| {a:<17} | {q:<10} | {f:<10} | {c:<13} | {e:<9} |"
502+
for (a, q, f, c, e) in rows
503+
)
504+
return h + "\n" + b
505+
506+
def _fmt_table_c(rows):
507+
h = "| Product | Quantity | Category |\n"
508+
h += "| ----------------- | --------- | --------------------- |"
509+
b = "\n".join(
510+
f"| {a:<17} | {q:<9} | {c:<21} |"
511+
for (a, q, c) in rows
512+
)
513+
return h + "\n" + b
514+
515+
readme = Path("README.md")
516+
text = readme.read_text(encoding="utf-8")
517+
# Replace the existing combined "Equities/ETFs/Funds" table with two tables (A + B).
518+
text = re.sub(
519+
r"\| Product\s+\| Quantity\s+\| Sectors[^\n]*\| Industries[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){3}",
520+
_fmt_table_a(table_a_row) + "\n\n" + _fmt_table_b(table_b_rows) + "\n",
521+
text,
522+
count=1,
523+
)
524+
text = re.sub(
525+
r"\| Product\s+\| Quantity\s+\| Category[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){4}",
526+
_fmt_table_c(table_c_rows) + "\n",
527+
text,
528+
count=1,
529+
)
530+
readme.write_text(text, encoding="utf-8")
531+
print("README.md statistics tables refreshed.")
532+
- name: Commit README update
533+
run: |
534+
git config --global user.name 'GitHub Action'
535+
git config --global user.email 'action@github.com'
536+
git add README.md
537+
git diff-index --quiet HEAD || git commit -m "Update README statistics"
538+
git push
539+
429540
Check-GICS-Categorisation:
430-
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
541+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files, Update-README-Statistics]
431542
runs-on: ubuntu-latest
432543
steps:
433544
- name: checkout repo content

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,19 @@ Some key statistics of the database:
2323

2424
| Product | Quantity | Sectors | Industries | Countries | Exchanges |
2525
| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |
26-
| Equities | 158.429 | 12 | 63 | 111 | 83 |
27-
| ETFs | 36.786 | 295 | 22 | 111 | 53 |
28-
| Funds | 57.881 | 1541 | 52 | 111 | 34 |
26+
| Equities | 160.113 | 11 | 62 | 113 | 84 |
27+
28+
| Product | Quantity | Families | Categories | Exchanges |
29+
| ----------------- | ---------- | ---------- | ------------- | --------- |
30+
| ETFs | 36.485 | 320 | 51 | 51 |
31+
| Funds | 57.853 | 1.540 | 74 | 33 |
2932

3033
| Product | Quantity | Category |
3134
| ----------------- | --------- | --------------------- |
3235
| Currencies | 2.556 | 175 Currencies |
33-
| Cryptocurrencies | 3.367 | 352 Cryptocurrencies |
34-
| Indices | 91.183 | 64 Exchanges |
35-
| Money Markets | 1.367 | 3 Exchanges |
36+
| Cryptocurrencies | 3.367 | 351 Cryptocurrencies |
37+
| Indices | 91.178 | 63 Exchanges |
38+
| Money Markets | 1.367 | 2 Exchanges |
3639

3740
The Finance Database is used within or referenced by:
3841

0 commit comments

Comments
 (0)