Skip to content

Commit ba9cb4e

Browse files
doksonclaude
andauthored
ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats (#147)
* ETFs/Funds data quality + cross-asset invariants + country backfill + README stats Data quality on etfs.csv: - 14 non-ETF rows removed (already correctly in equities.csv: BHF, DTE, RGA, HTGC, PSEC, TPVG, ...) - 56 cross-asset symbol collisions with equities.csv removed (corporate bonds, senior notes, share-class variants misclassified as ETFs) - 29 corrupted `exchange` values fixed (issuer name in exchange column: Xtrackers, Fundlogic, Purpose Investments, ...) - FSST (Fidelity Sustainable U.S. Equity ETF) completed (was all-NaN) - After cleanup: equities.csv and etfs.csv share zero symbols Country backfill on equities.csv (50.3% -> 63.1%): - 15,692 rows filled from primary listing in equities.csv (same base ticker, e.g. ASML.AS country propagated to ASML.DE) - 5,777 rows filled via TradingView screener API (HQ country, not listing country -- ASML on Nasdaq stays "Netherlands", not "United States") - 113 additional rows from yfinance lookup - Skips bases that resolve ambiguously across markets (e.g. numeric bases shared between Chinese .SZ and Korean .KS exchanges) - Russian Federation -> Russia normalization - ALMER.PA: Reunion (French overseas dept) -> France Cross-asset invariants: - New tests/test_invariants.py with test_no_symbol_collisions_across_asset_classes covering all 7 asset class files (equities, etfs, funds, indices, currencies, cryptos, moneymarkets). Catches drift like the ^REIT-in-etfs case at PR time. Automated README statistics: - New Update-README-Statistics job in .github/workflows/database_update.yml regenerates stats tables from database/*.csv after every database update. - README restructured into three tables (Equities w/ Countries; ETFs/Funds w/o Country; Currencies/Cryptos/Indices/Money Markets) to keep every cell honest -- ETFs/Funds country was a manual placeholder before. Misc: - financedatabase/helpers.py: widen base show_options() return type from pd.Series to pd.Index | dict | np.ndarray to match what subclasses actually return (LSP fix; runtime unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * equities.csv data quality: ISIN cleanup + SPAC template removal + canonicalize names Picks up where the prior commit left off after deeper auditing surfaced two more contamination layers in equities.csv. ISIN cleanup: - Cross-asset duplicates (eq.csv ∩ etfs.csv): 209 ISIN cleared from equities rows where the ISIN rightfully belongs to the ETF (UBS UCITS, Lyxor, iShares, HSBC etc. previously stamped onto unrelated PNK micro-caps). - Multi-name ISIN cleanup (different companies sharing one ISIN — impossible by ISO design, so one row's ISIN is wrong). Resolved canonical name per ISIN via GLEIF (legalName) + yfinance (longName) + TradingView (description). Total wrong ISIN values cleared: 9,634 across the three passes. - Multi-name ISIN count: 4,772 baseline -> 1,467 (-69%). - New invariant test `test_no_isin_collisions_across_asset_classes` enforces no future regression of the cross-asset case (catches the same drift that the symbol invariant catches). Name canonicalization: - 8,361 names rewritten to the canonical legalName/description form so that cross-listings of the same company collapse to one spelling instead of 4 variants like "STRABAG SE", "STRABAG SE-BR", "Strabag SE Inhaber-Aktien o.N." - Conservative threshold: only rewrite when token similarity with canonical >= 0.4 (preserves originals when the row likely refers to a different company than the canonical). SPAC template removal (upstream data poisoning): - 1,584 equities.csv rows had identical name="one" + summary about a SPAC that "does not have significant operations. It intends to effect a merger ...", plus identical state=CA / city=San Francisco / zipcode=94129 / website=a-star.co / sector=Financials / industry=Diversified Financial Services. This is A-Star Financial Acquisition Corp's data copy-pasted onto 1,584 unrelated tickers somewhere upstream. - 200 real names recovered via TradingView + stockevents.app + DuckDuckGo + Finnhub (the rest are micro-caps not in any free public dataset). - All 1,608 rows carrying the SPAC fingerprint (website=a-star.co) had the contaminated sector/industry_group/industry/state/city/zipcode/ website/market_cap fields cleared. The bogus uniform values were more misleading than missing data. ALMER.PA: Reunion -> France (French overseas department; matches FD's convention for metropolitan/overseas France). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0e9e8d3 commit ba9cb4e

36 files changed

Lines changed: 71550 additions & 71429 deletions

.github/workflows/database_update.yml

Lines changed: 112 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -426,8 +426,119 @@ jobs:
426426
if: steps.run.outputs.status != '0'
427427
run: exit "${{ steps.run.outputs.status }}"
428428

429+
Update-README-Statistics:
430+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
431+
runs-on: ubuntu-latest
432+
steps:
433+
- name: checkout repo content
434+
uses: actions/checkout@v3
435+
- name: pull changes
436+
run: git pull https://${{secrets.PAT}}@github.com/JerBouma/FinanceDatabase.git main
437+
- name: setup python
438+
uses: actions/setup-python@v4
439+
with:
440+
python-version: '3.13'
441+
- run: pip install pandas
442+
- name: Refresh the statistics tables in README.md
443+
uses: jannekem/run-python-script-action@v1
444+
with:
445+
script: |
446+
import re
447+
from pathlib import Path
448+
import pandas as pd
449+
450+
def _n(x: int) -> str:
451+
"""Format an integer with a thousand-separator dot (e.g. 160.113)."""
452+
return f"{x:,}".replace(",", ".")
453+
454+
# Table A: Equities — Quantity, Sectors, Industries, Countries, Exchanges
455+
eq = pd.read_csv("database/equities.csv", dtype=str)
456+
table_a_row = (
457+
"Equities",
458+
_n(len(eq)),
459+
_n(eq["sector"].nunique()),
460+
_n(eq["industry"].nunique()),
461+
_n(eq["country"].nunique()),
462+
_n(eq["exchange"].nunique()),
463+
)
464+
465+
# Table B: ETFs / Funds — Quantity, Families, Categories, Exchanges
466+
# (no Countries column: ETF/Fund 'country' is semantically the investment
467+
# scope, not a HQ — not derivable from listing exchange alone.)
468+
table_b_rows = []
469+
for asset in ("ETFs", "Funds"):
470+
df = pd.read_csv(f"database/{asset.lower()}.csv", dtype=str)
471+
table_b_rows.append((
472+
asset,
473+
_n(len(df)),
474+
_n(df["family"].nunique()),
475+
_n(df["category"].nunique()),
476+
_n(df["exchange"].nunique()),
477+
))
478+
479+
# Table C: Currencies / Cryptocurrencies / Indices / Money Markets
480+
table_c_rows = []
481+
for asset, fname, col, label in [
482+
("Currencies", "currencies.csv", "quote_currency", "Currencies"),
483+
("Cryptocurrencies", "cryptos.csv", "cryptocurrency", "Cryptocurrencies"),
484+
("Indices", "indices.csv", "exchange", "Exchanges"),
485+
("Money Markets", "moneymarkets.csv", "exchange", "Exchanges"),
486+
]:
487+
df = pd.read_csv(f"database/{fname}", dtype=str)
488+
table_c_rows.append((asset, _n(len(df)), f"{_n(df[col].nunique())} {label}"))
489+
490+
def _fmt_table_a(row):
491+
h = "| Product | Quantity | Sectors | Industries | Countries | Exchanges |\n"
492+
h += "| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |"
493+
a, q, s, i, c, e = row
494+
b = f"| {a:<17} | {q:<10} | {s:<10} | {i:<13} | {c:<9} | {e:<9} |"
495+
return h + "\n" + b
496+
497+
def _fmt_table_b(rows):
498+
h = "| Product | Quantity | Families | Categories | Exchanges |\n"
499+
h += "| ----------------- | ---------- | ---------- | ------------- | --------- |"
500+
b = "\n".join(
501+
f"| {a:<17} | {q:<10} | {f:<10} | {c:<13} | {e:<9} |"
502+
for (a, q, f, c, e) in rows
503+
)
504+
return h + "\n" + b
505+
506+
def _fmt_table_c(rows):
507+
h = "| Product | Quantity | Category |\n"
508+
h += "| ----------------- | --------- | --------------------- |"
509+
b = "\n".join(
510+
f"| {a:<17} | {q:<9} | {c:<21} |"
511+
for (a, q, c) in rows
512+
)
513+
return h + "\n" + b
514+
515+
readme = Path("README.md")
516+
text = readme.read_text(encoding="utf-8")
517+
# Replace the existing combined "Equities/ETFs/Funds" table with two tables (A + B).
518+
text = re.sub(
519+
r"\| Product\s+\| Quantity\s+\| Sectors[^\n]*\| Industries[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){3}",
520+
_fmt_table_a(table_a_row) + "\n\n" + _fmt_table_b(table_b_rows) + "\n",
521+
text,
522+
count=1,
523+
)
524+
text = re.sub(
525+
r"\| Product\s+\| Quantity\s+\| Category[^\n]*\n\|[^\n]*\n(?:\|[^\n]*\n){4}",
526+
_fmt_table_c(table_c_rows) + "\n",
527+
text,
528+
count=1,
529+
)
530+
readme.write_text(text, encoding="utf-8")
531+
print("README.md statistics tables refreshed.")
532+
- name: Commit README update
533+
run: |
534+
git config --global user.name 'GitHub Action'
535+
git config --global user.email 'action@github.com'
536+
git add README.md
537+
git diff-index --quiet HEAD || git commit -m "Update README statistics"
538+
git push
539+
429540
Check-GICS-Categorisation:
430-
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files]
541+
needs: [Add-New-Ticker, Update-Compression-Files, Update-Categorization-Files, Update-README-Statistics]
431542
runs-on: ubuntu-latest
432543
steps:
433544
- name: checkout repo content

README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,16 +23,19 @@ Some key statistics of the database:
2323

2424
| Product | Quantity | Sectors | Industries | Countries | Exchanges |
2525
| ----------------- | ---------- | ---------- | ------------- | --------- | --------- |
26-
| Equities | 158.429 | 12 | 63 | 111 | 83 |
27-
| ETFs | 36.786 | 295 | 22 | 111 | 53 |
28-
| Funds | 57.881 | 1541 | 52 | 111 | 34 |
26+
| Equities | 160.113 | 11 | 62 | 113 | 84 |
27+
28+
| Product | Quantity | Families | Categories | Exchanges |
29+
| ----------------- | ---------- | ---------- | ------------- | --------- |
30+
| ETFs | 36.485 | 320 | 51 | 51 |
31+
| Funds | 57.853 | 1.540 | 74 | 33 |
2932

3033
| Product | Quantity | Category |
3134
| ----------------- | --------- | --------------------- |
3235
| Currencies | 2.556 | 175 Currencies |
33-
| Cryptocurrencies | 3.367 | 352 Cryptocurrencies |
34-
| Indices | 91.183 | 64 Exchanges |
35-
| Money Markets | 1.367 | 3 Exchanges |
36+
| Cryptocurrencies | 3.367 | 351 Cryptocurrencies |
37+
| Indices | 91.178 | 63 Exchanges |
38+
| Money Markets | 1.367 | 2 Exchanges |
3639

3740
The Finance Database is used within or referenced by:
3841

0 commit comments

Comments
 (0)