ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats#147
Conversation
|
Hmm, less enthusiastic about this one given that ETFs and Funds can come in all kinds of flavours. While listed on an American exchange, there are ETFs such as the MSCI World that have a global scope. The country column should therefore not be attached to the exchange but rather the ETF's purpose just like ASML can be on an American exchange but maintain the "Netherlands" string inside the country column. |
|
@JerBouma ok, I will keep automated readme stats only then |
0ecc9d4 to
81f9661
Compare
|
@JerBouma had to remove country column from ETFs and funds (in the README) because it was misleading...do you agree? |
|
I'd expect just the README to change. Why all the test files and the GitHub Actions as well? |
README automation is in the github action: whenever a new equity/etf/etc... is added, then also README will be auto-updated I also fixed some data inconsistencies I found by adding tests: there were records inside |
|
This looks good, happy with the changes! |
I am working to further improve data quality of FinanceDatabase...give me some time and I will add to this PR 😉 Thanks for the feedback @JerBouma |
Wonderful, thank you for your work! |
69146c5 to
0d1fc04
Compare
… README stats Data quality on etfs.csv: - 14 non-ETF rows removed (already correctly in equities.csv: BHF, DTE, RGA, HTGC, PSEC, TPVG, ...) - 56 cross-asset symbol collisions with equities.csv removed (corporate bonds, senior notes, share-class variants misclassified as ETFs) - 29 corrupted `exchange` values fixed (issuer name in exchange column: Xtrackers, Fundlogic, Purpose Investments, ...) - FSST (Fidelity Sustainable U.S. Equity ETF) completed (was all-NaN) - After cleanup: equities.csv and etfs.csv share zero symbols Country backfill on equities.csv (50.3% -> 63.1%): - 15,692 rows filled from primary listing in equities.csv (same base ticker, e.g. ASML.AS country propagated to ASML.DE) - 5,777 rows filled via TradingView screener API (HQ country, not listing country -- ASML on Nasdaq stays "Netherlands", not "United States") - 113 additional rows from yfinance lookup - Skips bases that resolve ambiguously across markets (e.g. numeric bases shared between Chinese .SZ and Korean .KS exchanges) - Russian Federation -> Russia normalization - ALMER.PA: Reunion (French overseas dept) -> France Cross-asset invariants: - New tests/test_invariants.py with test_no_symbol_collisions_across_asset_classes covering all 7 asset class files (equities, etfs, funds, indices, currencies, cryptos, moneymarkets). Catches drift like the ^REIT-in-etfs case at PR time. Automated README statistics: - New Update-README-Statistics job in .github/workflows/database_update.yml regenerates stats tables from database/*.csv after every database update. - README restructured into three tables (Equities w/ Countries; ETFs/Funds w/o Country; Currencies/Cryptos/Indices/Money Markets) to keep every cell honest -- ETFs/Funds country was a manual placeholder before. Misc: - financedatabase/helpers.py: widen base show_options() return type from pd.Series to pd.Index | dict | np.ndarray to match what subclasses actually return (LSP fix; runtime unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0d1fc04 to
f9782c3
Compare
…onicalize names Picks up where the prior commit left off after deeper auditing surfaced two more contamination layers in equities.csv. ISIN cleanup: - Cross-asset duplicates (eq.csv ∩ etfs.csv): 209 ISIN cleared from equities rows where the ISIN rightfully belongs to the ETF (UBS UCITS, Lyxor, iShares, HSBC etc. previously stamped onto unrelated PNK micro-caps). - Multi-name ISIN cleanup (different companies sharing one ISIN — impossible by ISO design, so one row's ISIN is wrong). Resolved canonical name per ISIN via GLEIF (legalName) + yfinance (longName) + TradingView (description). Total wrong ISIN values cleared: 9,634 across the three passes. - Multi-name ISIN count: 4,772 baseline -> 1,467 (-69%). - New invariant test `test_no_isin_collisions_across_asset_classes` enforces no future regression of the cross-asset case (catches the same drift that the symbol invariant catches). Name canonicalization: - 8,361 names rewritten to the canonical legalName/description form so that cross-listings of the same company collapse to one spelling instead of 4 variants like "STRABAG SE", "STRABAG SE-BR", "Strabag SE Inhaber-Aktien o.N." - Conservative threshold: only rewrite when token similarity with canonical >= 0.4 (preserves originals when the row likely refers to a different company than the canonical). SPAC template removal (upstream data poisoning): - 1,584 equities.csv rows had identical name="one" + summary about a SPAC that "does not have significant operations. It intends to effect a merger ...", plus identical state=CA / city=San Francisco / zipcode=94129 / website=a-star.co / sector=Financials / industry=Diversified Financial Services. This is A-Star Financial Acquisition Corp's data copy-pasted onto 1,584 unrelated tickers somewhere upstream. - 200 real names recovered via TradingView + stockevents.app + DuckDuckGo + Finnhub (the rest are micro-caps not in any free public dataset). - All 1,608 rows carrying the SPAC fingerprint (website=a-star.co) had the contaminated sector/industry_group/industry/state/city/zipcode/ website/market_cap fields cleared. The bogus uniform values were more misleading than missing data. ALMER.PA: Reunion -> France (French overseas department; matches FD's convention for metropolitan/overseas France). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks again, looks good! Merging.. |
Summary
Two-commit data-quality + invariants PR. Started narrow (
etfs.csvcleanup + README automation) and grew as deeper auditing surfaced two further contamination layers inequities.csv.Commit 1 — ETFs/Funds data quality + cross-asset invariants + country backfill + README stats
Commit 2 — equities.csv ISIN cleanup + SPAC template removal + name canonicalization
What's in:
etfs.csvdata quality — 100 rows cleaned (14 non-ETFs already inequities.csv, 56 cross-asset symbol collisions, 29 corruptedexchangevalues, FSST completed from all-NaN). After cleanupequities.csv/etfs.csv/funds.csv/indices.csvshare zero symbols.tests/test_invariants.pywith two tests:test_no_symbol_collisions_across_asset_classes— symbol belongs to at most one of the 7 asset class filestest_no_isin_collisions_across_asset_classes— ISIN belongs to at most one ofequities.csv/etfs.csv(the only two files that track ISIN)equities.csv— 50.3% → 71.6% using HQ-country semantics (per your earlier review). 35,118 rows filled across 7 sources.equities.csv— 24.6% → 25.6% with 9,634 wrong ISINs cleared via GLEIF + yfinance + TradingView canonical-name verification. Net coverage moves modestly because we removed a lot of bad data alongside adding good data.legalName/description/longName) so cross-listings of the same company collapse to one spelling.STRABAG SE/STRABAG SE-BR/Strabag SE Inhaber-Aktien o.N.→STRABAG SE.Update-README-Statisticsworkflow job + 3-table layout reflecting the actual schema.51/51 tests pass (was 49 before; +1 symbol invariant, +1 ISIN invariant).
1.
etfs.csvdata qualityequities.csv)equities.csvremovedexchangevalues fixedFSSTcompleted (was all-NaN)Non-ETF removals (14)
^REIT(later restored after yfinance verification — see §2.1) plus 13 US-equity rows already inequities.csv:BHF,BHFAN,BHFAO,BHFAP,DTB,DTE,DTP,HTGC,PBC,PSEC,RGA,RZA,TPVG.Cross-asset collisions (56)
All 56 were corporate bonds / senior notes / equity share-class rows misclassified as ETFs. Examples: Brighthouse Financial cross-listings, Corvus Gold, senior notes (
AJXA,CMSA,CMSC,CTBB,CTDD,CNFRL,ARGD),ASGI, DTE Energy variants.Corrupted
exchangevalues (29)Issuer name in the
exchangecolumn. Re-derived from ticker suffix via FD's own mapping:Xtrackers(DXSP.DE,DXSP.F,XSSX.L,XSSX.MI) →FRA,FRA,LSE,MILFundlogic(EHEF.L,EMHF.L,JHEF.L,MEEU.L,MIVS.L) →LSEPurpose Investments,CI Investments,Horizons ETFs Management,Harvest Portfolios Group,IA Clarington Investments,National Bank Investments,Caldwell Investment Management(all.TO) →TORDeveloped Markets,Emerging Markets,High Yield Bonds(descriptive labels on.F) →FRAFSSTcompletionWas all-NaN. Now:
Fidelity Sustainable U.S. Equity ETF, USD, full summary,category_group=Equities,category=Large Blend,family=Fidelity Investments,exchange=PCX,isin=US3160922791.2. Cross-asset invariants
tests/test_invariants.py— two complementary invariants, pure-assert tests (no recorder fixture).2.1 —
test_no_symbol_collisions_across_asset_classesSymbols across all 7 asset class files. Caught the 70-row cleanup above; will catch future drift.
While iterating I tried moving 60
^…tickers frometfs.csvtoindices.csvon the assumption that^prefix = index. yfinance'squoteTypeshowed 57 of them are actually classifiedETF(intraday-indicative-value siblings of real ETFs:^ACWI,^BND,^ONEQ,^REIT,^VYMI, …). Only^ARB-EU/NV/TCcome back asquoteType=INDEX. Final state: 57 restored toetfs.csvwith original metadata, 3 inindices.csv. The invariant test enforces this stays consistent.2.2 —
test_no_isin_collisions_across_asset_classesISIN is a unique identifier per security by ISO design; the same code appearing in both
equities.csvandetfs.csvmeans one row has been mis-tagged with an ISIN that rightfully belongs to the other.Baseline had 209 such collisions — almost all the same pattern: legitimate ETF ISIN (UBS UCITS, Lyxor MSCI, iShares MSCI, HSBC S&P 500, …) had been stamped onto unrelated PNK micro-cap rows that happened to share a ticker substring with the ETF. Cleared the equities-side ISIN for all 209. Now zero collisions; the test prevents regression.
3. Country backfill on
equities.csv(50.3% → 71.6%)Per your earlier comment, country should reflect HQ, not the listing exchange (ASML on Nasdaq is still Netherlands). Used HQ-country semantics end-to-end through 7 sources, in order of yield:
ASML.AScountry →ASML.DE,ASML.MU, …)legalAddress.country)info["country"]The TradingView pass is the key one for the HQ-semantics concern: an ADR of a German company on NYSE still gets
country=Germany. Cross-checked against yfinance on a sample — 82% agreement. Disagreements were mostly ambiguous short numeric bases shared between Chinese.SZand Korean.KS(e.g. base000050). The script skips those — 67 ambiguous rows were left ascountry=NaNrather than guessed.Normalisations:
Russian Federation→Russia(TV uses ISO formal name, FD has always used short form — 19 rows).Reunion→FranceforALMER.PA(French overseas department).Coverage ceiling is ~29% missing because remaining rows are ISIN-as-ticker codes, Euronext
.NXwarrant series, and obscure cross-listings on German secondary exchanges (BER, DUS, MUN, STU) of micro-caps that aren't in yfinance, TradingView, GLEIF, Wikidata, OpenFIGI, Finnhub, valueinvesting.io, stockevents.app, or DuckDuckGo. Not reachable from free public APIs.4. ISIN backfill + deep cleanup
Net coverage move: 24.6% → 25.6%. The headline is small because we also cleared a lot of bad data alongside adding good data.
isin = symbol)isinin eq.csv ∩ etfs.csv)Multi-name ISINs (same ISIN attached to rows with different company names — impossible by ISO 6166 design):
Residual ~1,467 are mostly legitimate cross-listing naming variations of the same company (different exchange suffixes produce slightly different names). True-collision bugs in the residual are bounded by which ISINs neither GLEIF nor yfinance nor TradingView had data for (mostly deprecated country codes like
ANNetherlands Antilles, and ISINs without an associated LEI).5. Name canonicalization
Same canonical-name lookup (GLEIF
legalName+ TradingViewdescription) used in §4, applied to thenamecolumn: when a row's existing name has token similarity ≥ 0.4 with the canonical, rewrite it to the canonical form. This collapses cross-listing name variations to one spelling without overwriting genuinely different-company rows.Result: 8,361 names rewritten. Examples:
STRABAG SE/STRABAG SE-BR/Strabag SE Inhaber-Aktien o.N.→STRABAG SELenzing Aktiengesellschaft/LENZING AG LENZING ORD SHS→Lenzing AktiengesellschaftAMAG AUSTRIA METALL INH./AMAG Austria Metall AG Inhaber-→AMAG Austria Metall AG6. SPAC template removal
While running canonical-name passes I noticed 1,584 equities.csv rows sharing identical:
name = "one"summary = "one does not have significant operations. It intends to effect a merger, capital stock exchange, asset acquisition…"state = "CA",city = "San Francisco",zipcode = 94129website = "http://www.a-star.co"sector = "Financials",industry = "Diversified Financial Services"market_cap = "Small Cap"This is A-Star Financial Acquisition Corp's profile copy-pasted onto 1,584 unrelated tickers somewhere upstream — a clear data poisoning. The same SPAC's data has no business being on Malaysian
.KLlistings, Frankfurt.Fwarrants, etc.Two-step cleanup:
website=a-star.cofingerprint (the 200 we recovered names for + the 1,408 we didn't), cleared the contaminated fields toNaN:sector,industry_group,industry,state,city,zipcode,website,market_cap. Bogus uniform values were more misleading than missing data. Real names preserved where we recovered them.The remaining ~1,400 rows with cleared name still carry
symbol+exchange+market+currency(andcountrywhere we resolved it independently), which is enough for them to be addressable but no longer carry a fake company profile.7. Automated README statistics
New
Update-README-Statisticsjob in.github/workflows/database_update.yml. Runs after the existingAdd-New-Ticker/Update-Compression-Files/Update-Categorization-Fileschain.README layout (three tables)
The previous combined "Equities / ETFs / Funds" table had a
Countriescolumn whose ETFs/Funds cells (111,111) were a manual placeholder not backed by any column. Splitting into A + B keeps every cell honest. B-table headings areFamilies / Categories(notSectors / Industries) because that's what the underlying CSV columns are for those asset classes.Test plan
pytest tests/— 51 tests pass (was 49 before; +2 invariant tests)black --check tests/ financedatabase/cleanUpdate-README-Statisticsjob rewrites README correctly on next data updateBHF,DTE,RGA, … only inequities.csv^REIT,^ACWI,^BNDstill inetfs.csv;^ARB-EU,^ARB-NV,^ARB-TCinindices.csvASML.DE,ASML.MU, … havecountry=Netherlands(HQ), not Germany (listing)STRABAG SE-BR/Strabag SE Inhaber-Aktien o.N.collapse toSTRABAG SEwebsite=http://www.a-star.coequities.csvandetfs.csvRelated: #140 (test infrastructure), #143 (introduced the
exchange → marketinvariant pattern), #144 (proposes splitting CSVs by exchange).