Skip to content

ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats#147

Merged
JerBouma merged 2 commits into
JerBouma:mainfrom
dokson:feature/automate-readme-stats
May 21, 2026
Merged

ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats#147
JerBouma merged 2 commits into
JerBouma:mainfrom
dokson:feature/automate-readme-stats

Conversation

@dokson
Copy link
Copy Markdown
Contributor

@dokson dokson commented May 19, 2026

Summary

Two-commit data-quality + invariants PR. Started narrow (etfs.csv cleanup + README automation) and grew as deeper auditing surfaced two further contamination layers in equities.csv.

Commit 1 — ETFs/Funds data quality + cross-asset invariants + country backfill + README stats
Commit 2 — equities.csv ISIN cleanup + SPAC template removal + name canonicalization

What's in:

  1. etfs.csv data quality — 100 rows cleaned (14 non-ETFs already in equities.csv, 56 cross-asset symbol collisions, 29 corrupted exchange values, FSST completed from all-NaN). After cleanup equities.csv / etfs.csv / funds.csv / indices.csv share zero symbols.
  2. Cross-asset invariants in CItests/test_invariants.py with two tests:
    • test_no_symbol_collisions_across_asset_classes — symbol belongs to at most one of the 7 asset class files
    • test_no_isin_collisions_across_asset_classes — ISIN belongs to at most one of equities.csv/etfs.csv (the only two files that track ISIN)
  3. Country backfill on equities.csv50.3% → 71.6% using HQ-country semantics (per your earlier review). 35,118 rows filled across 7 sources.
  4. ISIN backfill + cleanup on equities.csv24.6% → 25.6% with 9,634 wrong ISINs cleared via GLEIF + yfinance + TradingView canonical-name verification. Net coverage moves modestly because we removed a lot of bad data alongside adding good data.
  5. Name canonicalization — 8,361 names rewritten to canonical (legalName / description / longName) so cross-listings of the same company collapse to one spelling. STRABAG SE / STRABAG SE-BR / Strabag SE Inhaber-Aktien o.N.STRABAG SE.
  6. SPAC template removal — 1,584 baseline rows had identical fake metadata copy-pasted from A-Star Financial Acquisition Corp's SPAC profile. 200 real names recovered, 1,608 rows of bogus location/sector fields cleared.
  7. Automated README statistics — new Update-README-Statistics workflow job + 3-table layout reflecting the actual schema.

51/51 tests pass (was 49 before; +1 symbol invariant, +1 ISIN invariant).


1. etfs.csv data quality

Fix Rows
Non-ETF rows removed (already correctly in equities.csv) 14
Cross-asset symbol collisions with equities.csv removed 56
Corrupted exchange values fixed 29
FSST completed (was all-NaN) 1

Non-ETF removals (14)

^REIT (later restored after yfinance verification — see §2.1) plus 13 US-equity rows already in equities.csv: BHF, BHFAN, BHFAO, BHFAP, DTB, DTE, DTP, HTGC, PBC, PSEC, RGA, RZA, TPVG.

Cross-asset collisions (56)

All 56 were corporate bonds / senior notes / equity share-class rows misclassified as ETFs. Examples: Brighthouse Financial cross-listings, Corvus Gold, senior notes (AJXA, CMSA, CMSC, CTBB, CTDD, CNFRL, ARGD), ASGI, DTE Energy variants.

Corrupted exchange values (29)

Issuer name in the exchange column. Re-derived from ticker suffix via FD's own mapping:

  • Xtrackers (DXSP.DE, DXSP.F, XSSX.L, XSSX.MI) → FRA, FRA, LSE, MIL
  • Fundlogic (EHEF.L, EMHF.L, JHEF.L, MEEU.L, MIVS.L) → LSE
  • Purpose Investments, CI Investments, Horizons ETFs Management, Harvest Portfolios Group, IA Clarington Investments, National Bank Investments, Caldwell Investment Management (all .TO) → TOR
  • Developed Markets, Emerging Markets, High Yield Bonds (descriptive labels on .F) → FRA

FSST completion

Was all-NaN. Now: Fidelity Sustainable U.S. Equity ETF, USD, full summary, category_group=Equities, category=Large Blend, family=Fidelity Investments, exchange=PCX, isin=US3160922791.


2. Cross-asset invariants

tests/test_invariants.py — two complementary invariants, pure-assert tests (no recorder fixture).

2.1 — test_no_symbol_collisions_across_asset_classes

Symbols across all 7 asset class files. Caught the 70-row cleanup above; will catch future drift.

While iterating I tried moving 60 ^… tickers from etfs.csv to indices.csv on the assumption that ^ prefix = index. yfinance's quoteType showed 57 of them are actually classified ETF (intraday-indicative-value siblings of real ETFs: ^ACWI, ^BND, ^ONEQ, ^REIT, ^VYMI, …). Only ^ARB-EU/NV/TC come back as quoteType=INDEX. Final state: 57 restored to etfs.csv with original metadata, 3 in indices.csv. The invariant test enforces this stays consistent.

2.2 — test_no_isin_collisions_across_asset_classes

ISIN is a unique identifier per security by ISO design; the same code appearing in both equities.csv and etfs.csv means one row has been mis-tagged with an ISIN that rightfully belongs to the other.

Baseline had 209 such collisions — almost all the same pattern: legitimate ETF ISIN (UBS UCITS, Lyxor MSCI, iShares MSCI, HSBC S&P 500, …) had been stamped onto unrelated PNK micro-cap rows that happened to share a ticker substring with the ETF. Cleared the equities-side ISIN for all 209. Now zero collisions; the test prevents regression.


3. Country backfill on equities.csv (50.3% → 71.6%)

Per your earlier comment, country should reflect HQ, not the listing exchange (ASML on Nasdaq is still Netherlands). Used HQ-country semantics end-to-end through 7 sources, in order of yield:

Source Rows filled
In-file primary-listing propagation (ASML.AS country → ASML.DE, ASML.MU, …) 15,692
ISO 6166 ISIN-prefix decoding (ticker that IS an ISIN → first 2 chars = country) 12,574
TradingView screener (HQ country, not listing country) 5,777
GLEIF (ISIN → LEI → legalAddress.country) 451
yfinance info["country"] 113
Wikidata / Euronext direct 0
Total 34,607

The TradingView pass is the key one for the HQ-semantics concern: an ADR of a German company on NYSE still gets country=Germany. Cross-checked against yfinance on a sample — 82% agreement. Disagreements were mostly ambiguous short numeric bases shared between Chinese .SZ and Korean .KS (e.g. base 000050). The script skips those — 67 ambiguous rows were left as country=NaN rather than guessed.

Normalisations: Russian FederationRussia (TV uses ISO formal name, FD has always used short form — 19 rows). ReunionFrance for ALMER.PA (French overseas department).

Coverage ceiling is ~29% missing because remaining rows are ISIN-as-ticker codes, Euronext .NX warrant series, and obscure cross-listings on German secondary exchanges (BER, DUS, MUN, STU) of micro-caps that aren't in yfinance, TradingView, GLEIF, Wikidata, OpenFIGI, Finnhub, valueinvesting.io, stockevents.app, or DuckDuckGo. Not reachable from free public APIs.


4. ISIN backfill + deep cleanup

Net coverage move: 24.6% → 25.6%. The headline is small because we also cleared a lot of bad data alongside adding good data.

Source / pass Effect
ISIN-as-ticker decoding (symbol is itself a valid ISIN → set isin = symbol) +11,287 rows
Cross-asset cleanup (isin in eq.csv ∩ etfs.csv) -598 rows wrong (209 unique ISINs × cross-listings)
Multi-name ISIN cleanup, GLEIF canonical name pass -3,652 rows wrong
Multi-name ISIN cleanup, yfinance canonical name pass -3,884 rows wrong
Multi-name ISIN cleanup, TradingView canonical name pass -1,500 rows wrong
Wrong ISINs cleared in total 9,634

Multi-name ISINs (same ISIN attached to rows with different company names — impossible by ISO 6166 design):

  • Baseline: 4,772 multi-name ISINs
  • After three canonical-name passes (clear rows whose name has token similarity < 0.3 with canonical): 1,467 multi-name ISINs (-69%).

Residual ~1,467 are mostly legitimate cross-listing naming variations of the same company (different exchange suffixes produce slightly different names). True-collision bugs in the residual are bounded by which ISINs neither GLEIF nor yfinance nor TradingView had data for (mostly deprecated country codes like AN Netherlands Antilles, and ISINs without an associated LEI).


5. Name canonicalization

Same canonical-name lookup (GLEIF legalName + TradingView description) used in §4, applied to the name column: when a row's existing name has token similarity ≥ 0.4 with the canonical, rewrite it to the canonical form. This collapses cross-listing name variations to one spelling without overwriting genuinely different-company rows.

Result: 8,361 names rewritten. Examples:

  • STRABAG SE / STRABAG SE-BR / Strabag SE Inhaber-Aktien o.N.STRABAG SE
  • Lenzing Aktiengesellschaft / LENZING AG LENZING ORD SHSLenzing Aktiengesellschaft
  • AMAG AUSTRIA METALL INH. / AMAG Austria Metall AG Inhaber-AMAG Austria Metall AG

6. SPAC template removal

While running canonical-name passes I noticed 1,584 equities.csv rows sharing identical:

  • name = "one"
  • summary = "one does not have significant operations. It intends to effect a merger, capital stock exchange, asset acquisition…"
  • state = "CA", city = "San Francisco", zipcode = 94129
  • website = "http://www.a-star.co"
  • sector = "Financials", industry = "Diversified Financial Services"
  • market_cap = "Small Cap"

This is A-Star Financial Acquisition Corp's profile copy-pasted onto 1,584 unrelated tickers somewhere upstream — a clear data poisoning. The same SPAC's data has no business being on Malaysian .KL listings, Frankfurt .F warrants, etc.

Two-step cleanup:

  1. Name recovery (200 rows) — for each affected symbol, queried in order: TradingView screener (62 hits), stockevents.app (136 hits), DuckDuckGo search (2 hits before throttling), Finnhub (1 hit), yfinance / Yahoo Search / OpenFIGI / Boerse Frankfurt / valueinvesting.io (0 each). 200 real names recovered.
  2. Metadata clearing — for all 1,608 rows still bearing the website=a-star.co fingerprint (the 200 we recovered names for + the 1,408 we didn't), cleared the contaminated fields to NaN: sector, industry_group, industry, state, city, zipcode, website, market_cap. Bogus uniform values were more misleading than missing data. Real names preserved where we recovered them.

The remaining ~1,400 rows with cleared name still carry symbol + exchange + market + currency (and country where we resolved it independently), which is enough for them to be addressable but no longer carry a fake company profile.


7. Automated README statistics

New Update-README-Statistics job in .github/workflows/database_update.yml. Runs after the existing Add-New-Ticker / Update-Compression-Files / Update-Categorization-Files chain.

README layout (three tables)

## Table A — Equities (where Country has clear semantics: HQ)
| Product  | Quantity | Sectors | Industries | Countries | Exchanges |
| Equities | 160.113  | 11      | 62         | 113       | 84        |

## Table B — ETFs / Funds (no Country: column intentionally absent)
| Product | Quantity | Families | Categories | Exchanges |
| ETFs    | 36.485   | 320      | 51         | 51        |
| Funds   | 57.853   | 1.540    | 74         | 33        |

## Table C — Currencies / Cryptos / Indices / Money Markets (unchanged)
| Product           | Quantity | Category              |
| Currencies        | 2.556    | 175 Currencies        |
| Cryptocurrencies  | 3.367    | 351 Cryptocurrencies  |
| Indices           | 91.178   | 63 Exchanges          |
| Money Markets     | 1.367    | 2 Exchanges           |

The previous combined "Equities / ETFs / Funds" table had a Countries column whose ETFs/Funds cells (111, 111) were a manual placeholder not backed by any column. Splitting into A + B keeps every cell honest. B-table headings are Families / Categories (not Sectors / Industries) because that's what the underlying CSV columns are for those asset classes.


Test plan

  • pytest tests/ — 51 tests pass (was 49 before; +2 invariant tests)
  • black --check tests/ financedatabase/ clean
  • CI Update-README-Statistics job rewrites README correctly on next data update
  • Spot-check: BHF, DTE, RGA, … only in equities.csv
  • Spot-check: ^REIT, ^ACWI, ^BND still in etfs.csv; ^ARB-EU, ^ARB-NV, ^ARB-TC in indices.csv
  • Spot-check: ASML.DE, ASML.MU, … have country=Netherlands (HQ), not Germany (listing)
  • Spot-check: STRABAG SE-BR / Strabag SE Inhaber-Aktien o.N. collapse to STRABAG SE
  • Spot-check: no row has website=http://www.a-star.co
  • Spot-check: no ISIN appears in both equities.csv and etfs.csv

Related: #140 (test infrastructure), #143 (introduced the exchange → market invariant pattern), #144 (proposes splitting CSVs by exchange).

@JerBouma
Copy link
Copy Markdown
Owner

Hmm, less enthusiastic about this one given that ETFs and Funds can come in all kinds of flavours. While listed on an American exchange, there are ETFs such as the MSCI World that have a global scope. The country column should therefore not be attached to the exchange but rather the ETF's purpose just like ASML can be on an American exchange but maintain the "Netherlands" string inside the country column.

@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 19, 2026

@JerBouma ok, I will keep automated readme stats only then

@dokson dokson force-pushed the feature/automate-readme-stats branch from 0ecc9d4 to 81f9661 Compare May 19, 2026 09:26
@dokson dokson changed the title Add country column to ETFs/Funds + cross-asset invariant tests + automated README stats ETFs/Funds data quality + cross-asset invariant + automated README stats May 19, 2026
@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 19, 2026

@JerBouma had to remove country column from ETFs and funds (in the README) because it was misleading...do you agree?

@JerBouma
Copy link
Copy Markdown
Owner

I'd expect just the README to change. Why all the test files and the GitHub Actions as well?

@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 19, 2026

I'd expect just the README to change. Why all the test files and the GitHub Actions as well?

README automation is in the github action: whenever a new equity/etf/etc... is added, then also README will be auto-updated

I also fixed some data inconsistencies I found by adding tests: there were records inside etfs.csv that were actually equities (and were already inside the equities.csv file)

Comment thread tests/json/test_etfs/test_show_options_5.json
@dokson dokson marked this pull request as draft May 19, 2026 10:04
@JerBouma
Copy link
Copy Markdown
Owner

This looks good, happy with the changes!

@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 19, 2026

This looks good, happy with the changes!

I am working to further improve data quality of FinanceDatabase...give me some time and I will add to this PR 😉

Thanks for the feedback @JerBouma

@JerBouma
Copy link
Copy Markdown
Owner

This looks good, happy with the changes!

I am working to further improve data quality of FinanceDatabase...give me some time and I will add to this PR 😉

Thanks for the feedback @JerBouma

Wonderful, thank you for your work!

@dokson dokson force-pushed the feature/automate-readme-stats branch from 69146c5 to 0d1fc04 Compare May 19, 2026 12:54
@dokson dokson changed the title ETFs/Funds data quality + cross-asset invariant + automated README stats ETFs/Funds data quality + cross-asset invariants + country backfill + README stats May 19, 2026
… README stats

Data quality on etfs.csv:
- 14 non-ETF rows removed (already correctly in equities.csv: BHF, DTE, RGA,
  HTGC, PSEC, TPVG, ...)
- 56 cross-asset symbol collisions with equities.csv removed (corporate
  bonds, senior notes, share-class variants misclassified as ETFs)
- 29 corrupted `exchange` values fixed (issuer name in exchange column:
  Xtrackers, Fundlogic, Purpose Investments, ...)
- FSST (Fidelity Sustainable U.S. Equity ETF) completed (was all-NaN)
- After cleanup: equities.csv and etfs.csv share zero symbols

Country backfill on equities.csv (50.3% -> 63.1%):
- 15,692 rows filled from primary listing in equities.csv (same base ticker,
  e.g. ASML.AS country propagated to ASML.DE)
- 5,777 rows filled via TradingView screener API (HQ country, not listing
  country -- ASML on Nasdaq stays "Netherlands", not "United States")
- 113 additional rows from yfinance lookup
- Skips bases that resolve ambiguously across markets (e.g. numeric
  bases shared between Chinese .SZ and Korean .KS exchanges)
- Russian Federation -> Russia normalization
- ALMER.PA: Reunion (French overseas dept) -> France

Cross-asset invariants:
- New tests/test_invariants.py with test_no_symbol_collisions_across_asset_classes
  covering all 7 asset class files (equities, etfs, funds, indices, currencies,
  cryptos, moneymarkets). Catches drift like the ^REIT-in-etfs case at PR time.

Automated README statistics:
- New Update-README-Statistics job in .github/workflows/database_update.yml
  regenerates stats tables from database/*.csv after every database update.
- README restructured into three tables (Equities w/ Countries; ETFs/Funds
  w/o Country; Currencies/Cryptos/Indices/Money Markets) to keep every cell
  honest -- ETFs/Funds country was a manual placeholder before.

Misc:
- financedatabase/helpers.py: widen base show_options() return type from
  pd.Series to pd.Index | dict | np.ndarray to match what subclasses actually
  return (LSP fix; runtime unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dokson dokson force-pushed the feature/automate-readme-stats branch from 0d1fc04 to f9782c3 Compare May 19, 2026 14:26
…onicalize names

Picks up where the prior commit left off after deeper auditing surfaced
two more contamination layers in equities.csv.

ISIN cleanup:
- Cross-asset duplicates (eq.csv ∩ etfs.csv): 209 ISIN cleared from equities
  rows where the ISIN rightfully belongs to the ETF (UBS UCITS, Lyxor, iShares,
  HSBC etc. previously stamped onto unrelated PNK micro-caps).
- Multi-name ISIN cleanup (different companies sharing one ISIN — impossible
  by ISO design, so one row's ISIN is wrong). Resolved canonical name per ISIN
  via GLEIF (legalName) + yfinance (longName) + TradingView (description).
  Total wrong ISIN values cleared: 9,634 across the three passes.
- Multi-name ISIN count: 4,772 baseline -> 1,467 (-69%).
- New invariant test `test_no_isin_collisions_across_asset_classes` enforces
  no future regression of the cross-asset case (catches the same drift that
  the symbol invariant catches).

Name canonicalization:
- 8,361 names rewritten to the canonical legalName/description form so that
  cross-listings of the same company collapse to one spelling instead of 4
  variants like "STRABAG SE", "STRABAG SE-BR", "Strabag SE Inhaber-Aktien o.N."
- Conservative threshold: only rewrite when token similarity with canonical
  >= 0.4 (preserves originals when the row likely refers to a different
  company than the canonical).

SPAC template removal (upstream data poisoning):
- 1,584 equities.csv rows had identical name="one" + summary about a SPAC
  that "does not have significant operations. It intends to effect a merger
  ...", plus identical state=CA / city=San Francisco / zipcode=94129
  / website=a-star.co / sector=Financials / industry=Diversified Financial
  Services. This is A-Star Financial Acquisition Corp's data copy-pasted
  onto 1,584 unrelated tickers somewhere upstream.
- 200 real names recovered via TradingView + stockevents.app + DuckDuckGo
  + Finnhub (the rest are micro-caps not in any free public dataset).
- All 1,608 rows carrying the SPAC fingerprint (website=a-star.co) had
  the contaminated sector/industry_group/industry/state/city/zipcode/
  website/market_cap fields cleared. The bogus uniform values were more
  misleading than missing data.

ALMER.PA: Reunion -> France (French overseas department; matches FD's
convention for metropolitan/overseas France).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dokson dokson changed the title ETFs/Funds data quality + cross-asset invariants + country backfill + README stats ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats May 20, 2026
@dokson dokson marked this pull request as ready for review May 20, 2026 10:53
@JerBouma
Copy link
Copy Markdown
Owner

Thanks again, looks good! Merging..

@JerBouma JerBouma merged commit ba9cb4e into JerBouma:main May 21, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants