Skip to content

Commit ad58eb0

Browse files
authored
Backfill ISIN/CUSIP + fix placeholder names from yfinance (#139)
* Backfill 1,950 ISIN values in etfs.csv from yfinance + fix KWEB Populated the `isin` column for 1,950 US ETF rows that had a matching symbol but no ISIN. Also corrected one pre-existing incorrect ISIN. Data source: yfinance (`Ticker(symbol).isin`), which mirrors Yahoo Finance's security identifier for the exact symbol queried. Scope: US-listed ETFs only (no exchange suffix in symbol). 2,850 US ETF rows had an empty `isin`; yfinance returned a candidate ISIN for 1,955 of them. Gating applied: * ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$` * ISIN Mod-10 Double-Add-Double check digit * Skip rows where FD already has an `isin` value (zero overwrite) Stats: * 1,950 filled * 5 rejected by format/checksum gate (yfinance returned non-ISIN tags like "XIT", "EUVIX", "GVZ" for delisted/esoteric ETFs) KWEB correction: * Symbol KWEB (KraneShares CSI China Internet ETF, NYSE Arca) had ISIN `IE00BFXR7892` in FD, which is actually the ISIN of the Irish-domiciled UCITS share class (KWBE.L) and was likely pasted in by symbol-name confusion. yfinance returns `US5007673065` for KWEB, matching CUSIP 500767306 of the NYSE Arca ETF. Corrected. Note: 3 other rows (KWBP.L, KWEB.AS, KWEB.L) share the same `IE00BFXR7892` ISIN and may have similar propagation errors, but disambiguating them by exchange/currency is out of scope for this backfill — left for follow-up. Related: #78 * Backfill 9,326 ISIN values in equities.csv from yfinance + 185 name/summary fixes Populated the `isin` column for 9,326 US equity rows that had a matching symbol but no ISIN. Also corrected 185 rows whose `name` was a junk placeholder ("one", "two", ...) and replaced the boilerplate summary on 181 of those. Data source: yfinance `Ticker(symbol).isin` and `Ticker(symbol).info` (longName + longBusinessSummary). Scope: US-listed equities only (no exchange suffix in symbol). 20,101 candidate rows; yfinance returned a candidate ISIN for 11,237. Gating (to avoid misattribution): * ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$` + Mod-10 check digit * Skip rows where FD already has an `isin` value (zero overwrite) * US/CA ISIN prefix -> accept unconditionally * Foreign ISIN prefix -> accept only if normalized name similarity between FD `name` and yfinance `longName` is >= 0.5; otherwise reject as likely ticker collision (e.g. ticker `RELT` on Yahoo returned an Italian ISIN for Relatech S.p.A., but FD's `RELT` row is Reliant Holdings, Inc. — correctly rejected) * Garbage prefixes (NE/XD/FN/NN — non-ISO codes returned by yfinance for some delisted/exotic tickers) -> reject * Junk-name rows (FD `name` is lowercase placeholder like "one", "two") where yfinance provided a real name -> accept ISIN AND replace the junk name with yfinance's `longName`, and replace the boilerplate summary with yfinance's `longBusinessSummary` Stats from 11,237 candidates: * 9,326 filled * 185 of which also got `name` replaced (junk placeholder) * 181 of those also got `summary` replaced (junk boilerplate) * 1,911 rejected (name-similarity gate, garbage prefix, or no yf_name on a foreign-prefix ISIN) Notes: * The 185 placeholder-name rows have other contaminated fields (sector, website, country) that are pre-existing FD data issues beyond scope of this PR; only `name` + `summary` are touched. * Validation pass against the 4,957 rows that already had a populated ISIN found ~300 same-country-prefix mismatches that may indicate pre-existing FD errors — left for follow-up to keep this PR scoped to empty-cell backfill (with the one KWEB exception already noted in the etfs.csv commit). Related: #78 * Derive 9,577 CUSIPs from US/CA ISINs + repair 1,556 Excel-corrupted CUSIPs The middle 9 characters of a US- or CA-prefixed ISIN ARE the CUSIP (ISIN format: `<country><CUSIP><checksum>`). For any row whose ISIN matches `^(US|CA)[A-Z0-9]{9}[0-9]$`, the CUSIP can be derived deterministically. No external API calls; no name-matching needed. Two kinds of fix: 1. **Filled (9,577 rows)** — `cusip` was empty, `isin` is US/CA. Derived CUSIP written. 2. **Repaired (1,556 rows)** — `cusip` was populated but did not match the CUSIP embedded in the row's ISIN. Inspection showed these are Excel-corrupted strings: leading zeros stripped (`1055102` instead of `001055102` for AFL) or, when the CUSIP contains the letter `E`, the cell was interpreted as scientific notation and saved as a literal string (`8.92E+113` instead of `89151E109` for TotalEnergies ADR). Replaced with the canonical 9-character CUSIP from the ISIN. Since ISIN check digits were validated upstream (in the ISIN backfill commit and in pre-existing FD data via SEC sources), the derived CUSIPs inherit that validation — no separate CUSIP Mod-10 check needed. Related: #78 * Overwrite 1,297 mismatched ISINs in equities.csv from yfinance + propagate CUSIPs Following maintainer feedback on #139: where FD's existing populated ISIN disagreed with what yfinance returned for the same ticker, overwrite with yfinance's value. The bare US-listing symbol should carry the US-listing ISIN (e.g. plain `ASML` on Nasdaq -> `US...`, not the underlying Dutch `NL...` which belongs to `ASML.AS`). Same gating as the backfill: * yfinance ISIN must pass regex + Mod-10 check digit * Skip yfinance "-" / null (FD value kept as-is) * Skip garbage prefixes (NE/XD/FN/NN) Stats on the 4,957 pre-populated US equity rows: * 2,098 already matched yfinance — left alone * 1,525 yfinance returned "-" — FD value kept (unverifiable) * 36 yfinance returned malformed ISIN — FD value kept * 1 garbage prefix — FD value kept * 1,297 ISIN overwritten with yfinance value CUSIP propagation: 582 CUSIP cells updated as a consequence of the ISIN overwrite (194 filled where the row now has a US/CA ISIN whose middle 9 chars are a derivable CUSIP, 388 repaired where the ISIN itself changed country prefix and a different CUSIP is now embedded). Related: #78 * Regenerate test snapshots for equities + etfs after data changes The snapshot tests in tests/test_equities.py and tests/test_etfs.py compare against recorded CSV/JSON fixtures. Those fixtures were captured against the prior `equities.csv` / `etfs.csv` and no longer match after the ISIN/CUSIP/name/summary edits in this branch. Regenerated locally with: pytest tests/test_equities.py tests/test_etfs.py --rewrite-expected Only fixture files for these two suites are touched. The pre-existing failures in tests/test_indices.py and tests/test_moneymarkets.py (present on `main` before this PR) are unrelated and intentionally left for separate cleanup. * Derive 680 ISINs from existing CUSIPs for US rows For US-domiciled rows where `cusip` was populated (mostly via PR #138's SEC 13F backfill) but `isin` was empty — including blue chips like AEP, CEG, ACHR — the ISIN can be derived deterministically: ISIN = "US" + cusip + check_digit where the check digit is computed via the standard ISIN Mod-10 Double-Add-Double algorithm on the 11-char prefix. No external API calls. Verified the algorithm matches authoritative public sources for a sample including AEP (US0255371017), CEG (US21037T1097), AAPL (US0378331005), MSFT (US5949181045). Gate: * country = "United States" * `cusip` populated and matches `^[0-9A-Z]{9}$` * `isin` empty (zero overwrite of pre-existing data) Total: 680 rows. Related: #78
1 parent fcd1e0d commit ad58eb0

4 files changed

Lines changed: 18698 additions & 18698 deletions

File tree

database/equities.csv

Lines changed: 16744 additions & 16744 deletions
Original file line numberDiff line numberDiff line change

database/etfs.csv

Lines changed: 1951 additions & 1951 deletions
Original file line numberDiff line numberDiff line change

tests/csv/test_equities/test_select_10.csv

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change

tests/json/test_equities/test_show_options_1.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change

0 commit comments

Comments
 (0)