Commit ad58eb0
authored
Backfill ISIN/CUSIP + fix placeholder names from yfinance (#139)
* Backfill 1,950 ISIN values in etfs.csv from yfinance + fix KWEB
Populated the `isin` column for 1,950 US ETF rows that had a matching
symbol but no ISIN. Also corrected one pre-existing incorrect ISIN.
Data source: yfinance (`Ticker(symbol).isin`), which mirrors Yahoo
Finance's security identifier for the exact symbol queried.
Scope: US-listed ETFs only (no exchange suffix in symbol). 2,850 US
ETF rows had an empty `isin`; yfinance returned a candidate ISIN for
1,955 of them.
Gating applied:
* ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$`
* ISIN Mod-10 Double-Add-Double check digit
* Skip rows where FD already has an `isin` value (zero overwrite)
Stats:
* 1,950 filled
* 5 rejected by format/checksum gate (yfinance returned non-ISIN
tags like "XIT", "EUVIX", "GVZ" for delisted/esoteric ETFs)
KWEB correction:
* Symbol KWEB (KraneShares CSI China Internet ETF, NYSE Arca)
had ISIN `IE00BFXR7892` in FD, which is actually the ISIN of the
Irish-domiciled UCITS share class (KWBE.L) and was likely pasted
in by symbol-name confusion. yfinance returns `US5007673065` for
KWEB, matching CUSIP 500767306 of the NYSE Arca ETF. Corrected.
Note: 3 other rows (KWBP.L, KWEB.AS, KWEB.L) share the same
`IE00BFXR7892` ISIN and may have similar propagation errors, but
disambiguating them by exchange/currency is out of scope for this
backfill — left for follow-up.
Related: #78
* Backfill 9,326 ISIN values in equities.csv from yfinance + 185 name/summary fixes
Populated the `isin` column for 9,326 US equity rows that had a matching
symbol but no ISIN. Also corrected 185 rows whose `name` was a junk
placeholder ("one", "two", ...) and replaced the boilerplate summary on
181 of those.
Data source: yfinance `Ticker(symbol).isin` and `Ticker(symbol).info`
(longName + longBusinessSummary). Scope: US-listed equities only (no
exchange suffix in symbol). 20,101 candidate rows; yfinance returned a
candidate ISIN for 11,237.
Gating (to avoid misattribution):
* ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$` + Mod-10 check digit
* Skip rows where FD already has an `isin` value (zero overwrite)
* US/CA ISIN prefix -> accept unconditionally
* Foreign ISIN prefix -> accept only if normalized name similarity
between FD `name` and yfinance `longName` is >= 0.5; otherwise reject
as likely ticker collision (e.g. ticker `RELT` on Yahoo returned an
Italian ISIN for Relatech S.p.A., but FD's `RELT` row is Reliant
Holdings, Inc. — correctly rejected)
* Garbage prefixes (NE/XD/FN/NN — non-ISO codes returned by yfinance
for some delisted/exotic tickers) -> reject
* Junk-name rows (FD `name` is lowercase placeholder like "one", "two")
where yfinance provided a real name -> accept ISIN AND replace the
junk name with yfinance's `longName`, and replace the boilerplate
summary with yfinance's `longBusinessSummary`
Stats from 11,237 candidates:
* 9,326 filled
* 185 of which also got `name` replaced (junk placeholder)
* 181 of those also got `summary` replaced (junk boilerplate)
* 1,911 rejected (name-similarity gate, garbage prefix, or no yf_name
on a foreign-prefix ISIN)
Notes:
* The 185 placeholder-name rows have other contaminated fields
(sector, website, country) that are pre-existing FD data issues
beyond scope of this PR; only `name` + `summary` are touched.
* Validation pass against the 4,957 rows that already had a populated
ISIN found ~300 same-country-prefix mismatches that may indicate
pre-existing FD errors — left for follow-up to keep this PR scoped
to empty-cell backfill (with the one KWEB exception already noted
in the etfs.csv commit).
Related: #78
* Derive 9,577 CUSIPs from US/CA ISINs + repair 1,556 Excel-corrupted CUSIPs
The middle 9 characters of a US- or CA-prefixed ISIN ARE the CUSIP
(ISIN format: `<country><CUSIP><checksum>`). For any row whose ISIN
matches `^(US|CA)[A-Z0-9]{9}[0-9]$`, the CUSIP can be derived
deterministically. No external API calls; no name-matching needed.
Two kinds of fix:
1. **Filled (9,577 rows)** — `cusip` was empty, `isin` is US/CA. Derived
CUSIP written.
2. **Repaired (1,556 rows)** — `cusip` was populated but did not match
the CUSIP embedded in the row's ISIN. Inspection showed these are
Excel-corrupted strings: leading zeros stripped (`1055102` instead
of `001055102` for AFL) or, when the CUSIP contains the letter `E`,
the cell was interpreted as scientific notation and saved as a
literal string (`8.92E+113` instead of `89151E109` for TotalEnergies
ADR). Replaced with the canonical 9-character CUSIP from the ISIN.
Since ISIN check digits were validated upstream (in the ISIN backfill
commit and in pre-existing FD data via SEC sources), the derived CUSIPs
inherit that validation — no separate CUSIP Mod-10 check needed.
Related: #78
* Overwrite 1,297 mismatched ISINs in equities.csv from yfinance + propagate CUSIPs
Following maintainer feedback on #139: where FD's existing populated ISIN
disagreed with what yfinance returned for the same ticker, overwrite with
yfinance's value. The bare US-listing symbol should carry the US-listing
ISIN (e.g. plain `ASML` on Nasdaq -> `US...`, not the underlying Dutch
`NL...` which belongs to `ASML.AS`).
Same gating as the backfill:
* yfinance ISIN must pass regex + Mod-10 check digit
* Skip yfinance "-" / null (FD value kept as-is)
* Skip garbage prefixes (NE/XD/FN/NN)
Stats on the 4,957 pre-populated US equity rows:
* 2,098 already matched yfinance — left alone
* 1,525 yfinance returned "-" — FD value kept (unverifiable)
* 36 yfinance returned malformed ISIN — FD value kept
* 1 garbage prefix — FD value kept
* 1,297 ISIN overwritten with yfinance value
CUSIP propagation: 582 CUSIP cells updated as a consequence of the ISIN
overwrite (194 filled where the row now has a US/CA ISIN whose middle 9
chars are a derivable CUSIP, 388 repaired where the ISIN itself changed
country prefix and a different CUSIP is now embedded).
Related: #78
* Regenerate test snapshots for equities + etfs after data changes
The snapshot tests in tests/test_equities.py and tests/test_etfs.py
compare against recorded CSV/JSON fixtures. Those fixtures were captured
against the prior `equities.csv` / `etfs.csv` and no longer match after
the ISIN/CUSIP/name/summary edits in this branch.
Regenerated locally with:
pytest tests/test_equities.py tests/test_etfs.py --rewrite-expected
Only fixture files for these two suites are touched. The pre-existing
failures in tests/test_indices.py and tests/test_moneymarkets.py
(present on `main` before this PR) are unrelated and intentionally left
for separate cleanup.
* Derive 680 ISINs from existing CUSIPs for US rows
For US-domiciled rows where `cusip` was populated (mostly via PR #138's
SEC 13F backfill) but `isin` was empty — including blue chips like AEP,
CEG, ACHR — the ISIN can be derived deterministically:
ISIN = "US" + cusip + check_digit
where the check digit is computed via the standard ISIN Mod-10
Double-Add-Double algorithm on the 11-char prefix. No external API
calls.
Verified the algorithm matches authoritative public sources for a
sample including AEP (US0255371017), CEG (US21037T1097), AAPL
(US0378331005), MSFT (US5949181045).
Gate:
* country = "United States"
* `cusip` populated and matches `^[0-9A-Z]{9}$`
* `isin` empty (zero overwrite of pre-existing data)
Total: 680 rows.
Related: #781 parent fcd1e0d commit ad58eb0
4 files changed
Lines changed: 18698 additions & 18698 deletions
File tree
- database
- tests
- csv/test_equities
- json/test_equities
| Original file line number | Diff line number | Diff line change |
|---|
| Original file line number | Diff line number | Diff line change |
|---|
| Original file line number | Diff line number | Diff line change |
|---|
| Original file line number | Diff line number | Diff line change |
|---|
0 commit comments