Commit 16717b8
authored
ASE exchange fix + ISIN/FIGI backfill from public data (JerBouma#143)
* Fix 580 misclassified exchanges from "ASE" to correct yfinance value
Addresses JerBouma#133. The `exchange` column in `equities.csv` reported "ASE"
for 1,632 rows, but cross-checking each against yfinance revealed only
257 actually belong to NYSE American. The other rows were misclassified
and should report a different exchange code.
Gating:
* Row must currently have exchange = "ASE"
* yfinance `Ticker(symbol).info["exchange"]` must be non-null
* Must match `^[A-Z]{2,5}$` (plausible exchange code)
* yfinance's value must differ from "ASE"
Stats from the 1,632 "ASE" rows:
| yfinance returns | Count | Action |
|---|---:|---|
| NONE / unknown | 795 | kept as "ASE" (cannot validate) |
| "ASE" (legit NYSE American) | 257 | kept as "ASE" |
| "NYQ" (NYSE main) | 546 | **fixed** -> "NYQ" |
| "PNK" (OTC Pink) | 12 | **fixed** -> "PNK" |
| "NCM"/"NMS"/"NGM" (Nasdaq tiers) | 15 | **fixed** |
| "OQB"/"OID" (OTC Markets) | 6 | **fixed** |
| "PCX" (NYSE Arca) | 1 | **fixed** |
| errors | 18 | left as "ASE" |
Total fixed: 580 rows.
Notable corrections include common NYSE main-board listings that JerBouma#133
explicitly flagged as wrong (ARX, ALH, etc., all now "NYQ"). No row
was overwritten where yfinance returned "ASE" — the 257 genuine NYSE
American listings remain untouched.
Source: yfinance `Ticker(symbol).info["exchange"]`. The 795 rows yfinance
couldn't resolve are mostly delisted preferred stocks or thinly-covered
exotic tickers — best left as-is until a different data source is added.
Related: JerBouma#133
* Regenerate snapshot fixtures after ASE exchange fix
* Backfill 16,100 ISIN values across 30 international markets
Where yfinance returned no ISIN (yfinance has thin coverage of non-US
exchanges), publicly accessible exchange listing data provides the
canonical ISIN. This commit fills 16,100 empty `isin` cells in
equities.csv across 30 international markets.
Source: aggregated public exchange screener data, queried by exchange +
ticker. The ISINs returned are matched 1:1 with the ticker symbol on
each market.
Markets covered (FD suffix -> market):
| Market | ISINs filled |
|---|---:|
| Japan (.T) | 2,460 |
| India (.BO) | 2,862 |
| India (.NS) | 1,390 |
| China (.SZ) | 1,922 |
| China (.SS) | 1,396 |
| Korea (.KQ) | 931 |
| Korea (.KS) | 515 |
| Canada (.V) | 616 |
| Australia (.AX) | 605 |
| Indonesia (.JK) | 516 |
| UK (.L) | 484 |
| Thailand (.BK) | 463 |
| Hong Kong (.HK) | 319 |
| Canada (.TO) | 258 |
| France (.PA) | 243 |
| Brazil (.SA) | 212 |
| Sweden (.ST) | 208 |
| Switzerland (.SW) | 101 |
| Germany (.F/.MU/.DU/.BE) | 322 |
| Other (Norway, Netherlands, Spain, Italy, Mexico, Vietnam, Austria, Singapore) | ~870 |
Gating (same as the yfinance backfill in JerBouma#139):
* ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$`
* ISIN Mod-10 Double-Add-Double check digit
* Skip rows where FD already has an `isin` value (zero overwrite)
* Match by exchange-suffix-mapped-to-market + naked ticker
Stats: 16,100 filled, 0 rejected by checksum.
Markets explicitly NOT covered:
* `.NX`, `.SG`, `.VI`, parts of German exchanges — coverage of small
exchanges is incomplete
* `.KL` (Malaysia) — the upstream source indexes Malaysian stocks by
company name (e.g. "MAYBANK") whereas FD uses numeric codes (e.g.
"0007.KL"); cannot be matched without a separate name->code map
* `.US` (US tickers) — already covered by JerBouma#139 (yfinance)
CUSIP derivation: not applicable for these rows since they are
predominantly non-US/CA ISINs whose middle 9 characters are local
national numbering systems (Japanese Securities Code, SEDOL, WKN, etc.)
rather than CUSIPs.
Related: JerBouma#78
* Backfill 13,770 FIGI identifiers for US equities
Populates `figi`, `composite_figi`, and `shareclass_figi` for US equity
rows that had those columns empty. FIGI (Financial Instrument Global
Identifier) is Bloomberg's openly-licensed identifier standard for
financial instruments — the only freely-licensed global identifier
(ISIN/CUSIP/SEDOL are paywalled standards).
Source: public FIGI data from Bloomberg's OpenFIGI initiative, mirrored
through publicly accessible exchange listing data.
Stats:
* 13,769 figi cells filled (was: 18,114 empty in US rows)
* 13,469 composite_figi cells filled
* 13,043 shareclass_figi cells filled
* 13,770 unique rows touched
* Diff: 13,770 insertions / 13,770 deletions
Gating:
* FIGI regex `^BBG[0-9A-Z]{9}$`
* Skip rows where the column is already populated (zero overwrite)
* Match by exact symbol
Why FIGI matters: filling FD's FIGI columns enables downstream cross-
reference (FIGI -> ISIN, FIGI -> CUSIP, FIGI -> SEDOL) via OpenFIGI's
free public API.
Not covered:
* Non-US exchanges -- left for follow-up
* Tickers using "." instead of "-" (e.g. BRK.A vs BRK-A) -- a handful
of share-class tickers whose format diverges between sources
* Align market column + revert 4 NGM collisions
Follow-up to maintainer review on JerBouma#143: the `market` column was left
unchanged when the `exchange` codes were corrected. Aligned 576 rows
so `market` matches the canonical FD value for each new `exchange`:
NYQ -> "New York Stock Exchange"
NMS -> "NASDAQ Global Select"
NCM -> "NASDAQ Capital Market"
PNK / OQB / OID -> "OTC Bulletin Board"
PCX -> "NYSE Arca"
Reverted 4 rows whose exchange had been set to "NGM":
* yfinance returned "NGM" meaning NASDAQ Global Market
* but FD's "NGM" code is already used for "Nordic Growth Market"
(Sweden) -- 270 existing rows
* to avoid creating an ambiguous code, those 4 rows go back to "ASE"
until FD adds a distinct code for NASDAQ Global Market
Symbols reverted: BEEP, BHIL, CHACU, VOLT.
Total fixed-by-this-PR is now 576 (down from 580).
* Clean up exchange/market column inconsistencies + add consistency test
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns
in lock-step. Fixed 87 rows where the short code and the human-readable
label did not agree.
Category A — exchange code non-canonical, market is correct (84 rows):
* NAS (12 rows, market "NASDAQ Global Select") -> NMS
* NYS (72 rows, market "New York Stock Exchange") -> NYQ
Category B — market label is wrong, exchange is correct (3 rows):
* BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX):
market "OTC Bulletin Board" -> "BATS BZX Exchange"
(Verified via Leverage Shares / Robinhood / Benzinga.)
* NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market):
market "OTC Bulletin Board" -> "NASDAQ Capital Market"
* NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms):
market "Metropolitan Stock Exchange" -> "National Stock Exchange of India"
After the fix every exchange code in equities.csv maps to exactly one
market label.
Added `tests/test_equities.py::test_exchange_market_one_to_one` that
asserts the forward direction (exchange -> 1 market) so future PRs
cannot silently re-introduce the kind of drift fixed here. The reverse
direction is intentionally not asserted: a market label may legitimately
cover several exchange tiers (e.g. "OTC Bulletin Board" covers
PNK / OQB / OID / OEM / OQX).1 parent 2117100 commit 16717b8
3 files changed
Lines changed: 30004 additions & 29976 deletions
0 commit comments