ASE exchange fix + ISIN/FIGI backfill from public data#143
Merged
Conversation
Addresses JerBouma#133. The `exchange` column in `equities.csv` reported "ASE" for 1,632 rows, but cross-checking each against yfinance revealed only 257 actually belong to NYSE American. The other rows were misclassified and should report a different exchange code. Gating: * Row must currently have exchange = "ASE" * yfinance `Ticker(symbol).info["exchange"]` must be non-null * Must match `^[A-Z]{2,5}$` (plausible exchange code) * yfinance's value must differ from "ASE" Stats from the 1,632 "ASE" rows: | yfinance returns | Count | Action | |---|---:|---| | NONE / unknown | 795 | kept as "ASE" (cannot validate) | | "ASE" (legit NYSE American) | 257 | kept as "ASE" | | "NYQ" (NYSE main) | 546 | **fixed** -> "NYQ" | | "PNK" (OTC Pink) | 12 | **fixed** -> "PNK" | | "NCM"/"NMS"/"NGM" (Nasdaq tiers) | 15 | **fixed** | | "OQB"/"OID" (OTC Markets) | 6 | **fixed** | | "PCX" (NYSE Arca) | 1 | **fixed** | | errors | 18 | left as "ASE" | Total fixed: 580 rows. Notable corrections include common NYSE main-board listings that JerBouma#133 explicitly flagged as wrong (ARX, ALH, etc., all now "NYQ"). No row was overwritten where yfinance returned "ASE" — the 257 genuine NYSE American listings remain untouched. Source: yfinance `Ticker(symbol).info["exchange"]`. The 795 rows yfinance couldn't resolve are mostly delisted preferred stocks or thinly-covered exotic tickers — best left as-is until a different data source is added. Related: JerBouma#133
Where yfinance returned no ISIN (yfinance has thin coverage of non-US exchanges), publicly accessible exchange listing data provides the canonical ISIN. This commit fills 16,100 empty `isin` cells in equities.csv across 30 international markets. Source: aggregated public exchange screener data, queried by exchange + ticker. The ISINs returned are matched 1:1 with the ticker symbol on each market. Markets covered (FD suffix -> market): | Market | ISINs filled | |---|---:| | Japan (.T) | 2,460 | | India (.BO) | 2,862 | | India (.NS) | 1,390 | | China (.SZ) | 1,922 | | China (.SS) | 1,396 | | Korea (.KQ) | 931 | | Korea (.KS) | 515 | | Canada (.V) | 616 | | Australia (.AX) | 605 | | Indonesia (.JK) | 516 | | UK (.L) | 484 | | Thailand (.BK) | 463 | | Hong Kong (.HK) | 319 | | Canada (.TO) | 258 | | France (.PA) | 243 | | Brazil (.SA) | 212 | | Sweden (.ST) | 208 | | Switzerland (.SW) | 101 | | Germany (.F/.MU/.DU/.BE) | 322 | | Other (Norway, Netherlands, Spain, Italy, Mexico, Vietnam, Austria, Singapore) | ~870 | Gating (same as the yfinance backfill in JerBouma#139): * ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$` * ISIN Mod-10 Double-Add-Double check digit * Skip rows where FD already has an `isin` value (zero overwrite) * Match by exchange-suffix-mapped-to-market + naked ticker Stats: 16,100 filled, 0 rejected by checksum. Markets explicitly NOT covered: * `.NX`, `.SG`, `.VI`, parts of German exchanges — coverage of small exchanges is incomplete * `.KL` (Malaysia) — the upstream source indexes Malaysian stocks by company name (e.g. "MAYBANK") whereas FD uses numeric codes (e.g. "0007.KL"); cannot be matched without a separate name->code map * `.US` (US tickers) — already covered by JerBouma#139 (yfinance) CUSIP derivation: not applicable for these rows since they are predominantly non-US/CA ISINs whose middle 9 characters are local national numbering systems (Japanese Securities Code, SEDOL, WKN, etc.) rather than CUSIPs. Related: JerBouma#78
Populates `figi`, `composite_figi`, and `shareclass_figi` for US equity
rows that had those columns empty. FIGI (Financial Instrument Global
Identifier) is Bloomberg's openly-licensed identifier standard for
financial instruments — the only freely-licensed global identifier
(ISIN/CUSIP/SEDOL are paywalled standards).
Source: public FIGI data from Bloomberg's OpenFIGI initiative, mirrored
through publicly accessible exchange listing data.
Stats:
* 13,769 figi cells filled (was: 18,114 empty in US rows)
* 13,469 composite_figi cells filled
* 13,043 shareclass_figi cells filled
* 13,770 unique rows touched
* Diff: 13,770 insertions / 13,770 deletions
Gating:
* FIGI regex `^BBG[0-9A-Z]{9}$`
* Skip rows where the column is already populated (zero overwrite)
* Match by exact symbol
Why FIGI matters: filling FD's FIGI columns enables downstream cross-
reference (FIGI -> ISIN, FIGI -> CUSIP, FIGI -> SEDOL) via OpenFIGI's
free public API.
Not covered:
* Non-US exchanges -- left for follow-up
* Tickers using "." instead of "-" (e.g. BRK.A vs BRK-A) -- a handful
of share-class tickers whose format diverges between sources
fd74764 to
3ba838d
Compare
Owner
|
Looks good! Outside of the exchange column, did you also verify the market column? This should align with all changes of the exchanges. |
Contributor
Author
give me a sec |
Follow-up to maintainer review on JerBouma#143: the `market` column was left unchanged when the `exchange` codes were corrected. Aligned 576 rows so `market` matches the canonical FD value for each new `exchange`: NYQ -> "New York Stock Exchange" NMS -> "NASDAQ Global Select" NCM -> "NASDAQ Capital Market" PNK / OQB / OID -> "OTC Bulletin Board" PCX -> "NYSE Arca" Reverted 4 rows whose exchange had been set to "NGM": * yfinance returned "NGM" meaning NASDAQ Global Market * but FD's "NGM" code is already used for "Nordic Growth Market" (Sweden) -- 270 existing rows * to avoid creating an ambiguous code, those 4 rows go back to "ASE" until FD adds a distinct code for NASDAQ Global Market Symbols reverted: BEEP, BHIL, CHACU, VOLT. Total fixed-by-this-PR is now 576 (down from 580).
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns in lock-step. Fixed 87 rows where the short code and the human-readable label did not agree. Category A — exchange code non-canonical, market is correct (84 rows): * NAS (12 rows, market "NASDAQ Global Select") -> NMS * NYS (72 rows, market "New York Stock Exchange") -> NYQ Category B — market label is wrong, exchange is correct (3 rows): * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX): market "OTC Bulletin Board" -> "BATS BZX Exchange" (Verified via Leverage Shares / Robinhood / Benzinga.) * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market): market "OTC Bulletin Board" -> "NASDAQ Capital Market" * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms): market "Metropolitan Stock Exchange" -> "National Stock Exchange of India" After the fix every exchange code in equities.csv maps to exactly one market label. Added `tests/test_equities.py::test_exchange_market_one_to_one` that asserts the forward direction (exchange -> 1 market) so future PRs cannot silently re-introduce the kind of drift fixed here. The reverse direction is intentionally not asserted: a market label may legitimately cover several exchange tiers (e.g. "OTC Bulletin Board" covers PNK / OQB / OID / OEM / OQX).
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns in lock-step. Fixed 87 rows where the short code and the human-readable label did not agree. Category A — exchange code non-canonical, market is correct (84 rows): * NAS (12 rows, market "NASDAQ Global Select") -> NMS * NYS (72 rows, market "New York Stock Exchange") -> NYQ Category B — market label is wrong, exchange is correct (3 rows): * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX): market "OTC Bulletin Board" -> "BATS BZX Exchange" (Verified via Leverage Shares / Robinhood / Benzinga.) * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market): market "OTC Bulletin Board" -> "NASDAQ Capital Market" * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms): market "Metropolitan Stock Exchange" -> "National Stock Exchange of India" After the fix every exchange code in equities.csv maps to exactly one market label. Added `tests/test_equities.py::test_exchange_market_one_to_one` that asserts the forward direction (exchange -> 1 market) so future PRs cannot silently re-introduce the kind of drift fixed here. The reverse direction is intentionally not asserted: a market label may legitimately cover several exchange tiers (e.g. "OTC Bulletin Board" covers PNK / OQB / OID / OEM / OQX).
fd298dc to
46db4bb
Compare
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns in lock-step. Fixed 87 rows where the short code and the human-readable label did not agree. Category A — exchange code non-canonical, market is correct (84 rows): * NAS (12 rows, market "NASDAQ Global Select") -> NMS * NYS (72 rows, market "New York Stock Exchange") -> NYQ Category B — market label is wrong, exchange is correct (3 rows): * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX): market "OTC Bulletin Board" -> "BATS BZX Exchange" (Verified via Leverage Shares / Robinhood / Benzinga.) * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market): market "OTC Bulletin Board" -> "NASDAQ Capital Market" * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms): market "Metropolitan Stock Exchange" -> "National Stock Exchange of India" After the fix every exchange code in equities.csv maps to exactly one market label. Added `tests/test_equities.py::test_exchange_market_one_to_one` that asserts the forward direction (exchange -> 1 market) so future PRs cannot silently re-introduce the kind of drift fixed here. The reverse direction is intentionally not asserted: a market label may legitimately cover several exchange tiers (e.g. "OTC Bulletin Board" covers PNK / OQB / OID / OEM / OQX).
46db4bb to
faa411d
Compare
Contributor
Author
done! also added a new test that got a couple of issues 😉 |
Owner
|
Perfect, thanks again. Merging. |
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
…ures, docstrings, local data Sets a cleaner baseline for the test suite so future data and code PRs are easier to develop, review, and triage. All changes confined to `tests/` (plus a one-character typo fix in `database/etfs.csv`). **All 32 tests pass.** ## Snapshot diffs that actually tell you what changed The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed. ```diff - string_value = json.dumps(data, **kwargs) + string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs) ``` `ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences). ## Unified-diff on snapshot mismatch `Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143. Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message. ## Tests now read the PR branch's data, not main The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs. Two cooperating fixes: - Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout. - `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture. The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run. ## Test module quality (incidental cleanup) Since the regen touched every test file, took the opportunity to: - Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`). - Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`. - Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost. ## conftest.py cleanup - Removed `# pylint: skip-file` so static analysis is no longer suppressed. - Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`). ## Test infrastructure config (pyproject.toml) - Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures). - Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers. ## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests) While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge. ## What this PR explicitly does NOT do Deferred to focused follow-ups (out of scope here): - `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go). - `pytest-xdist` parallel execution — adds a new dev dep; pure perf win. - Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting. - Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
6 tasks
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
…ures, docstrings, local data Sets a cleaner baseline for the test suite so future data and code PRs are easier to develop, review, and triage. All changes confined to `tests/` (plus a one-character typo fix in `database/etfs.csv`). **All 32 tests pass.** ## Snapshot diffs that actually tell you what changed The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed. ```diff - string_value = json.dumps(data, **kwargs) + string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs) ``` `ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences). ## Unified-diff on snapshot mismatch `Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143. Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message. ## Tests now read the PR branch's data, not main The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs. Two cooperating fixes: - Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout. - `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture. The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run. ## Test module quality (incidental cleanup) Since the regen touched every test file, took the opportunity to: - Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`). - Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`. - Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost. ## conftest.py cleanup - Removed `# pylint: skip-file` so static analysis is no longer suppressed. - Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`). ## Test infrastructure config (pyproject.toml) - Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures). - Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers. ## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests) While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge. ## What this PR explicitly does NOT do Deferred to focused follow-ups (out of scope here): - `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go). - `pytest-xdist` parallel execution — adds a new dev dep; pure perf win. - Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting. - Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
…ures, docstrings, local data Sets a cleaner baseline for the test suite so future data and code PRs are easier to develop, review, and triage. All changes confined to `tests/` (plus a one-character typo fix in `database/etfs.csv`). **All 32 tests pass.** ## Snapshot diffs that actually tell you what changed The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed. ```diff - string_value = json.dumps(data, **kwargs) + string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs) ``` `ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences). ## Unified-diff on snapshot mismatch `Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143. Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message. ## Tests now read the PR branch's data, not main The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs. Two cooperating fixes: - Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout. - `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture. The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run. ## Test module quality (incidental cleanup) Since the regen touched every test file, took the opportunity to: - Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`). - Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`. - Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost. ## conftest.py cleanup - Removed `# pylint: skip-file` so static analysis is no longer suppressed. - Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`). ## Test infrastructure config (pyproject.toml) - Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures). - Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers. ## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests) While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge. ## What this PR explicitly does NOT do Deferred to focused follow-ups (out of scope here): - `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go). - `pytest-xdist` parallel execution — adds a new dev dep; pure perf win. - Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting. - Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
…ures, docstrings, local data Sets a cleaner baseline for the test suite so future data and code PRs are easier to develop, review, and triage. All changes confined to `tests/` (plus a one-character typo fix in `database/etfs.csv`). **All 32 tests pass.** ## Snapshot diffs that actually tell you what changed The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed. ```diff - string_value = json.dumps(data, **kwargs) + string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs) ``` `ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences). ## Unified-diff on snapshot mismatch `Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143. Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message. ## Tests now read the PR branch's data, not main The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs. Two cooperating fixes: - Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout. - `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture. The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run. ## Test module quality (incidental cleanup) Since the regen touched every test file, took the opportunity to: - Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`). - Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`. - Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost. ## conftest.py cleanup - Removed `# pylint: skip-file` so static analysis is no longer suppressed. - Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`). ## Test infrastructure config (pyproject.toml) - Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures). - Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers. ## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests) While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge. ## What this PR explicitly does NOT do Deferred to focused follow-ups (out of scope here): - `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go). - `pytest-xdist` parallel execution — adds a new dev dep; pure perf win. - Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting. - Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
This was referenced May 18, 2026
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 18, 2026
Two passes:
**Pass 1 (deterministic, 17,240 rows)**: for every row with `exchange`
empty AND a recognisable ticker suffix (`.NX`, `.F`, `.DU`, `.MI`, ...),
derive the canonical exchange code from rows that already have that
same suffix populated. No external API. The mapping is unambiguous at
>=99% per suffix (verified before applying):
| Suffix | Exchange | Rows |
|---|---|---:|
| .NX | ENX (Euronext) | 11,364 |
| .F | FRA (Frankfurt) | 3,068 |
| .DU | DUS (Dusseldorf) | 624 |
| .BE | BER (Berlin) | 370 |
| .MI | MIL (Borsa Italiana) | 349 |
| .MU | MUN (Munich) | 247 |
| .SG | STU (Stuttgart) | 218 |
| .OL | OSL (Oslo) | 115 |
| .TI | TLO (EuroTLX) | 115 |
| .ST | STO (NASDAQ OMX Stockholm) | 89 |
| .BO | BSE (BSE India) | 75 |
| .PA | PAR (Euronext Paris) | 70 |
| .VI | VIE (Vienna) | 68 |
| .MX | MEX (Mexico) | 63 |
| Other suffixes | various | ~405 |
**Pass 2 (external resolution, 12 rows)**: for the 98 remaining rows
(US-style symbols with no suffix and every other column NaN — failed
scrapes), query yfinance and Finnhub. yfinance resolved 9; Finnhub's
US universe endpoint resolved 3 more (MIC mapping: XNYS->NYQ, XNAS->NMS,
OOTC->PNK). TradingView returned no additional matches on this set.
Stats:
* 17,240 deterministic suffix-based fills
* 12 external resolution fills (9 yfinance + 3 Finnhub)
* 17,252 `exchange` cells filled (was: 17,338 empty before)
* 17,252 `market` cells also filled (inherited from canonical
`exchange -> market` mapping)
* 86 rows still empty: US-style symbols not found in any free
data source (delisted warrants/units like `BFT-WT`, `CAS'U`,
`CCX'U`, etc.). Out of scope.
Validation: passes `test_exchange_market_one_to_one` (introduced in
JerBouma#143) — every exchange code in `equities.csv` still maps to exactly
one market label.
No row with a pre-existing populated `exchange` was modified. Only
empty cells filled.
JerBouma
pushed a commit
that referenced
this pull request
May 19, 2026
Two passes:
**Pass 1 (deterministic, 17,240 rows)**: for every row with `exchange`
empty AND a recognisable ticker suffix (`.NX`, `.F`, `.DU`, `.MI`, ...),
derive the canonical exchange code from rows that already have that
same suffix populated. No external API. The mapping is unambiguous at
>=99% per suffix (verified before applying):
| Suffix | Exchange | Rows |
|---|---|---:|
| .NX | ENX (Euronext) | 11,364 |
| .F | FRA (Frankfurt) | 3,068 |
| .DU | DUS (Dusseldorf) | 624 |
| .BE | BER (Berlin) | 370 |
| .MI | MIL (Borsa Italiana) | 349 |
| .MU | MUN (Munich) | 247 |
| .SG | STU (Stuttgart) | 218 |
| .OL | OSL (Oslo) | 115 |
| .TI | TLO (EuroTLX) | 115 |
| .ST | STO (NASDAQ OMX Stockholm) | 89 |
| .BO | BSE (BSE India) | 75 |
| .PA | PAR (Euronext Paris) | 70 |
| .VI | VIE (Vienna) | 68 |
| .MX | MEX (Mexico) | 63 |
| Other suffixes | various | ~405 |
**Pass 2 (external resolution, 12 rows)**: for the 98 remaining rows
(US-style symbols with no suffix and every other column NaN — failed
scrapes), query yfinance and Finnhub. yfinance resolved 9; Finnhub's
US universe endpoint resolved 3 more (MIC mapping: XNYS->NYQ, XNAS->NMS,
OOTC->PNK). TradingView returned no additional matches on this set.
Stats:
* 17,240 deterministic suffix-based fills
* 12 external resolution fills (9 yfinance + 3 Finnhub)
* 17,252 `exchange` cells filled (was: 17,338 empty before)
* 17,252 `market` cells also filled (inherited from canonical
`exchange -> market` mapping)
* 86 rows still empty: US-style symbols not found in any free
data source (delisted warrants/units like `BFT-WT`, `CAS'U`,
`CCX'U`, etc.). Out of scope.
Validation: passes `test_exchange_market_one_to_one` (introduced in
#143) — every exchange code in `equities.csv` still maps to exactly
one market label.
No row with a pre-existing populated `exchange` was modified. Only
empty cells filled.
JerBouma
pushed a commit
that referenced
this pull request
May 19, 2026
…ures, docstrings, local data (#140) Sets a cleaner baseline for the test suite so future data and code PRs are easier to develop, review, and triage. All changes confined to `tests/` (plus a one-character typo fix in `database/etfs.csv`). **All 32 tests pass.** ## Snapshot diffs that actually tell you what changed The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed. ```diff - string_value = json.dumps(data, **kwargs) + string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs) ``` `ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences). ## Unified-diff on snapshot mismatch `Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing #141 and #143. Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message. ## Tests now read the PR branch's data, not main The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs. Two cooperating fixes: - Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout. - `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture. The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run. ## Test module quality (incidental cleanup) Since the regen touched every test file, took the opportunity to: - Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`). - Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`. - Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost. ## conftest.py cleanup - Removed `# pylint: skip-file` so static analysis is no longer suppressed. - Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`). ## Test infrastructure config (pyproject.toml) - Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures). - Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers. ## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests) While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge. ## What this PR explicitly does NOT do Deferred to focused follow-ups (out of scope here): - `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go). - `pytest-xdist` parallel execution — adds a new dev dep; pure perf win. - Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting. - Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 19, 2026
…+ automated README stats Closes the long-standing data-shape gap where ETFs and Funds had no country field, plus a series of related data-quality fixes surfaced by the new cross-asset invariant test. ## Schema change: new `country` column on etfs.csv and funds.csv Derived deterministically by chaining `exchange -> country` from equities.csv (the source of truth for that mapping), with three small manual overrides for fund-only exchange codes that don't appear in equities.csv (`NAS`/`NYM`/`CME` -> United States) plus `NIM` for ETF NextShares (also US). A second-pass ticker-suffix fallback handles rows whose `exchange` was missing or corrupted. Coverage after the fill: | File | Rows | country populated | |---|---:|---| | `etfs.csv` | 36,485 | 100.00% | | `funds.csv` | 57,853 | 100.00% | Top 5 countries: - **ETFs**: United States (17,723), Canada (6,688), Germany (5,598), United Kingdom (2,137), Switzerland (1,463) — 33 distinct countries - **Funds**: United States (47,643), Spain (5,383), Canada (1,898), United Kingdom (1,769), India (496) — 24 distinct countries ## Data-quality fixes surfaced by the new invariant tests - **Removed 14 non-ETF rows** from `etfs.csv` (^REIT plus 13 US-equity duplicates that existed correctly in `equities.csv` already: `BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`, `RZA`, `TPVG`). - **Removed 56 cross-asset symbol collisions** between `equities.csv` and `etfs.csv` — all 56 were corporate bonds / senior notes / equity share-class rows that had ended up in `etfs.csv` (Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior subordinated notes, Conifer Holdings senior notes, Qwest Corp notes, DTE Energy variants, ASGI = Aberdeen Global Infrastructure Income Fund, etc.). - **Fixed 29 ETF rows with corrupted `exchange` values** (issuer name written into the exchange column instead of a real exchange code: `Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`, `Horizons ETFs Management`, `Harvest Portfolios Group`, `IA Clarington Investments`, `National Bank Investments`, `Caldwell Investment Management`, `Developed Markets`, `Emerging Markets`, `High Yield Bonds`). Re-derived from the ticker suffix. - **Completed FSST** (Fidelity Sustainable U.S. Equity ETF) — the row previously had every field NaN. Filled name, currency, summary, category_group, category, family, exchange (PCX), country, isin. After this round: 0 cross-asset symbol collisions, 0 corrupted exchange values flagged by the new invariants. ## API additions `ETFs.select()`, `ETFs.show_options()`, `Funds.select()`, and `Funds.show_options()` now accept a `country` parameter, validated against the new column. Calling with an unknown country raises `ValueError` matching the pattern used by the existing filters. ## New tests In `tests/test_etfs.py` and `tests/test_funds.py`: - `test_exchange_country_one_to_one` — asserts that every `exchange` code on the asset maps to exactly one `country` value (the same invariant introduced for equities `exchange -> market` in JerBouma#143). - `test_select_with_invalid_value_raises` now also exercises the `country` filter ValueError path. New file `tests/test_invariants.py`: - `test_no_symbol_collisions_across_asset_classes` — asserts that a given `symbol` belongs to at most one of `equities.csv`, `etfs.csv`, `funds.csv`. Catches the kind of drift fixed by the 56-row cleanup above before it can land on `main` again. ## Automated README statistics `.github/workflows/database_update.yml` gains a new `Update-README-Statistics` job that runs after the existing Add-New-Ticker / Update-Compression / Update-Categorization jobs. It recomputes both statistics tables from the on-disk CSVs and rewrites README.md in-place, committing the result. The Check-GICS job now also depends on it. Replaced the meaningless `Countries` numbers for ETFs/Funds (which were not backed by any column) with the now-real value derived from the new `country` column. Other numbers also refreshed to current state: | Product | Quantity | Sectors | Industries | Countries | Exchanges | |-------------------|-----------:|-----------:|--------------:|----------:|----------:| | Equities | 160.113 | 11 | 62 | 113 | 84 | | ETFs | 36.485 | 320 | 51 | 33 | 51 | | Funds | 57.853 | 1.540 | 74 | 24 | 33 | | Product | Quantity | Category | |-------------------|----------:|-----------------------| | Currencies | 2.556 | 175 Currencies | | Cryptocurrencies | 3.367 | 351 Cryptocurrencies | | Indices | 91.178 | 63 Exchanges | | Money Markets | 1.367 | 2 Exchanges | For ETFs/Funds the `Sectors` column = `family` count and the `Industries` column = `category` count, which is how those numbers were always interpreted in the README — now backed by real columns. ## Test plan - [ ] `pytest tests/` — 52 tests pass (was 49 before this PR) - [ ] `test_exchange_country_one_to_one` passes for ETFs and Funds - [ ] `test_no_symbol_collisions_across_asset_classes` passes - [ ] CI Update-README-Statistics job rewrites README correctly - [ ] `black --check tests/ financedatabase/` clean - [ ] Spot-check: `fd.ETFs(use_local_location=True).show_options(selection="country")` returns 33 countries - [ ] Spot-check: `fd.Funds(use_local_location=True).select(country="Spain")` returns ~5,383 rows
9 tasks
dokson
added a commit
to dokson/FinanceDatabase
that referenced
this pull request
May 19, 2026
…ME stats Per maintainer feedback on the initial version, this PR drops the `country` column addition for ETFs/Funds — the country semantics for those asset classes (investment scope, not listing-exchange country) need a richer signal than `exchange -> country` can give. What remains is still substantial: ## 1. Data-quality fixes on `etfs.csv` (surfaced by the new invariant) - **14 non-ETF rows removed**: `^REIT` plus 13 US-equity duplicates that already existed correctly in `equities.csv` (`BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`, `RZA`, `TPVG`). - **56 cross-asset symbol collisions removed** — all 56 were corporate bonds / senior notes / equity share-class rows misclassified as ETFs (Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior subordinated notes, Conifer Holdings senior notes, Qwest Corp notes, DTE Energy variants, `ASGI` = Aberdeen Global Infrastructure Income Fund, etc.). - **29 ETF rows with corrupted `exchange` values fixed** — issuer name written into the exchange column instead of a real exchange code (`Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`, `Horizons ETFs Management`, `Harvest Portfolios Group`, `IA Clarington Investments`, `National Bank Investments`, `Caldwell Investment Management`, `Developed Markets`, `Emerging Markets`, `High Yield Bonds`). Re-derived from ticker suffix. - **`FSST` completed** (Fidelity Sustainable U.S. Equity ETF) — row previously had every field NaN. Filled name, currency, summary, category_group, category, family, exchange (`PCX`), isin. ## 2. New cross-asset invariant test `tests/test_invariants.py::test_no_symbol_collisions_across_asset_classes` asserts that a given `symbol` belongs to at most one of `equities.csv`, `etfs.csv`, `funds.csv`. This is the test that surfaced the 56 + 14 = 70 cleanup rows above; it would have caught all of them before they landed on `main`. ## 3. Automated README statistics `.github/workflows/database_update.yml` gains a new `Update-README-Statistics` job that runs after the existing `Add-New-Ticker` / `Update-Compression-Files` / `Update-Categorization-Files` jobs. It recomputes the statistics tables from the on-disk CSVs and rewrites `README.md` in-place, committing the result. The `Check-GICS-Categorisation` job now also depends on it. README is now split into three tables that reflect the actual schema of each asset class rather than the previous combined-table layout whose `Countries` column for ETFs/Funds had no backing data: ``` Table A — Equities only (has country, sector, industry) | Product | Quantity | Sectors | Industries | Countries | Exchanges | | Equities | 160.113 | 11 | 62 | 113 | 84 | Table B — ETFs / Funds (no country: no schema for it) | Product | Quantity | Families | Categories | Exchanges | | ETFs | 36.485 | 320 | 51 | 51 | | Funds | 57.853 | 1.540 | 74 | 33 | Table C — Currencies / Cryptos / Indices / Money Markets | Product | Quantity | Category | | Currencies | 2.556 | 175 Currencies | | Cryptocurrencies | 3.367 | 351 Cryptocurrencies | | Indices | 91.178 | 63 Exchanges | | Money Markets | 1.367 | 2 Exchanges | ``` The B-table columns are renamed from the previous "Sectors / Industries" (which were really family/category counts) to be honest about what the numbers represent. ## What this PR explicitly does NOT do - **No `country` column on `etfs.csv` / `funds.csv`.** Per @JerBouma's feedback, "country" for ETFs/Funds should reflect the investment scope (e.g. iShares MSCI World ETF listed on NYSE has global scope, not "United States"), not the listing exchange. The obvious deterministic source (exchange -> country) doesn't fit that semantics. Deferred to a focused discussion / future PR if the community decides on a definition. - **No country fill on missing equities.** 80k equity rows still have empty `country`; filling them from listing exchange would be wrong for the same reason ASML on Nasdaq stays "Netherlands". Needs an actual issuer-domicile source. ## Test plan - [ ] `pytest tests/` — 50 tests pass (was 49 before; the new `test_no_symbol_collisions_across_asset_classes` is the +1) - [ ] `test_no_symbol_collisions_across_asset_classes` passes - [ ] `black --check tests/ financedatabase/` clean - [ ] CI `Update-README-Statistics` job rewrites README correctly on next data update - [ ] Spot-check: `BHF`, `DTE`, `RGA`, etc. exist in `equities.csv` only (not in `etfs.csv` anymore) - [ ] Spot-check: `FSST` has all fields populated Related: JerBouma#140 (test infra), JerBouma#143 (introduced the `exchange -> market` invariant pattern), JerBouma#144 (proposes splitting CSVs by exchange).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four
equities.csvdata improvements bundled because they touch only the data and share the same public-data philosophy as #139.exchange="ASE"values (closes Incorrectexchangecode "ASE" assigned to NYSE / NYSE American securities (BRK.B, BF.B, ...) #133)figi,composite_figi,shareclass_figi) for US equitiesexchange="ASE"→ correct codeexchangemarketafter exchange fix + revert 4 NGM collisionsexchange,marketisin→ ISIN from international marketsisinfigi,composite_figi,shareclass_figiexchange/marketTotal: ~31,000 cells improved. No rows added or removed.
1. ASE exchange reclassification (closes #133)
The
exchangecolumn had"ASE"on 1,632 rows. Cross-checking each against yfinance revealed only 257 are actually NYSE American listings — the other rows were misclassified. This fixes 576 of them with the correct yfinance value (started at 580, then 4 NGM collisions reverted — see follow-up section below).ASE(no data to fix)ASE(legit NYSE American)ASENYQ(NYSE main board)NYQPNK(OTC Markets OTCPK)PNKNCM/NMS(Nasdaq tiers)NGM(NASDAQ Global Market — collides with FD's NGM=Nordic)ASEOQB(OTC Markets OTCQB)OQBOID(OTC Markets OTCID)OIDPCX(NYSE Arca)PCXASEMarket column alignment + NGM reverts
Per @JerBouma's review feedback ("verify the market column"): the
marketcolumn was left unchanged when theexchangecodes were corrected. Aligned 576 rows so the human-readable name matches the new code (NYQ→ "New York Stock Exchange",NMS→ "NASDAQ Global Select", etc.).Also reverted 4 rows that yfinance labelled
NGM(NASDAQ Global Market) — FD'sNGMcode is already used for Nordic Growth Market (270 existing rows). To avoid creating an ambiguous code, those 4 rows go back toASEuntil FD introduces a distinct code for NASDAQ Global Market. Affected symbols: BEEP, BHIL, CHACU, VOLT.External verification
Sampled 4 random tickers from the 546 NYSE main-board reclassifications:
ZETAASE→NYQ✅IOTASE→NYQ✅JXNASE→NYQ✅GRNTASE→NYQ✅2. International ISIN backfill (16,100 rows)
Where yfinance returned no ISIN (yfinance has thin coverage of non-US exchanges, as noted in #139), publicly accessible exchange listing data provides the canonical ISIN. This commit fills empty
isincells across 30 international markets.Top markets covered
Gating (same as #139's yfinance backfill)
^[A-Z]{2}[A-Z0-9]{9}[0-9]$isinvalue (zero overwrite)Stats: 16,100 filled, 0 rejected by checksum.
Markets explicitly NOT covered
.NX,.SG,.VI, parts of.F/.MU/.DU/.BE.KL(Malaysia).US(US tickers)3. FIGI backfill for US equities (13,770 rows)
FD's
figi,composite_figi, andshareclass_figicolumns were ~80% empty on US rows. FIGI is the only freely-licensed global financial identifier — ISIN/CUSIP/SEDOL are paywalled standards, while FIGI was created (Bloomberg/OpenFIGI) as the open alternative. Filling these columns enables downstream cross-reference (FIGI → ISIN, FIGI → CUSIP, FIGI → SEDOL) via OpenFIGI's free public API.Source: public FIGI data from Bloomberg's OpenFIGI initiative, mirrored through publicly accessible exchange listing data.
Gating
^BBG[0-9A-Z]{9}$Stats
figicomposite_figishareclass_figiUnique rows touched: 13,770.
Not covered
/v3/mapping(free, batchable, ~1 min for the empty set if pursued).BRK.AvsBRK-A) — format diverges between sources for a handful of share-class tickers; left for follow-up.4. Exchange/market consistency cleanup + test
Per @JerBouma's broader feedback to keep
exchangeandmarketcolumns in lock-step: agroupby('exchange').market.nunique()audit surfaced 87 pre-existing inconsistencies in FD where the short code and the human-readable label did not agree.Category A — exchange code non-canonical, market correct (84 rows)
exchangeNAS(12 rows)NMS(matches the 6,726 canonical rows)NYS(72 rows)NYQ(matches the 4,089 canonical rows)Category B — market label wrong, exchange correct (3 rows)
CATGCAPSLT.NSAfter this fix every exchange code in
equities.csvmaps to exactly one market label.New test:
test_exchange_market_one_to_oneAdded to
tests/test_equities.py. Asserts the forward invariant (eachexchangecode maps to exactly onemarketlabel) and fails fast if any future PR re-introduces the kind of drift that produced #133 and the 87 cleanup rows above.The reverse direction (market → 1 exchange) is intentionally not asserted: one market label legitimately covers several exchange tiers — e.g. "OTC Bulletin Board" covers PNK / OQB / OID / OEM / OQX.
Diff shape
database/equities.csv(ASE exchange)database/equities.csv(international ISIN)database/equities.csv(FIGI backfill)database/equities.csv(market alignment + NGM revert)database/equities.csv(87 consistency fixes) + test + snapshotNo rows added or removed. Columns modified:
exchange,market,isin,figi,composite_figi,shareclass_figi.Test plan
pytest tests/— 32 tests pass (added 1)test_exchange_market_one_to_onepasses (will fail until the compression workflow regeneratescompression/equities.bz2post-merge, since the test reads via the library)exchangecode "ASE" assigned to NYSE / NYSE American securities (BRK.B, BF.B, ...) #133's examples (ARX,ALH) — both should now beNYQwith market "New York Stock Exchange"000017.SZshould now beCNE0000002Q4NVDAshould now havefigi=BBG000BBJQV0"ASE"— those 257 rows are intentionally stillASEisinorfigiwas overwritten — only empty cells were filledCloses #133. Related: #78.