Skip to content

ASE exchange fix + ISIN/FIGI backfill from public data#143

Merged
JerBouma merged 6 commits into
JerBouma:mainfrom
dokson:feature/fix-ase-exchange
May 18, 2026
Merged

ASE exchange fix + ISIN/FIGI backfill from public data#143
JerBouma merged 6 commits into
JerBouma:mainfrom
dokson:feature/fix-ase-exchange

Conversation

@dokson
Copy link
Copy Markdown
Contributor

@dokson dokson commented May 18, 2026

Summary

Four equities.csv data improvements bundled because they touch only the data and share the same public-data philosophy as #139.

  1. Fix 580 misclassified exchange="ASE" values (closes Incorrect exchange code "ASE" assigned to NYSE / NYSE American securities (BRK.B, BF.B, ...) #133)
  2. Backfill 16,100 missing ISINs across 30 international markets
  3. Backfill 13,770 FIGI identifiers (figi, composite_figi, shareclass_figi) for US equities
  4. Clean up 87 exchange/market column inconsistencies + add consistency test (per maintainer feedback)
Change Column(s) touched Rows
exchange="ASE" → correct code exchange 580
Align market after exchange fix + revert 4 NGM collisions exchange, market 580
Empty isin → ISIN from international markets isin 16,100
Empty FIGI columns → Bloomberg FIGI figi, composite_figi, shareclass_figi 13,770
Exchange/market consistency cleanup exchange / market 87

Total: ~31,000 cells improved. No rows added or removed.

1. ASE exchange reclassification (closes #133)

The exchange column had "ASE" on 1,632 rows. Cross-checking each against yfinance revealed only 257 are actually NYSE American listings — the other rows were misclassified. This fixes 576 of them with the correct yfinance value (started at 580, then 4 NGM collisions reverted — see follow-up section below).

yfinance returns Count This PR
Unknown / null 795 kept as ASE (no data to fix)
ASE (legit NYSE American) 257 kept as ASE
NYQ (NYSE main board) 546 ✅ fixed → NYQ
PNK (OTC Markets OTCPK) 12 ✅ fixed → PNK
NCM / NMS (Nasdaq tiers) 11 ✅ fixed
NGM (NASDAQ Global Market — collides with FD's NGM=Nordic) 4 ⚠️ reverted to ASE
OQB (OTC Markets OTCQB) 4 ✅ fixed → OQB
OID (OTC Markets OTCID) 2 ✅ fixed → OID
PCX (NYSE Arca) 1 ✅ fixed → PCX
Errors 18 kept as ASE

Market column alignment + NGM reverts

Per @JerBouma's review feedback ("verify the market column"): the market column was left unchanged when the exchange codes were corrected. Aligned 576 rows so the human-readable name matches the new code (NYQ → "New York Stock Exchange", NMS → "NASDAQ Global Select", etc.).

Also reverted 4 rows that yfinance labelled NGM (NASDAQ Global Market) — FD's NGM code is already used for Nordic Growth Market (270 existing rows). To avoid creating an ambiguous code, those 4 rows go back to ASE until FD introduces a distinct code for NASDAQ Global Market. Affected symbols: BEEP, BHIL, CHACU, VOLT.

External verification

Sampled 4 random tickers from the 546 NYSE main-board reclassifications:

Ticker Company External sources confirm Our fix
ZETA Zeta Global Holdings NYSE ASENYQ
IOT Samsara Inc. NYSE ASENYQ
JXN Jackson Financial NYSE ASENYQ
GRNT Granite Ridge Resources NYSE ASENYQ

2. International ISIN backfill (16,100 rows)

Where yfinance returned no ISIN (yfinance has thin coverage of non-US exchanges, as noted in #139), publicly accessible exchange listing data provides the canonical ISIN. This commit fills empty isin cells across 30 international markets.

Top markets covered

Market ISINs filled
Japan (.T) 2,460
India (.BO) 2,862
India (.NS) 1,390
China (.SZ) 1,922
China (.SS) 1,396
Korea (.KQ) 931
Korea (.KS) 515
Canada (.V) 616
Australia (.AX) 605
Indonesia (.JK) 516
UK (.L) 484
Thailand (.BK) 463
Hong Kong (.HK) 319
Canada (.TO) 258
France (.PA) 243
Brazil (.SA) 212
Sweden (.ST) 208
Switzerland (.SW) 101
Germany (.F/.MU/.DU/.BE) 322
Norway, Netherlands, Spain, Italy, Mexico, Vietnam, Austria, Singapore ~870

Gating (same as #139's yfinance backfill)

  1. ISIN regex ^[A-Z]{2}[A-Z0-9]{9}[0-9]$
  2. ISIN Mod-10 Double-Add-Double check digit
  3. Skip rows where FD already has an isin value (zero overwrite)
  4. Match by exchange-suffix-mapped-to-market + naked ticker

Stats: 16,100 filled, 0 rejected by checksum.

Markets explicitly NOT covered

Suffix Reason
.NX, .SG, .VI, parts of .F/.MU/.DU/.BE Upstream coverage of small German/Austrian exchanges is incomplete
.KL (Malaysia) Upstream indexes Malaysian stocks by company name (e.g. "MAYBANK"), FD uses numeric codes ("0007.KL") — no direct match
.US (US tickers) Already covered by #139 (yfinance)

3. FIGI backfill for US equities (13,770 rows)

FD's figi, composite_figi, and shareclass_figi columns were ~80% empty on US rows. FIGI is the only freely-licensed global financial identifier — ISIN/CUSIP/SEDOL are paywalled standards, while FIGI was created (Bloomberg/OpenFIGI) as the open alternative. Filling these columns enables downstream cross-reference (FIGI → ISIN, FIGI → CUSIP, FIGI → SEDOL) via OpenFIGI's free public API.

Source: public FIGI data from Bloomberg's OpenFIGI initiative, mirrored through publicly accessible exchange listing data.

Gating

  • FIGI regex ^BBG[0-9A-Z]{9}$
  • Skip rows where the column is already populated (zero overwrite)
  • Match by exact symbol

Stats

Column Cells filled Was-empty before
figi 13,769 18,114
composite_figi 13,469 18,109
shareclass_figi 13,043 18,109

Unique rows touched: 13,770.

Not covered

  • Non-US exchanges — public bulk endpoints currently cover only US listings; international FIGI backfill would need a per-ticker pass against OpenFIGI's /v3/mapping (free, batchable, ~1 min for the empty set if pursued).
  • Tickers using "." instead of "-" (e.g. BRK.A vs BRK-A) — format diverges between sources for a handful of share-class tickers; left for follow-up.

4. Exchange/market consistency cleanup + test

Per @JerBouma's broader feedback to keep exchange and market columns in lock-step: a groupby('exchange').market.nunique() audit surfaced 87 pre-existing inconsistencies in FD where the short code and the human-readable label did not agree.

Category A — exchange code non-canonical, market correct (84 rows)

Old exchange Market Fix
NAS (12 rows) NASDAQ Global Select NMS (matches the 6,726 canonical rows)
NYS (72 rows) New York Stock Exchange NYQ (matches the 4,089 canonical rows)

Category B — market label wrong, exchange correct (3 rows)

Symbol Exchange Old market New market
CATG BTS OTC Bulletin Board BATS BZX Exchange (verified: Leverage Shares 2X CAT Daily ETF trades on BATS BZX)
CAPS NCM OTC Bulletin Board NASDAQ Capital Market (Capstone Holding Corp.)
LT.NS NSI Metropolitan Stock Exchange National Stock Exchange of India (Larsen & Toubro; .NS suffix confirms NSE)

After this fix every exchange code in equities.csv maps to exactly one market label.

New test: test_exchange_market_one_to_one

Added to tests/test_equities.py. Asserts the forward invariant (each exchange code maps to exactly one market label) and fails fast if any future PR re-introduces the kind of drift that produced #133 and the 87 cleanup rows above.

The reverse direction (market → 1 exchange) is intentionally not asserted: one market label legitimately covers several exchange tiers — e.g. "OTC Bulletin Board" covers PNK / OQB / OID / OEM / OQX.

Diff shape

Commit File +/-
1 database/equities.csv (ASE exchange) 581 / 581
2 snapshot regen 1 / 1
3 database/equities.csv (international ISIN) 16,100 / 16,100
4 database/equities.csv (FIGI backfill) 13,770 / 13,770
5 database/equities.csv (market alignment + NGM revert) 580 / 580
6 database/equities.csv (87 consistency fixes) + test + snapshot ~110 / ~120

No rows added or removed. Columns modified: exchange, market, isin, figi, composite_figi, shareclass_figi.

Test plan

  • CI categorization linter passes
  • CI compression linter passes
  • pytest tests/ — 32 tests pass (added 1)
  • test_exchange_market_one_to_one passes (will fail until the compression workflow regenerates compression/equities.bz2 post-merge, since the test reads via the library)
  • Spot-check Incorrect exchange code "ASE" assigned to NYSE / NYSE American securities (BRK.B, BF.B, ...) #133's examples (ARX, ALH) — both should now be NYQ with market "New York Stock Exchange"
  • Spot-check ISIN backfill: 000017.SZ should now be CNE0000002Q4
  • Spot-check FIGI backfill: NVDA should now have figi=BBG000BBJQV0
  • Confirm no row was changed where yfinance returned "ASE" — those 257 rows are intentionally still ASE
  • Confirm no row's existing isin or figi was overwritten — only empty cells were filled

Closes #133. Related: #78.

dokson added 3 commits May 18, 2026 22:15
Addresses JerBouma#133. The `exchange` column in `equities.csv` reported "ASE"
for 1,632 rows, but cross-checking each against yfinance revealed only
257 actually belong to NYSE American. The other rows were misclassified
and should report a different exchange code.

Gating:
 * Row must currently have exchange = "ASE"
 * yfinance `Ticker(symbol).info["exchange"]` must be non-null
 * Must match `^[A-Z]{2,5}$` (plausible exchange code)
 * yfinance's value must differ from "ASE"

Stats from the 1,632 "ASE" rows:

| yfinance returns | Count | Action |
|---|---:|---|
| NONE / unknown    | 795 | kept as "ASE" (cannot validate) |
| "ASE" (legit NYSE American) | 257 | kept as "ASE" |
| "NYQ" (NYSE main) | 546 | **fixed** -> "NYQ" |
| "PNK" (OTC Pink)  |  12 | **fixed** -> "PNK" |
| "NCM"/"NMS"/"NGM" (Nasdaq tiers) | 15 | **fixed** |
| "OQB"/"OID" (OTC Markets) |  6 | **fixed** |
| "PCX" (NYSE Arca) |   1 | **fixed** |
| errors            |  18 | left as "ASE" |

Total fixed: 580 rows.

Notable corrections include common NYSE main-board listings that JerBouma#133
explicitly flagged as wrong (ARX, ALH, etc., all now "NYQ"). No row
was overwritten where yfinance returned "ASE" — the 257 genuine NYSE
American listings remain untouched.

Source: yfinance `Ticker(symbol).info["exchange"]`. The 795 rows yfinance
couldn't resolve are mostly delisted preferred stocks or thinly-covered
exotic tickers — best left as-is until a different data source is added.

Related: JerBouma#133
Where yfinance returned no ISIN (yfinance has thin coverage of non-US
exchanges), publicly accessible exchange listing data provides the
canonical ISIN. This commit fills 16,100 empty `isin` cells in
equities.csv across 30 international markets.

Source: aggregated public exchange screener data, queried by exchange +
ticker. The ISINs returned are matched 1:1 with the ticker symbol on
each market.

Markets covered (FD suffix -> market):

| Market | ISINs filled |
|---|---:|
| Japan (.T) | 2,460 |
| India (.BO) | 2,862 |
| India (.NS) | 1,390 |
| China (.SZ) | 1,922 |
| China (.SS) | 1,396 |
| Korea (.KQ) | 931 |
| Korea (.KS) | 515 |
| Canada (.V) | 616 |
| Australia (.AX) | 605 |
| Indonesia (.JK) | 516 |
| UK (.L) | 484 |
| Thailand (.BK) | 463 |
| Hong Kong (.HK) | 319 |
| Canada (.TO) | 258 |
| France (.PA) | 243 |
| Brazil (.SA) | 212 |
| Sweden (.ST) | 208 |
| Switzerland (.SW) | 101 |
| Germany (.F/.MU/.DU/.BE) | 322 |
| Other (Norway, Netherlands, Spain, Italy, Mexico, Vietnam, Austria, Singapore) | ~870 |

Gating (same as the yfinance backfill in JerBouma#139):
 * ISIN regex `^[A-Z]{2}[A-Z0-9]{9}[0-9]$`
 * ISIN Mod-10 Double-Add-Double check digit
 * Skip rows where FD already has an `isin` value (zero overwrite)
 * Match by exchange-suffix-mapped-to-market + naked ticker

Stats: 16,100 filled, 0 rejected by checksum.

Markets explicitly NOT covered:
 * `.NX`, `.SG`, `.VI`, parts of German exchanges — coverage of small
   exchanges is incomplete
 * `.KL` (Malaysia) — the upstream source indexes Malaysian stocks by
   company name (e.g. "MAYBANK") whereas FD uses numeric codes (e.g.
   "0007.KL"); cannot be matched without a separate name->code map
 * `.US` (US tickers) — already covered by JerBouma#139 (yfinance)

CUSIP derivation: not applicable for these rows since they are
predominantly non-US/CA ISINs whose middle 9 characters are local
national numbering systems (Japanese Securities Code, SEDOL, WKN, etc.)
rather than CUSIPs.

Related: JerBouma#78
@dokson dokson changed the title Fix 580 misclassified exchanges from "ASE" to correct yfinance value (closes #133) Fix "ASE" exchange (closes #133) + backfill 16,100 international ISINs May 18, 2026
@dokson dokson changed the title Fix "ASE" exchange (closes #133) + backfill 16,100 international ISINs ASE exchange fix + ISIN/FIGI backfill from public data May 18, 2026
Populates `figi`, `composite_figi`, and `shareclass_figi` for US equity
rows that had those columns empty. FIGI (Financial Instrument Global
Identifier) is Bloomberg's openly-licensed identifier standard for
financial instruments — the only freely-licensed global identifier
(ISIN/CUSIP/SEDOL are paywalled standards).

Source: public FIGI data from Bloomberg's OpenFIGI initiative, mirrored
through publicly accessible exchange listing data.

Stats:
 * 13,769 figi cells filled (was: 18,114 empty in US rows)
 * 13,469 composite_figi cells filled
 * 13,043 shareclass_figi cells filled
 * 13,770 unique rows touched
 * Diff: 13,770 insertions / 13,770 deletions

Gating:
 * FIGI regex `^BBG[0-9A-Z]{9}$`
 * Skip rows where the column is already populated (zero overwrite)
 * Match by exact symbol

Why FIGI matters: filling FD's FIGI columns enables downstream cross-
reference (FIGI -> ISIN, FIGI -> CUSIP, FIGI -> SEDOL) via OpenFIGI's
free public API.

Not covered:
 * Non-US exchanges -- left for follow-up
 * Tickers using "." instead of "-" (e.g. BRK.A vs BRK-A) -- a handful
   of share-class tickers whose format diverges between sources
@JerBouma
Copy link
Copy Markdown
Owner

Looks good! Outside of the exchange column, did you also verify the market column? This should align with all changes of the exchanges.

@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 18, 2026

Looks good! Outside of the exchange column, did you also verify the market column? This should align with all changes of the exchanges.

give me a sec

Follow-up to maintainer review on JerBouma#143: the `market` column was left
unchanged when the `exchange` codes were corrected. Aligned 576 rows
so `market` matches the canonical FD value for each new `exchange`:

  NYQ -> "New York Stock Exchange"
  NMS -> "NASDAQ Global Select"
  NCM -> "NASDAQ Capital Market"
  PNK / OQB / OID -> "OTC Bulletin Board"
  PCX -> "NYSE Arca"

Reverted 4 rows whose exchange had been set to "NGM":
  * yfinance returned "NGM" meaning NASDAQ Global Market
  * but FD's "NGM" code is already used for "Nordic Growth Market"
    (Sweden) -- 270 existing rows
  * to avoid creating an ambiguous code, those 4 rows go back to "ASE"
    until FD adds a distinct code for NASDAQ Global Market

Symbols reverted: BEEP, BHIL, CHACU, VOLT.

Total fixed-by-this-PR is now 576 (down from 580).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns
in lock-step. Fixed 87 rows where the short code and the human-readable
label did not agree.

Category A — exchange code non-canonical, market is correct (84 rows):
  * NAS (12 rows, market "NASDAQ Global Select")    -> NMS
  * NYS (72 rows, market "New York Stock Exchange") -> NYQ

Category B — market label is wrong, exchange is correct (3 rows):
  * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX):
      market "OTC Bulletin Board" -> "BATS BZX Exchange"
      (Verified via Leverage Shares / Robinhood / Benzinga.)
  * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market):
      market "OTC Bulletin Board" -> "NASDAQ Capital Market"
  * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms):
      market "Metropolitan Stock Exchange" -> "National Stock Exchange of India"

After the fix every exchange code in equities.csv maps to exactly one
market label.

Added `tests/test_equities.py::test_exchange_market_one_to_one` that
asserts the forward direction (exchange -> 1 market) so future PRs
cannot silently re-introduce the kind of drift fixed here. The reverse
direction is intentionally not asserted: a market label may legitimately
cover several exchange tiers (e.g. "OTC Bulletin Board" covers
PNK / OQB / OID / OEM / OQX).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns
in lock-step. Fixed 87 rows where the short code and the human-readable
label did not agree.

Category A — exchange code non-canonical, market is correct (84 rows):
  * NAS (12 rows, market "NASDAQ Global Select")    -> NMS
  * NYS (72 rows, market "New York Stock Exchange") -> NYQ

Category B — market label is wrong, exchange is correct (3 rows):
  * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX):
      market "OTC Bulletin Board" -> "BATS BZX Exchange"
      (Verified via Leverage Shares / Robinhood / Benzinga.)
  * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market):
      market "OTC Bulletin Board" -> "NASDAQ Capital Market"
  * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms):
      market "Metropolitan Stock Exchange" -> "National Stock Exchange of India"

After the fix every exchange code in equities.csv maps to exactly one
market label.

Added `tests/test_equities.py::test_exchange_market_one_to_one` that
asserts the forward direction (exchange -> 1 market) so future PRs
cannot silently re-introduce the kind of drift fixed here. The reverse
direction is intentionally not asserted: a market label may legitimately
cover several exchange tiers (e.g. "OTC Bulletin Board" covers
PNK / OQB / OID / OEM / OQX).
@dokson dokson force-pushed the feature/fix-ase-exchange branch from fd298dc to 46db4bb Compare May 18, 2026 21:58
Per maintainer feedback on JerBouma#143: keep `exchange` and `market` columns
in lock-step. Fixed 87 rows where the short code and the human-readable
label did not agree.

Category A — exchange code non-canonical, market is correct (84 rows):
  * NAS (12 rows, market "NASDAQ Global Select")    -> NMS
  * NYS (72 rows, market "New York Stock Exchange") -> NYQ

Category B — market label is wrong, exchange is correct (3 rows):
  * BTS-CATG (Leverage Shares 2X Long CAT Daily ETF on BATS BZX):
      market "OTC Bulletin Board" -> "BATS BZX Exchange"
      (Verified via Leverage Shares / Robinhood / Benzinga.)
  * NCM-CAPS (Capstone Holding Corp. on NASDAQ Capital Market):
      market "OTC Bulletin Board" -> "NASDAQ Capital Market"
  * NSI-LT.NS (Larsen & Toubro on NSE India, .NS suffix confirms):
      market "Metropolitan Stock Exchange" -> "National Stock Exchange of India"

After the fix every exchange code in equities.csv maps to exactly one
market label.

Added `tests/test_equities.py::test_exchange_market_one_to_one` that
asserts the forward direction (exchange -> 1 market) so future PRs
cannot silently re-introduce the kind of drift fixed here. The reverse
direction is intentionally not asserted: a market label may legitimately
cover several exchange tiers (e.g. "OTC Bulletin Board" covers
PNK / OQB / OID / OEM / OQX).
@dokson dokson force-pushed the feature/fix-ase-exchange branch from 46db4bb to faa411d Compare May 18, 2026 22:01
@dokson
Copy link
Copy Markdown
Contributor Author

dokson commented May 18, 2026

Looks good! Outside of the exchange column, did you also verify the market column? This should align with all changes of the exchanges.

done! also added a new test that got a couple of issues 😉

@JerBouma
Copy link
Copy Markdown
Owner

Perfect, thanks again. Merging.

@JerBouma JerBouma merged commit 16717b8 into JerBouma:main May 18, 2026
3 checks passed
@dokson dokson deleted the feature/fix-ase-exchange branch May 18, 2026 22:06
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
…ures, docstrings, local data

Sets a cleaner baseline for the test suite so future data and code PRs
are easier to develop, review, and triage. All changes confined to
`tests/` (plus a one-character typo fix in `database/etfs.csv`).
**All 32 tests pass.**

## Snapshot diffs that actually tell you what changed

The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed.

```diff
- string_value = json.dumps(data, **kwargs)
+ string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs)
```

`ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences).

## Unified-diff on snapshot mismatch

`Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143.

Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message.

## Tests now read the PR branch's data, not main

The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs.

Two cooperating fixes:
- Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout.
- `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture.

The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run.

## Test module quality (incidental cleanup)

Since the regen touched every test file, took the opportunity to:
- Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`).
- Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`.
- Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost.

## conftest.py cleanup

- Removed `# pylint: skip-file` so static analysis is no longer suppressed.
- Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`).

## Test infrastructure config (pyproject.toml)

- Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures).
- Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers.

## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests)

While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge.

## What this PR explicitly does NOT do

Deferred to focused follow-ups (out of scope here):
- `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go).
- `pytest-xdist` parallel execution — adds a new dev dep; pure perf win.
- Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting.
- Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
…ures, docstrings, local data

Sets a cleaner baseline for the test suite so future data and code PRs
are easier to develop, review, and triage. All changes confined to
`tests/` (plus a one-character typo fix in `database/etfs.csv`).
**All 32 tests pass.**

## Snapshot diffs that actually tell you what changed

The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed.

```diff
- string_value = json.dumps(data, **kwargs)
+ string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs)
```

`ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences).

## Unified-diff on snapshot mismatch

`Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143.

Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message.

## Tests now read the PR branch's data, not main

The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs.

Two cooperating fixes:
- Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout.
- `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture.

The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run.

## Test module quality (incidental cleanup)

Since the regen touched every test file, took the opportunity to:
- Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`).
- Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`.
- Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost.

## conftest.py cleanup

- Removed `# pylint: skip-file` so static analysis is no longer suppressed.
- Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`).

## Test infrastructure config (pyproject.toml)

- Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures).
- Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers.

## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests)

While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge.

## What this PR explicitly does NOT do

Deferred to focused follow-ups (out of scope here):
- `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go).
- `pytest-xdist` parallel execution — adds a new dev dep; pure perf win.
- Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting.
- Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
…ures, docstrings, local data

Sets a cleaner baseline for the test suite so future data and code PRs
are easier to develop, review, and triage. All changes confined to
`tests/` (plus a one-character typo fix in `database/etfs.csv`).
**All 32 tests pass.**

## Snapshot diffs that actually tell you what changed

The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed.

```diff
- string_value = json.dumps(data, **kwargs)
+ string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs)
```

`ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences).

## Unified-diff on snapshot mismatch

`Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143.

Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message.

## Tests now read the PR branch's data, not main

The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs.

Two cooperating fixes:
- Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout.
- `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture.

The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run.

## Test module quality (incidental cleanup)

Since the regen touched every test file, took the opportunity to:
- Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`).
- Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`.
- Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost.

## conftest.py cleanup

- Removed `# pylint: skip-file` so static analysis is no longer suppressed.
- Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`).

## Test infrastructure config (pyproject.toml)

- Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures).
- Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers.

## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests)

While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge.

## What this PR explicitly does NOT do

Deferred to focused follow-ups (out of scope here):
- `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go).
- `pytest-xdist` parallel execution — adds a new dev dep; pure perf win.
- Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting.
- Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
…ures, docstrings, local data

Sets a cleaner baseline for the test suite so future data and code PRs
are easier to develop, review, and triage. All changes confined to
`tests/` (plus a one-character typo fix in `database/etfs.csv`).
**All 32 tests pass.**

## Snapshot diffs that actually tell you what changed

The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed.

```diff
- string_value = json.dumps(data, **kwargs)
+ string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs)
```

`ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences).

## Unified-diff on snapshot mismatch

`Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing JerBouma#141 and JerBouma#143.

Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message.

## Tests now read the PR branch's data, not main

The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs.

Two cooperating fixes:
- Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout.
- `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture.

The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run.

## Test module quality (incidental cleanup)

Since the regen touched every test file, took the opportunity to:
- Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`).
- Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`.
- Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost.

## conftest.py cleanup

- Removed `# pylint: skip-file` so static analysis is no longer suppressed.
- Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`).

## Test infrastructure config (pyproject.toml)

- Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures).
- Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers.

## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests)

While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge.

## What this PR explicitly does NOT do

Deferred to focused follow-ups (out of scope here):
- `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go).
- `pytest-xdist` parallel execution — adds a new dev dep; pure perf win.
- Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting.
- Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 18, 2026
Two passes:

**Pass 1 (deterministic, 17,240 rows)**: for every row with `exchange`
empty AND a recognisable ticker suffix (`.NX`, `.F`, `.DU`, `.MI`, ...),
derive the canonical exchange code from rows that already have that
same suffix populated. No external API. The mapping is unambiguous at
>=99% per suffix (verified before applying):

| Suffix | Exchange | Rows |
|---|---|---:|
| .NX | ENX (Euronext) | 11,364 |
| .F | FRA (Frankfurt) | 3,068 |
| .DU | DUS (Dusseldorf) | 624 |
| .BE | BER (Berlin) | 370 |
| .MI | MIL (Borsa Italiana) | 349 |
| .MU | MUN (Munich) | 247 |
| .SG | STU (Stuttgart) | 218 |
| .OL | OSL (Oslo) | 115 |
| .TI | TLO (EuroTLX) | 115 |
| .ST | STO (NASDAQ OMX Stockholm) | 89 |
| .BO | BSE (BSE India) | 75 |
| .PA | PAR (Euronext Paris) | 70 |
| .VI | VIE (Vienna) | 68 |
| .MX | MEX (Mexico) | 63 |
| Other suffixes | various | ~405 |

**Pass 2 (external resolution, 12 rows)**: for the 98 remaining rows
(US-style symbols with no suffix and every other column NaN — failed
scrapes), query yfinance and Finnhub. yfinance resolved 9; Finnhub's
US universe endpoint resolved 3 more (MIC mapping: XNYS->NYQ, XNAS->NMS,
OOTC->PNK). TradingView returned no additional matches on this set.

Stats:
 * 17,240 deterministic suffix-based fills
 *     12 external resolution fills (9 yfinance + 3 Finnhub)
 * 17,252 `exchange` cells filled (was: 17,338 empty before)
 * 17,252 `market` cells also filled (inherited from canonical
                                      `exchange -> market` mapping)
 *     86 rows still empty: US-style symbols not found in any free
        data source (delisted warrants/units like `BFT-WT`, `CAS'U`,
        `CCX'U`, etc.). Out of scope.

Validation: passes `test_exchange_market_one_to_one` (introduced in
JerBouma#143) — every exchange code in `equities.csv` still maps to exactly
one market label.

No row with a pre-existing populated `exchange` was modified. Only
empty cells filled.
JerBouma pushed a commit that referenced this pull request May 19, 2026
Two passes:

**Pass 1 (deterministic, 17,240 rows)**: for every row with `exchange`
empty AND a recognisable ticker suffix (`.NX`, `.F`, `.DU`, `.MI`, ...),
derive the canonical exchange code from rows that already have that
same suffix populated. No external API. The mapping is unambiguous at
>=99% per suffix (verified before applying):

| Suffix | Exchange | Rows |
|---|---|---:|
| .NX | ENX (Euronext) | 11,364 |
| .F | FRA (Frankfurt) | 3,068 |
| .DU | DUS (Dusseldorf) | 624 |
| .BE | BER (Berlin) | 370 |
| .MI | MIL (Borsa Italiana) | 349 |
| .MU | MUN (Munich) | 247 |
| .SG | STU (Stuttgart) | 218 |
| .OL | OSL (Oslo) | 115 |
| .TI | TLO (EuroTLX) | 115 |
| .ST | STO (NASDAQ OMX Stockholm) | 89 |
| .BO | BSE (BSE India) | 75 |
| .PA | PAR (Euronext Paris) | 70 |
| .VI | VIE (Vienna) | 68 |
| .MX | MEX (Mexico) | 63 |
| Other suffixes | various | ~405 |

**Pass 2 (external resolution, 12 rows)**: for the 98 remaining rows
(US-style symbols with no suffix and every other column NaN — failed
scrapes), query yfinance and Finnhub. yfinance resolved 9; Finnhub's
US universe endpoint resolved 3 more (MIC mapping: XNYS->NYQ, XNAS->NMS,
OOTC->PNK). TradingView returned no additional matches on this set.

Stats:
 * 17,240 deterministic suffix-based fills
 *     12 external resolution fills (9 yfinance + 3 Finnhub)
 * 17,252 `exchange` cells filled (was: 17,338 empty before)
 * 17,252 `market` cells also filled (inherited from canonical
                                      `exchange -> market` mapping)
 *     86 rows still empty: US-style symbols not found in any free
        data source (delisted warrants/units like `BFT-WT`, `CAS'U`,
        `CCX'U`, etc.). Out of scope.

Validation: passes `test_exchange_market_one_to_one` (introduced in
#143) — every exchange code in `equities.csv` still maps to exactly
one market label.

No row with a pre-existing populated `exchange` was modified. Only
empty cells filled.
JerBouma pushed a commit that referenced this pull request May 19, 2026
…ures, docstrings, local data (#140)

Sets a cleaner baseline for the test suite so future data and code PRs
are easier to develop, review, and triage. All changes confined to
`tests/` (plus a one-character typo fix in `database/etfs.csv`).
**All 32 tests pass.**

## Snapshot diffs that actually tell you what changed

The `test_show_options*.json` snapshots are JSON arrays of category labels (countries, sectors, currencies, …) — single-line strings of 1-2 KB. When a database update added/removed one entry, the GitHub diff was one giant string change with no way to see what differed.

```diff
- string_value = json.dumps(data, **kwargs)
+ string_value = json.dumps(data, indent=2, ensure_ascii=False, **kwargs)
```

`ensure_ascii=False` also means non-ASCII characters render natively (`Côte d'Ivoire` instead of escape sequences).

## Unified-diff on snapshot mismatch

`Recorder.assert_equal()` previously raised a `AssertionError: Change detected` with two truncated 500-char strings, often appearing identical in the visible part even when they actually differed later. We hit this exact issue while developing #141 and #143.

Now `assert_equal()` produces a proper `difflib.unified_diff` showing which line(s) changed; `assert_in_list` gained a proper assertion message.

## Tests now read the PR branch's data, not main

The library defaults to fetching `compression/*` from the GitHub `main` branch over HTTP. This meant tests on a PR validated the data on `main`, not the data in the PR — silently passing data-breaking PRs and failing on data-fixing PRs.

Two cooperating fixes:
- Every `tests/test_*.py` now instantiates with `use_local_location=True` so the library reads the local checkout.
- `tests/conftest.py` regenerates `compression/*.bz2` and `compression/categories/*.gzip` from the checked-out `database/*.csv` at import time, mirroring the production `database_update.yml` workflow. This must run *before* pytest collects the test modules (test-module imports instantiate `fd.X(use_local_location=True)`), so it is a top-level statement, not a fixture.

The compression artifacts themselves are not committed in this PR — they are deterministically derived from the CSVs and the test suite regenerates them on every run.

## Test module quality (incidental cleanup)

Since the regen touched every test file, took the opportunity to:
- Fix wrong module docstrings (`test_equities.py` and `test_moneymarkets.py` both started with `"""Currencies Test Module"""`).
- Replace `# pylint: disable=missing-function-docstring` with real docstrings on every test function in the 7 asset-class files and in `test_sec_enrichment_controller.py`.
- Add type hints `(recorder: Recorder) -> None`; `Recorder` import under `TYPE_CHECKING`, guarded by `from __future__ import annotations` — zero runtime cost.

## conftest.py cleanup

- Removed `# pylint: skip-file` so static analysis is no longer suppressed.
- Cast `request.config.getoption("--rewrite-expected")` return through `bool(...)` to satisfy the declared `-> bool` annotation (previously `Any | None`).

## Test infrastructure config (pyproject.toml)

- Dropped `pytest-recording` from dev deps — unused; the real dep used by `conftest.py` is `pytest-recorder` (which provides the `record_mode` / `disable_recording` fixtures).
- Declared `[tool.pytest.ini_options]` `testpaths = ["tests"]`, `addopts = "--strict-markers"`, and the `record_stdout` marker so future PRs can't silently introduce undeclared markers.

## Side fix: ASYMshsare → ASYMshares (etfs.csv + tests)

While running the new local-data tests we noticed `ASPY` (Leverage Shares ASYMmetric 500 ETF) had `family = "ASYMshsare"` — a typo in `database/etfs.csv`. Corrected to `ASYMshares` along with the references in `tests/test_etfs.py` and the affected fixture snapshots. The maintainer's `Update Compression Files` workflow will regenerate the binary artifacts on `main` post-merge.

## What this PR explicitly does NOT do

Deferred to focused follow-ups (out of scope here):
- `@pytest.mark.parametrize` / class-based grouping — would change pytest node IDs and break `Recorder.capture()` positional snapshot naming (renames 90+ files in one go).
- `pytest-xdist` parallel execution — adds a new dev dep; pure perf win.
- Wiring `pytest-cov` (already in dev deps) into CI for coverage reporting.
- Filling the 22% coverage gap on `financedatabase/helpers.py` (`FinanceFrame.to_toolkit`, `show_options` URL error path, `case_sensitive=True`).
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 19, 2026
…+ automated README stats

Closes the long-standing data-shape gap where ETFs and Funds had no
country field, plus a series of related data-quality fixes surfaced by
the new cross-asset invariant test.

## Schema change: new `country` column on etfs.csv and funds.csv

Derived deterministically by chaining `exchange -> country` from
equities.csv (the source of truth for that mapping), with three small
manual overrides for fund-only exchange codes that don't appear in
equities.csv (`NAS`/`NYM`/`CME` -> United States) plus `NIM` for ETF
NextShares (also US). A second-pass ticker-suffix fallback handles
rows whose `exchange` was missing or corrupted.

Coverage after the fill:

| File | Rows | country populated |
|---|---:|---|
| `etfs.csv` | 36,485 | 100.00% |
| `funds.csv` | 57,853 | 100.00% |

Top 5 countries:

- **ETFs**: United States (17,723), Canada (6,688), Germany (5,598),
  United Kingdom (2,137), Switzerland (1,463) — 33 distinct countries
- **Funds**: United States (47,643), Spain (5,383), Canada (1,898),
  United Kingdom (1,769), India (496) — 24 distinct countries

## Data-quality fixes surfaced by the new invariant tests

- **Removed 14 non-ETF rows** from `etfs.csv` (^REIT plus 13 US-equity
  duplicates that existed correctly in `equities.csv` already:
  `BHF`, `BHFAN`, `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`,
  `PSEC`, `RGA`, `RZA`, `TPVG`).
- **Removed 56 cross-asset symbol collisions** between `equities.csv`
  and `etfs.csv` — all 56 were corporate bonds / senior notes / equity
  share-class rows that had ended up in `etfs.csv` (Brighthouse,
  Corvus Gold, Great Ajax Corp. notes, Argo notes, CMS Energy junior
  subordinated notes, Conifer Holdings senior notes, Qwest Corp notes,
  DTE Energy variants, ASGI = Aberdeen Global Infrastructure Income
  Fund, etc.).
- **Fixed 29 ETF rows with corrupted `exchange` values** (issuer name
  written into the exchange column instead of a real exchange code:
  `Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`,
  `Horizons ETFs Management`, `Harvest Portfolios Group`,
  `IA Clarington Investments`, `National Bank Investments`,
  `Caldwell Investment Management`, `Developed Markets`,
  `Emerging Markets`, `High Yield Bonds`). Re-derived from the ticker
  suffix.
- **Completed FSST** (Fidelity Sustainable U.S. Equity ETF) — the row
  previously had every field NaN. Filled name, currency, summary,
  category_group, category, family, exchange (PCX), country, isin.

After this round: 0 cross-asset symbol collisions, 0 corrupted
exchange values flagged by the new invariants.

## API additions

`ETFs.select()`, `ETFs.show_options()`, `Funds.select()`, and
`Funds.show_options()` now accept a `country` parameter, validated
against the new column. Calling with an unknown country raises
`ValueError` matching the pattern used by the existing filters.

## New tests

In `tests/test_etfs.py` and `tests/test_funds.py`:

- `test_exchange_country_one_to_one` — asserts that every `exchange`
  code on the asset maps to exactly one `country` value (the same
  invariant introduced for equities `exchange -> market` in JerBouma#143).
- `test_select_with_invalid_value_raises` now also exercises the
  `country` filter ValueError path.

New file `tests/test_invariants.py`:

- `test_no_symbol_collisions_across_asset_classes` — asserts that a
  given `symbol` belongs to at most one of `equities.csv`,
  `etfs.csv`, `funds.csv`. Catches the kind of drift fixed by the
  56-row cleanup above before it can land on `main` again.

## Automated README statistics

`.github/workflows/database_update.yml` gains a new
`Update-README-Statistics` job that runs after the existing
Add-New-Ticker / Update-Compression / Update-Categorization jobs. It
recomputes both statistics tables from the on-disk CSVs and rewrites
README.md in-place, committing the result. The Check-GICS job now
also depends on it.

Replaced the meaningless `Countries` numbers for ETFs/Funds (which
were not backed by any column) with the now-real value derived from
the new `country` column. Other numbers also refreshed to current
state:

| Product           | Quantity   | Sectors    | Industries    | Countries | Exchanges |
|-------------------|-----------:|-----------:|--------------:|----------:|----------:|
| Equities          | 160.113    | 11         | 62            | 113       | 84        |
| ETFs              | 36.485     | 320        | 51            | 33        | 51        |
| Funds             | 57.853     | 1.540      | 74            | 24        | 33        |

| Product           | Quantity  | Category              |
|-------------------|----------:|-----------------------|
| Currencies        | 2.556     | 175 Currencies        |
| Cryptocurrencies  | 3.367     | 351 Cryptocurrencies  |
| Indices           | 91.178    | 63 Exchanges          |
| Money Markets     | 1.367     | 2 Exchanges           |

For ETFs/Funds the `Sectors` column = `family` count and the
`Industries` column = `category` count, which is how those numbers
were always interpreted in the README — now backed by real columns.

## Test plan

- [ ] `pytest tests/` — 52 tests pass (was 49 before this PR)
- [ ] `test_exchange_country_one_to_one` passes for ETFs and Funds
- [ ] `test_no_symbol_collisions_across_asset_classes` passes
- [ ] CI Update-README-Statistics job rewrites README correctly
- [ ] `black --check tests/ financedatabase/` clean
- [ ] Spot-check: `fd.ETFs(use_local_location=True).show_options(selection="country")` returns 33 countries
- [ ] Spot-check: `fd.Funds(use_local_location=True).select(country="Spain")` returns ~5,383 rows
dokson added a commit to dokson/FinanceDatabase that referenced this pull request May 19, 2026
…ME stats

Per maintainer feedback on the initial version, this PR drops the
`country` column addition for ETFs/Funds — the country semantics for
those asset classes (investment scope, not listing-exchange country)
need a richer signal than `exchange -> country` can give. What
remains is still substantial:

## 1. Data-quality fixes on `etfs.csv` (surfaced by the new invariant)

- **14 non-ETF rows removed**: `^REIT` plus 13 US-equity duplicates
  that already existed correctly in `equities.csv` (`BHF`, `BHFAN`,
  `BHFAO`, `BHFAP`, `DTB`, `DTE`, `DTP`, `HTGC`, `PBC`, `PSEC`, `RGA`,
  `RZA`, `TPVG`).
- **56 cross-asset symbol collisions removed** — all 56 were corporate
  bonds / senior notes / equity share-class rows misclassified as ETFs
  (Brighthouse, Corvus Gold, Great Ajax Corp. notes, Argo notes,
  CMS Energy junior subordinated notes, Conifer Holdings senior notes,
  Qwest Corp notes, DTE Energy variants, `ASGI` = Aberdeen Global
  Infrastructure Income Fund, etc.).
- **29 ETF rows with corrupted `exchange` values fixed** — issuer name
  written into the exchange column instead of a real exchange code
  (`Xtrackers`, `Fundlogic`, `Purpose Investments`, `CI Investments`,
  `Horizons ETFs Management`, `Harvest Portfolios Group`,
  `IA Clarington Investments`, `National Bank Investments`,
  `Caldwell Investment Management`, `Developed Markets`,
  `Emerging Markets`, `High Yield Bonds`). Re-derived from ticker
  suffix.
- **`FSST` completed** (Fidelity Sustainable U.S. Equity ETF) — row
  previously had every field NaN. Filled name, currency, summary,
  category_group, category, family, exchange (`PCX`), isin.

## 2. New cross-asset invariant test

`tests/test_invariants.py::test_no_symbol_collisions_across_asset_classes`
asserts that a given `symbol` belongs to at most one of `equities.csv`,
`etfs.csv`, `funds.csv`. This is the test that surfaced the 56 + 14 = 70
cleanup rows above; it would have caught all of them before they
landed on `main`.

## 3. Automated README statistics

`.github/workflows/database_update.yml` gains a new
`Update-README-Statistics` job that runs after the existing
`Add-New-Ticker` / `Update-Compression-Files` / `Update-Categorization-Files`
jobs. It recomputes the statistics tables from the on-disk CSVs and
rewrites `README.md` in-place, committing the result. The
`Check-GICS-Categorisation` job now also depends on it.

README is now split into three tables that reflect the actual schema
of each asset class rather than the previous combined-table layout
whose `Countries` column for ETFs/Funds had no backing data:

```
Table A — Equities only (has country, sector, industry)
| Product  | Quantity | Sectors | Industries | Countries | Exchanges |
| Equities | 160.113  | 11      | 62         | 113       | 84        |

Table B — ETFs / Funds (no country: no schema for it)
| Product | Quantity | Families | Categories | Exchanges |
| ETFs    | 36.485   | 320      | 51         | 51        |
| Funds   | 57.853   | 1.540    | 74         | 33        |

Table C — Currencies / Cryptos / Indices / Money Markets
| Product           | Quantity | Category              |
| Currencies        | 2.556    | 175 Currencies        |
| Cryptocurrencies  | 3.367    | 351 Cryptocurrencies  |
| Indices           | 91.178   | 63 Exchanges          |
| Money Markets     | 1.367    | 2 Exchanges           |
```

The B-table columns are renamed from the previous "Sectors / Industries"
(which were really family/category counts) to be honest about what the
numbers represent.

## What this PR explicitly does NOT do

- **No `country` column on `etfs.csv` / `funds.csv`.** Per
  @JerBouma's feedback, "country" for ETFs/Funds should reflect the
  investment scope (e.g. iShares MSCI World ETF listed on NYSE has
  global scope, not "United States"), not the listing exchange. The
  obvious deterministic source (exchange -> country) doesn't fit that
  semantics. Deferred to a focused discussion / future PR if the
  community decides on a definition.
- **No country fill on missing equities.** 80k equity rows still have
  empty `country`; filling them from listing exchange would be wrong
  for the same reason ASML on Nasdaq stays "Netherlands". Needs an
  actual issuer-domicile source.

## Test plan

- [ ] `pytest tests/` — 50 tests pass (was 49 before; the new
      `test_no_symbol_collisions_across_asset_classes` is the +1)
- [ ] `test_no_symbol_collisions_across_asset_classes` passes
- [ ] `black --check tests/ financedatabase/` clean
- [ ] CI `Update-README-Statistics` job rewrites README correctly on
      next data update
- [ ] Spot-check: `BHF`, `DTE`, `RGA`, etc. exist in `equities.csv`
      only (not in `etfs.csv` anymore)
- [ ] Spot-check: `FSST` has all fields populated

Related: JerBouma#140 (test infra), JerBouma#143 (introduced the `exchange -> market`
invariant pattern), JerBouma#144 (proposes splitting CSVs by exchange).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect exchange code "ASE" assigned to NYSE / NYSE American securities (BRK.B, BF.B, ...)

2 participants