Commit ba9cb4e
ETFs/Funds data quality + cross-asset invariants + equities country/ISIN backfill + SPAC cleanup + README stats (#147)
* ETFs/Funds data quality + cross-asset invariants + country backfill + README stats
Data quality on etfs.csv:
- 14 non-ETF rows removed (already correctly in equities.csv: BHF, DTE, RGA,
HTGC, PSEC, TPVG, ...)
- 56 cross-asset symbol collisions with equities.csv removed (corporate
bonds, senior notes, share-class variants misclassified as ETFs)
- 29 corrupted `exchange` values fixed (issuer name in exchange column:
Xtrackers, Fundlogic, Purpose Investments, ...)
- FSST (Fidelity Sustainable U.S. Equity ETF) completed (was all-NaN)
- After cleanup: equities.csv and etfs.csv share zero symbols
Country backfill on equities.csv (50.3% -> 63.1%):
- 15,692 rows filled from primary listing in equities.csv (same base ticker,
e.g. ASML.AS country propagated to ASML.DE)
- 5,777 rows filled via TradingView screener API (HQ country, not listing
country -- ASML on Nasdaq stays "Netherlands", not "United States")
- 113 additional rows from yfinance lookup
- Skips bases that resolve ambiguously across markets (e.g. numeric
bases shared between Chinese .SZ and Korean .KS exchanges)
- Russian Federation -> Russia normalization
- ALMER.PA: Reunion (French overseas dept) -> France
Cross-asset invariants:
- New tests/test_invariants.py with test_no_symbol_collisions_across_asset_classes
covering all 7 asset class files (equities, etfs, funds, indices, currencies,
cryptos, moneymarkets). Catches drift like the ^REIT-in-etfs case at PR time.
Automated README statistics:
- New Update-README-Statistics job in .github/workflows/database_update.yml
regenerates stats tables from database/*.csv after every database update.
- README restructured into three tables (Equities w/ Countries; ETFs/Funds
w/o Country; Currencies/Cryptos/Indices/Money Markets) to keep every cell
honest -- ETFs/Funds country was a manual placeholder before.
Misc:
- financedatabase/helpers.py: widen base show_options() return type from
pd.Series to pd.Index | dict | np.ndarray to match what subclasses actually
return (LSP fix; runtime unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* equities.csv data quality: ISIN cleanup + SPAC template removal + canonicalize names
Picks up where the prior commit left off after deeper auditing surfaced
two more contamination layers in equities.csv.
ISIN cleanup:
- Cross-asset duplicates (eq.csv ∩ etfs.csv): 209 ISIN cleared from equities
rows where the ISIN rightfully belongs to the ETF (UBS UCITS, Lyxor, iShares,
HSBC etc. previously stamped onto unrelated PNK micro-caps).
- Multi-name ISIN cleanup (different companies sharing one ISIN — impossible
by ISO design, so one row's ISIN is wrong). Resolved canonical name per ISIN
via GLEIF (legalName) + yfinance (longName) + TradingView (description).
Total wrong ISIN values cleared: 9,634 across the three passes.
- Multi-name ISIN count: 4,772 baseline -> 1,467 (-69%).
- New invariant test `test_no_isin_collisions_across_asset_classes` enforces
no future regression of the cross-asset case (catches the same drift that
the symbol invariant catches).
Name canonicalization:
- 8,361 names rewritten to the canonical legalName/description form so that
cross-listings of the same company collapse to one spelling instead of 4
variants like "STRABAG SE", "STRABAG SE-BR", "Strabag SE Inhaber-Aktien o.N."
- Conservative threshold: only rewrite when token similarity with canonical
>= 0.4 (preserves originals when the row likely refers to a different
company than the canonical).
SPAC template removal (upstream data poisoning):
- 1,584 equities.csv rows had identical name="one" + summary about a SPAC
that "does not have significant operations. It intends to effect a merger
...", plus identical state=CA / city=San Francisco / zipcode=94129
/ website=a-star.co / sector=Financials / industry=Diversified Financial
Services. This is A-Star Financial Acquisition Corp's data copy-pasted
onto 1,584 unrelated tickers somewhere upstream.
- 200 real names recovered via TradingView + stockevents.app + DuckDuckGo
+ Finnhub (the rest are micro-caps not in any free public dataset).
- All 1,608 rows carrying the SPAC fingerprint (website=a-star.co) had
the contaminated sector/industry_group/industry/state/city/zipcode/
website/market_cap fields cleared. The bogus uniform values were more
misleading than missing data.
ALMER.PA: Reunion -> France (French overseas department; matches FD's
convention for metropolitan/overseas France).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 0e9e8d3 commit ba9cb4e
36 files changed
Lines changed: 71550 additions & 71429 deletions
File tree
- .github/workflows
- database
- financedatabase
- tests
- csv
- test_equities
- test_etfs
- json
- test_equities
- test_etfs
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
426 | 426 | | |
427 | 427 | | |
428 | 428 | | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
429 | 540 | | |
430 | | - | |
| 541 | + | |
431 | 542 | | |
432 | 543 | | |
433 | 544 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | | - | |
28 | | - | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
29 | 32 | | |
30 | 33 | | |
31 | 34 | | |
32 | 35 | | |
33 | | - | |
34 | | - | |
35 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
36 | 39 | | |
37 | 40 | | |
38 | 41 | | |
| |||
0 commit comments