What's the feature or data that should be improved?
The monolithic CSV files database/equities.csv, database/etfs.csv, and database/funds.csv into per-exchange shards, while keeping the compression artifacts (compression/*.bz2, compression/categories/*.gzip) byte-identical so library consumers see zero change.
| File |
Rows today |
Size |
Distinct exchange values |
equities.csv |
160,105 |
~90 MB |
84 |
etfs.csv |
36,555 |
~14 MB |
64 |
funds.csv |
57,853 |
~22 MB |
33 |
Why now: every recent contribution PR (#138 #139 #141 #142 #143 — ~30k cells across the batch) hit one of these pain points:
- GitHub UI can't render the diff on PRs that touch more than a few rows. Reviewing 16k-row changes is essentially blind.
- Merge conflicts between concurrent data PRs are guaranteed and painful to resolve manually.
- Local editing in Excel/text editors is sluggish on a 90 MB CSV; many tools won't open it at all.
- The 50 MB GitHub recommended-file-size warning fires on every push.
- Targeted contributions (e.g. "fix all NYSE rows") require scrolling through the entire file or external scripts even for one-shot edits.
Describe how you would like the feature or data improved
Today:
database/
├── equities.csv (160k rows, 90 MB, 84 exchanges)
├── etfs.csv (36k rows, 14 MB, 64 exchanges)
└── funds.csv (57k rows, 22 MB, 33 exchanges)
After:
database/
├── equities/
│ ├── FRA.csv # ~14k rows
│ ├── STU.csv # ~13k rows
│ ├── PNK.csv # ~12k rows
│ ├── ENX.csv # ~11k rows
│ ├── NMS.csv # ~7k rows
│ └── ... (84 files)
├── etfs/ (64 files)
├── funds/ (33 files)
└── delisted.csv # ~77 rows: truly delisted tickers with no current exchange
# (verified against Morningstar's reused-ticker list:
# these names have NOT been reassigned and are dead.
# Keeping them in a separate file makes the per-exchange
# shards a clean partition of "currently listed").
The "empty exchange" bucket that previously held ~17k rows in equities.csv is gone: those rows had ticker suffixes (.NX, .F, .DU, ...) whose exchange code can be derived deterministically from FD's own data with no API calls (see the follow-up PR that fills them). After that fill the only remaining empty-exchange rows are 77 truly-delisted tickers — and delisted.csv is the right home for those.
The build step concatenates the shards back into the existing in-memory DataFrame and produces the same compression/*.bz2 / compression/categories/*.gzip as today. Library consumers (and the use_local_location=True path) are unaffected because they never read database/*.csv directly.
Benefits map:
| Pain today |
After split |
Reviewing #143's diff: 16k-row strings in equities.csv, no GitHub-side rendering |
Reviewer sees only the per-exchange files that actually changed — diffs render normally |
Two PRs touching different exchanges merge-conflict on equities.csv |
Each PR touches different files — clean three-way merge |
Opening equities.csv in Excel = slow / fails |
NYQ.csv is ~7k rows = trivial |
git blame on a row in a 160k-line CSV = useless |
git blame on NYQ.csv = meaningful, shows last touch per ticker tier |
| GitHub 50 MB warning on every push |
Largest shard < 10 MB, well under the limit |
What this proposal does NOT change:
- No data change. Same rows, same columns, same values; only on-disk layout.
- No library API change.
fd.Equities() etc. continue to work identically.
- No compression artifact change.
compression/equities.bz2 byte-identical (or trivially equivalent up to row order, enforced by sorting after pd.concat).
- No new dependency.
Possibly describe the ideal way to improve this
- One-shot migration script that reads each monolithic CSV, splits by
exchange (using _no_exchange.csv for NaN), writes shards to database/<asset>/, deletes the monolithic CSV.
- Update the snippet at
database_update.yml:~300 (and any local helper) to:
df = pd.concat([pd.read_csv(p) for p in sorted(Path("database/equities").glob("*.csv"))])
df.to_csv("compression/equities.bz2", index=False, compression="bz2")
- Update
CONTRIBUTING.md ("Download the CSV files…") to reflect the new layout — pointer to the right shard for common exchanges.
- Optional CI guard: assert
pd.concat(shards) matches a checksum of the canonical sort order, so accidentally adding a row to the wrong shard is caught.
Additional information
Happy to implement this as a single PR (migration script + workflow change + CONTRIBUTING update + optional CI checksum check). Cleanest moment is after #140 (test infrastructure improvements) is merged, so the test suite catches any drift introduced by the migration. The split key is exchange because it is the most stable identity attribute for a row (more stable than name, sector, country).
What's the feature or data that should be improved?
The monolithic CSV files
database/equities.csv,database/etfs.csv, anddatabase/funds.csvinto per-exchange shards, while keeping the compression artifacts (compression/*.bz2,compression/categories/*.gzip) byte-identical so library consumers see zero change.exchangevaluesequities.csvetfs.csvfunds.csvWhy now: every recent contribution PR (#138 #139 #141 #142 #143 — ~30k cells across the batch) hit one of these pain points:
Describe how you would like the feature or data improved
Today:
After:
The "empty exchange" bucket that previously held ~17k rows in
equities.csvis gone: those rows had ticker suffixes (.NX,.F,.DU, ...) whose exchange code can be derived deterministically from FD's own data with no API calls (see the follow-up PR that fills them). After that fill the only remaining empty-exchange rows are 77 truly-delisted tickers — anddelisted.csvis the right home for those.The build step concatenates the shards back into the existing in-memory DataFrame and produces the same
compression/*.bz2/compression/categories/*.gzipas today. Library consumers (and theuse_local_location=Truepath) are unaffected because they never readdatabase/*.csvdirectly.Benefits map:
equities.csv, no GitHub-side renderingequities.csvequities.csvin Excel = slow / failsNYQ.csvis ~7k rows = trivialgit blameon a row in a 160k-line CSV = uselessgit blameonNYQ.csv= meaningful, shows last touch per ticker tierWhat this proposal does NOT change:
fd.Equities()etc. continue to work identically.compression/equities.bz2byte-identical (or trivially equivalent up to row order, enforced by sorting afterpd.concat).Possibly describe the ideal way to improve this
exchange(using_no_exchange.csvforNaN), writes shards todatabase/<asset>/, deletes the monolithic CSV.database_update.yml:~300(and any local helper) to:CONTRIBUTING.md("Download the CSV files…") to reflect the new layout — pointer to the right shard for common exchanges.pd.concat(shards)matches a checksum of the canonical sort order, so accidentally adding a row to the wrong shard is caught.Additional information
Happy to implement this as a single PR (migration script + workflow change + CONTRIBUTING update + optional CI checksum check). Cleanest moment is after #140 (test infrastructure improvements) is merged, so the test suite catches any drift introduced by the migration. The split key is
exchangebecause it is the most stable identity attribute for a row (more stable than name, sector, country).