Skip to content

[IMPROVE] Split equities/etfs/funds CSVs by exchange for maintainability (no data change) #144

@dokson

Description

@dokson

What's the feature or data that should be improved?

The monolithic CSV files database/equities.csv, database/etfs.csv, and database/funds.csv into per-exchange shards, while keeping the compression artifacts (compression/*.bz2, compression/categories/*.gzip) byte-identical so library consumers see zero change.

File Rows today Size Distinct exchange values
equities.csv 160,105 ~90 MB 84
etfs.csv 36,555 ~14 MB 64
funds.csv 57,853 ~22 MB 33

Why now: every recent contribution PR (#138 #139 #141 #142 #143 — ~30k cells across the batch) hit one of these pain points:

  • GitHub UI can't render the diff on PRs that touch more than a few rows. Reviewing 16k-row changes is essentially blind.
  • Merge conflicts between concurrent data PRs are guaranteed and painful to resolve manually.
  • Local editing in Excel/text editors is sluggish on a 90 MB CSV; many tools won't open it at all.
  • The 50 MB GitHub recommended-file-size warning fires on every push.
  • Targeted contributions (e.g. "fix all NYSE rows") require scrolling through the entire file or external scripts even for one-shot edits.

Describe how you would like the feature or data improved

Today:

database/
├── equities.csv     (160k rows, 90 MB, 84 exchanges)
├── etfs.csv         (36k rows, 14 MB, 64 exchanges)
└── funds.csv        (57k rows, 22 MB, 33 exchanges)

After:

database/
├── equities/
│   ├── FRA.csv               # ~14k rows
│   ├── STU.csv               # ~13k rows
│   ├── PNK.csv               # ~12k rows
│   ├── ENX.csv               # ~11k rows
│   ├── NMS.csv               # ~7k rows
│   └── ... (84 files)
├── etfs/   (64 files)
├── funds/  (33 files)
└── delisted.csv              # ~77 rows: truly delisted tickers with no current exchange
                              #          (verified against Morningstar's reused-ticker list:
                              #          these names have NOT been reassigned and are dead.
                              #          Keeping them in a separate file makes the per-exchange
                              #          shards a clean partition of "currently listed").

The "empty exchange" bucket that previously held ~17k rows in equities.csv is gone: those rows had ticker suffixes (.NX, .F, .DU, ...) whose exchange code can be derived deterministically from FD's own data with no API calls (see the follow-up PR that fills them). After that fill the only remaining empty-exchange rows are 77 truly-delisted tickers — and delisted.csv is the right home for those.

The build step concatenates the shards back into the existing in-memory DataFrame and produces the same compression/*.bz2 / compression/categories/*.gzip as today. Library consumers (and the use_local_location=True path) are unaffected because they never read database/*.csv directly.

Benefits map:

Pain today After split
Reviewing #143's diff: 16k-row strings in equities.csv, no GitHub-side rendering Reviewer sees only the per-exchange files that actually changed — diffs render normally
Two PRs touching different exchanges merge-conflict on equities.csv Each PR touches different files — clean three-way merge
Opening equities.csv in Excel = slow / fails NYQ.csv is ~7k rows = trivial
git blame on a row in a 160k-line CSV = useless git blame on NYQ.csv = meaningful, shows last touch per ticker tier
GitHub 50 MB warning on every push Largest shard < 10 MB, well under the limit

What this proposal does NOT change:

  • No data change. Same rows, same columns, same values; only on-disk layout.
  • No library API change. fd.Equities() etc. continue to work identically.
  • No compression artifact change. compression/equities.bz2 byte-identical (or trivially equivalent up to row order, enforced by sorting after pd.concat).
  • No new dependency.

Possibly describe the ideal way to improve this

  1. One-shot migration script that reads each monolithic CSV, splits by exchange (using _no_exchange.csv for NaN), writes shards to database/<asset>/, deletes the monolithic CSV.
  2. Update the snippet at database_update.yml:~300 (and any local helper) to:
    df = pd.concat([pd.read_csv(p) for p in sorted(Path("database/equities").glob("*.csv"))])
    df.to_csv("compression/equities.bz2", index=False, compression="bz2")
  3. Update CONTRIBUTING.md ("Download the CSV files…") to reflect the new layout — pointer to the right shard for common exchanges.
  4. Optional CI guard: assert pd.concat(shards) matches a checksum of the canonical sort order, so accidentally adding a row to the wrong shard is caught.

Additional information

Happy to implement this as a single PR (migration script + workflow change + CONTRIBUTING update + optional CI checksum check). Cleanest moment is after #140 (test infrastructure improvements) is merged, so the test suite catches any drift introduced by the migration. The split key is exchange because it is the most stable identity attribute for a row (more stable than name, sector, country).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions