[IMPROVE] Split equities/etfs/funds CSVs by exchange for maintainability (no data change)

**What's the feature or data that should be improved?**

The monolithic CSV files `database/equities.csv`, `database/etfs.csv`, and `database/funds.csv` into per-exchange shards, while keeping the compression artifacts (`compression/*.bz2`, `compression/categories/*.gzip`) byte-identical so library consumers see zero change.

| File | Rows today | Size | Distinct `exchange` values |
|---|---:|---:|---:|
| `equities.csv` | 160,105 | ~90 MB | 84 |
| `etfs.csv` | 36,555 | ~14 MB | 64 |
| `funds.csv` | 57,853 | ~22 MB | 33 |

Why now: every recent contribution PR (#138 #139 #141 #142 #143 — ~30k cells across the batch) hit one of these pain points:

- GitHub UI can't render the diff on PRs that touch more than a few rows. Reviewing 16k-row changes is essentially blind.
- Merge conflicts between concurrent data PRs are guaranteed and painful to resolve manually.
- Local editing in Excel/text editors is sluggish on a 90 MB CSV; many tools won't open it at all.
- The 50 MB GitHub recommended-file-size warning fires on every push.
- Targeted contributions (e.g. "fix all NYSE rows") require scrolling through the entire file or external scripts even for one-shot edits.

**Describe how you would like the feature or data improved**

Today:
```
database/
├── equities.csv     (160k rows, 90 MB, 84 exchanges)
├── etfs.csv         (36k rows, 14 MB, 64 exchanges)
└── funds.csv        (57k rows, 22 MB, 33 exchanges)
```

After:
```
database/
├── equities/
│   ├── FRA.csv               # ~14k rows
│   ├── STU.csv               # ~13k rows
│   ├── PNK.csv               # ~12k rows
│   ├── ENX.csv               # ~11k rows
│   ├── NMS.csv               # ~7k rows
│   └── ... (84 files)
├── etfs/   (64 files)
├── funds/  (33 files)
└── delisted.csv              # ~77 rows: truly delisted tickers with no current exchange
                              #          (verified against Morningstar's reused-ticker list:
                              #          these names have NOT been reassigned and are dead.
                              #          Keeping them in a separate file makes the per-exchange
                              #          shards a clean partition of "currently listed").
```

The "empty exchange" bucket that previously held ~17k rows in `equities.csv` is gone: those rows had ticker suffixes (`.NX`, `.F`, `.DU`, ...) whose exchange code can be derived deterministically from FD's own data with no API calls (see the follow-up PR that fills them). After that fill the only remaining empty-exchange rows are 77 truly-delisted tickers — and `delisted.csv` is the right home for those.

The build step concatenates the shards back into the existing in-memory DataFrame and produces the same `compression/*.bz2` / `compression/categories/*.gzip` as today. Library consumers (and the `use_local_location=True` path) are unaffected because they never read `database/*.csv` directly.

Benefits map:

| Pain today | After split |
|---|---|
| Reviewing #143's diff: 16k-row strings in `equities.csv`, no GitHub-side rendering | Reviewer sees only the per-exchange files that actually changed — diffs render normally |
| Two PRs touching different exchanges merge-conflict on `equities.csv` | Each PR touches different files — clean three-way merge |
| Opening `equities.csv` in Excel = slow / fails | `NYQ.csv` is ~7k rows = trivial |
| `git blame` on a row in a 160k-line CSV = useless | `git blame` on `NYQ.csv` = meaningful, shows last touch per ticker tier |
| GitHub 50 MB warning on every push | Largest shard < 10 MB, well under the limit |

What this proposal does NOT change:
- **No data change.** Same rows, same columns, same values; only on-disk layout.
- **No library API change.** `fd.Equities()` etc. continue to work identically.
- **No compression artifact change.** `compression/equities.bz2` byte-identical (or trivially equivalent up to row order, enforced by sorting after `pd.concat`).
- **No new dependency.**

**Possibly describe the ideal way to improve this**

1. One-shot migration script that reads each monolithic CSV, splits by `exchange` (using `_no_exchange.csv` for `NaN`), writes shards to `database/<asset>/`, deletes the monolithic CSV.
2. Update the snippet at `database_update.yml:~300` (and any local helper) to:
   ```python
   df = pd.concat([pd.read_csv(p) for p in sorted(Path("database/equities").glob("*.csv"))])
   df.to_csv("compression/equities.bz2", index=False, compression="bz2")
   ```
3. Update `CONTRIBUTING.md` ("Download the CSV files…") to reflect the new layout — pointer to the right shard for common exchanges.
4. Optional CI guard: assert `pd.concat(shards)` matches a checksum of the canonical sort order, so accidentally adding a row to the wrong shard is caught.

**Additional information**

Happy to implement this as a single PR (migration script + workflow change + CONTRIBUTING update + optional CI checksum check). Cleanest moment is after #140 (test infrastructure improvements) is merged, so the test suite catches any drift introduced by the migration. The split key is `exchange` because it is the most stable identity attribute for a row (more stable than name, sector, country).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IMPROVE] Split equities/etfs/funds CSVs by exchange for maintainability (no data change) #144

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

File	Rows today	Size	Distinct `exchange` values
`equities.csv`	160,105	~90 MB	84
`etfs.csv`	36,555	~14 MB	64
`funds.csv`	57,853	~22 MB	33

Pain today	After split
Reviewing #143's diff: 16k-row strings in `equities.csv`, no GitHub-side rendering	Reviewer sees only the per-exchange files that actually changed — diffs render normally
Two PRs touching different exchanges merge-conflict on `equities.csv`	Each PR touches different files — clean three-way merge
Opening `equities.csv` in Excel = slow / fails	`NYQ.csv` is ~7k rows = trivial
`git blame` on a row in a 160k-line CSV = useless	`git blame` on `NYQ.csv` = meaningful, shows last touch per ticker tier
GitHub 50 MB warning on every push	Largest shard < 10 MB, well under the limit

Uh oh!

[IMPROVE] Split equities/etfs/funds CSVs by exchange for maintainability (no data change) #144

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions