feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540) by MDUYN · Pull Request #537 · coding-kitties/investing-algorithm-framework

MDUYN · 2026-05-10T17:46:06Z

Bundle format v2 + storage-layer groundwork

Status: Foundation for the tiered storage rewrite tracked in #540.

This PR is not the size win — measurements during review showed v2 alone moves on-disk size by ~5% (the ceiling of per-file compression). Shipping it now anyway because it lays the schema/typing/contract groundwork the storage rewrite depends on, and the engine_type split + summary_only read path are user-visible improvements in their own right.

What's in this PR

1. Bundle format v2 (additive)

IAFB magic + uint32 format version header for cheap routing/validation without decompression
v2 envelope splits runs by engine via Backtest.engine_type (vector / event / null), exposed as vector_runs / event_runs properties; legacy backtest_runs still works for engine-agnostic callers
8 metric time-series (equity_curve, drawdown_series, cumulative_return_series, rolling_sharpe_ratio, monthly_returns, yearly_returns, twr_equity_curve, twr_drawdown_series) extracted as embedded Parquet blobs with int64 epoch-ms timestamps + float64 values
v1 envelopes remain readable indefinitely; v8.9 writer never emits v1

2. Read-path additions (groundwork for the storage rewrite)

open_bundle(summary_only=True) and Backtest.open(summary_only=True) skip the per-run Parquet decode for bulk listing pipelines. Scalar summary metrics remain fully populated. This is the reader contract the future BacktestStore.list() will sit on top of.
save_bundle(format_version=, float32_ohlcv=) knobs for explicit downgrade and OHLCV size reduction.

3. zstd default 7 → 19

Measured ~14% smaller on real bundles with no observable decode-cost change (zstd decoder speed is independent of encoder level).
For a 12,500-bundle workload this trims ~9 GB at zero behavioural cost.

4. Specs (the actual story)

docs/design/bundle-format-v2.md — public, stable wire format spec
docs/design/ohlcv-dedup-protocol.md — content-addressed chunk negotiate protocol, generalizable to all chunk types
docs/design/tiered-backtest-storage.md — the actual size story: relational index + columnar Parquet bulk + content-addressed chunks. Projects 64 GB → < 20 GB on the production-shape sample. This is what the storage rewrite epic will implement.

Honest size accounting

On three real production-shape sample bundles in docs/sample_backtests/ (~570 KB each):

Configuration	Avg per bundle	Projected total (12,500)
v1, zstd 7 (pre-this-PR)	569 KB	64.0 GB
v2, zstd 19 (this PR)	489 KB	~55 GB
v2 + daily portfolio snapshots (user config change)	~431 KB	~48 GB
Tiered store + content-addressed dedup (next epic)	n/a	< 20 GB

The big win lives in the storage rewrite, not in the per-file format. This PR is the contract layer that makes the rewrite implementable without breaking anyone.

Compatibility

v1 bundles still readable via the upgraded reader
save_bundle() defaults to v2 — no caller change required
Set format_version=1 to keep emitting legacy bundles for downstream tools that haven't upgraded
All Python APIs are additive. No removed exports.

→ Suggested release: minor (v8.9.0).

Tests

17 / 17 bundle tests pass (9 existing + 8 new v2-specific cases: v2 header, v1 downgrade, vector/event slot routing, summary_only mode, legacy default, round-trip preservation)
Full suite: 1681 passed, 42 skipped, 0 failed
Lint clean

What this PR does not claim

It does not materially shrink the archive on its own (~14% from zstd 19 is the entire size win)
It does not make listing 12,500 backtests by Sharpe fast (that's the next epic)
It does not dedup across bundles (also the next epic)

What it does do is freeze the v2 wire format, the engine_type semantics, and the summary_only read path so that the storage rewrite can be implemented purely as additive changes on top.

Follow-up

See the epic in #540 for the full tiered storage roadmap (replaces the now-closed #538 and #539).

Introduces .iafbt format version 2 (default writer) — see docs/design/bundle-format-v2.md for the public spec. Changes: - Backtest dataclass gains an engine_type field ('vector' | 'event' | None) plus vector_runs / event_runs / vector_metrics / event_metrics derived properties. Engine-tagged at construction by VectorBacktest and EventBacktest service paths; legacy / unknown bundles keep the engine-agnostic backtest_runs / backtest_summary view. - Bundle envelope v2 routes runs into engine-specific slots (vector_runs / event_runs) based on engine_type. v1 envelopes remain readable indefinitely. - Eight heavy metric time-series (equity_curve, drawdown_series, cumulative_return_series, rolling_sharpe_ratio, monthly_returns, yearly_returns, twr_equity_curve, twr_drawdown_series) are extracted into embedded Parquet blobs with int64 epoch-ms timestamps and float64 values, replacing the v1 inline list-of-(value, ISO-string) shape. - save_bundle gains format_version (default 2, accepts 1 for downgrade) and float32_ohlcv (off by default) options. - open_bundle and Backtest.open gain summary_only=True for bulk listing pipelines that skip the per-run blob decode. OHLCV side store, content addressing, and LazyOhlcvDict are unchanged. Public OHLCV dedup protocol spec added at docs/design/ohlcv-dedup-protocol.md for upload-style integrations. 17 bundle tests pass (8 new v2-specific cases); full suite passes 1681/1681 against unchanged behaviour for v1 bundles.

Measured on real production-shape bundles (~570 KB at level 7, 7 runs × 2192 hourly portfolio snapshots each): level 7 → ~570 KB level 19 → ~489 KB (−14%) level 22 → ~489 KB (saturated) Decode speed is unchanged in zstd's design (decoder is independent of encoder level). For a 12,500-bundle workload (~64 GB at level 7) this trims the archive to ~55 GB at zero behavioural cost. Level 19 is the highest level still in the standard tier (no --ultra flag, no special memory window), so the bytes are readable by any stock zstd reader.

Captures the storage architecture proposal that came out of the v8.9 size-review measurements: - Why per-file compression has hit its ceiling (~14% headroom max) - Three-tier model: Index (SQL) + Columnar bulk (Parquet) + Content-addressed chunks (S3-compatible) - .iafbt demoted from storage primitive to deterministic export format; v1 and v2 remain readable forever - Local-only OSS path (LocalTieredStore) keeps the framework self-contained without a server - Phased migration: v8.10 read-side helpers, v8.11 store abstraction, Finterion remote store closed-source Companion to docs/design/bundle-format-v2.md and docs/design/ohlcv-dedup-protocol.md.

…phase 1) Lift the existing untyped flat-row index helper into a public, typed Tier-1 contract: * New BacktestIndexRow dataclass (domain/backtesting/backtest_index_row.py) with identity / provenance / config / nested summary_metrics + forward-compat extras. Lossless to_flat_dict / from_flat_dict round trip for Parquet, SQL and JSON sinks. * New Backtest.index_row(bundle_path=None) method. Builds without decoding any v2 Parquet metric blobs, so it works against bundles loaded with Backtest.open(..., summary_only=True). This is the fast read path the upcoming 'iaf index' CLI (phase 2) and any tiered store implementation (phase 3) will rely on. * _backtest_to_index_row in backtest_utils now delegates to BacktestIndexRow.to_flat_dict() so the wire shape and the in-memory shape are a single source of truth (no behavioural change for the existing index.parquet sidecar). * Re-export BacktestIndexRow from the domain and top-level packages. * docs/design/tiered-backtest-storage.md \xa73.1 + roadmap row updated to reference the typed contract. Tests: 5 new (in-memory derivation, flat round-trip incl. NaN, unknown columns landing in extras, derivation from a summary_only=True bundle load). Full backtests suite green (29/29).

CI flake8 flags the import as unused without the corresponding entry in __all__. Phase 1 missed adding it.

Tier-1 SQLite index over a folder of .iafbt bundles, building on the phase-1 BacktestIndexRow contract. * services/backtest_index/sqlite_index.py: SqliteBacktestIndex with create / open / upsert / upsert_many / iter_rows / query. Every scalar field of BacktestSummaryMetrics is promoted to its own summary_<name> SQL column so analysts can filter without opening any bundle (e.g. WHERE summary_sharpe_ratio > 1.0). parameters / strategy_ids / extras round-trip as JSON text. WAL mode for safe concurrent reads. Forward-only additive schema migration via PRAGMA user_version. * cli/index_command.py + new 'iaf index <bundle-dir>' click command: walks the directory, opens each bundle with summary_only=True (no Parquet metric-blob decode), derives BacktestIndexRow, upserts. --output, --absolute-paths and --no-progress flags. * docs/design/tiered-backtest-storage.md updated with phase-2 status. Tests: 12 new (8 SqliteBacktestIndex unit tests + 4 CLI integration tests via click.testing.CliRunner). Full repo suite green (1698 passed, 42 skipped). Lint clean.

feat(cli): `iaf index` + SqliteBacktestIndex — epic #540 phase 2

feat(backtest): BacktestIndexRow DTO + Backtest.index_row() — epic #540 phase 1

MDUYN added 3 commits May 10, 2026 19:44

MDUYN changed the title ~~feat: bundle format v2 (vector/event split + Parquet metric blobs)~~ feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540) May 10, 2026

This was referenced May 10, 2026

feat(backtest): BacktestIndexRow DTO + Backtest.index_row() — epic #540 phase 1 #541

Merged

feat(cli): iaf index + SqliteBacktestIndex — epic #540 phase 2 #542

Merged

MDUYN added 4 commits May 10, 2026 22:31

fix(lint): export BacktestIndexRow from domain.__all__

5682157

CI flake8 flags the import as unused without the corresponding entry in __all__. Phase 1 missed adding it.

Merge pull request #542 from coding-kitties/feature/iaf-index-cli

84ccf1f

feat(cli): `iaf index` + SqliteBacktestIndex — epic #540 phase 2

Merge pull request #541 from coding-kitties/feature/index-row

78a203b

feat(backtest): BacktestIndexRow DTO + Backtest.index_row() — epic #540 phase 1

MDUYN merged commit b3f86a0 into dev May 12, 2026
8 checks passed

MDUYN mentioned this pull request May 18, 2026

Epic: tiered backtest storage (rewrite the storage layer) #540

Closed

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540)#537

feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540)#537
MDUYN merged 8 commits into
devfrom
feature/bundle-format-v2

MDUYN commented May 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MDUYN commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bundle format v2 + storage-layer groundwork

What's in this PR

1. Bundle format v2 (additive)

2. Read-path additions (groundwork for the storage rewrite)

3. zstd default 7 → 19

4. Specs (the actual story)

Honest size accounting

Compatibility

Tests

What this PR does not claim

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MDUYN commented May 10, 2026 •

edited

Loading