Skip to content

feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540)#537

Merged
MDUYN merged 8 commits into
devfrom
feature/bundle-format-v2
May 12, 2026
Merged

feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540)#537
MDUYN merged 8 commits into
devfrom
feature/bundle-format-v2

Conversation

@MDUYN
Copy link
Copy Markdown
Collaborator

@MDUYN MDUYN commented May 10, 2026

Bundle format v2 + storage-layer groundwork

Status: Foundation for the tiered storage rewrite tracked in #540.

This PR is not the size win — measurements during review showed v2 alone moves on-disk size by ~5% (the ceiling of per-file compression). Shipping it now anyway because it lays the schema/typing/contract groundwork the storage rewrite depends on, and the engine_type split + summary_only read path are user-visible improvements in their own right.

What's in this PR

1. Bundle format v2 (additive)

  • IAFB magic + uint32 format version header for cheap routing/validation without decompression
  • v2 envelope splits runs by engine via Backtest.engine_type (vector / event / null), exposed as vector_runs / event_runs properties; legacy backtest_runs still works for engine-agnostic callers
  • 8 metric time-series (equity_curve, drawdown_series, cumulative_return_series, rolling_sharpe_ratio, monthly_returns, yearly_returns, twr_equity_curve, twr_drawdown_series) extracted as embedded Parquet blobs with int64 epoch-ms timestamps + float64 values
  • v1 envelopes remain readable indefinitely; v8.9 writer never emits v1

2. Read-path additions (groundwork for the storage rewrite)

  • open_bundle(summary_only=True) and Backtest.open(summary_only=True) skip the per-run Parquet decode for bulk listing pipelines. Scalar summary metrics remain fully populated. This is the reader contract the future BacktestStore.list() will sit on top of.
  • save_bundle(format_version=, float32_ohlcv=) knobs for explicit downgrade and OHLCV size reduction.

3. zstd default 7 → 19

  • Measured ~14% smaller on real bundles with no observable decode-cost change (zstd decoder speed is independent of encoder level).
  • For a 12,500-bundle workload this trims ~9 GB at zero behavioural cost.

4. Specs (the actual story)

Honest size accounting

On three real production-shape sample bundles in docs/sample_backtests/ (~570 KB each):

Configuration Avg per bundle Projected total (12,500)
v1, zstd 7 (pre-this-PR) 569 KB 64.0 GB
v2, zstd 19 (this PR) 489 KB ~55 GB
v2 + daily portfolio snapshots (user config change) ~431 KB ~48 GB
Tiered store + content-addressed dedup (next epic) n/a < 20 GB

The big win lives in the storage rewrite, not in the per-file format. This PR is the contract layer that makes the rewrite implementable without breaking anyone.

Compatibility

  • v1 bundles still readable via the upgraded reader
  • save_bundle() defaults to v2 — no caller change required
  • Set format_version=1 to keep emitting legacy bundles for downstream tools that haven't upgraded
  • All Python APIs are additive. No removed exports.

→ Suggested release: minor (v8.9.0).

Tests

  • 17 / 17 bundle tests pass (9 existing + 8 new v2-specific cases: v2 header, v1 downgrade, vector/event slot routing, summary_only mode, legacy default, round-trip preservation)
  • Full suite: 1681 passed, 42 skipped, 0 failed
  • Lint clean

What this PR does not claim

  • It does not materially shrink the archive on its own (~14% from zstd 19 is the entire size win)
  • It does not make listing 12,500 backtests by Sharpe fast (that's the next epic)
  • It does not dedup across bundles (also the next epic)

What it does do is freeze the v2 wire format, the engine_type semantics, and the summary_only read path so that the storage rewrite can be implemented purely as additive changes on top.

Follow-up

See the epic in #540 for the full tiered storage roadmap (replaces the now-closed #538 and #539).

MDUYN added 3 commits May 10, 2026 19:44
Introduces .iafbt format version 2 (default writer) — see
docs/design/bundle-format-v2.md for the public spec.

Changes:
- Backtest dataclass gains an engine_type field ('vector' | 'event' |
  None) plus vector_runs / event_runs / vector_metrics / event_metrics
  derived properties. Engine-tagged at construction by VectorBacktest
  and EventBacktest service paths; legacy / unknown bundles keep the
  engine-agnostic backtest_runs / backtest_summary view.
- Bundle envelope v2 routes runs into engine-specific slots
  (vector_runs / event_runs) based on engine_type. v1 envelopes
  remain readable indefinitely.
- Eight heavy metric time-series (equity_curve, drawdown_series,
  cumulative_return_series, rolling_sharpe_ratio, monthly_returns,
  yearly_returns, twr_equity_curve, twr_drawdown_series) are
  extracted into embedded Parquet blobs with int64 epoch-ms
  timestamps and float64 values, replacing the v1 inline
  list-of-(value, ISO-string) shape.
- save_bundle gains format_version (default 2, accepts 1 for
  downgrade) and float32_ohlcv (off by default) options.
- open_bundle and Backtest.open gain summary_only=True for bulk
  listing pipelines that skip the per-run blob decode.

OHLCV side store, content addressing, and LazyOhlcvDict are
unchanged. Public OHLCV dedup protocol spec added at
docs/design/ohlcv-dedup-protocol.md for upload-style integrations.

17 bundle tests pass (8 new v2-specific cases); full suite passes
1681/1681 against unchanged behaviour for v1 bundles.
Measured on real production-shape bundles (~570 KB at level 7,
7 runs × 2192 hourly portfolio snapshots each):

  level 7  →  ~570 KB
  level 19 →  ~489 KB  (−14%)
  level 22 →  ~489 KB  (saturated)

Decode speed is unchanged in zstd's design (decoder is independent
of encoder level). For a 12,500-bundle workload (~64 GB at level 7)
this trims the archive to ~55 GB at zero behavioural cost.

Level 19 is the highest level still in the standard tier (no
--ultra flag, no special memory window), so the bytes are readable
by any stock zstd reader.
Captures the storage architecture proposal that came out of the
v8.9 size-review measurements:

  - Why per-file compression has hit its ceiling (~14% headroom max)
  - Three-tier model: Index (SQL) + Columnar bulk (Parquet) +
    Content-addressed chunks (S3-compatible)
  - .iafbt demoted from storage primitive to deterministic export
    format; v1 and v2 remain readable forever
  - Local-only OSS path (LocalTieredStore) keeps the framework
    self-contained without a server
  - Phased migration: v8.10 read-side helpers, v8.11 store
    abstraction, Finterion remote store closed-source

Companion to docs/design/bundle-format-v2.md and
docs/design/ohlcv-dedup-protocol.md.
@MDUYN MDUYN changed the title feat: bundle format v2 (vector/event split + Parquet metric blobs) feat(bundle): format v2 + groundwork for tiered storage rewrite (epic #540) May 10, 2026
…phase 1)

Lift the existing untyped flat-row index helper into a public, typed
Tier-1 contract:

* New BacktestIndexRow dataclass (domain/backtesting/backtest_index_row.py)
  with identity / provenance / config / nested summary_metrics +
  forward-compat extras. Lossless to_flat_dict / from_flat_dict round
  trip for Parquet, SQL and JSON sinks.
* New Backtest.index_row(bundle_path=None) method. Builds without
  decoding any v2 Parquet metric blobs, so it works against bundles
  loaded with Backtest.open(..., summary_only=True). This is the
  fast read path the upcoming 'iaf index' CLI (phase 2) and any
  tiered store implementation (phase 3) will rely on.
* _backtest_to_index_row in backtest_utils now delegates to
  BacktestIndexRow.to_flat_dict() so the wire shape and the in-memory
  shape are a single source of truth (no behavioural change for the
  existing index.parquet sidecar).
* Re-export BacktestIndexRow from the domain and top-level packages.
* docs/design/tiered-backtest-storage.md \xa73.1 + roadmap row updated
  to reference the typed contract.

Tests: 5 new (in-memory derivation, flat round-trip incl. NaN, unknown
columns landing in extras, derivation from a summary_only=True bundle
load). Full backtests suite green (29/29).
MDUYN added 4 commits May 10, 2026 22:31
CI flake8 flags the import as unused without the corresponding entry
in __all__. Phase 1 missed adding it.
Tier-1 SQLite index over a folder of .iafbt bundles, building on the
phase-1 BacktestIndexRow contract.

* services/backtest_index/sqlite_index.py: SqliteBacktestIndex with
  create / open / upsert / upsert_many / iter_rows / query. Every
  scalar field of BacktestSummaryMetrics is promoted to its own
  summary_<name> SQL column so analysts can filter without opening
  any bundle (e.g. WHERE summary_sharpe_ratio > 1.0). parameters /
  strategy_ids / extras round-trip as JSON text. WAL mode for safe
  concurrent reads. Forward-only additive schema migration via
  PRAGMA user_version.
* cli/index_command.py + new 'iaf index <bundle-dir>' click command:
  walks the directory, opens each bundle with summary_only=True (no
  Parquet metric-blob decode), derives BacktestIndexRow, upserts.
  --output, --absolute-paths and --no-progress flags.
* docs/design/tiered-backtest-storage.md updated with phase-2 status.

Tests: 12 new (8 SqliteBacktestIndex unit tests + 4 CLI integration
tests via click.testing.CliRunner). Full repo suite green
(1698 passed, 42 skipped). Lint clean.
feat(cli): `iaf index` + SqliteBacktestIndex — epic #540 phase 2
feat(backtest): BacktestIndexRow DTO + Backtest.index_row() — epic #540 phase 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant