docs(pipeline): document factor algebra, cross-sectional transforms, lazy executor (#502)

MDUYN · MDUYN · commit 4af8b2da56a9 · 2026-05-02T13:28:18.000+02:00
- Add 'Cross-sectional transforms' and 'Factor algebra' sections to pipelines.md covering zscore/demean/winsorize and arithmetic operators with shared sub-expression caching.

- Rewrite pipelines-vector-backtest.md from roadmap stub to user-facing reference: how the engine evaluates factors, lazy/streaming option, equivalence guarantee with event mode.

- Update Phase 2 status to shipped.
diff --git a/docusaurus/docs/Advanced Concepts/pipelines-vector-backtest.md b/docusaurus/docs/Advanced Concepts/pipelines-vector-backtest.md
@@ -1,46 +1,84 @@
 ---
 sidebar_position: 11
-title: Pipelines — Vector backtest (roadmap)
-description: Vector-mode pipeline execution. Tracked under #502.
+title: Pipelines — Vector backtest
+description: Vector-mode pipeline execution. Phase 2 (#502) — shipped.
 ---
 
 # Pipelines: Vector backtest
 
-:::info Status: not yet shipped (Phase 2)
-
-Vector-mode pipelines are tracked under
-[#502](https://github.com/coding-kitties/investing-algorithm-framework/issues/502).
-The public API (`Pipeline`, `Factor`, `Filter`) defined in
-[Pipelines: Event-driven backtest](pipelines-event-backtest.md) is
-intentionally engine-agnostic, so strategies you write against Phase 1
-will keep working when Phase 2 lands.
+:::tip Status: shipped (Phase 2)
+Vector-mode pipelines run by default whenever you backtest with
+`BacktestService` — no opt-in needed. The vector engine evaluates each
+declared factor across the **entire** backtest window once per
+strategy iteration, with shared sub-expression caching.
 :::
 
-## What's planned
+## How it works
+
+When `BacktestService` runs, it inspects each strategy's `pipelines`
+list and routes them through `VectorPipelineEngine`. For every
+iteration:
+
+1. The engine builds a long-form Polars panel
+   `(datetime, symbol, open, high, low, close, volume)` truncated at
+   the current bar (no look-ahead).
+2. Each declared `Factor` is evaluated **once** in vectorised Polars,
+   per symbol, over the full window.
+3. A per-evaluation cache (a `ContextVar`) memoises shared
+   sub-expressions — for example `r.zscore() - r.demean()` only
+   computes `r` once.
+4. The optional `universe` mask filters the result; the universe
+   column itself is dropped from the output.
+5. The strategy receives the wide frame via
+   `data["YourPipelineClassName"]`.
+
+The strategy author surface is unchanged from
+[Pipelines: Event-driven backtest](pipelines-event-backtest.md): you
+write the same `Pipeline` subclasses and read the same
+`data["..."]` frames.
+
+## Lazy / streaming execution
+
+For memory-bound runs over very large universes you can opt the
+post-factor pipeline (universe filter + drop + sort) onto Polars'
+streaming engine:
+
+```python
+from investing_algorithm_framework.services.pipeline import (
+    VectorPipelineEngine,
+)
 
-- A vector executor that materialises every factor in the pipeline
-  once over the **entire** backtest window, instead of rebuilding the
-  panel on each event.
-- Integration with the existing vector backtester (see
-  [Vector backtesting](vector-backtesting.md)).
-- Cached intermediate frames so a `rank` of a `Returns` doesn't
-  recompute returns.
-- Optional Polars **lazy** execution path for memory-bound runs.
+engine = VectorPipelineEngine(lazy=True)
+result = engine.evaluate_window(
+    pipeline_cls=MomentumScreener,
+    data_object=panel_data,
+    symbol_to_identifier=sym_id,
+)
+```
 
-## What stays the same
+`lazy=True` is **bit-for-bit equivalent** to the default eager mode
+(this is verified by an equivalence test in the suite). It only
+changes how the result frame is collected — factors themselves still
+return eager `pl.Series` values per symbol. On older Polars versions
+that don't accept `engine="streaming"` on `collect`, the engine falls
+back to a default collect transparently.
 
-The strategy author surface — declaring a `Pipeline` subclass, listing
-it on `strategy.pipelines`, and reading
-`data["YourPipelineClassName"]` inside `run_strategy` — does not
-change. Switching from event mode to vector mode is meant to be a
-runner choice, not a strategy rewrite.
+You typically don't need to instantiate `VectorPipelineEngine`
+yourself; `BacktestService` handles it. The `lazy` flag is exposed for
+direct users of the engine and for performance experiments.
 
-## Want to help?
+## Equivalence with event mode
 
-Track or comment on the implementation issue:
-[#502 — Pipeline API: Phase 2 (vector executor)](https://github.com/coding-kitties/investing-algorithm-framework/issues/502).
+Vector and event mode are required to produce **identical** factor
+values for the same panel and same `as_of`. The test suite enforces
+this with cross-mode equivalence tests in
+`tests/services/pipeline/test_vector_pipeline_engine.py`. If you find
+a discrepancy, that's a bug — please file an issue.
 
 ## See also
 
-- [Pipelines](pipelines.md) — concept page.
-- [Pipelines: Event-driven backtest](pipelines-event-backtest.md) — what works today.
+- [Pipelines](pipelines.md) — concept page (factor algebra, transforms).
+- [Pipelines: Event-driven backtest](pipelines-event-backtest.md) —
+  same surface, event executor.
+- [Pipelines: Live trading](pipelines-live.md) — stateless / serverless
+  notes.
diff --git a/docusaurus/docs/Advanced Concepts/pipelines.md b/docusaurus/docs/Advanced Concepts/pipelines.md
@@ -90,6 +90,53 @@ factor.top(n)                # boolean mask: top-n by descending value
 factor.bottom(n)             # boolean mask: bottom-n by ascending value
 ```
 
+### Cross-sectional transforms
+
+Per-bar normalisation operators (Phase 2). Each takes an optional
+`mask` so the statistic is computed only over the universe that
+passes the mask:
+
+```python
+factor.zscore(mask=universe)             # (x - mean) / std per bar
+factor.demean(mask=universe)             # x - mean per bar
+factor.winsorize(0.01, 0.99,             # clip to per-bar quantiles
+                 mask=universe)
+```
+
+Where the cross-sectional `std` is `0` or undefined (e.g. only one
+symbol survives the mask), `zscore` returns `null` rather than
+`inf`/`NaN`. Masked-out symbols are excluded from the bar's
+statistic *and* from the bar's output.
+
+### Factor algebra
+
+Factors compose via the standard arithmetic operators. The framework
+auto-coerces scalar operands and shares sub-expression results via a
+per-evaluation cache, so the same input factor is computed once even
+when it appears multiple times:
+
+```python
+class MyScreener(Pipeline):
+    momentum = Returns(window=30)
+    vol = Volatility(window=30)
+
+    universe = AverageDollarVolume(window=30).top(100)
+
+    # Composite alphas — `momentum` is computed once even though it
+    # appears in two terms.
+    risk_adjusted = momentum / vol
+    score = (
+        momentum.zscore(mask=universe)
+        - 0.5 * vol.zscore(mask=universe)
+    )
+```
+
+Supported operators: `+`, `-`, `*`, `/`, unary `-`. Both operands may
+be `Factor` instances; either may be a Python `int` or `float`.
+Division by zero leaves `inf` in place (downstream filters can drop
+it) — for safe normalisation prefer `zscore`, which guards against
+zero dispersion.
+
 ## Phased rollout
 
 Pipelines run today in the **event-driven backtest** path and in
@@ -99,7 +146,7 @@ and cached/lazy execution are tracked separately.
 | Mode | Status | Page |
 | --- | --- | --- |
 | Event-driven backtest | ✅ Phase 1 | [Pipelines: Event-driven backtest](pipelines-event-backtest.md) |
-| Vector backtest | 🚧 Phase 2 ([#502](https://github.com/coding-kitties/investing-algorithm-framework/issues/502)) | [Pipelines: Vector backtest](pipelines-vector-backtest.md) |
+| Vector backtest | ✅ Phase 2 ([#502](https://github.com/coding-kitties/investing-algorithm-framework/issues/502)) | [Pipelines: Vector backtest](pipelines-vector-backtest.md) |
 | Live trading | 🚧 Phase 3 ([#503](https://github.com/coding-kitties/investing-algorithm-framework/issues/503)) | [Pipelines: Live trading](pipelines-live.md) |
 
 Start with the event-driven backtest page — it covers the full Phase 1