Skip to content

Reusable narwhals pipelines#204

Open
ptomecek wants to merge 4 commits intomainfrom
pit/transforms
Open

Reusable narwhals pipelines#204
ptomecek wants to merge 4 commits intomainfrom
pit/transforms

Conversation

@ptomecek
Copy link
Copy Markdown
Collaborator

@ptomecek ptomecek commented Apr 29, 2026

Adds reusable narwhals pipeline abstractions in ccflow.models.narwhals, plus an end-to-end TPC-H notebook demonstrating them.

New classes

  • NarwhalsFrameTransform — pure LazyFrame -> LazyFrame transform base class. Framework-agnostic; usable standalone via lf.pipe(transform).
  • SequenceTransform — bundles a list of transforms; itself a NarwhalsFrameTransform so it nests and JSON-roundtrips.
  • NarwhalsPipelineModelCallableModel that pipes a NarwhalsFrameResult source through a list of transforms. Delegates context_type to the source.
  • JoinTransform — joins another CallableModel's frame onto the input. Supports same-named (on=), cross-named (left_on=/right_on=), and how="cross" joins.
  • JoinBackTransform — runs an inner transform on the input and joins the result back, for fork/rejoin patterns where window functions don't fit.

Notebook

ccflow/examples/narwhals_pipelines.ipynb — TPC-H Q1 + Q3 walkthrough covering refactoring a canonical query into reusable transforms, dependency injection, JSON serialization of full pipelines, multi-source enrichment via JoinTransform, and the confluence pattern (pipelines composed as inputs to other pipelines).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Test Results

689 tests  +38   687 ✅ +38   1m 43s ⏱️ ±0s
  1 suites ± 0     2 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 56d2592. ± Comparison against base commit e2ef462.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 98.30028% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.08%. Comparing base (e2ef462) to head (56d2592).

Files with missing lines Patch % Lines
ccflow/tests/models/test_narwhals.py 98.44% 4 Missing ⚠️
ccflow/models/narwhals.py 97.87% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #204      +/-   ##
==========================================
+ Coverage   96.00%   96.08%   +0.07%     
==========================================
  Files         140      142       +2     
  Lines        9839    10192     +353     
  Branches      568      582      +14     
==========================================
+ Hits         9446     9793     +347     
- Misses        275      280       +5     
- Partials      118      119       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ptomecek ptomecek force-pushed the pit/transforms branch 2 times, most recently from c3013e7 to 01562e7 Compare April 30, 2026 00:29
@timkpaine timkpaine marked this pull request as ready for review April 30, 2026 03:16
Comment thread ccflow/models/narwhals.py Outdated
Comment thread ccflow/tests/models/test_narwhals.py Outdated
Comment thread build_notebook.py Outdated
@ptomecek
Copy link
Copy Markdown
Collaborator Author

I'm going to rework this a bit after #205, as the current setup of the TPCH generator is propagating suboptimal patterns into this notebook. Will address the feedback above as well (and probably remove the notebook builder).

pt10597 and others added 4 commits April 30, 2026 11:57
Introduce ccflow.models.narwhals providing:

- NarwhalsFrameTransform: pure LazyFrame -> LazyFrame transform base class.
  Framework-agnostic; usable standalone via lf.pipe(transform).
- SequenceTransform: bundles a strict list of transforms; itself a
  NarwhalsFrameTransform so it nests and JSON-roundtrips.
- NarwhalsPipelineModel: CallableModel that pipes a NarwhalsFrameResult
  source through a list of transforms. Delegates context_type to the
  source. Output is always a narwhals.LazyFrame (lazy contract enforced
  by re-coercing after every stage). Supports loose Callable transforms
  at runtime (strict NarwhalsFrameTransform required for serialization).
- JoinTransform: joins another CallableModel's frame onto the input.
  Other source invoked with NullContext.
- JoinBackTransform: runs an inner transform on the input and joins
  the result back -- for fork/rejoin patterns where window functions
  do not fit.

All classes are pydantic models, JSON-serializable, and integrate with
ccflow's graph evaluator via explicit __deps__.

Includes 28 unit tests covering base contracts, JSON roundtrip,
lazy enforcement, dependency injection, multi-source enrichment, and
confluence (pipeline as source of another pipeline).

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
JoinTransform now accepts either same-named (on=) or cross-named
(left_on=/right_on=) join keys, and supports how='cross'. A model
validator enforces mutual exclusion and that cross joins specify no keys.

Adds ccflow/examples/narwhals_pipelines.ipynb -- an end-to-end walkthrough
of the new abstractions on TPC-H data, refactoring Q1 into transforms,
demonstrating dependency injection, JSON serialization, multi-source
enrichment via JoinTransform, and the confluence-of-pipelines pattern
on Q3. Notebook is generated by build_notebook.py.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Notebook now relies on a properly-configured environment
(`pip install -e .` from this worktree) rather than a
worktree-detection shim in cell 1.

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
…rializeAsAny

- NarwhalsPipelineModel.__call__ now forwards the caller's context to
  source(context) rather than passing NullContext, so context-keyed sources
  (e.g. TPCHDataGenerator with TPCHTableContext) can be used directly without
  per-table adapter wrappers.

- JoinTransform gains an other_context field so a single context-keyed source
  can be reused across multiple joins (e.g. one TPCHDataGenerator instance
  serving customer/orders/lineitem). Validator now checks
  isinstance(other_context, other.context_type) at construction.

- Drop SerializeAsAny from source and other fields. ccflow's BaseModel
  metaclass already wraps bare BaseModel-typed fields, so the explicit
  annotation was redundant. Verified subclass info still survives JSON
  round-trip in tests.

- Loosen SequenceTransform.transforms to accept the same
  Union[NarwhalsFrameTransform, Callable] as NarwhalsPipelineModel.transforms,
  for consistency. Beef up the comment on NarwhalsFrameTransformOrCallable to
  spell out why both branches matter (BaseModel branch enables type_-based
  round-trip; Callable branch is a runtime escape hatch that doesn't
  serialize).

- Notebook polish: drop hardcoded TOC, drop graph-awareness subsection, drop
  trailing Pointers section, fix ascii alignment in section 7 diagram, and
  remove the TPCHTableProvider adapter (and per-table providers) by leaning
  on context passthrough -- pipelines now use the generator directly. Combine
  the AggregateByReturnStatus and SortByReturnStatus transforms into a single
  SummarizeByReturnStatus (group keys = sort keys, naturally one operation).
  Add a new section 4 'Aside' that frames NarwhalsFrameTransform as an
  opt-in convention rather than a requirement, with a tradeoffs table for
  plain function vs plain BaseModel vs NarwhalsFrameTransform.

- Tests: add coverage for plain callable + plain ccflow.BaseModel inside
  SequenceTransform (round-trips via type_), context forwarding to source,
  other_context flow-through, and other_context type-mismatch rejection
  (38 tests total, full suite 687 passed).

Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants