Skip to content

Commit ff8b6f4

Browse files
ptomecekCopilot
andcommitted
Refactor TPC-H example to a registry-driven model graph
Reworks ccflow/examples/tpch to be more illustrative of how ccflow is used in practice (registry of typed providers wired in YAML), instead of having a single context-dispatched data generator and query runner. Changes: * Drop TPCHTableContext and TPCHQueryContext; both providers and the query runner now take NullContext. The discriminator (table name / query id) becomes a Pydantic field, so each instance has a fixed output schema. * Split the old TPCHDataGenerator into: - TPCHDuckDBBackend (BaseModel) — shared duckdb connection, runs dbgen exactly once. Lives in the registry as /tpch/backend so every provider that references it shares one connection. - TPCHTableProvider (CallableModel) — one instance per table. - TPCHAnswerProvider (CallableModel) — one instance per query id. * Replace TPCHQueryRunner with a generic TPCHQuery(CallableModel) whose query_id and tuple of input table providers are explicit Pydantic fields, replacing the hidden _QUERY_TABLE_MAP side-table. * Author config/conf.yaml as a heavily-commented teaching example showing the BaseModel/CallableModel distinction, registry cross-references via /abs/path strings, and how a single backend override (e.g. tpch.backend.scale_factor=1.0) flows through to all 22+8 providers. * Add a load_config() helper modeled on the etl example, with a package docstring showing the canonical 'load_config -> registry -> call' usage pattern. * Rewrite tests to drive everything off the loaded registry, looking up table/<n>, answer/Q<n>, and query/Q<n> entries by path. All 22 query results still match the canonical DuckDB answers at sf=0.1. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
1 parent a16f19b commit ff8b6f4

6 files changed

Lines changed: 501 additions & 157 deletions

File tree

ccflow/examples/tpch/__init__.py

Lines changed: 77 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,77 @@
1-
from .base import *
2-
from .data_generators import *
3-
from .query import *
1+
"""TPC-H example for ccflow.
2+
3+
This package is a *teaching* example showing how to compose a workflow from
4+
``CallableModel``s wired together through the ``ModelRegistry``. The
5+
canonical usage is::
6+
7+
from ccflow import ModelRegistry
8+
from ccflow.examples.tpch import load_config
9+
10+
load_config() # populate the root ModelRegistry from conf.yaml
11+
registry = ModelRegistry.root()
12+
result = registry["/query/Q1"]() # run TPC-H query 1
13+
print(result.df.to_native())
14+
15+
To run the same example at a different TPC-H scale factor, override the
16+
single shared backend on load (every table / answer / query references it,
17+
so the change flows through everywhere)::
18+
19+
load_config(overrides=["tpch.backend.scale_factor=1.0"])
20+
"""
21+
22+
from pathlib import Path
23+
from typing import List, Optional
24+
25+
from ccflow import RootModelRegistry, load_config as _load_config_base
26+
27+
from .data_generators import TPCHAnswerProvider, TPCHDuckDBBackend, TPCHTable, TPCHTableProvider
28+
from .query import TPCHQuery
29+
30+
__all__ = (
31+
"TPCHTable",
32+
"TPCHDuckDBBackend",
33+
"TPCHTableProvider",
34+
"TPCHAnswerProvider",
35+
"TPCHQuery",
36+
"load_config",
37+
)
38+
39+
40+
def load_config(
41+
config_dir: str = "",
42+
config_name: str = "",
43+
overrides: Optional[List[str]] = None,
44+
*,
45+
overwrite: bool = True,
46+
basepath: str = "",
47+
) -> RootModelRegistry:
48+
"""Load the TPC-H example registry into the root ``ModelRegistry``.
49+
50+
Pass hydra-style ``overrides`` to reconfigure entries on load — most
51+
usefully ``["tpch.backend.scale_factor=1.0"]`` to run the example at a
52+
different TPC-H scale factor. Every table / answer / query references the
53+
single ``/tpch/backend`` entry, so this one override flows through to all
54+
22+8 providers.
55+
56+
Arguments:
57+
config_dir: Optional extra hydra config directory to overlay on top
58+
of the bundled ``config/conf.yaml``. Empty string (the default)
59+
means "use only the bundled config".
60+
config_name: Optional config name within ``config_dir`` to load.
61+
overrides: Hydra override strings, e.g.
62+
``["tpch.backend.scale_factor=1.0"]``.
63+
overwrite: When True (the default), entries already present in the
64+
registry are replaced. This is what you want in notebooks where
65+
you re-call ``load_config()`` after tweaking overrides; set to
66+
False to require a fresh registry.
67+
basepath: Base path for resolving a relative ``config_dir``.
68+
"""
69+
return _load_config_base(
70+
root_config_dir=str(Path(__file__).resolve().parent / "config"),
71+
root_config_name="conf",
72+
config_dir=config_dir,
73+
config_name=config_name,
74+
overrides=overrides,
75+
overwrite=overwrite,
76+
basepath=basepath,
77+
)

ccflow/examples/tpch/base.py

Lines changed: 0 additions & 22 deletions
This file was deleted.
Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
# TPC-H example registry.
2+
#
3+
# This file is loaded by ``ccflow.examples.tpch.load_config()`` into the root
4+
# ``ModelRegistry`` and demonstrates several ccflow features:
5+
#
6+
# 1. A *flat* model graph defined entirely in YAML — Python code defines the
7+
# classes (``TPCHDuckDBBackend``, ``TPCHTableProvider``, ``TPCHAnswerProvider``,
8+
# ``TPCHQuery``); this file decides which instances exist and how they are
9+
# wired together.
10+
# 2. Cross-references between registry entries. Strings beginning with ``/``
11+
# are absolute paths into the registry; ccflow's pydantic validators
12+
# resolve them to the actual configured Python instance at config-load
13+
# time. Resolution is by reference (not copy), so every provider below
14+
# points at the *same* ``/tpch/backend`` instance, and ``dbgen`` runs
15+
# exactly once for the whole registry. Order within this file does not
16+
# matter — references are resolved after the whole file is parsed.
17+
# 3. Explicit dependencies on a generic ``CallableModel``. ``TPCHQuery`` has
18+
# an ``inputs: tuple[CallableModel[NullContext, NarwhalsFrameResult], ...]``
19+
# field; the registry resolves each ``/table/<name>`` reference into the
20+
# corresponding ``TPCHTableProvider`` instance, so each query's table
21+
# dependencies are first-class fields on that query's model instance.
22+
# 4. ``scale_factor`` lives on a single backend entry, so loading the same
23+
# config with a hydra override
24+
# (``load_config(overrides=["tpch.backend.scale_factor=1.0"])``) reconfigures
25+
# every table, answer and query consistently.
26+
27+
# ---------------------------------------------------------------------------
28+
# Shared DuckDB backend. Plain ``ccflow.BaseModel`` — not callable itself,
29+
# but registered so all providers share one connection and one ``dbgen`` call.
30+
# ---------------------------------------------------------------------------
31+
tpch:
32+
backend:
33+
_target_: ccflow.examples.tpch.TPCHDuckDBBackend
34+
scale_factor: 0.1
35+
36+
# ---------------------------------------------------------------------------
37+
# Per-table providers. One instance per TPC-H table; the output schema of
38+
# each instance is fixed by its ``table`` field.
39+
# ---------------------------------------------------------------------------
40+
table:
41+
customer:
42+
_target_: ccflow.examples.tpch.TPCHTableProvider
43+
backend: /tpch/backend
44+
table: customer
45+
lineitem:
46+
_target_: ccflow.examples.tpch.TPCHTableProvider
47+
backend: /tpch/backend
48+
table: lineitem
49+
nation:
50+
_target_: ccflow.examples.tpch.TPCHTableProvider
51+
backend: /tpch/backend
52+
table: nation
53+
orders:
54+
_target_: ccflow.examples.tpch.TPCHTableProvider
55+
backend: /tpch/backend
56+
table: orders
57+
part:
58+
_target_: ccflow.examples.tpch.TPCHTableProvider
59+
backend: /tpch/backend
60+
table: part
61+
partsupp:
62+
_target_: ccflow.examples.tpch.TPCHTableProvider
63+
backend: /tpch/backend
64+
table: partsupp
65+
region:
66+
_target_: ccflow.examples.tpch.TPCHTableProvider
67+
backend: /tpch/backend
68+
table: region
69+
supplier:
70+
_target_: ccflow.examples.tpch.TPCHTableProvider
71+
backend: /tpch/backend
72+
table: supplier
73+
74+
# ---------------------------------------------------------------------------
75+
# Reference answers, one per query, served straight from DuckDB's
76+
# ``tpch_answers()`` table at the configured scale factor.
77+
# ---------------------------------------------------------------------------
78+
answer:
79+
Q1:
80+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
81+
backend: /tpch/backend
82+
query_id: 1
83+
Q2:
84+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
85+
backend: /tpch/backend
86+
query_id: 2
87+
Q3:
88+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
89+
backend: /tpch/backend
90+
query_id: 3
91+
Q4:
92+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
93+
backend: /tpch/backend
94+
query_id: 4
95+
Q5:
96+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
97+
backend: /tpch/backend
98+
query_id: 5
99+
Q6:
100+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
101+
backend: /tpch/backend
102+
query_id: 6
103+
Q7:
104+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
105+
backend: /tpch/backend
106+
query_id: 7
107+
Q8:
108+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
109+
backend: /tpch/backend
110+
query_id: 8
111+
Q9:
112+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
113+
backend: /tpch/backend
114+
query_id: 9
115+
Q10:
116+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
117+
backend: /tpch/backend
118+
query_id: 10
119+
Q11:
120+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
121+
backend: /tpch/backend
122+
query_id: 11
123+
Q12:
124+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
125+
backend: /tpch/backend
126+
query_id: 12
127+
Q13:
128+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
129+
backend: /tpch/backend
130+
query_id: 13
131+
Q14:
132+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
133+
backend: /tpch/backend
134+
query_id: 14
135+
Q15:
136+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
137+
backend: /tpch/backend
138+
query_id: 15
139+
Q16:
140+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
141+
backend: /tpch/backend
142+
query_id: 16
143+
Q17:
144+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
145+
backend: /tpch/backend
146+
query_id: 17
147+
Q18:
148+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
149+
backend: /tpch/backend
150+
query_id: 18
151+
Q19:
152+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
153+
backend: /tpch/backend
154+
query_id: 19
155+
Q20:
156+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
157+
backend: /tpch/backend
158+
query_id: 20
159+
Q21:
160+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
161+
backend: /tpch/backend
162+
query_id: 21
163+
Q22:
164+
_target_: ccflow.examples.tpch.TPCHAnswerProvider
165+
backend: /tpch/backend
166+
query_id: 22
167+
168+
# ---------------------------------------------------------------------------
169+
# The 22 TPC-H queries. Each ``TPCHQuery`` is the same Python class with a
170+
# different ``query_id`` and a different tuple of table-provider inputs.
171+
# Wiring the inputs in YAML makes each query's table dependencies explicit
172+
# and overridable per-query.
173+
# ---------------------------------------------------------------------------
174+
query:
175+
Q1:
176+
_target_: ccflow.examples.tpch.TPCHQuery
177+
query_id: 1
178+
inputs: [/table/lineitem]
179+
Q2:
180+
_target_: ccflow.examples.tpch.TPCHQuery
181+
query_id: 2
182+
inputs: [/table/region, /table/nation, /table/supplier, /table/part, /table/partsupp]
183+
Q3:
184+
_target_: ccflow.examples.tpch.TPCHQuery
185+
query_id: 3
186+
inputs: [/table/customer, /table/lineitem, /table/orders]
187+
Q4:
188+
_target_: ccflow.examples.tpch.TPCHQuery
189+
query_id: 4
190+
inputs: [/table/lineitem, /table/orders]
191+
Q5:
192+
_target_: ccflow.examples.tpch.TPCHQuery
193+
query_id: 5
194+
inputs: [/table/region, /table/nation, /table/customer, /table/lineitem, /table/orders, /table/supplier]
195+
Q6:
196+
_target_: ccflow.examples.tpch.TPCHQuery
197+
query_id: 6
198+
inputs: [/table/lineitem]
199+
Q7:
200+
_target_: ccflow.examples.tpch.TPCHQuery
201+
query_id: 7
202+
inputs: [/table/nation, /table/customer, /table/lineitem, /table/orders, /table/supplier]
203+
Q8:
204+
_target_: ccflow.examples.tpch.TPCHQuery
205+
query_id: 8
206+
inputs: [/table/part, /table/supplier, /table/lineitem, /table/orders, /table/customer, /table/nation, /table/region]
207+
Q9:
208+
_target_: ccflow.examples.tpch.TPCHQuery
209+
query_id: 9
210+
inputs: [/table/part, /table/partsupp, /table/nation, /table/lineitem, /table/orders, /table/supplier]
211+
Q10:
212+
_target_: ccflow.examples.tpch.TPCHQuery
213+
query_id: 10
214+
inputs: [/table/customer, /table/nation, /table/lineitem, /table/orders]
215+
Q11:
216+
_target_: ccflow.examples.tpch.TPCHQuery
217+
query_id: 11
218+
inputs: [/table/nation, /table/partsupp, /table/supplier]
219+
Q12:
220+
_target_: ccflow.examples.tpch.TPCHQuery
221+
query_id: 12
222+
inputs: [/table/lineitem, /table/orders]
223+
Q13:
224+
_target_: ccflow.examples.tpch.TPCHQuery
225+
query_id: 13
226+
inputs: [/table/customer, /table/orders]
227+
Q14:
228+
_target_: ccflow.examples.tpch.TPCHQuery
229+
query_id: 14
230+
inputs: [/table/lineitem, /table/part]
231+
Q15:
232+
_target_: ccflow.examples.tpch.TPCHQuery
233+
query_id: 15
234+
inputs: [/table/lineitem, /table/supplier]
235+
Q16:
236+
_target_: ccflow.examples.tpch.TPCHQuery
237+
query_id: 16
238+
inputs: [/table/part, /table/partsupp, /table/supplier]
239+
Q17:
240+
_target_: ccflow.examples.tpch.TPCHQuery
241+
query_id: 17
242+
inputs: [/table/lineitem, /table/part]
243+
Q18:
244+
_target_: ccflow.examples.tpch.TPCHQuery
245+
query_id: 18
246+
inputs: [/table/customer, /table/lineitem, /table/orders]
247+
Q19:
248+
_target_: ccflow.examples.tpch.TPCHQuery
249+
query_id: 19
250+
inputs: [/table/lineitem, /table/part]
251+
Q20:
252+
_target_: ccflow.examples.tpch.TPCHQuery
253+
query_id: 20
254+
inputs: [/table/part, /table/partsupp, /table/nation, /table/lineitem, /table/supplier]
255+
Q21:
256+
_target_: ccflow.examples.tpch.TPCHQuery
257+
query_id: 21
258+
inputs: [/table/lineitem, /table/nation, /table/orders, /table/supplier]
259+
Q22:
260+
_target_: ccflow.examples.tpch.TPCHQuery
261+
query_id: 22
262+
inputs: [/table/customer, /table/orders]

0 commit comments

Comments
 (0)