Skip to content

Commit 2db5c42

Browse files
Hore01skrawcz
authored andcommitted
docs(reuse): add User Guide page on reusing functions and nodes (#1045)
Walks through the four Hamilton patterns for reusing the same logic across different data sources, regions, or runtime contexts: - driving the same function module from multiple Drivers (examples/feature_engineering/feature_engineering_multiple_contexts/) - Builder.allow_module_overrides() for swapping same-named functions (examples/module_overrides/) - @subdag for repeating a transformation graph inside one Driver (examples/reusing_functions/) - @parameterized_subdag for many subdags at once, with the readability warning quoted verbatim from the source docstring Every snippet on the page is a literalinclude pulled from the example files, so the docs and the runnable code can't drift apart. Closes #1045 Signed-off-by: Olajumoke Akinremi <106763970+Hore01@users.noreply.github.com>
1 parent f566ba2 commit 2db5c42

2 files changed

Lines changed: 263 additions & 0 deletions

File tree

docs/how-tos/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ directory. If there's an example you want but don't see, reach out or open an is
1212
load-data
1313
caching-tutorial
1414
use-for-feature-engineering
15+
reuse-nodes
1516
ml-training
1617
llm-workflows
1718
run-data-quality-checks

docs/how-tos/reuse-nodes.rst

Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
..
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
19+
==========================================
20+
Reusing functions and nodes
21+
==========================================
22+
23+
A common question on `Slack <https://join.slack.com/t/hamilton-opensource/shared_invite/zt-2niepkra8-DGKGf_tTYhXuJWBTXtIs4g>`_:
24+
*"I want to run the same logic for several regions / datasets / model variants
25+
-- what is the Hamilton way?"* Hamilton has four answers, and the right one
26+
depends on how the variation is shaped.
27+
28+
This page walks through them in order from simplest to most advanced:
29+
30+
1. **Reuse a function module across multiple Drivers** -- the data is what
31+
varies, the dataflow is the same.
32+
2. **Override a module with another that has the same function names** --
33+
one or two specific functions need to be swapped (e.g. for testing or for a
34+
different runtime context).
35+
3. **Use** ``@subdag`` -- you want the *same* transformation graph evaluated
36+
several times *inside one Driver*, with different inputs or config.
37+
4. **Use** ``@parameterized_subdag`` -- the variation is large enough that
38+
writing one ``@subdag`` per case becomes tedious. (Advanced.)
39+
40+
Every code sample below is taken from a runnable example in the
41+
`examples folder <https://github.com/apache/hamilton/tree/main/examples>`_, so
42+
you can copy any of them, run them locally, and adapt them.
43+
44+
45+
1. Reuse a function module across multiple Drivers
46+
--------------------------------------------------
47+
48+
If the *dataflow* is the same and only the *data* changes, you do not need
49+
any decorator -- you just import the function module and build a Driver
50+
wherever you need one. This is the most common form of reuse and the one to
51+
reach for first.
52+
53+
The
54+
`feature_engineering_multiple_contexts <https://github.com/apache/hamilton/tree/main/examples/feature_engineering/feature_engineering_multiple_contexts>`_
55+
example shows this pattern across an offline ETL and an online FastAPI
56+
service: ``features.py`` is written **once**, then driven from two contexts.
57+
The offline ETL builds a Driver and executes it on a batch of rows; the
58+
online server builds another Driver from the same module and executes it
59+
per-request.
60+
61+
When to reach for this pattern:
62+
63+
* The same feature definitions need to run in batch *and* in a request
64+
handler.
65+
* You want to share code between training and inference.
66+
* You want different teams to consume the same canonical module with their
67+
own inputs.
68+
69+
What you do *not* need: any Hamilton-specific decorator. The reuse is just
70+
ordinary Python imports plus building a Driver per context.
71+
72+
73+
2. Override a module to swap same-named functions
74+
-------------------------------------------------
75+
76+
Sometimes you want most of a dataflow to stay the same and only swap one
77+
or two functions -- for example, replacing a real data loader with a mock
78+
one in tests, or switching between two implementations of the same business
79+
rule.
80+
81+
By default Hamilton refuses to build a DAG when two modules define
82+
functions with the same name, because the resulting graph would be
83+
ambiguous. The
84+
`module_overrides <https://github.com/apache/hamilton/tree/main/examples/module_overrides>`_
85+
example shows how to opt in to a "later wins" rule with
86+
``Builder.allow_module_overrides()``:
87+
88+
.. literalinclude:: ../../examples/module_overrides/module_a.py
89+
:language: python
90+
:lines: 19-
91+
:caption: ``examples/module_overrides/module_a.py``
92+
93+
.. literalinclude:: ../../examples/module_overrides/module_b.py
94+
:language: python
95+
:lines: 19-
96+
:caption: ``examples/module_overrides/module_b.py``
97+
98+
.. literalinclude:: ../../examples/module_overrides/run.py
99+
:language: python
100+
:lines: 18-
101+
:caption: ``examples/module_overrides/run.py``
102+
103+
When ``allow_module_overrides()`` is set, the function from the
104+
**later-imported** module wins, so the example above prints
105+
``"This is module b."``.
106+
107+
When to reach for this pattern:
108+
109+
* You have a stable dataflow but want a small, well-named seam for swapping
110+
in a test double, a mock data source, or an environment-specific function.
111+
* You want the swap to be visible in the Driver-construction code, rather
112+
than buried inside a function or a config flag.
113+
114+
When *not* to reach for this pattern:
115+
116+
* If many functions need to vary, prefer keeping the variations in distinct
117+
modules and choosing which one to import. Module overrides are best as a
118+
surgical tool.
119+
120+
121+
3. ``@subdag`` -- repeat the same transform inside one Driver
122+
-------------------------------------------------------------
123+
124+
Sometimes you want the *same* transformation graph evaluated several times
125+
*inside the same DAG*, each time with a different input or configuration --
126+
for example, computing unique-user counts at daily / weekly / monthly grains
127+
across two regions.
128+
129+
The ``@subdag`` decorator from ``hamilton.function_modifiers`` does this
130+
declaratively. From the source documentation:
131+
132+
``@subdag`` enables you to rerun components of your DAG with varying
133+
parameters. That is, it enables you to "chain" what you could express
134+
with a Driver into a single DAG.
135+
136+
The
137+
`reusing_functions <https://github.com/apache/hamilton/tree/main/examples/reusing_functions>`_
138+
example computes ``unique_users`` for two regions and three time grains.
139+
The shared logic lives in ``unique_users.py``:
140+
141+
.. literalinclude:: ../../examples/reusing_functions/unique_users.py
142+
:language: python
143+
:lines: 18-
144+
:caption: ``examples/reusing_functions/unique_users.py``
145+
146+
Then in ``reusable_subdags.py``, each ``@subdag`` declaration creates one
147+
named instance of that subgraph, with its own ``inputs`` and ``config``:
148+
149+
.. literalinclude:: ../../examples/reusing_functions/reusable_subdags.py
150+
:language: python
151+
:pyobject: daily_unique_users_US
152+
:caption: One @subdag invocation from ``examples/reusing_functions/reusable_subdags.py``
153+
154+
Each decorated function:
155+
156+
* Takes the *output* of its sub-DAG as its argument. Above, the sub-DAG ends
157+
in ``unique_users``, so the wrapping function receives ``unique_users:
158+
pd.Series`` and returns it (perhaps after post-processing).
159+
* Receives ``inputs={"grain": value("day")}`` -- this binds the sub-DAG
160+
input ``grain`` to the literal ``"day"`` for *this instance only*.
161+
* Receives ``config={"region": "US"}`` -- this scopes Hamilton's
162+
``@config.when`` selection to ``"US"`` for this sub-DAG.
163+
164+
The same module then defines five more analogous functions (``weekly_*``,
165+
``monthly_*``, the ``CA`` variants), giving twelve nodes that all reuse the
166+
same underlying definitions.
167+
168+
Two parameters worth knowing:
169+
170+
* ``namespace`` -- a string prefix for the nodes that ``@subdag`` materialises.
171+
By default Hamilton uses the wrapping function's name, which is normally
172+
what you want.
173+
* ``external_inputs`` -- declare any function parameter that comes from
174+
*outside* the sub-DAG (e.g. from the surrounding DAG). This makes the
175+
boundary between the sub-DAG and its surroundings explicit.
176+
177+
When to reach for this pattern:
178+
179+
* You want one Driver, one visualised DAG, and one ``execute`` call to
180+
produce all the variants -- rather than a Python ``for`` loop over many
181+
Drivers in your application code.
182+
* You want lineage and execution metadata for every variant captured by
183+
Hamilton, not by a wrapper script.
184+
185+
186+
4. ``@parameterized_subdag`` -- many subdags at once (advanced)
187+
---------------------------------------------------------------
188+
189+
If you have *many* subdags that differ only along a small number of
190+
parameters, writing one ``@subdag`` declaration per case becomes verbose.
191+
``@parameterized_subdag`` is syntactic sugar that produces several subdags
192+
from a single decorator -- analogous to how ``@parameterize`` produces
193+
several nodes from one function.
194+
195+
From the
196+
`source documentation <https://github.com/apache/hamilton/blob/main/hamilton/function_modifiers/recursive.py>`_:
197+
198+
.. code-block:: python
199+
200+
@parameterized_subdag(
201+
feature_modules,
202+
from_datasource_1={"inputs": {"data": value("datasource_1.csv")}},
203+
from_datasource_2={"inputs": {"data": value("datasource_2.csv")}},
204+
from_datasource_3={
205+
"inputs": {"data": value("datasource_3.csv")},
206+
"config": {"filter": "only_even_client_ids"},
207+
},
208+
)
209+
def feature_engineering(feature_df: pd.DataFrame) -> pd.DataFrame:
210+
return feature_df
211+
212+
Each entry below the decorator becomes one subdag, all built from the same
213+
``feature_modules`` but with different inputs / config.
214+
215+
The Hamilton source itself includes a deliberate warning on this decorator:
216+
217+
Think about whether this feature is really the one you want -- often
218+
times, verbose, static DAGs are far more readable than very concise,
219+
highly parameterized DAGs.
220+
221+
In practice: prefer the explicit form from section 3 until the repetition
222+
genuinely hurts. Reach for ``@parameterized_subdag`` when the parameter
223+
list comes from elsewhere (e.g. a config file resolved with ``@resolve``)
224+
or when you have a dozen-plus near-identical subdags.
225+
226+
The full reference for both decorators lives at:
227+
228+
* :doc:`/reference/decorators/subdag`
229+
* :doc:`/reference/decorators/parameterize_subdag`
230+
231+
232+
Choosing between the four patterns
233+
----------------------------------
234+
235+
A short decision tree:
236+
237+
* **The data varies, the code does not** → just build another Driver from
238+
the same module (section 1).
239+
* **One or two named functions need to be swapped** → put the swaps in
240+
another module and use ``allow_module_overrides()`` (section 2).
241+
* **You want N copies of the same transform graph in one DAG** → use
242+
``@subdag`` (section 3).
243+
* **You have many copies and the parameter list is itself data** → consider
244+
``@parameterized_subdag`` (section 4).
245+
246+
In practice, most production Hamilton projects rely heavily on (1), use (2)
247+
sparingly for testing seams, reach for (3) when modeling per-segment or
248+
per-grain pipelines, and treat (4) as an advanced tool.
249+
250+
251+
Where to go from here
252+
---------------------
253+
254+
* Walk through the runnable examples linked above:
255+
`feature_engineering_multiple_contexts <https://github.com/apache/hamilton/tree/main/examples/feature_engineering/feature_engineering_multiple_contexts>`_,
256+
`module_overrides <https://github.com/apache/hamilton/tree/main/examples/module_overrides>`_,
257+
and
258+
`reusing_functions <https://github.com/apache/hamilton/tree/main/examples/reusing_functions>`_.
259+
* Read :doc:`/concepts/best-practices/code-organization` for the module
260+
layout that makes these patterns natural.
261+
* For an end-to-end deep-dive on subdags and reuse, see the
262+
`Hamilton March 2024 Meetup tutorial notebook <https://github.com/DAGWorks-Inc/hamilton-tutorials/blob/main/2024-03-19/march-meetup.ipynb>`_.

0 commit comments

Comments
 (0)