|
| 1 | +.. |
| 2 | + Licensed to the Apache Software Foundation (ASF) under one |
| 3 | + or more contributor license agreements. See the NOTICE file |
| 4 | + distributed with this work for additional information |
| 5 | + regarding copyright ownership. The ASF licenses this file |
| 6 | + to you under the Apache License, Version 2.0 (the |
| 7 | + "License"); you may not use this file except in compliance |
| 8 | + with the License. You may obtain a copy of the License at |
| 9 | +
|
| 10 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 11 | + |
| 12 | + Unless required by applicable law or agreed to in writing, |
| 13 | + software distributed under the License is distributed on an |
| 14 | + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| 15 | + KIND, either express or implied. See the License for the |
| 16 | + specific language governing permissions and limitations |
| 17 | + under the License. |
| 18 | + |
| 19 | +========================================== |
| 20 | +Reusing functions and nodes |
| 21 | +========================================== |
| 22 | + |
| 23 | +A common question on `Slack <https://join.slack.com/t/hamilton-opensource/shared_invite/zt-2niepkra8-DGKGf_tTYhXuJWBTXtIs4g>`_: |
| 24 | +*"I want to run the same logic for several regions / datasets / model variants |
| 25 | +-- what is the Hamilton way?"* Hamilton has four answers, and the right one |
| 26 | +depends on how the variation is shaped. |
| 27 | + |
| 28 | +This page walks through them in order from simplest to most advanced: |
| 29 | + |
| 30 | +1. **Reuse a function module across multiple Drivers** -- the data is what |
| 31 | + varies, the dataflow is the same. |
| 32 | +2. **Override a module with another that has the same function names** -- |
| 33 | + one or two specific functions need to be swapped (e.g. for testing or for a |
| 34 | + different runtime context). |
| 35 | +3. **Use** ``@subdag`` -- you want the *same* transformation graph evaluated |
| 36 | + several times *inside one Driver*, with different inputs or config. |
| 37 | +4. **Use** ``@parameterized_subdag`` -- the variation is large enough that |
| 38 | + writing one ``@subdag`` per case becomes tedious. (Advanced.) |
| 39 | + |
| 40 | +Every code sample below is taken from a runnable example in the |
| 41 | +`examples folder <https://github.com/apache/hamilton/tree/main/examples>`_, so |
| 42 | +you can copy any of them, run them locally, and adapt them. |
| 43 | + |
| 44 | + |
| 45 | +1. Reuse a function module across multiple Drivers |
| 46 | +-------------------------------------------------- |
| 47 | + |
| 48 | +If the *dataflow* is the same and only the *data* changes, you do not need |
| 49 | +any decorator -- you just import the function module and build a Driver |
| 50 | +wherever you need one. This is the most common form of reuse and the one to |
| 51 | +reach for first. |
| 52 | + |
| 53 | +The |
| 54 | +`feature_engineering_multiple_contexts <https://github.com/apache/hamilton/tree/main/examples/feature_engineering/feature_engineering_multiple_contexts>`_ |
| 55 | +example shows this pattern across an offline ETL and an online FastAPI |
| 56 | +service: ``features.py`` is written **once**, then driven from two contexts. |
| 57 | +The offline ETL builds a Driver and executes it on a batch of rows; the |
| 58 | +online server builds another Driver from the same module and executes it |
| 59 | +per-request. |
| 60 | + |
| 61 | +When to reach for this pattern: |
| 62 | + |
| 63 | +* The same feature definitions need to run in batch *and* in a request |
| 64 | + handler. |
| 65 | +* You want to share code between training and inference. |
| 66 | +* You want different teams to consume the same canonical module with their |
| 67 | + own inputs. |
| 68 | + |
| 69 | +What you do *not* need: any Hamilton-specific decorator. The reuse is just |
| 70 | +ordinary Python imports plus building a Driver per context. |
| 71 | + |
| 72 | + |
| 73 | +2. Override a module to swap same-named functions |
| 74 | +------------------------------------------------- |
| 75 | + |
| 76 | +Sometimes you want most of a dataflow to stay the same and only swap one |
| 77 | +or two functions -- for example, replacing a real data loader with a mock |
| 78 | +one in tests, or switching between two implementations of the same business |
| 79 | +rule. |
| 80 | + |
| 81 | +By default Hamilton refuses to build a DAG when two modules define |
| 82 | +functions with the same name, because the resulting graph would be |
| 83 | +ambiguous. The |
| 84 | +`module_overrides <https://github.com/apache/hamilton/tree/main/examples/module_overrides>`_ |
| 85 | +example shows how to opt in to a "later wins" rule with |
| 86 | +``Builder.allow_module_overrides()``: |
| 87 | + |
| 88 | +.. literalinclude:: ../../examples/module_overrides/module_a.py |
| 89 | + :language: python |
| 90 | + :lines: 19- |
| 91 | + :caption: ``examples/module_overrides/module_a.py`` |
| 92 | + |
| 93 | +.. literalinclude:: ../../examples/module_overrides/module_b.py |
| 94 | + :language: python |
| 95 | + :lines: 19- |
| 96 | + :caption: ``examples/module_overrides/module_b.py`` |
| 97 | + |
| 98 | +.. literalinclude:: ../../examples/module_overrides/run.py |
| 99 | + :language: python |
| 100 | + :lines: 18- |
| 101 | + :caption: ``examples/module_overrides/run.py`` |
| 102 | + |
| 103 | +When ``allow_module_overrides()`` is set, the function from the |
| 104 | +**later-imported** module wins, so the example above prints |
| 105 | +``"This is module b."``. |
| 106 | + |
| 107 | +When to reach for this pattern: |
| 108 | + |
| 109 | +* You have a stable dataflow but want a small, well-named seam for swapping |
| 110 | + in a test double, a mock data source, or an environment-specific function. |
| 111 | +* You want the swap to be visible in the Driver-construction code, rather |
| 112 | + than buried inside a function or a config flag. |
| 113 | + |
| 114 | +When *not* to reach for this pattern: |
| 115 | + |
| 116 | +* If many functions need to vary, prefer keeping the variations in distinct |
| 117 | + modules and choosing which one to import. Module overrides are best as a |
| 118 | + surgical tool. |
| 119 | + |
| 120 | + |
| 121 | +3. ``@subdag`` -- repeat the same transform inside one Driver |
| 122 | +------------------------------------------------------------- |
| 123 | + |
| 124 | +Sometimes you want the *same* transformation graph evaluated several times |
| 125 | +*inside the same DAG*, each time with a different input or configuration -- |
| 126 | +for example, computing unique-user counts at daily / weekly / monthly grains |
| 127 | +across two regions. |
| 128 | + |
| 129 | +The ``@subdag`` decorator from ``hamilton.function_modifiers`` does this |
| 130 | +declaratively. From the source documentation: |
| 131 | + |
| 132 | + ``@subdag`` enables you to rerun components of your DAG with varying |
| 133 | + parameters. That is, it enables you to "chain" what you could express |
| 134 | + with a Driver into a single DAG. |
| 135 | + |
| 136 | +The |
| 137 | +`reusing_functions <https://github.com/apache/hamilton/tree/main/examples/reusing_functions>`_ |
| 138 | +example computes ``unique_users`` for two regions and three time grains. |
| 139 | +The shared logic lives in ``unique_users.py``: |
| 140 | + |
| 141 | +.. literalinclude:: ../../examples/reusing_functions/unique_users.py |
| 142 | + :language: python |
| 143 | + :lines: 18- |
| 144 | + :caption: ``examples/reusing_functions/unique_users.py`` |
| 145 | + |
| 146 | +Then in ``reusable_subdags.py``, each ``@subdag`` declaration creates one |
| 147 | +named instance of that subgraph, with its own ``inputs`` and ``config``: |
| 148 | + |
| 149 | +.. literalinclude:: ../../examples/reusing_functions/reusable_subdags.py |
| 150 | + :language: python |
| 151 | + :pyobject: daily_unique_users_US |
| 152 | + :caption: One @subdag invocation from ``examples/reusing_functions/reusable_subdags.py`` |
| 153 | + |
| 154 | +Each decorated function: |
| 155 | + |
| 156 | +* Takes the *output* of its sub-DAG as its argument. Above, the sub-DAG ends |
| 157 | + in ``unique_users``, so the wrapping function receives ``unique_users: |
| 158 | + pd.Series`` and returns it (perhaps after post-processing). |
| 159 | +* Receives ``inputs={"grain": value("day")}`` -- this binds the sub-DAG |
| 160 | + input ``grain`` to the literal ``"day"`` for *this instance only*. |
| 161 | +* Receives ``config={"region": "US"}`` -- this scopes Hamilton's |
| 162 | + ``@config.when`` selection to ``"US"`` for this sub-DAG. |
| 163 | + |
| 164 | +The same module then defines five more analogous functions (``weekly_*``, |
| 165 | +``monthly_*``, the ``CA`` variants), giving twelve nodes that all reuse the |
| 166 | +same underlying definitions. |
| 167 | + |
| 168 | +Two parameters worth knowing: |
| 169 | + |
| 170 | +* ``namespace`` -- a string prefix for the nodes that ``@subdag`` materialises. |
| 171 | + By default Hamilton uses the wrapping function's name, which is normally |
| 172 | + what you want. |
| 173 | +* ``external_inputs`` -- declare any function parameter that comes from |
| 174 | + *outside* the sub-DAG (e.g. from the surrounding DAG). This makes the |
| 175 | + boundary between the sub-DAG and its surroundings explicit. |
| 176 | + |
| 177 | +When to reach for this pattern: |
| 178 | + |
| 179 | +* You want one Driver, one visualised DAG, and one ``execute`` call to |
| 180 | + produce all the variants -- rather than a Python ``for`` loop over many |
| 181 | + Drivers in your application code. |
| 182 | +* You want lineage and execution metadata for every variant captured by |
| 183 | + Hamilton, not by a wrapper script. |
| 184 | + |
| 185 | + |
| 186 | +4. ``@parameterized_subdag`` -- many subdags at once (advanced) |
| 187 | +--------------------------------------------------------------- |
| 188 | + |
| 189 | +If you have *many* subdags that differ only along a small number of |
| 190 | +parameters, writing one ``@subdag`` declaration per case becomes verbose. |
| 191 | +``@parameterized_subdag`` is syntactic sugar that produces several subdags |
| 192 | +from a single decorator -- analogous to how ``@parameterize`` produces |
| 193 | +several nodes from one function. |
| 194 | + |
| 195 | +From the |
| 196 | +`source documentation <https://github.com/apache/hamilton/blob/main/hamilton/function_modifiers/recursive.py>`_: |
| 197 | + |
| 198 | +.. code-block:: python |
| 199 | +
|
| 200 | + @parameterized_subdag( |
| 201 | + feature_modules, |
| 202 | + from_datasource_1={"inputs": {"data": value("datasource_1.csv")}}, |
| 203 | + from_datasource_2={"inputs": {"data": value("datasource_2.csv")}}, |
| 204 | + from_datasource_3={ |
| 205 | + "inputs": {"data": value("datasource_3.csv")}, |
| 206 | + "config": {"filter": "only_even_client_ids"}, |
| 207 | + }, |
| 208 | + ) |
| 209 | + def feature_engineering(feature_df: pd.DataFrame) -> pd.DataFrame: |
| 210 | + return feature_df |
| 211 | +
|
| 212 | +Each entry below the decorator becomes one subdag, all built from the same |
| 213 | +``feature_modules`` but with different inputs / config. |
| 214 | + |
| 215 | +The Hamilton source itself includes a deliberate warning on this decorator: |
| 216 | + |
| 217 | + Think about whether this feature is really the one you want -- often |
| 218 | + times, verbose, static DAGs are far more readable than very concise, |
| 219 | + highly parameterized DAGs. |
| 220 | + |
| 221 | +In practice: prefer the explicit form from section 3 until the repetition |
| 222 | +genuinely hurts. Reach for ``@parameterized_subdag`` when the parameter |
| 223 | +list comes from elsewhere (e.g. a config file resolved with ``@resolve``) |
| 224 | +or when you have a dozen-plus near-identical subdags. |
| 225 | + |
| 226 | +The full reference for both decorators lives at: |
| 227 | + |
| 228 | +* :doc:`/reference/decorators/subdag` |
| 229 | +* :doc:`/reference/decorators/parameterize_subdag` |
| 230 | + |
| 231 | + |
| 232 | +Choosing between the four patterns |
| 233 | +---------------------------------- |
| 234 | + |
| 235 | +A short decision tree: |
| 236 | + |
| 237 | +* **The data varies, the code does not** → just build another Driver from |
| 238 | + the same module (section 1). |
| 239 | +* **One or two named functions need to be swapped** → put the swaps in |
| 240 | + another module and use ``allow_module_overrides()`` (section 2). |
| 241 | +* **You want N copies of the same transform graph in one DAG** → use |
| 242 | + ``@subdag`` (section 3). |
| 243 | +* **You have many copies and the parameter list is itself data** → consider |
| 244 | + ``@parameterized_subdag`` (section 4). |
| 245 | + |
| 246 | +In practice, most production Hamilton projects rely heavily on (1), use (2) |
| 247 | +sparingly for testing seams, reach for (3) when modeling per-segment or |
| 248 | +per-grain pipelines, and treat (4) as an advanced tool. |
| 249 | + |
| 250 | + |
| 251 | +Where to go from here |
| 252 | +--------------------- |
| 253 | + |
| 254 | +* Walk through the runnable examples linked above: |
| 255 | + `feature_engineering_multiple_contexts <https://github.com/apache/hamilton/tree/main/examples/feature_engineering/feature_engineering_multiple_contexts>`_, |
| 256 | + `module_overrides <https://github.com/apache/hamilton/tree/main/examples/module_overrides>`_, |
| 257 | + and |
| 258 | + `reusing_functions <https://github.com/apache/hamilton/tree/main/examples/reusing_functions>`_. |
| 259 | +* Read :doc:`/concepts/best-practices/code-organization` for the module |
| 260 | + layout that makes these patterns natural. |
| 261 | +* For an end-to-end deep-dive on subdags and reuse, see the |
| 262 | + `Hamilton March 2024 Meetup tutorial notebook <https://github.com/DAGWorks-Inc/hamilton-tutorials/blob/main/2024-03-19/march-meetup.ipynb>`_. |
0 commit comments