Skip to content

Commit 34039b3

Browse files
Docs: Update multi engine guide with gateway managed virtual layer info (#4171)
Co-authored-by: Trey Spiller <1831878+treysp@users.noreply.github.com>
1 parent f2f7cde commit 34039b3

5 files changed

Lines changed: 251 additions & 19 deletions

File tree

docs/guides/configuration.md

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -865,6 +865,39 @@ This may be useful in cases where the name casing needs to be preserved, since t
865865

866866
See [here](https://sqlglot.com/sqlglot/dialects/dialect.html#NormalizationStrategy) to learn more about the supported normalization strategies.
867867

868+
##### Gateway-specific model defaults
869+
870+
You can also define gateway specific `model_defaults` in the `gateways` section, which override the global defaults for that gateway.
871+
872+
```yaml linenums="1" hl_lines="6 14"
873+
gateways:
874+
redshift:
875+
connection:
876+
type: redshift
877+
model_defaults:
878+
dialect: "snowflake,normalization_strategy=case_insensitive"
879+
snowflake:
880+
connection:
881+
type: snowflake
882+
883+
default_gateway: snowflake
884+
885+
model_defaults:
886+
dialect: snowflake
887+
start: 2025-02-05
888+
```
889+
890+
This allows you to tailor the behavior of models for each gateway without affecting the global `model_defaults`.
891+
892+
For example, in some SQL engines identifiers like table and column names are case-sensitive, but they are case-insensitive in other engines. By default, a project that uses both types of engines would need to ensure the models for each engine aligned with the engine's normalization behavior, which makes project maintenance and debugging more challenging.
893+
894+
Gateway-specific `model_defaults` allow you to change how SQLMesh performs identifier normalization *by engine* to align the different engines' behavior.
895+
896+
In the example above, the project's default dialect is `snowflake` (line 14). The `redshift` gateway configuration overrides that global default dialect with `"snowflake,normalization_strategy=case_insensitive"` (line 6).
897+
898+
That value tells SQLMesh that the `redshift` gateway's models will be written in the Snowflake SQL dialect (so need to be transpiled from Snowflake to Redshift), but that the resulting Redshift SQL should treat identifiers as case-insensitive to match Snowflake's behavior.
899+
900+
868901
#### Model Kinds
869902

870903
Model kinds are required in each model file's `MODEL` DDL statement. They may optionally be used to specify a default kind in the model defaults configuration key.

docs/guides/multi_engine.md

Lines changed: 217 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2,28 +2,36 @@
22

33
Organizations typically connect to a data warehouse through a single engine to ensure data consistency. However, there are cases where the processing capabilities of one engine may be better suited to specific tasks than another.
44

5-
By decoupling storage from compute and with growing support for open table formats like Apache Iceberg and Hive, different engines can now interact with the same data.
5+
Companies are increasingly decoupling how/where data is stored from the how computations are run on the data, requiring interoperability across platforms and tools. Open table formats like Apache Iceberg, Delta Lake, and Hive provide a common storage format that can be used by multiple SQL engines.
66

7-
With SQLMesh's new multi-engine feature, users can leverage multiple engine adapters within a single project, offering the flexibility to choose the best engine for each task.
7+
SQLMesh enables this decoupling by supporting multiple engine adapters within a single project, giving you the flexibility to choose the best engine for each computational task. You can specify the engine each model uses, based on what computations the model performs or other organization-specific considerations.
88

9-
This feature allows you to run each model on a specified engine, provided the data catalog is shared and the engines support read/write operations on it.
9+
## Configuring a Project with Multiple Engines
1010

11+
Configuring your project to use multiple engines follows a simple process:
1112

12-
## Configuring project with multiple engines
13+
- Include all required [gateway connections](../reference/configuration.md#connection) in your configuration.
14+
- Specify the `gateway` to be used for execution in the `MODEL` DDL.
1315

14-
To configure a SQLMesh project with multiple engines, simply include all required gateway [connections](../reference/configuration.md#connection) in your configuration.
16+
If no gateway is explicitly defined for a model, the [default_gateway](../reference/configuration.md#default-gateway) of the project is used.
1517

16-
Next, specify the appropriate `gateway` in the `MODEL` DDL for each model. If no gateway is explicitly defined, the default gateway will be used.
18+
By default, virtual layer views are created in the `default_gateway`. This approach requires that all engines can read from and write to the same shared catalog, so a view in the `default_gateway` can access a table in another gateway.
1719

18-
The [virtual layer](../concepts/glossary.md#virtual-layer) will be created within the engine corresponding to the default gateway.
20+
Alternatively, each gateway can create the virtual layer views for the models it runs. Use this approach by setting the [gateway_managed_virtual_layer](#gateway-managed-virtual-layer) flag to `true` in your project configuration.
1921

20-
### Example
22+
### Shared Virtual Layer
2123

22-
Below is a simple example of setting up a project with connections to both DuckDB and PostgreSQL.
24+
To dive deeper, in SQLMesh the [physical layer](../concepts/glossary.md#physical-layer) is the concrete data storage layer, where it stores and manages data in database tables and materialized views.
25+
26+
While, the [virtual layer](../concepts/glossary.md#virtual-layer) consists of views, one for each model, each pointing to a snapshot table in the physical layer.
27+
28+
In a multi-engine project with a shared data catalog, the model-specific gateway is responsible for the physical layer, while the default gateway is used for managing the virtual layer.
29+
30+
#### Example: DuckDB + PostgreSQL
2331

24-
In this setup, the PostgreSQL engine is set as the default, so it will be used to manage views in the virtual layer.
32+
Below is a simple example of setting up a project with connections to both DuckDB and PostgreSQL.
2533

26-
Meanwhile, the DuckDB's [attach](https://duckdb.org/docs/sql/statements/attach.html) feature enables read-write access to the PostgreSQL catalog's physical tables.
34+
In this setup, the PostgreSQL engine is set as the default, so it will be used to manage views in the virtual layer. Meanwhile, DuckDB's [attach](https://duckdb.org/docs/sql/statements/attach.html) feature enables read-write access to the PostgreSQL catalog's physical tables.
2735

2836
=== "YAML"
2937

@@ -81,16 +89,15 @@ Meanwhile, the DuckDB's [attach](https://duckdb.org/docs/sql/statements/attach.h
8189
port=5432,
8290
user="postgres",
8391
password="password",
84-
database="main_db",
92+
database="main_db",
8593
)
8694
),
8795
},
8896
default_gateway="postgres",
8997
)
9098
```
9199

92-
Given this configuration, when a model’s gateway is set to duckdb, it will be materialized within the PostgreSQL `main_db` catalog, but it will be evaluated using DuckDB’s engine.
93-
100+
Given this configuration, when a model’s gateway is set to DuckDB, the DuckDB engine will perform the calculations before materializing the physical table in the PostgreSQL `main_db` catalog.
94101

95102
```sql linenums="1"
96103
MODEL (
@@ -100,12 +107,204 @@ MODEL (
100107
);
101108

102109
SELECT
103-
l_orderkey,
110+
l_orderkey,
104111
l_shipdate
105-
FROM
112+
FROM
106113
iceberg_scan('data/bucket/lineitem_iceberg', allow_moved_paths = true);
107114
```
108115

109-
In this model, the DuckDB engine can be used to scan and load data from an iceberg table and create the physical table in the PostgreSQL database.
116+
The `order_ship_date` model specifies the DuckDB engine, which will perform the computations used to create the physical table in the PostgreSQL database.
117+
118+
This allows you to efficiently scan data from an Iceberg table, or even query tables directly from S3 when used with the [HTTPFS](https://duckdb.org/docs/stable/extensions/httpfs/overview.html) extension.
119+
120+
![PostgreSQL + DuckDB](./multi_engine/postgres_duckdb.png)
121+
122+
In models where no gateway is specified, such as the `customer_orders` model, the default PostgreSQL engine will both create the physical table and the views in the virtual layer.
123+
124+
### Gateway-Managed Virtual Layer
125+
126+
By default, all virtual layer views are created in the project's default gateway.
127+
128+
If your project's engines don’t have a mutually accessible catalog or your raw data is located in different engines, you may prefer for each model's virtual layer view to exist in the gateway that ran the model. This allows a single SQLMesh project to manage isolated sets of models in different gateways, which is sometimes necessary for data governance or security concerns.
129+
130+
To enable this, set `gateway_managed_virtual_layer` to `true` in your configuration. By default, this flag is set to false.
131+
132+
#### Example: Redshift + Athena + Snowflake
133+
134+
Consider a scenario where you need to create a project with models in Redshift, Athena and Snowflake, where each engine hosts its models' virtual layer views.
135+
136+
First, add the connections to your configuration and set the `gateway_managed_virtual_layer` flag to `true`:
137+
138+
=== "YAML"
139+
140+
```yaml linenums="1" hl_lines="30"
141+
gateways:
142+
redshift:
143+
connection:
144+
type: redshift
145+
user: <redshift_user>
146+
password: <redshift_password>
147+
host: <redshift_host>
148+
database: <redshift_database>
149+
variables:
150+
gw_var: 'redshift'
151+
athena:
152+
connection:
153+
type: athena
154+
aws_access_key_id: <athena_aws_access_key_id>
155+
aws_secret_access_key: <athena_aws_secret_access_key>
156+
s3_warehouse_location: <athena_s3_warehouse_location>
157+
variables:
158+
gw_var: 'athena'
159+
snowflake:
160+
connection:
161+
type: snowflake
162+
account: <snowflake_account>
163+
user: <snowflake_user>
164+
database: <snowflake_database>
165+
warehouse: <snowflake_warehouse>
166+
variables:
167+
gw_var: 'snowflake'
168+
169+
default_gateway: redshift
170+
gateway_managed_virtual_layer: true
171+
172+
variables:
173+
gw_var: 'global'
174+
global_var: 5
175+
```
176+
177+
=== "Python"
178+
179+
```python linenums="1" hl_lines="48"
180+
from sqlmesh.core.config import (
181+
Config,
182+
ModelDefaultsConfig,
183+
GatewayConfig,
184+
RedshiftConnectionConfig,
185+
AthenaConnectionConfig,
186+
SnowflakeConnectionConfig,
187+
)
188+
189+
config = Config(
190+
model_defaults=ModelDefaultsConfig(dialect="redshift"),
191+
gateways={
192+
"redshift": GatewayConfig(
193+
connection=RedshiftConnectionConfig(
194+
user="<redshift_user>",
195+
password="<redshift_password>",
196+
host="<redshift_host>",
197+
database="<redshift_database>",
198+
),
199+
variables={
200+
"gw_var": "redshift"
201+
},
202+
),
203+
"athena": GatewayConfig(
204+
connection=AthenaConnectionConfig(
205+
aws_access_key_id="<athena_aws_access_key_id>",
206+
aws_secret_access_key="<athena_aws_secret_access_key>",
207+
region_name="<athena_region_name>",
208+
s3_warehouse_location="<athena_s3_warehouse_location>",
209+
),
210+
variables={
211+
"gw_var": "athena"
212+
},
213+
),
214+
"snowflake": GatewayConfig(
215+
connection=SnowflakeConnectionConfig(
216+
account="<snowflake_account>",
217+
user="<snowflake_user>",
218+
database="<snowflake_database>",
219+
warehouse="<snowflake_warehouse>",
220+
),
221+
variables={
222+
"gw_var": "snowflake"
223+
},
224+
),
225+
},
226+
default_gateway="redshift",
227+
gateway_managed_virtual_layer=True,
228+
variables={
229+
"gw_var": "global",
230+
"global_var": 5,
231+
},
232+
)
233+
```
234+
235+
Note that gateway-specific variables take precedence over global ones. In the example above, the `gw_var` used in a model will resolve to the value specified in the model's gateway.
236+
237+
For further customization, you can also enable [gateway-specific model defaults](../guides/configuration.md#gateway-specific-model-defaults). This allows you to define custom behaviors, such as specifying a dialect with case-insensitivity normalization.
238+
239+
The default gateway is `redshift` In the example configuration above, so all models without a `gateway` specification will run on redshift, as in this `order_dates` model:
240+
241+
```sql linenums="1"
242+
MODEL (
243+
name redshift_schema.order_dates,
244+
table_format iceberg,
245+
);
246+
247+
SELECT
248+
order_date,
249+
order_id
250+
FROM
251+
bucket.raw_data;
252+
```
253+
254+
For the `athena_schema.order_status` model, we explicitly specify the `athena` gateway:
255+
256+
```sql linenums="1" hl_lines="4"
257+
MODEL (
258+
name athena_schema.order_status,
259+
table_format iceberg,
260+
gateway athena,
261+
);
262+
263+
SELECT
264+
order_id,
265+
status
266+
FROM
267+
bucket.raw_data;
268+
```
269+
270+
Finally, specifying the `snowflake` gateway for the `customer_orders` model ensures it is isolated from the rest and reads from a table within the Snowflake database:
271+
272+
```sql linenums="1" hl_lines="4"
273+
MODEL (
274+
name snowflake_schema.customer_orders,
275+
table_format iceberg,
276+
gateway snowflake
277+
);
278+
279+
SELECT
280+
customer_id,
281+
orders
282+
FROM
283+
bronze_schema.customer_data;
284+
```
285+
286+
287+
![Athena + Redshift + Snowflake](./multi_engine/athena_redshift_snowflake.png)
288+
289+
When you run the plan, the catalogs for each model will be set automatically based on the gateway’s connection and each corresponding model will be executed by the specified engine:
290+
291+
```bash
292+
❯ sqlmesh plan
293+
294+
`prod` environment will be initialized
295+
296+
Models:
297+
└── Added:
298+
├── awsdatacatalog.athena_schema.order_status # each model uses its gateway's catalog and schema
299+
├── redshift_schema.order_dates
300+
└── silver.snowflake_schema.customers
301+
Models needing backfill:
302+
├── awsdatacatalog.athena_schema.order_status: [full refresh]
303+
├── redshift_schema.order_dates: [full refresh]
304+
└── silver.snowflake_schema.customer_orders: [full refresh]
305+
Apply - Backfill Tables [y/n]: y
306+
```
307+
308+
The views of the virtual layer will also be created by each corresponding engine.
110309

111-
While the PostgreSQL engine is responsible for creating the model's view for the virtual layer.
310+
This approach provides isolation between your models, while maintaining centralized control over your project.
125 KB
Loading
75.3 KB
Loading

docs/reference/configuration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ Configuration options for SQLMesh environment creation and promotion.
3535
| `physical_schema_override` | (Deprecated) Use `physical_schema_mapping` instead. A mapping from model schema names to names of schemas in which physical tables for the corresponding models will be placed. | dict[string, string] | N |
3636
| `physical_schema_mapping` | A mapping from regular expressions to names of schemas in which physical tables for the corresponding models [will be placed](../guides/configuration.md#physical-table-schemas). (Default physical schema name: `sqlmesh__[model schema]`) | dict[string, string] | N |
3737
| `environment_suffix_target` | Whether SQLMesh views should append their environment name to the `schema` or `table` - [additional details](../guides/configuration.md#view-schema-override). (Default: `schema`) | string | N |
38-
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/configuration.md#view-schema-override). (Default: False) | boolean | N |
38+
| `gateway_managed_virtual_layer` | Whether SQLMesh views of the virtual layer will be created by the default gateway or model specified gateways - [additional details](../guides/multi_engine.md#gateway-managed-virtual-layer). (Default: False) | boolean | N |
3939
| `environment_catalog_mapping` | A mapping from regular expressions to catalog names. The catalog name is used to determine the target catalog for a given environment. | dict[string, string] | N |
4040
| `log_limit` | The default number of logs to keep (Default: `20`) | int | N |
4141

0 commit comments

Comments
 (0)