Skip to content

Commit 8fae0c0

Browse files
authored
Docs: add incremental by time guide (#1402)
* Initial commit * Add scheduling, time filtering, and forward-only sections * Add forward only example * Add guide to mkdocs * Wordsmith * Update and expand figures * Organize guide links * Capitalize model kind names
1 parent d3c950d commit 8fae0c0

8 files changed

Lines changed: 179 additions & 18 deletions

File tree

docs/concepts/models/model_kinds.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,7 +208,7 @@ The model kinds described so far cause the output of a model query to be materia
208208

209209
The `VIEW` kind is different, because no data is actually written during model execution. Instead, a non-materialized view (or "virtual table") is created or replaced based on the model's query.
210210

211-
**Note:** View is the default model kind if kind is not specified.
211+
**Note:** `VIEW` is the default model kind if kind is not specified.
212212

213213
**Note:** With this kind, the model's query is evaluated every time the model is referenced in a downstream query. This may incur undesirable compute cost and time in cases where the model's query is compute-intensive, or when the model is referenced in many downstream queries.
214214

docs/guides/incremental_time.md

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
# Incremental by time guide
2+
3+
SQLMesh models are classified by [kind](../concepts/models/model_kinds.md). One powerful model kind is "incremental by time range" - this guide describes how these models work and demonstrates how to use them.
4+
5+
See the [models guide](./models.md) to learn more about working with models in general or the [model kinds concepts page](../concepts/models/model_kinds.md) for an overview of the different model kinds.
6+
7+
## Load the right data
8+
9+
The incremental by time approach to data loading is motivated by efficiency. It is based on the principle of only loading a given data row one time.
10+
11+
Model kinds such as `VIEW` or `FULL` reload the entirety of the source system's data every time they run. In some cases, reloading all the data is not feasible. In other cases, it is an inefficient use of time and computational resources - both of which equate to money your business could spend on something else.
12+
13+
Incremental by time models only load new data, drastically decreasing the computational resources required for each model run.
14+
15+
## Counting time
16+
17+
Incremental by time models work by first identifying the date range for which data should be read from the source table.
18+
19+
One approach to determining the date range bases it on the most recent record timestamp observed in the data. That approach is simple to implement, but it makes three assumptions: the table already exists, there are no temporal gaps in the data, and that the query is able to run in a single pass.
20+
21+
SQLMesh takes a different approach by using time *intervals*.
22+
23+
### Calculating intervals
24+
25+
The first step to using time intervals is to create the set of all intervals based on the model's *start datetime* and *interval unit*. The start datetime specifies when time "begins" for the model, and interval unit specifies how finely time should be divided.
26+
27+
For example, consider a model with a start datetime of 12am two days ago that we are working with today at 12pm. This is illustrated in Figure 1:
28+
29+
![Illustration of model with start datetime of 12am two days ago that we are working with today at 12pm](./incremental_time/interval-example.png)
30+
31+
*__Figure 1: Illustration of model with start datetime of 12am two days ago that we are working with today at 12pm__*
32+
33+
<br>
34+
35+
If the model's interval unit was 1 day, the model's set of intervals would have 3 entries:
36+
37+
- 1 for two days ago
38+
- 1 for yesterday
39+
- 1 for today
40+
41+
Today's interval is not yet complete because it's 12pm right now. This is illustrated in the top panel of Figure 2:
42+
43+
![Illustration of counting intervals over a 36 hour period with interval units of 1 day and 1 hour](./incremental_time/interval-counting.png)
44+
45+
*__Figure 2: Illustration of counting intervals over a 36 hour period with interval units of 1 day and 1 hour__*
46+
47+
<br>
48+
49+
If the model's interval unit was 1 hour instead, its set of time intervals would have 60 entries:
50+
51+
- 24 for each hour of two days ago
52+
- 24 for each hour of yesterday
53+
- 12 for each hour today from 12am to 12pm
54+
55+
All intervals are complete because it is 12pm (so the 11am interval has ended). This is illustrated in the bottom panel of Figure 2.
56+
57+
When we first execute and backfill the bottom model as part of a `sqlmesh plan` today at 12pm, SQLMesh calculates its set of 60 time intervals and records that all 60 of them were backfilled. It retains this information in the SQLMesh state tables for future use.
58+
59+
If we `sqlmesh run` the model *tomorrow* at 12pm, SQLMesh calculates the set of all intervals as:
60+
61+
- 24 for two days ago
62+
- 24 for yesterday
63+
- 24 for today
64+
- 12 for tomorrow 12am to 12pm
65+
66+
This gives a total of 84 intervals.
67+
68+
It compares this set of 84 to the stored set of 60 that we already backfilled to identify the 24 un-processed intervals from 12pm yesterday to 12pm today. It then processes only those 24 intervals during today's run.
69+
70+
## Running `run`
71+
72+
SQLMesh has two different commands for processing data. If any model has been changed, [`sqlmesh plan`](../reference/cli.md#plan) is used to apply the change to data in a specific environment. If no models have changed, [`sqlmesh run`](../reference/cli.md#run) is used to execute the project's models.
73+
74+
Data accumulation rates and freshness requirements may differ across models. If `sqlmesh run` ran every model whenever it was executed, all models would be held to the same freshness requirements as the most stringent model. This is inefficient and wastes computational resources (and money).
75+
76+
Instead, you specify the [`cron` parameter](../concepts/models/overview.md#cron) for each model. `sqlmesh run` uses each model's `cron` to determine whether that model should be executed in a given run.
77+
78+
For example, if your most frequent model's `cron` is hourly, you need to execute the `sqlmesh run` command at least hourly (with a tool like Linux's cron). That model will run once every hour, but another model with a `cron` of daily will only run once per day when 24 hours have elapsed since its previous run.
79+
80+
### Scheduling computations
81+
82+
By default, SQLMesh processes all intervals that have elapsed since a model's previous run in a single job. If a model's source data is large, you may want to break the computations up into smaller jobs - this is done with the model configuration's `batch_size` parameter.
83+
84+
When `batch_size` is specified, the total number of intervals to process is divided into batches of size `batch_size`, and one job is executed for each batch.
85+
86+
## Model time
87+
88+
Incremental by time models require specification of a time column in their configuration. In addition, their model SQL queries should specify a `WHERE` clause that filters the data on a time range.
89+
90+
This example shows an incremental by time model that could be added to the SQLMesh [quickstart project](../quick_start.md):
91+
92+
``` sql linenums="1"
93+
MODEL (
94+
name sqlmesh_example.new_model,
95+
kind INCREMENTAL_BY_TIME_RANGE (
96+
time_column (ds, '%Y-%m-%d'), -- Time column `ds` with format '%Y-%m-%d'
97+
),
98+
);
99+
100+
SELECT
101+
*
102+
FROM
103+
sqlmesh_example.incremental_model
104+
WHERE
105+
ds BETWEEN @start_ds and @end_ds -- WHERE clause filters based on time
106+
```
107+
108+
The model configuration specifies that the column `ds` represents the time stamp for each row, and the model query contains a `WHERE` clause that uses the time column to filter the data.
109+
110+
The `WHERE` clause uses the [SQLMesh predefined macro variables](../concepts/macros/macro_variables.md#predefined-variables) `@start_ds` and `@end_ds` to specify the date range. SQLMesh automatically substitutes in the correct dates based on which intervals are being processed in a job.
111+
112+
In addition to the query `WHERE` clause, SQLMesh prevents data leakage by automatically wrapping the query in another time-filtering `WHERE` clause using the time column in the model's configuration.
113+
114+
This raises a question: if SQLMesh automatically adds a time filtering `WHERE` clause, why do you need to include one in the query? Because the two filters play different roles:
115+
116+
- The model query `WHERE` clause filters the data *read into the model*
117+
- The SQLMesh wrapper `WHERE` clause filters the data *output by the model*
118+
119+
The model query ensures that only the necessary data is processed by the model, so no resources are wasted. It also adds flexibility - if an upstream model uses a different time column than the model itself, that column can be used in addition to (or in place of) the model's time column in the query `WHERE` clause.
120+
121+
The SQLMesh wrapper clause prevents data leakage by ensuring the model does not return records outside the time range. This is a safety mechanism that guards against improperly specified queries.
122+
123+
For some queries, the two filters are functionally duplicative, but for others they are not. There is no way for SQLMesh to determine whether they are duplicative in any given instance, so the model query should always include a time-filtering `WHERE` clause.
124+
125+
## Forward-only models
126+
127+
Every time a model is modified, SQLMesh classifies the change as "[breaking](../concepts/plans.md#breaking-change)" or "[non-breaking](../concepts/plans.md#non-breaking-change)."
128+
129+
Breaking changes may invalidate data for downstream models, so a new physical table is created and fully refreshed for the changed model and all models downstream of it. Non-breaking changes only affect the changed model, so only its physical table is refreshed.
130+
131+
Sometimes a model's data may be so large that it is not feasible to rebuild either its own or its downstream models' physical tables. In those situations a third type of change, "forward only," can be used. The name reflects that the change is only applied "going forward" in time.
132+
133+
### Specifying forward-only
134+
135+
Forward-only changes can be specified in two ways. First, a model can be [configured as forward-only](../guides/configuration.md#models) such that all changes to them are automatically classified as forward-only. This guarantees that the model's physical table will never be fully refreshed.
136+
137+
This example configures the model in the previous example to be forward only:
138+
139+
``` sql linenums="1"
140+
MODEL (
141+
name sqlmesh_example.new_model,
142+
kind INCREMENTAL_BY_TIME_RANGE (
143+
time_column (ds, '%Y-%m-%d'),
144+
),
145+
forward_only true -- All changes will be forward only
146+
);
147+
148+
SELECT
149+
*
150+
FROM
151+
sqlmesh_example.incremental_model
152+
WHERE
153+
ds BETWEEN @start_ds and @end_ds
154+
```
155+
156+
Alternatively, all the changes contained in a *specific plan* can be classified as forward-only with a flag: `sqlmesh plan --forward-only`. A subsequent plan that did not include the forward-only flag would fully refresh the model's physical table. Learn more about forward-only plans [here](../concepts/plans.md#forward-only-plans).
52.6 KB
Loading
24.5 KB
Loading

docs/guides/models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ To add a model:
1717

1818
MODEL (
1919
name sqlmesh_example.new_model,
20-
kind incremental_by_time_range (
20+
kind INCREMENTAL_BY_TIME_RANGE (
2121
time_column (ds, '%Y-%m-%d'),
2222
),
2323
);

docs/integrations/dbt.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ This section describes how to adapt dbt's incremental models to run on sqlmesh a
5151

5252
SQLMesh supports two approaches to implement [idempotent](../concepts/glossary.md#idempotency) incremental loads:
5353

54-
* Using merge (with the sqlmesh [`incremental_by_unique_key` model kind](../concepts/models/model_kinds.md#incremental_by_unique_key))
55-
* Using insert-overwrite/delete+insert (with the sqlmesh [`incremental_by_time_range` model kind](../concepts/models/model_kinds.md#incremental_by_time_range))
54+
* Using merge (with the sqlmesh [`INCREMENTAL_BY_UNIQUE_KEY` model kind](../concepts/models/model_kinds.md#incremental_by_unique_key))
55+
* Using insert-overwrite/delete+insert (with the sqlmesh [`INCREMENTAL_BY_TIME_RANGE` model kind](../concepts/models/model_kinds.md#incremental_by_time_range))
5656

5757
#### Incremental by unique key
5858

docs/reference/model_configuration.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,7 +90,7 @@ Configuration options for all incremental models.
9090

9191
#### Incremental by time range
9292

93-
Configuration options for [incremental by time models](../concepts/models/model_kinds.md#incremental_by_time_range).
93+
Configuration options for [`INCREMENTAL_BY_TIME_RANGE` models](../concepts/models/model_kinds.md#incremental_by_time_range).
9494

9595
| Option | Description | Type | Required |
9696
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | :--: | :------: |
@@ -103,7 +103,7 @@ Python model configuration object: [IncrementalByTimeRangeKind()](https://sqlmes
103103

104104
#### Incremental by unique key
105105

106-
Configuration options for [incremental by unique key models](../concepts/models/model_kinds.md#incremental_by_unique_key).
106+
Configuration options for [`INCREMENTAL_BY_UNIQUE_KEY` models](../concepts/models/model_kinds.md#incremental_by_unique_key).
107107

108108
| Option | Description | Type | Required |
109109
| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | :--------------: | :------: |
@@ -115,7 +115,7 @@ Python model configuration object: [IncrementalByUniqueKeyKind()](https://sqlmes
115115

116116
### `SEED` models
117117

118-
Configuration options for [seed models](../concepts/models/model_kinds.md#seed).
118+
Configuration options for [`SEED` models](../concepts/models/model_kinds.md#seed).
119119

120120
| Option | Description | Type | Required |
121121
| ------ | ---------------------- | :--: | :------: |
@@ -125,7 +125,7 @@ Python model configuration object: [SeedKind()](https://sqlmesh.readthedocs.io/e
125125

126126
### SCD Type 2 models
127127

128-
Configuration options for [SCD Type 2 models](../concepts/models/model_kinds.md#scd-type-2).
128+
Configuration options for [`SCD_TYPE_2` models](../concepts/models/model_kinds.md#scd-type-2).
129129

130130
| Option | Description | Type | Required |
131131
| --------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | :-------: | :------: |

mkdocs.yml

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,21 @@ nav:
1212
- quickstart/notebook.md
1313
- quickstart/ui.md
1414
- Guides:
15-
- guides/projects.md
16-
- guides/models.md
17-
- guides/testing.md
18-
- guides/scheduling.md
19-
- guides/connections.md
20-
- guides/multi_repo.md
21-
- guides/migrations.md
22-
- guides/notifications.md
23-
- guides/tablediff.md
24-
- guides/configuration.md
15+
- Project structure:
16+
- guides/projects.md
17+
- guides/multi_repo.md
18+
- Project setup:
19+
- guides/configuration.md
20+
- guides/connections.md
21+
- guides/scheduling.md
22+
- guides/notifications.md
23+
- guides/migrations.md
24+
- Project content:
25+
- guides/models.md
26+
- guides/incremental_time.md
27+
- guides/testing.md
28+
- SQLMesh tools:
29+
- guides/tablediff.md
2530
- Concepts:
2631
- concepts/overview.md
2732
- Development:

0 commit comments

Comments
 (0)