Skip to content

Commit f26cdbb

Browse files
committed
Improve explanations for query optimizations changes.
1 parent 3543d98 commit f26cdbb

2 files changed

Lines changed: 22 additions & 17 deletions

File tree

docs/integrations/data-ingestion/etl-tools/fivetran/reference.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar_position: 3
55
description: 'Type mappings, table engine details, metadata columns, and debugging queries for the Fivetran ClickHouse destination.'
66
title: 'Technical Reference'
77
doc_type: 'guide'
8-
keywords: ['fivetran', 'clickhouse destination', 'type mapping', 'SharedReplacingMergeTree', 'deduplication', 'FINAL', 'reference']
8+
keywords: ['fivetran', 'clickhouse destination', 'technical reference']
99
---
1010

1111
# Technical Reference
@@ -206,22 +206,18 @@ Since `_fivetran_id` is unique and there are no other primary key options, it is
206206

207207
`SharedReplacingMergeTree` performs background data deduplication
208208
[only during merges at an unknown time](/engines/table-engines/mergetree-family/replacingmergetree).
209-
However, selecting the latest version of the data without duplicates ad-hoc is possible with the `FINAL` keyword and
210-
[select_sequential_consistency](/operations/settings/settings#select_sequential_consistency)
211-
setting:
209+
However, selecting the latest version of the data without duplicates ad-hoc is possible with the `FINAL` keyword:
212210

213211
```sql
214212
SELECT *
215213
FROM example FINAL
216214
LIMIT 1000
217-
SETTINGS select_sequential_consistency = 1;
218215
```
219216

220-
See also [Duplicate records with ReplacingMergeTree](/integrations/fivetran/troubleshooting#duplicate-records) in the troubleshooting guide.
217+
Check out the "optimizing reading queries](/integrations/fivetran/troubleshooting#optimizing-reading-queries)" section in the troubleshooting guide for query optimization tips.
221218

222219
## Retries on network failures {#retries-on-network-failures}
223220

224221
The ClickHouse Cloud destination retries transient network errors using the exponential backoff algorithm.
225222
This is safe even when the destination inserts the data, as any potential duplicates are handled by
226-
the `SharedReplacingMergeTree` table engine, either during background merges,
227-
or when querying the data with `SELECT FINAL`.
223+
the `SharedReplacingMergeTree` table engine.

docs/integrations/data-ingestion/etl-tools/fivetran/troubleshooting.md

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ sidebar_position: 4
55
description: 'Common errors, debugging tips, and best practices for the Fivetran ClickHouse destination.'
66
title: 'Troubleshooting & Best Practices'
77
doc_type: 'guide'
8-
keywords: ['fivetran', 'clickhouse destination', 'troubleshooting', 'OOM', 'batch size', 'AST too big', 'EOF', 'replica inactive', 'ORDER BY', 'database occupied', 'BYOC']
8+
keywords: ['fivetran', 'clickhouse destination', 'troubleshooting', 'best practices', 'debugging']
99
---
1010

1111
# Troubleshooting & Best Practices
@@ -144,33 +144,42 @@ For example, the following architecture can be used:
144144
- **Service A (writer)**: Fivetran destination + other ingestion tools (ClickPipes, Kafka connectors)
145145
- **Service B (reader)**: BI tools, dashboards, ad-hoc queries
146146

147-
### Duplicate records with ReplacingMergeTree {#duplicate-records}
147+
### Optimizing reading queries {#optimizing-reading-queries}
148148

149-
ClickHouse uses `SharedReplacingMergeTree` for Fivetran destination tables. Duplicate rows with the same primary key are normal — deduplication happens asynchronously during background merges.
149+
ClickHouse uses `SharedReplacingMergeTree` for Fivetran destination tables, which is the version of the [`ReplacingMergeTree` table engine](/guides/replacing-merge-tree) in ClickHouse Cloud. Duplicate rows with the same primary key are normal — deduplication happens asynchronously during background merges. At read time, you need to be careful to avoid returning duplicate rows, as some rows may not have been deduplicated yet.
150150

151-
**Always use the `FINAL` modifier** to get deduplicated results:
151+
Using the `FINAL` keyword is the simplest way to avoid duplicate rows, as it forces a merge of any not-yet-deduplicated rows at read time:
152152

153153
```sql
154154
SELECT * FROM schema.table FINAL WHERE ...
155155
```
156156

157-
See the [table-structure](/integrations/fivetran/reference#table-structure) reference for more details.
157+
There are ways to optimize this `FINAL` operation — for example, by filtering on key columns using a `WHERE` condition. For more details, see the [FINAL performance](/guides/replacing-merge-tree#final-performance) section of the ReplacingMergeTree guide.
158+
159+
If those optimizations are not sufficient, you have additional options that avoid using `FINAL` while still handling duplicates correctly:
160+
- If you want to query a numeric column that is always incrementing, [you can use `max(the_column)`](/guides/developer/deduplication#avoiding-final).
161+
- If you need to retrieve the latest value for some columns for a particular key, you can use [`argMax(the_column, _fivetran_id)`](https://clickhouse.com/blog/10-best-practice-tips#perfecting_replacingmergetree).
158162

159163
### Primary key and ORDER BY optimization {#primary-key-optimization}
160164

161-
Fivetran replicates the source table's primary key as the ClickHouse `ORDER BY` clause. When the source has no PK, `_fivetran_id` (a UUID) becomes the sorting key, which sometimes may lead to poor query performance because ClickHouse builds its [sparse primary index](/guides/best-practices/sparse-primary-indexes) from the `ORDER BY` columns.
165+
Fivetran replicates the source table's primary key as the ClickHouse `ORDER BY` clause. When the source has no PK, `_fivetran_id` (a UUID) becomes the sorting key, which can lead to poor query performance because ClickHouse builds its [sparse primary index](/guides/best-practices/sparse-primary-indexes) from the `ORDER BY` columns.
162166
163-
**Recommendations:**
167+
**Recommendations in this case if any other optimization is not sufficient:**
164168
165169
1. **Treat Fivetran tables as raw staging tables.** Do not query them directly for analytics.
166-
2. **Create materialized views** with an `ORDER BY` optimized for your query patterns:
170+
2. **If queries are still not performant enough**, use a [Refreshable Materialized View](/materialized-view/refreshable-materialized-view) to create a copy of the table with an `ORDER BY` optimized for your query patterns. Unlike incremental materialized views, refreshable materialized views re-run the full query on a schedule, which correctly handles the `UPDATE` and `DELETE` operations that Fivetran issues during syncs:
167171
```sql
168172
CREATE MATERIALIZED VIEW schema.table_optimized
173+
REFRESH EVERY 1 HOUR
169174
ENGINE = ReplacingMergeTree()
170175
ORDER BY (user_id, event_date)
171-
AS SELECT * FROM schema.table_raw;
176+
AS SELECT * FROM schema.table_raw FINAL;
172177
```
173178
179+
:::note
180+
Avoid incremental (non-refreshable) materialized views for Fivetran-managed tables. Because Fivetran issues `UPDATE` and `DELETE` operations to keep data in sync, incremental materialized views will not reflect these changes and will contain stale or incorrect data.
181+
:::
182+
174183
### Don't manually modify Fivetran-managed tables {#dont-modify-tables}
175184

176185
Avoid manual DDL changes (e.g., `ALTER TABLE ... MODIFY COLUMN`) to tables managed by Fivetran. The connector expects the schema it created. Manual changes can cause [type mapping errors](#uint64-type-error) and schema mismatch failures.

0 commit comments

Comments
 (0)