Skip to content

Commit 3cc9ebc

Browse files
clickhouse: add schema metrics (schema_metrics + view refresh) (DataDog#23900)
* Add ClickHouse schema metrics (schema_metrics + view refresh) Adds per-table size gauges via ClickhouseTableMetrics job class and per-view refresh status/timing gauges collected inline in check(). Introduces schema_metrics config block and _VIEW_REFRESHES_QUERY with graceful handling for ClickHouse < 24.3 and missing permissions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(clickhouse): rename changelog entry to match PR number 23900 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(clickhouse): sync SchemaMetrics model formatting with spec.yaml Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(clickhouse): gate view refresh metrics on schema_metrics.enabled _collect_view_refresh_metrics was running on every check cycle when dbm was enabled, regardless of schema_metrics.enabled (which defaults to false). Gate it on self.table_metrics so it respects the opt-in config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(clickhouse): avoid duplicate db: tag on schema metrics Per-table and per-view schema metrics appended `db:<database>` on top of the instance-level `db:<connection_db>` base tag, producing two conflicting `db:` tags for any entity outside the connection database. Strip the base `db:` tag before adding the entity's own database, mirroring the postgres `tags_without_db` pattern, so each series carries exactly one `db:` tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Address review feedback: null guard, feature flag, constant naming, tests - Remove redundant schema_metrics null check (always initialized with defaults) - Register FeatureKey.SCHEMA_METRICS in config.py feature reporting - Rename _DEFAULT_COLLECTION_INTERVAL → DEFAULT_COLLECTION_INTERVAL per AGENTS.md - Add restart hint to permission-denied log message - Add unit tests for _collect_view_refresh_metrics() and _handle_view_refreshes_error() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Move view refresh logic into ClickhouseTableMetrics _collect_view_refresh_metrics and _handle_view_refreshes_error now live on ClickhouseTableMetrics and run inside run_job(), sharing the same collection interval and DBMAsyncJob lifecycle as table size gauges. State flags and query/status-map constants move to table_metrics.py. clickhouse.py loses the duplicate if-table_metrics block, the two module-level constants, the class attribute, and the three init flags. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix run_job() table-size tests silently exercising view refresh path _patch_query now patches execute_query_raw to return [] alongside _execute_query, so the view refresh half of run_job() succeeds cleanly instead of swallowing a connection error. The cluster routing test also patches execute_query_raw and gains an assertion that view refresh queries go through clusterAllReplicas too. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix ruff formatting: collapse single-line format() call in table_metrics.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Restore trailing spaces in conf.yaml.example stripped by ruff ddev validate config requires spec-generated trailing spaces on 'default: ' and 'For example: ' comment lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix view_refresh metrics to emit per-replica service checks In single_endpoint_mode, clusterAllReplicas returns one row per replica per view. The previous seen dedup keyed on (database, view) silently discarded all but one replica's status, which could hide per-replica Error states. Now keys seen on (database, view, host) and tags each series with host: so each replica surfaces its own refresh status independently. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Use LIMIT 1 BY in table sizes query instead of Python-side dedup clusterAllReplicas returns one row per replica per table; pushing the dedup into SQL with LIMIT 1 BY database, name is cleaner than filtering in Python. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix test_table_metrics for host tag and SQL-side dedup changes - Add host field to _view_refresh_row fixture (query now returns hostName()) - Assert host: tag appears in view refresh service check - Replace Python-side dedup test with assertion that LIMIT 1 BY appears in the table sizes query, since dedup is now the SQL's responsibility Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Use LIMIT 1 BY in view refreshes query instead of Python-side dedup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove stale Python-side dedup test for view refreshes Deduplication is now enforced by LIMIT 1 BY in the SQL query. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix ruff: remove extra blank line in test_table_metrics.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove view.refresh service check in favour of view.refresh.status gauge Service checks are slated for deprecation; new ones should not be added. The view.refresh.status gauge already covers the same OK/WARNING/CRITICAL/UNKNOWN signal, so the service_check() call is redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent fec12ba commit 3cc9ebc

11 files changed

Lines changed: 549 additions & 0 deletions

File tree

clickhouse/assets/configuration/spec.yaml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -345,6 +345,34 @@ files:
345345
type: boolean
346346
example: false
347347
fleet_configurable: false
348+
- name: schema_metrics
349+
description: |
350+
Configure per-table size and per-view refresh gauges derived from
351+
`system.tables` and `system.view_refreshes`. Independent of
352+
`collect_schemas` (which controls catalog structure collection)
353+
so users can dashboard table sizes without enabling Schema Explorer.
354+
Requires `dbm: true`.
355+
options:
356+
- name: enabled
357+
description: |
358+
Enable collection of per-table size and per-view refresh gauges.
359+
value:
360+
type: boolean
361+
example: false
362+
- name: collection_interval
363+
description: |
364+
Set the schema metrics collection interval (in seconds). These
365+
gauges change continuously, so 60s is a reasonable default.
366+
value:
367+
type: number
368+
example: 60
369+
- name: run_sync
370+
hidden: true
371+
description: |
372+
Run the schema metrics collection synchronously. For testing only.
373+
value:
374+
type: boolean
375+
example: false
348376
- name: collect_schemas
349377
description: |
350378
Configure collection of ClickHouse catalog metadata (databases,

clickhouse/changelog.d/23900.added

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Add ClickHouse schema metrics: per-table size gauges and per-view refresh status gauges under schema_metrics.

clickhouse/datadog_checks/clickhouse/clickhouse.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from .query_errors import ClickhouseQueryErrors
2424
from .statement_samples import ClickhouseStatementSamples
2525
from .statements import ClickhouseStatementMetrics
26+
from .table_metrics import ClickhouseTableMetrics
2627
from .utils import ErrorSanitizer
2728

2829
try:
@@ -123,6 +124,12 @@ def _init_dbm_components(self):
123124
else:
124125
self.query_errors = None
125126

127+
# Initialize schema metrics (per-table size and per-view refresh gauges)
128+
if self._config.dbm and self._config.schema_metrics.enabled:
129+
self.table_metrics = ClickhouseTableMetrics(self, self._config.schema_metrics)
130+
else:
131+
self.table_metrics = None
132+
126133
# Initialize schema collection (catalog metadata for Schema Explorer)
127134
if self._config.dbm and self._config.collect_schemas.enabled:
128135
self.metadata = ClickhouseMetadata(self)
@@ -267,6 +274,10 @@ def check(self, _):
267274
if self.query_errors:
268275
self.query_errors.run_job_loop(self.tags)
269276

277+
# Run schema metrics (per-table size and per-view refresh gauges) if enabled
278+
if self.table_metrics:
279+
self.table_metrics.run_job_loop(self.tags)
280+
270281
# Run schema collection if enabled
271282
if self.metadata:
272283
self.metadata.run_job_loop(self.tags)
@@ -540,6 +551,8 @@ def cancel(self):
540551
self.query_completions.cancel()
541552
if self.query_errors:
542553
self.query_errors.cancel()
554+
if self.table_metrics:
555+
self.table_metrics.cancel()
543556
if self.metadata:
544557
self.metadata.cancel()
545558
if self.parts_and_merges:
@@ -554,6 +567,8 @@ def cancel(self):
554567
self.query_completions._job_loop_future.result()
555568
if self.query_errors and self.query_errors._job_loop_future:
556569
self.query_errors._job_loop_future.result()
570+
if self.table_metrics and self.table_metrics._job_loop_future:
571+
self.table_metrics._job_loop_future.result()
557572
if self.metadata and self.metadata._job_loop_future:
558573
self.metadata._job_loop_future.result()
559574
if self.parts_and_merges and self.parts_and_merges._job_loop_future:

clickhouse/datadog_checks/clickhouse/config.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,10 @@ def build_config(check: ClickhouseCheck) -> Tuple[InstanceConfig, ValidationResu
128128
**dict_defaults.instance_parts_and_merges().model_dump(),
129129
**(instance.get('parts_and_merges', {})),
130130
},
131+
"schema_metrics": {
132+
**dict_defaults.instance_schema_metrics().model_dump(),
133+
**(instance.get('schema_metrics', {})),
134+
},
131135
"collect_schemas": {
132136
**dict_defaults.instance_collect_schemas().model_dump(),
133137
**(instance.get('collect_schemas', {})),
@@ -315,6 +319,11 @@ def _apply_features(config: InstanceConfig, validation_result: ValidationResult)
315319
config.parts_and_merges.enabled and config.dbm,
316320
None if config.dbm else "Requires `dbm: true`",
317321
)
322+
validation_result.add_feature(
323+
FeatureKey.SCHEMA_METRICS,
324+
config.schema_metrics.enabled and config.dbm,
325+
None if config.dbm else "Requires `dbm: true`",
326+
)
318327
validation_result.add_feature(FeatureKey.SINGLE_ENDPOINT_MODE, config.single_endpoint_mode)
319328

320329

clickhouse/datadog_checks/clickhouse/config_models/dict_defaults.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,14 @@ def instance_query_errors():
5757
)
5858

5959

60+
def instance_schema_metrics():
61+
return instance.SchemaMetrics(
62+
enabled=False,
63+
collection_interval=60,
64+
run_sync=False,
65+
)
66+
67+
6068
def instance_collect_schemas():
6169
return instance.CollectSchemas(
6270
enabled=False,

clickhouse/datadog_checks/clickhouse/config_models/instance.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,6 +138,16 @@ class QuerySamples(BaseModel):
138138
run_sync: Optional[bool] = None
139139

140140

141+
class SchemaMetrics(BaseModel):
142+
model_config = ConfigDict(
143+
arbitrary_types_allowed=True,
144+
frozen=True,
145+
)
146+
collection_interval: Optional[float] = None
147+
enabled: Optional[bool] = None
148+
run_sync: Optional[bool] = None
149+
150+
141151
class InstanceConfig(BaseModel):
142152
model_config = ConfigDict(
143153
validate_default=True,
@@ -166,6 +176,7 @@ class InstanceConfig(BaseModel):
166176
query_samples: Optional[QuerySamples] = None
167177
read_timeout: Optional[int] = None
168178
reported_hostname: Optional[str] = None
179+
schema_metrics: Optional[SchemaMetrics] = None
169180
server: str
170181
service: Optional[str] = None
171182
single_endpoint_mode: Optional[bool] = None

clickhouse/datadog_checks/clickhouse/data/conf.yaml.example

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,6 +214,25 @@ instances:
214214
#
215215
# samples_per_hour_per_query: 60
216216

217+
## Configure per-table size and per-view refresh gauges derived from
218+
## `system.tables` and `system.view_refreshes`. Independent of
219+
## `collect_schemas` (which controls catalog structure collection)
220+
## so users can dashboard table sizes without enabling Schema Explorer.
221+
## Requires `dbm: true`.
222+
#
223+
# schema_metrics:
224+
225+
## @param enabled - boolean - optional - default: false
226+
## Enable collection of per-table size and per-view refresh gauges.
227+
#
228+
# enabled: false
229+
230+
## @param collection_interval - number - optional - default: 60
231+
## Set the schema metrics collection interval (in seconds). These
232+
## gauges change continuously, so 60s is a reasonable default.
233+
#
234+
# collection_interval: 60
235+
217236
## Configure collection of ClickHouse catalog metadata (databases,
218237
## tables, views, columns) for Database Monitoring's Schema Explorer.
219238
## Requires `dbm: true`.

clickhouse/datadog_checks/clickhouse/features.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ class FeatureKey(Enum):
2323
QUERY_COMPLETIONS = "query_completions"
2424
EXPLAIN_PLANS = "explain_plans"
2525
QUERY_ERRORS = "query_errors"
26+
SCHEMA_METRICS = "schema_metrics"
2627
COLLECT_SCHEMAS = "collect_schemas"
2728
PARTS_AND_MERGES = "parts_and_merges"
2829
SINGLE_ENDPOINT_MODE = "single_endpoint_mode"
@@ -35,6 +36,7 @@ class FeatureKey(Enum):
3536
FeatureKey.QUERY_COMPLETIONS: 'Query Completions',
3637
FeatureKey.QUERY_ERRORS: 'Query Errors',
3738
FeatureKey.EXPLAIN_PLANS: 'Explain Plans',
39+
FeatureKey.SCHEMA_METRICS: 'Schema Metrics',
3840
FeatureKey.COLLECT_SCHEMAS: 'Collect Schemas',
3941
FeatureKey.PARTS_AND_MERGES: 'Parts and Merges',
4042
FeatureKey.SINGLE_ENDPOINT_MODE: 'Single Endpoint Mode',
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# (C) Datadog, Inc. 2026-present
2+
# All rights reserved
3+
# Licensed under a 3-clause BSD style license (see LICENSE)
4+
from __future__ import annotations
5+
6+
from typing import TYPE_CHECKING
7+
8+
from clickhouse_connect.driver.exceptions import OperationalError
9+
10+
if TYPE_CHECKING:
11+
from datadog_checks.clickhouse import ClickhouseCheck
12+
from datadog_checks.clickhouse.config_models.instance import SchemaMetrics
13+
14+
from datadog_checks.base import AgentCheck
15+
from datadog_checks.base.utils.db.utils import DBMAsyncJob
16+
from datadog_checks.base.utils.tracking import tracked_method
17+
18+
DEFAULT_COLLECTION_INTERVAL = 60
19+
20+
_TABLE_SIZES_QUERY = """\
21+
SELECT
22+
database,
23+
name,
24+
toInt64(total_rows) AS total_rows,
25+
toInt64(total_bytes) AS total_bytes
26+
FROM {tables_table}
27+
WHERE database NOT IN ('system', 'INFORMATION_SCHEMA', 'information_schema')
28+
LIMIT 1 BY database, name
29+
"""
30+
31+
_VIEW_REFRESHES_QUERY = """\
32+
SELECT
33+
database,
34+
view,
35+
hostName() AS host,
36+
status,
37+
exception,
38+
toInt64(toUnixTimestamp(last_success_time)) AS last_refresh_time,
39+
toInt64(toUnixTimestamp(next_refresh_time)) AS next_refresh_time,
40+
toInt64(written_rows) AS written_rows,
41+
toInt64(written_bytes) AS written_bytes
42+
FROM {view_refreshes_table}
43+
LIMIT 1 BY database, view, host
44+
"""
45+
46+
_VIEW_REFRESH_STATUS_MAP = {
47+
'Scheduled': AgentCheck.OK,
48+
'Running': AgentCheck.OK,
49+
'WaitingForDependencies': AgentCheck.WARNING,
50+
'Disabled': AgentCheck.UNKNOWN,
51+
'Error': AgentCheck.CRITICAL,
52+
}
53+
54+
55+
def agent_check_getter(self):
56+
return self._check
57+
58+
59+
class ClickhouseTableMetrics(DBMAsyncJob):
60+
"""Per-table size and per-view refresh gauges from system.tables and system.view_refreshes."""
61+
62+
def __init__(self, check: ClickhouseCheck, config: SchemaMetrics):
63+
collection_interval = config.collection_interval
64+
if collection_interval is None or collection_interval <= 0:
65+
collection_interval = DEFAULT_COLLECTION_INTERVAL
66+
67+
super(ClickhouseTableMetrics, self).__init__(
68+
check,
69+
rate_limit=1 / collection_interval,
70+
run_sync=config.run_sync,
71+
enabled=config.enabled,
72+
dbms='clickhouse',
73+
min_collection_interval=check._config.min_collection_interval,
74+
expected_db_exceptions=(Exception,),
75+
job_name='clickhouse-table-metrics',
76+
)
77+
self._check = check
78+
self._config = config
79+
self._collection_interval = collection_interval
80+
self._db_client = None
81+
self._view_refreshes_unsupported_logged = False
82+
self._view_refreshes_permission_logged = False
83+
self._view_refreshes_skip = False
84+
85+
def cancel(self):
86+
super(ClickhouseTableMetrics, self).cancel()
87+
self._close_db_client()
88+
89+
def _close_db_client(self):
90+
if self._db_client:
91+
try:
92+
self._db_client.close()
93+
except Exception as e:
94+
self._log.debug("Error closing table-metrics client: %s", e)
95+
self._db_client = None
96+
97+
def _execute_query(self, query: str) -> list:
98+
if self._db_client is None:
99+
self._db_client = self._check.create_dbm_client()
100+
self._db_client.set_client_setting('max_execution_time', self._collection_interval)
101+
try:
102+
return self._db_client.query(query).result_rows
103+
except OperationalError as e:
104+
self._log.warning("Connection error on table-metrics query, will reconnect: %s", e)
105+
self._close_db_client()
106+
raise
107+
108+
@tracked_method(agent_check_getter=agent_check_getter)
109+
def run_job(self):
110+
self._emit_table_size_gauges()
111+
self._collect_view_refresh_metrics()
112+
113+
def _emit_table_size_gauges(self) -> None:
114+
try:
115+
rows = self._execute_query(_TABLE_SIZES_QUERY.format(tables_table=self._check.get_system_table('tables')))
116+
except Exception:
117+
self._log.exception("Failed to collect clickhouse table sizes")
118+
return
119+
120+
# Drop the instance-level `db:` base tag (the connection database) so each
121+
# per-table series carries exactly one `db:` tag — the table's own database.
122+
base_tags = [t for t in self._check.tags if not t.startswith('db:')]
123+
for database, name, total_rows, total_bytes in rows:
124+
entity_tags = base_tags + [f'db:{database}', f'table:{name}']
125+
self._check.gauge('table.rows', int(total_rows or 0), tags=entity_tags)
126+
self._check.gauge('table.bytes', int(total_bytes or 0), tags=entity_tags)
127+
128+
def _collect_view_refresh_metrics(self) -> None:
129+
if self._view_refreshes_skip:
130+
return
131+
try:
132+
rows = self._check.execute_query_raw(
133+
_VIEW_REFRESHES_QUERY.format(view_refreshes_table=self._check.get_system_table('view_refreshes'))
134+
)
135+
except Exception as e:
136+
self._handle_view_refreshes_error(e)
137+
return
138+
139+
# Drop the instance-level `db:` base tag (the connection database) so each
140+
# per-view series carries exactly one `db:` tag — the view's own database.
141+
base_tags = [t for t in self._check.tags if not t.startswith('db:')]
142+
for database, view_name, host, status, _exception, last_time, next_time, written_rows, written_bytes in rows:
143+
view_tags = base_tags + [f'db:{database}', f'view:{view_name}', f'host:{host}']
144+
refresh_status = _VIEW_REFRESH_STATUS_MAP.get(status, AgentCheck.UNKNOWN)
145+
self._check.gauge('view.refresh.status', refresh_status, tags=view_tags)
146+
self._check.gauge('view.refresh.last_time', int(last_time or 0), tags=view_tags)
147+
self._check.gauge('view.refresh.next_time', int(next_time or 0), tags=view_tags)
148+
self._check.gauge('view.refresh.rows', int(written_rows or 0), tags=view_tags)
149+
self._check.gauge('view.refresh.bytes', int(written_bytes or 0), tags=view_tags)
150+
151+
def _handle_view_refreshes_error(self, e: Exception) -> None:
152+
lowered = str(e).lower()
153+
if 'unknown table' in lowered or 'unknowntable' in lowered or 'unknown_table' in lowered:
154+
if not self._view_refreshes_unsupported_logged:
155+
self._log.info(
156+
"system.view_refreshes not present (ClickHouse < 24.3); refresh status will not be populated."
157+
)
158+
self._view_refreshes_unsupported_logged = True
159+
self._view_refreshes_skip = True
160+
elif 'not enough privileges' in lowered or 'access_denied' in lowered:
161+
if not self._view_refreshes_permission_logged:
162+
self._log.warning(
163+
"Agent user lacks SELECT on system.view_refreshes; refresh status will not be populated. "
164+
"Grant with: GRANT SELECT ON system.view_refreshes TO <agent_user>. "
165+
"Restart the agent after granting access."
166+
)
167+
self._view_refreshes_permission_logged = True
168+
self._view_refreshes_skip = True
169+
else:
170+
self._log.exception("Unexpected error querying system.view_refreshes")

clickhouse/tests/test_config_defaults.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,12 @@
7070
'max_samples_per_collection': 1000,
7171
'run_sync': False,
7272
},
73+
# === DBM: Schema metrics ===
74+
'schema_metrics': {
75+
'enabled': False,
76+
'collection_interval': 60,
77+
'run_sync': False,
78+
},
7379
# === DBM: Schema collector ===
7480
'collect_schemas': {
7581
'enabled': False,

0 commit comments

Comments
 (0)