Skip to content

Commit 4e51761

Browse files
authored
feat: add query/segments/count metric to data nodes (#19624)
Allows for quick queries to understand how many segments a given query issued scans for against a particular data node. Better than counting the unique query/segment/time or similar, especially if you are sampling higher volume metrics like query/segment/time, but still want to have quick way to know stats about # of segments queried per node.
1 parent c2d15df commit 4e51761

17 files changed

Lines changed: 409 additions & 14 deletions

File tree

docs/operations/metrics.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ Most metric values reset each emission period, as specified in `druid.monitoring
6464
|`query/failed/count`|Number of failed queries.|This metric is only available if the `QueryCountStatsMonitor` module is included.| |
6565
|`query/interrupted/count`|Number of queries interrupted due to cancellation.|This metric is only available if the `QueryCountStatsMonitor` module is included.| |
6666
|`query/timeout/count`|Number of timed out queries.|This metric is only available if the `QueryCountStatsMonitor` module is included.| |
67-
|`query/segments/count`|This metric is not enabled by default. See the `QueryMetrics` Interface for reference regarding enabling this metric. Number of segments that will be touched by the query. In the broker, it makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the broker. The broker will resend the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.||Varies|
67+
|`query/segments/count`|Number of segments that will be touched by the query. The Broker makes a plan to distribute the query to realtime tasks and historicals based on a snapshot of segment distribution state. If there are some segments moved after this snapshot is created, certain historicals and realtime tasks can report those segments as missing to the Broker. The Broker will resend the query to the new servers that serve those segments after move. In this case, those segments can be counted more than once in this metric.||Varies|
6868
|`query/priority`|Assigned lane and priority, only if Laning strategy is enabled. Refer to [Laning strategies](../configuration/index.md#laning-strategies)|`lane`, `dataSource`, `type`|0|
6969
|`sqlQuery/time`|Milliseconds taken to complete a SQL query.|`id`, `nativeQueryIds`, `dataSource`, `remoteAddress`, `success`, `engine`, `statusCode`|< 1s|
7070
|`sqlQuery/planningTimeMs`|Milliseconds taken to plan a SQL to native query.|`id`, `nativeQueryIds`, `dataSource`, `remoteAddress`, `success`, `engine`| |
@@ -107,6 +107,7 @@ Most metric values reset each emission period, as specified in `druid.monitoring
107107
|Metric|Description|Dimensions|Normal value|
108108
|------|-----------|----------|------------|
109109
|`query/time`|Milliseconds taken to complete a query.|<p>Common: `dataSource`, `type`, `interval`, `hasFilters`, `duration`, `context`, `remoteAddress`, `id`, `statusCode`.</p><p> Aggregation Queries: `numMetrics`, `numComplexMetrics`.</p><p> GroupBy: `numDimensions`.</p><p> TopN: `threshold`, `dimension`.</p>|< 1s|
110+
|`query/segments/count`|Number of segments this Historical scans for a query.|<p>Common: `dataSource`, `type`, `interval`, `hasFilters`, `duration`, `context`, `remoteAddress`, `id`.</p><p> Aggregation Queries: `numMetrics`, `numComplexMetrics`.</p><p> GroupBy: `numDimensions`.</p><p> TopN: `threshold`, `dimension`.</p>|Varies|
110111
|`query/segment/time`|Milliseconds taken to query individual segment. Includes time to page in the segment from disk.|`id`, `status`, `segment`, `vectorized`.|several hundred milliseconds|
111112
|`query/wait/time`|Milliseconds spent waiting for a segment to be scanned.|`id`, `segment`|< several hundred milliseconds|
112113
|`segment/scan/pending`|Number of segments in queue waiting to be scanned.||Close to 0|
@@ -141,6 +142,7 @@ to represent the task ID are deprecated and will be removed in a future release.
141142
|Metric|Description|Dimensions|Normal value|
142143
|------|-----------|----------|------------|
143144
|`query/time`|Milliseconds taken to complete a query.|<p>Common: `dataSource`, `type`, `interval`, `hasFilters`, `duration`, `context`, `remoteAddress`, `id`, `statusCode`.</p><p> Aggregation Queries: `numMetrics`, `numComplexMetrics`.</p><p> GroupBy: `numDimensions`.</p><p> TopN: `threshold`, `dimension`.</p>|< 1s|
145+
|`query/segments/count`|Number of segments this Peon scans for a query.|<p>Common: `dataSource`, `type`, `interval`, `hasFilters`, `duration`, `context`, `remoteAddress`, `id`.</p><p> Aggregation Queries: `numMetrics`, `numComplexMetrics`.</p><p> GroupBy: `numDimensions`. </p><p>TopN: `threshold`, `dimension`.</p>|Varies|
144146
|`query/wait/time`|Milliseconds spent waiting for a segment to be scanned.|`id`, `segment`|several hundred milliseconds|
145147
|`segment/scan/pending`|Number of segments in queue waiting to be scanned.||Close to 0|
146148
|`segment/scan/active`|Number of segments currently scanned. This metric also indicates how many threads from `druid.processing.numThreads` are currently being used.||Close to `druid.processing.numThreads`|

extensions-contrib/ambari-metrics-emitter/src/main/resources/defaultWhiteListMap.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,10 @@
3434
"dataSource",
3535
"type"
3636
],
37+
"query/segments/count": [
38+
"dataSource",
39+
"type"
40+
],
3741
"query/segment/time": [
3842
"dataSource",
3943
"type"

extensions-contrib/dropwizard-emitter/src/main/resources/defaultMetricDimensions.json

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,13 @@
2121
"type": "timer",
2222
"timeUnit": "MILLISECONDS"
2323
},
24+
"query/segments/count": {
25+
"dimensions": [
26+
"dataSource",
27+
"type"
28+
],
29+
"type": "counter"
30+
},
2431
"query/segment/time": {
2532
"dimensions": [],
2633
"type": "timer",
@@ -515,4 +522,4 @@
515522
],
516523
"type": "gauge"
517524
}
518-
}
525+
}

extensions-contrib/graphite-emitter/src/main/resources/defaultWhiteListMap.json

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121
"dataSource",
2222
"type"
2323
],
24+
"query/segments/count": [
25+
"dataSource",
26+
"type"
27+
],
2428
"query/segment/time": [
2529
"dataSource",
2630
"type"

extensions-contrib/prometheus-emitter/src/main/resources/defaultMetrics.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
"schemacache/inTransitSMQPublishedResults/count" : { "dimensions" : [], "type" : "count", "help": "Number of segments for which schema is cached after back filling in the database."},
2929
"serverview/sync/healthy" : { "dimensions" : ["server"], "type" : "gauge", "help": "Sync status of the Broker with a segment-loading server such as a Historical or Peon."},
3030
"serverview/sync/unstableTime" : { "dimensions" : ["server"], "type" : "timer", "conversionFactor": 1000.0, "help": "Time in seconds for which the Broker has been failing to sync with a segment-loading server."},
31+
"query/segments/count" : { "dimensions" : ["dataSource", "type"], "type" : "count", "help": "Number of segments this data node scans for a query."},
3132
"query/segment/time" : { "dimensions" : [], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds taken to query individual segment. Includes time to page in the segment from disk."},
3233
"query/wait/time" : { "dimensions" : [], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds spent waiting for a segment to be scanned."},
3334
"segment/scan/pending" : { "dimensions" : [], "type" : "gauge", "help": "Number of segments in queue waiting to be scanned."},

extensions-contrib/statsd-emitter/src/main/resources/defaultMetricDimensions.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
"query/node/ttfb" : { "dimensions" : ["server"], "type" : "timer"},
66
"query/node/bytes" : { "dimensions" : ["server"], "type" : "count"},
77

8+
"query/segments/count" : { "dimensions" : ["dataSource", "type"], "type" : "count"},
89
"query/segment/time" : { "dimensions" : [], "type" : "timer"},
910
"query/wait/time" : { "dimensions" : [], "type" : "timer"},
1011
"segment/scan/pending" : { "dimensions" : [], "type" : "gauge"},

processing/src/main/java/org/apache/druid/query/DefaultQueryMetrics.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ public class DefaultQueryMetrics<QueryType extends Query<?>> implements QueryMet
4646
public static final String QUERY_BYTES = "query/bytes";
4747
public static final String QUERY_CPU_TIME = "query/cpu/time";
4848
public static final String QUERY_WAIT_TIME = "query/wait/time";
49+
public static final String QUERY_SEGMENTS_COUNT = "query/segments/count";
4950
public static final String QUERY_SEGMENT_TIME = "query/segment/time";
5051
public static final String QUERY_SEGMENT_AND_CACHE_TIME = "query/segmentAndCache/time";
5152
public static final String QUERY_RESULT_CACHE_HIT = "query/resultCache/hit";

processing/src/main/java/org/apache/druid/query/QueryMetrics.java

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -329,7 +329,16 @@ default void filterBundle(FilterBundle.BundleInfo bundleInfo)
329329
QueryMetrics<QueryType> reportQueryBytes(long byteCount);
330330

331331
/**
332-
* Registers "segments queried count" metric.
332+
* Registers the {@code query/segments/count} metric, the number of segments touched by the query.
333+
*
334+
* Emitted once per query. The meaning of the metric depends on the emitting process:
335+
* <ul>
336+
* <li>On the Broker, it is the number of segments in the planned query distribution (a snapshot-based
337+
* count that may double-count segments that moved and were re-fetched).</li>
338+
* <li>On a data node (Historical, Peon/realtime), it is the number of segments that node actually scanned
339+
* for the query.</li>
340+
* </ul>
341+
* The two are disambiguated by the emitting service/host, not by the metric name.
333342
*/
334343
QueryMetrics<QueryType> reportQueriedSegmentCount(long segmentCount);
335344

processing/src/main/resources/loggingEmitterAllowedMetrics.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@
134134
"query/node/bytes": [],
135135
"query/node/time": [],
136136
"query/node/ttfb": [],
137+
"query/segments/count": [],
137138
"query/segment/time": [],
138139
"query/segmentAndCache/time": [],
139140
"query/success/count": [],
@@ -249,4 +250,4 @@
249250
"tier/total/capacity": [],
250251
"zk/connected": [],
251252
"zk/reconnect/time": []
252-
}
253+
}

processing/src/test/resources/loggingEmitterAllowedMetrics.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@
134134
"query/node/bytes": [],
135135
"query/node/time": [],
136136
"query/node/ttfb": [],
137+
"query/segments/count": [],
137138
"query/segment/time": [],
138139
"query/segmentAndCache/time": [],
139140
"query/success/count": [],

0 commit comments

Comments
 (0)