Skip to content

Commit 2b48e80

Browse files
Andres-Ayala1Hackerpilot
authored andcommitted
Refactor queries_grouped by hash and 2 additional optimization scripts from PR GoogleCloudPlatform#471 (GoogleCloudPlatform#517)
1 parent 562efc4 commit 2b48e80

8 files changed

Lines changed: 226 additions & 12 deletions

scripts/optimization/README.md

Lines changed: 60 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -336,14 +336,13 @@ and the pricing for each region found [here](https://cloud.google.com/bigquery/p
336336
337337
## Queries grouped by hash
338338
339-
The [queries_grouped_by_hash.sql](queries_grouped_by_hash.sql) script creates a
339+
The [queries_grouped_by_hash_project.sql](queries_grouped_by_hash_project.sql) script creates a
340340
table named,
341-
`queries_grouped_by_hash`. This table groups queries by their normalized query
341+
`queries_grouped_by_hash_project`. This table groups queries by their normalized query
342342
pattern, which ignores
343343
comments, parameter values, UDFs, and literals in the query text.
344344
This allows us to group queries that are logically the same, but
345-
have different literals. The `queries_grouped_by_hash` table does not expose the
346-
raw SQL text of the queries.
345+
have different literals.
347346
348347
The [viewable_queries_grouped_by_hash.sql](viewable_queries_grouped_by_hash.sql)
349348
script creates a table named,
@@ -355,6 +354,11 @@ in execution than the `queries_grouped_by_hash.sql` script because it has to
355354
loop over all projects and for each
356355
project query the `INFORMATION_SCHEMA.JOBS_BY_PROJECT` view.
357356
357+
Both the `queries_grouped_by_hash` (Org and Project level) tables include duration percentiles (`median_time_ms`, `p75_time_ms`, `p90_time_ms`, etc.) calculated from `creation_time`. These metrics help identify query performance stability:
358+
- **Median**: If median is high, it indicates that the query is taking a long time to complete. Prioritize optimizing queries with high median duration. (filter earlier, check joins).
359+
- **Median vs p99**: A large gap indicates unstable performance (e.g., occasional slot contention or data skew).
360+
- **p95/p99**: Useful for tracking SLA violations and "worst-case" user experience.
361+
358362
For example, the following queries would be grouped together because the date
359363
literal filters are ignored:
360364
@@ -372,16 +376,25 @@ Running the `run_anti_pattern_tool.sh` bash script will build and run the Anti-P
372376
373377
```sql
374378
SELECT *
375-
FROM optimization_workshop.queries_grouped_by_hash
379+
FROM optimization_workshop.queries_grouped_by_hash_org
376380
ORDER BY total_gigabytes_processed DESC
377381
LIMIT 100
378382
```
379383
384+
* Top 200 queries with the highest total slot hours
385+
386+
```sql
387+
SELECT *
388+
FROM optimization_workshop.queries_grouped_by_hash_project
389+
ORDER BY total_slot_hours DESC
390+
LIMIT 200
391+
```
392+
380393
* Top 100 recurring queries with the highest slot hours consumed
381394
382395
```sql
383396
SELECT *
384-
FROM optimization_workshop.queries_grouped_by_hash
397+
FROM optimization_workshop.queries_grouped_by_hash_org
385398
ORDER BY total_slot_hours * days_active * job_count DESC
386399
LIMIT 100
387400
```
@@ -487,6 +500,45 @@ generated for them in the past 30 days.
487500

488501
</details>
489502

503+
<details><summary><b>&#128269; BI Engine Mode Duration </b></summary>
504+
505+
## BI Engine Mode Duration
506+
507+
The [bi_engine_mode_duration](bi_engine_mode_duration.sql)
508+
script creates a table named, `bi_engine_mode_duration`. This table
509+
groups queries by their BI Engine mode and then shows for every day timeslice how long queries took for each mode.
510+
511+
### Examples of querying script results
512+
513+
* Order by day and BI Engine mode
514+
515+
```sql
516+
SELECT *
517+
FROM optimization_workshop.bi_engine_mode_duration
518+
ORDER BY day, bi_engine_mode ASC
519+
```
520+
521+
</details>
522+
523+
<details><summary><b>&#128269; BI Engine Disabled Reasons</b></summary>
524+
525+
## BI Engine Disabled Reasons
526+
527+
The [bi_engine_disabled_reasons](bi_engine_disabled_reasons.sql)
528+
script creates a table named, `bi_engine_disabled_reasons`. This table groups queries by their BI Engine Disabled reason and counts them by reason.
529+
530+
### Examples of querying script results
531+
532+
* Order by reasons count descending
533+
534+
```sql
535+
SELECT *
536+
FROM optimization_workshop.bi_engine_disabled_reasons
537+
ORDER BY count DESC
538+
```
539+
540+
</details>
541+
490542
# Workload Analysis
491543

492544
<details><summary><b>&#128269; Hourly slot consumption by query hash</b></summary>
@@ -534,3 +586,5 @@ of that hour's slots each grouping of labels consumed.
534586
```
535587

536588
</details>
589+
590+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
/*
2+
* Copyright 2026 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
/*
18+
* This script creates a table named, bi_engine_disabled_reasons,
19+
* which groups queries by their BI Engine Disabled reason and counts them by reason.
20+
* This table helps identify the most common reasons why BI Engine is disabled
21+
* for your queries so that you can tune your queries to be BI Engine friendly.
22+
*
23+
* 30 days is the default timeframe, but you can change this by setting the
24+
* num_days_to_scan variable to a different value.
25+
*/
26+
27+
28+
DECLARE num_days_to_scan INT64 DEFAULT 30;
29+
30+
CREATE SCHEMA IF NOT EXISTS optimization_workshop;
31+
CREATE OR REPLACE TABLE optimization_workshop.bi_engine_disabled_reasons AS
32+
SELECT reasons.code, count(*)
33+
FROM `region-us`.INFORMATION_SCHEMA.JOBS as jbo, UNNEST(bi_engine_statistics.bi_engine_reasons) AS reasons
34+
WHERE DATE(jbo.creation_time) >= CURRENT_DATE - num_days_to_scan
35+
AND bi_engine_statistics.bi_engine_mode = 'DISABLED'
36+
GROUP BY reasons.code;
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
/*
2+
* Copyright 2026 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
DECLARE num_days_to_scan INT64 DEFAULT 30;
18+
19+
CREATE SCHEMA IF NOT EXISTS optimization_workshop;
20+
CREATE OR REPLACE TABLE optimization_workshop.bi_engine_mode_duration AS
21+
SELECT
22+
day,
23+
bi_engine_mode,
24+
COUNT(*) as job_count,
25+
AVG(time_ms) avg_time_ms,
26+
MAX(median_time_ms) median_time_ms,
27+
MAX(p75_time_ms) p75_time_ms,
28+
MAX(p80_time_ms) p80_time_ms,
29+
MAX(p90_time_ms) p90_time_ms,
30+
MAX(p95_time_ms) p95_time_ms,
31+
MAX(p99_time_ms) p99_time_ms,
32+
FROM
33+
(
34+
SELECT
35+
day,
36+
bi_engine_mode,
37+
time_ms,
38+
PERCENTILE_CONT(time_ms, 0.5) OVER (PARTITION BY day, bi_engine_mode) as median_time_ms,
39+
PERCENTILE_CONT(time_ms, 0.75) OVER (PARTITION BY day, bi_engine_mode) as p75_time_ms,
40+
PERCENTILE_CONT(time_ms, 0.8) OVER (PARTITION BY day, bi_engine_mode) as p80_time_ms,
41+
PERCENTILE_CONT(time_ms, 0.90) OVER (PARTITION BY day, bi_engine_mode) as p90_time_ms,
42+
PERCENTILE_CONT(time_ms, 0.95) OVER (PARTITION BY day, bi_engine_mode) as p95_time_ms,
43+
PERCENTILE_CONT(time_ms, 0.99) OVER (PARTITION BY day, bi_engine_mode) as p99_time_ms,
44+
FROM
45+
(
46+
SELECT
47+
DATE(jbo.creation_time) AS day,
48+
bi_engine_statistics.bi_engine_mode as bi_engine_mode,
49+
job_id,
50+
TIMESTAMP_DIFF(jbo.end_time, jbo.creation_time, MILLISECOND) time_ms
51+
FROM
52+
FROM `region-us`.INFORMATION_SCHEMA.JOBS jbo
53+
WHERE
54+
DATE(creation_time) >= CURRENT_DATE - num_days_to_scan
55+
AND jbo.end_time > jbo.start_time
56+
AND jbo.error_result IS NULL
57+
AND jbo.statement_type != 'SCRIPT'
58+
)
59+
)
60+
GROUP BY
61+
1,
62+
2

scripts/optimization/hourly_slot_consumption_by_labels.sql

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,19 @@
1+
/*
2+
* Copyright 2026 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
117
DECLARE num_days_to_scan INT64 DEFAULT 30;
218
DECLARE my_reservation_id STRING DEFAULT "your_reservation_id";
319

scripts/optimization/hourly_slot_consumption_by_query_hash.sql

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,19 @@
1+
/*
2+
* Copyright 2026 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
117
DECLARE num_days_to_scan INT64 DEFAULT 30;
218
DECLARE my_reservation_id STRING DEFAULT "your_reservation_id";
319

scripts/optimization/queries_grouped_by_hash_org.sql

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
*/
1616

1717
/*
18-
* This script creates a table named, top_bytes_scanning_queries_by_hash,
19-
* which contains the top 200 most expensive queries by total bytes scanned
18+
* This script creates a table named, queries_grouped_by_hash_org,
19+
* which contains the top 200 most expensive queries by total slot hours
2020
* within the past 30 days.
2121
* 30 days is the default timeframe, but you can change this by setting the
2222
* num_days_to_scan variable to a different value.
@@ -74,6 +74,12 @@ SELECT
7474
|| '.' || ref_table.table_id
7575
FROM UNNEST(referenced_tables) ref_table
7676
)) AS referenced_tables,
77+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(50)] AS median_time_ms,
78+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(75)] AS p75_time_ms,
79+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(80)] AS p80_time_ms,
80+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(90)] AS p90_time_ms,
81+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(95)] AS p95_time_ms,
82+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(99)] AS p99_time_ms
7783
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION
7884
WHERE
7985
DATE(creation_time) >= CURRENT_DATE - num_days_to_scan

scripts/optimization/queries_grouped_by_hash_project.sql

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
*/
1616

1717
/*
18-
* This script creates a table named, top_bytes_scanning_queries_by_hash,
19-
* which contains the top 200 most expensive queries by total bytes scanned
18+
* This script creates a table named queries_grouped_by_hash_project,
19+
* which contains the top 200 most expensive queries by total slot hours
2020
* within the past 30 days.
2121
* 30 days is the default timeframe, but you can change this by setting the
2222
* num_days_to_scan variable to a different value.
@@ -75,11 +75,18 @@ SELECT
7575
|| '.' || ref_table.table_id
7676
FROM UNNEST(referenced_tables) ref_table
7777
)) AS referenced_tables,
78-
FROM `region-us`.INFORMATION_SCHEMA.JOBS
78+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(50)] AS median_time_ms,
79+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(75)] AS p75_time_ms,
80+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(80)] AS p80_time_ms,
81+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(90)] AS p90_time_ms,
82+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(95)] AS p95_time_ms,
83+
APPROX_QUANTILES(TIMESTAMP_DIFF(end_time, creation_time, MILLISECOND), 100)[OFFSET(99)] AS p99_time_ms
84+
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION
7985
WHERE
8086
DATE(creation_time) >= CURRENT_DATE - num_days_to_scan
8187
AND state = 'DONE'
8288
AND error_result IS NULL
8389
AND job_type = 'QUERY'
8490
AND statement_type != 'SCRIPT'
85-
GROUP BY statement_type, query_hash;
91+
GROUP BY statement_type, query_hash
92+

scripts/optimization/queries_grouped_by_labels.sql

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,20 @@
1+
/*
2+
* Copyright 2026 Google LLC
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
118
DECLARE num_days_to_scan INT64 DEFAULT 30;
219

320
CREATE TEMP FUNCTION num_stages_with_perf_insights(query_info ANY TYPE) AS (

0 commit comments

Comments
 (0)