Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 34 additions & 22 deletions skills/developing-with-bigquery/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
---
name: developing-with-bigquery
description: |
A repository of BigQuery-specific logic, knowledge, and specialized standards.
Provides BigQuery-specific logic, knowledge, and specialized standards.
Use this skill whenever you are doing anything with BigQuery, including:
1. BigQuery query optimization
1. BigQuery query optimization (SQL)
2. BigFrames Python code
3. BigQuery ML/AI functions.
3. BigQuery ML/AI functions (SQL & Python)
4. Graph Analytics (GQL & Property Graphs)
license: Apache-2.0
metadata:
version: v1
version: v2
publisher: google
---

Expand All @@ -31,32 +32,43 @@ features:
### 1. Query Optimization

Performance and efficiency guidelines for BigQuery SQL. Includes rules for
column pruning, pushdown, and materialization strategies. - **Guide**:
[OPTIMIZATION.md](references/OPTIMIZATION.md)
column pruning, pushdown, and materialization strategies.
- **Guide**: [OPTIMIZATION.md](references/sql/OPTIMIZATION.md)

### 2. BigFrames (BigQuery DataFrames)

Guidelines for generating valid BigFrames code for data manipulation, model
development, and visualization. - **Guide**:
[BIGFRAMES.md](references/BIGFRAMES.md)
development, and visualization.
- **Guide**: [BIGFRAMES.md](references/bigframes/BIGFRAMES.md)

Bigframes should be the default library/tool as it is more efficient than using
the BigQuery Python client library.

### 3. BigQuery ML & AI Functions (BQML SQL)

Usage rules and syntax standards for all BigQuery AI/ML functions via SQL
(Forecasting, Generative AI, Classification, etc.). - **Guide**:
[BQML.md](references/BQML.md) - **Functions Reference**: -
[AI.FORECAST](references/ai-forecast.md) -
[AI.EVALUATE](references/ai-evaluate.md) -
[AI.GENERATE_TABLE](references/ai-generate-table.md) -
[AI.GENERATE_EMBEDDING](references/ai-generate-embedding.md) -
[Remote Models](references/remote-models.md)
[CONTRIBUTION_ANALYSIS](references/ml-contribution-analysis.md)
[VECTOR_SEARCH](references/vector-search.md)

### 4. Notebook SQL cells

Refer to `@skill:notebook-guidance` for standards on running BigQuery in
notebooks.
(Forecasting, Generative AI, Classification, etc.).
- **Best Practices**: [ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md)
- **Functions Reference**:

- **AI.CLASSIFY**: [ai_classify.md](references/ai-ml/ai_classify.md) - Classify text.
- **AI.DETECT_ANOMALIES**: [ai_detect_anomalies.md](references/ai-ml/ai_detect_anomalies.md) - Detect anomalies.
- **AI.EVALUATE**: [ai_evaluate.md](references/ai-ml/ai_evaluate.md) - Evaluate models.
- **AI.FORECAST**: [ai_forecast.md](references/ai-ml/ai_forecast.md) - Time-series forecasting.
- **AI.GENERATE**: [ai_generate.md](references/ai-ml/ai_generate.md) - Generate text using LLMs.
- **AI.GENERATE_EMBEDDING**: [ai_generate_embedding.md](references/ai-ml/ai_generate_embedding.md) - Generate embeddings.
- **AI.GENERATE_TABLE**: [ai_generate_table.md](references/ai-ml/ai_generate_table.md) - Table-valued AI generation.
- **AI.IF**: [ai_if.md](references/ai-ml/ai_if.md) - Evaluate semantic conditions.
- **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identify key drivers.
- **AI.SCORE**: [ai_score.md](references/ai-ml/ai_score.md) - Score data.
- **AI.SEARCH**: [ai_search.md](references/ai-ml/ai_search.md) - Semantic search.
- **AI.SIMILARITY**: [ai_similarity.md](references/ai-ml/ai_similarity.md) - Semantic similarity.
- **Remote Models**: [remote_models.md](references/ai-ml/remote_models.md) - Working with remote models (Vertex AI).
- **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Step-by-step contribution analysis.
- **VECTOR_SEARCH**: [vector_search.md](references/ai-ml/vector_search.md) - Vector search best practices.

### 4. Graph Analytics (Property Graphs & GQL)

Guidelines and best practices for querying property graphs in BigQuery.
- **Property Graph Guidelines**: [graph_queries.md](references/graph/graph_queries.md) - Standard GQL syntax and query patterns.
- **Semantic Graph Guidelines**: [semantic_queries.md](references/graph/semantic_queries.md) - Semantic graph operations and expand functions.
62 changes: 0 additions & 62 deletions skills/developing-with-bigquery/references/ai-forecast.md

This file was deleted.

92 changes: 92 additions & 0 deletions skills/developing-with-bigquery/references/ai-ml/ai_classify.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# BigQuery AI.Classify

`AI.CLASSIFY` categorizes unstructured data into a predefined set of labels.

## Syntax Reference

```sql
AI.CLASSIFY(
[ input => ] 'INPUT',
[ categories => ] 'CATEGORIES'
[, connection_id => 'CONNECTION_ID' ]
[, endpoint => 'ENDPOINT' ]
[, output_mode => 'OUTPUT_MODE' ]
)
```

### Input Arguments

| Argument | Requirement | Type | Description |
| :------------------ | :----------- | :------------ | :-------------------- |
| **`input`** | **Required** | String | The text content to |
: : : : classify. :
| **`categories`** | **Required** | Array<String> | A list of target |
: : : : categories/labels. :
: : : : Can be :
: : : : `ARRAY<STRING>` or :
: : : : `ARRAY<STRUCT<STRING, :
: : : : STRING>>` (label, :
: : : : description). :
| **`connection_id`** | Optional | String | The connection ID to |
: : : : use for the LLM. :
| **`endpoint`** | Optional | String | The model name, e.g., |
: : : : `'gemini-2.5-flash'`. :
| **`output_mode`** | Optional | String | `'single'` (default) |
: : : : or `'multi'`. :
: : : : Determines the output :
: : : : type. :

### Output Schema

The output type depends on the `output_mode` argument:

| Output Mode | output_mode Value | Type | Description |
| :--------------- | :---------------- | :-------------- | :------------------ |
| **Single Label** | `NULL` (Default) | `STRING` | The single category |
: : : : that best fits the :
: : : : input. :
| **Single Label | `'single'` | `ARRAY<STRING>` | An array containing |
: (Explicit)** : : : exactly one :
: : : : category string. :
| **Multi Label** | `'multi'` | `ARRAY<STRING>` | An array containing |
: : : : zero or more :
: : : : matching :
: : : : categories. :

## Examples

### Classify text into categories

```sql
SELECT
content,
AI.CLASSIFY(
content,
categories => ['Spam', 'Not Spam', 'Urgent'],
connection_id => 'my-project.us.my-connection'
) as classification
FROM `dataset.emails`;
```

### Classify text into multiple topics

```
SELECT
title,
body,
AI.CLASSIFY(
body,
categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'],
output_mode => 'multi') AS categories
FROM
`bigquery-public-data.bbc_news.fulltext`
LIMIT 100;
```

### Classify reviews by sentiment

SELECT AI.CLASSIFY( ('Classify the review by sentiment: ', review), categories
=> [('green', 'The review is positive.'), ('yellow', 'The review is neutral.'),
('red', 'The review is negative.')]) AS ai_review_rating, reviewer_rating AS
human_provided_rating, review FROM `bigquery-public-data.imdb.reviews` WHERE
title = 'The English Patient'
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# BigQuery AI.Detect_Anomalies

`AI.DETECT_ANOMALIES` uses the pre-trained **TimesFM** model to identify
deviations in time series data without needing to train a custom model.

## Syntax Reference

This function compares a target dataset against a historical dataset to identify
anomalies.

```sql
SELECT *
FROM AI.DETECT_ANOMALIES(
{ TABLE `project.dataset.history_table` | (SELECT * FROM history_query) },
{ TABLE `project.dataset.target_table` | (SELECT * FROM target_query) },
data_col => 'DATA_COL',
timestamp_col => 'TIMESTAMP_COL'
[, model => 'MODEL']
[, id_cols => ID_COLS]
[, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD]
)

```

### Input Arguments

Argument | Requirement | Type | Description
:--------------------------- | :----------- | :------------ | :----------
**`historical_data`** | **Required** | Table/Query | The source table or subquery containing historical data for training context.
**`target_data`** | **Required** | Table/Query | The source table or subquery containing data to analyze for anomalies.
**`data_col`** | **Required** | String | The numeric column to analyze.
**`timestamp_col`** | **Required** | String | The column containing dates/timestamps.
**`id_cols`** | Optional | Array<String> | Grouping columns for multiple series (e.g., `['store_id']`).
**`anomaly_prob_threshold`** | Optional | Float64 | Threshold for anomaly detection (0 to 1). Defaults to 0.95.
**`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`.

### Output Schema

| Column | Type | Description |
| :------------------------------- | :--------- | :--------------------------- |
| **`id_cols`** | (As Input) | Original identifiers for the |
: : : series. :
| **`time_series_timestamp`** | TIMESTAMP | Timestamp for the analyzed |
: : : points. :
| **`time_series_data`** | FLOAT64 | The original data value. |
| **`is_anomaly`** | BOOL | TRUE if the point is |
: : : identified as an anomaly. :
| **`lower_bound`** | FLOAT64 | Lower bound of the expected |
: : : range. :
| **`upper_bound`** | FLOAT64 | Upper bound of the expected |
: : : range. :
| **`anomaly_probability`** | FLOAT64 | Probability that the point |
: : : is an anomaly. :
| **`ai_detect_anomalies_status`** | STRING | Error messages or empty |
: : : string on success. A minimum :
: : : of 3 data points is :
: : : required. :

## Examples

### Basic Anomaly Detection

Detect anomalies in daily bike trips for a specific 2-month window based on
prior history.

```sql
WITH bike_trips AS (
SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date
)
SELECT *
FROM AI.DETECT_ANOMALIES(
-- Historical context (Training data equivalent)
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
-- Target range (Data to inspect for anomalies)
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date'
);

```

### Multivariate Detection (Multiple Series)

Use `id_cols` to detect anomalies separately for different user types (e.g.,
Subscriber vs. Customer) in the same query.

```sql
WITH bike_trips AS (
SELECT
EXTRACT(DATE FROM starttime) AS date, usertype, gender,
COUNT(*) AS num_trips
FROM `bigquery-public-data.new_york.citibike_trips`
GROUP BY date, usertype, gender
)
SELECT *
FROM
AI.DETECT_ANOMALIES(
# Historical data from a query
(SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')),
# Target data from a query
(SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'),
data_col => 'num_trips',
timestamp_col => 'date',
id_cols => ['usertype', 'gender'],
model => "TimesFM 2.5",
anomaly_prob_threshold => 0.8);

```
Loading
Loading