Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions skills/developing-with-bigquery/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,16 @@ Guidelines for generating valid BigFrames code for data manipulation, model
development, and visualization.
- **Guide**: [BIGFRAMES.md](references/bigframes/BIGFRAMES.md)

Bigframes should be the default library/tool as it is more efficient than using
the BigQuery Python client library.
Bigframes should be the default library/tool for data manipulation and analysis in Python. However, if the user explicitly requests "BigQuery ML", "BQML", "BigQuery SQL", or "SQL", you MUST use native BQML SQL (via %%bqsql, magics) instead of BigFrames.

### 3. BigQuery ML & AI Functions (BQML SQL)

**CRITICAL** Best Practices: You MUST read and follow the global constraints and mandatory function routing rules in
[ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md) before writing any BQML AI/ML SQL query.

Usage rules and syntax standards for all BigQuery AI/ML functions via SQL
(Forecasting, Generative AI, Classification, etc.).
- **Best Practices**: [ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md)
- **Functions Reference**:

- **AI.CLASSIFY**: [ai_classify.md](references/ai-ml/ai_classify.md) - Classify text.
- **AI.DETECT_ANOMALIES**: [ai_detect_anomalies.md](references/ai-ml/ai_detect_anomalies.md) - Detect anomalies.
- **AI.EVALUATE**: [ai_evaluate.md](references/ai-ml/ai_evaluate.md) - Evaluate models.
Expand All @@ -59,12 +59,12 @@ Usage rules and syntax standards for all BigQuery AI/ML functions via SQL
- **AI.GENERATE_EMBEDDING**: [ai_generate_embedding.md](references/ai-ml/ai_generate_embedding.md) - Generate embeddings.
- **AI.GENERATE_TABLE**: [ai_generate_table.md](references/ai-ml/ai_generate_table.md) - Table-valued AI generation.
- **AI.IF**: [ai_if.md](references/ai-ml/ai_if.md) - Evaluate semantic conditions.
- **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identify key drivers.
- **AI.SCORE**: [ai_score.md](references/ai-ml/ai_score.md) - Score data.
- **AI.SEARCH**: [ai_search.md](references/ai-ml/ai_search.md) - Semantic search.
- **AI.SIMILARITY**: [ai_similarity.md](references/ai-ml/ai_similarity.md) - Semantic similarity.
- **Remote Models**: [remote_models.md](references/ai-ml/remote_models.md) - Working with remote models (Vertex AI).
- **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Step-by-step contribution analysis.
- **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Finds contributing factors, key drivers of change. Requires creating a MODEL entity.
- **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identifies key drivers, this is a TVF.
- **VECTOR_SEARCH**: [vector_search.md](references/ai-ml/vector_search.md) - Vector search best practices.

### 4. Graph Analytics (Property Graphs & GQL)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,30 +4,58 @@ Rules and syntax standards for BigQuery AI and Machine Learning functions.

## 1. Global Constraints

* **Connection ID**: Use `'DEFAULT'` for the `connection` argument in remote `CREATE MODEL` statements.
* **Dataset Creation**: Use `CREATE SCHEMA IF NOT EXISTS <project>.<dataset>;`.

## 2. Mandatory Function Routing

Function/Use Case | Required Reference File
------------------------- | ----------------------------------------------
**AI.FORECAST** | [ai_forecast.md](ai_forecast.md)
**AI.EVALUATE** | [ai_evaluate.md](ai_evaluate.md)
**AI.GENERATE_TABLE** | [ai_generate_table.md](ai_generate_table.md)
**AI.GENERATE_EMBEDDING** | [ai_generate_embedding.md](ai_generate_embedding.md)
**Remote Models** | [remote_models.md](remote_models.md)
**CONTRIBUTION_ANALYSIS** | [ml_contribution_analysis.md](ml_contribution_analysis.md)
**VECTOR_SEARCH** | [vector_search.md](vector_search.md)
* **Connection ID**: Use `'DEFAULT'` for the `connection` argument in remote
`CREATE MODEL` statements.
* **Dataset Creation**: Use `CREATE SCHEMA IF NOT EXISTS
<project>.<dataset>;`.
* **SQL Only**: You MUST use native BigQuery SQL (via `%%bqsql` magics) for
all BQML operations (model training, evaluation, prediction). Do NOT use
BigFrames (`bigframes.ml`) or the BigQuery Python client.

## 3. Mandatory Syntax Checks

* **Table-Valued Functions (TVFs)**: `AI.GENERATE_TABLE`, `AI.FORECAST`, `AI.EVALUATE`, and `AI.GENERATE_EMBEDDING` MUST be placed in the `FROM` clause.
* **Named Arguments**: `AI.FORECAST` and `AI.EVALUATE` require the `=>` operator for optional arguments.
* **The "Prompt" Alias**: For `AI.GENERATE_TABLE`, the input subquery must contain a column aliased as `prompt`.
* **Schema Quotes**: Ensure the `output_schema` string is enclosed in quotes.
* **Table-Valued Functions (TVFs)**: Table-Valued Functions (including,
but not limited to, `AI.GENERATE_TABLE`, `AI.FORECAST`, `AI.EVALUATE`,
and `AI.GENERATE_EMBEDDING`) MUST be placed in the `FROM` clause.
* **Named Arguments**: `AI.FORECAST` and `AI.EVALUATE` require the `=>`
operator for optional arguments.
* **The "Prompt" Alias**: For `AI.GENERATE_TABLE`, the input subquery must
contain a column aliased as `prompt`.
* **Schema Quotes**: Ensure the `output_schema` string is enclosed in quotes.

## 4. Model Selection

* **Time-series**: `AI.FORECAST` uses **TimesFM** endpoints.
* **Generative**: `AI.GENERATE_TABLE` uses **Gemini** endpoints.
* **Freshness**: Prefer current models (e.g., `gemini-2.5-flash`) over deprecated ones.
* **Time-series**: `AI.FORECAST` uses **TimesFM** endpoints.
* **Generative**: `AI.GENERATE_TABLE` uses **Gemini** endpoints.
* **Freshness**: Prefer current models (e.g., `gemini-2.5-flash`) over
deprecated ones.

## 5. Data Exploration

* **Mandatory Exploration**: Before training any model or running AI
functions, you MUST perform data exploration using:
1. `ML.DESCRIBE_DATA` to understand the statistics of the dataset.
2. A simple `SELECT` query with a `LIMIT` operator (e.g., `LIMIT 5` or
`LIMIT 10`) to sample the first few rows.

## 6. Model Training and Hyperparameters

* **Default Parameters**: Always rely on BQML's default parameters and
hyperparameters unless the prompt explicitly requests specific tuning. Do
not unnecessarily specify hyperparameters. If one is necessary, justify the
reasoning.
* **Data Splitting**: Most BQML models handle data splitting automatically
(default is `AUTO_SPLIT`). Do not perform manual training/validation/testing
splits (either via SQL subqueries or Python) unless explicitly instructed.
* **TimesFM Exception**: If performing time-series forecasting with
TimesFM (`AI.FORECAST`), you MUST split your dataset chronologically
into exactly two parts:
* **Historical Data (History)**: Used as history_data in `AI.EVALUATE` and
`input_data` in `AI.FORECAST`.
* **Evaluation Data (Actuals)**: Used as actual_data in `AI.EVALUATE` to
compare against the forecast.

## 7. Model Evaluation

* **Use BQML Functions**: Always use native BQML evaluation functions (e.g.,
`ML.EVALUATE`, `ML.ARIMA_EVALUATE`, `AI.EVALUATE`) to compute metrics.
Loading