From 9da8990f16bb988942f66cbdc4374e6f0f067f18 Mon Sep 17 00:00:00 2001 From: Data Cloud Agents Team Date: Fri, 12 Jun 2026 16:09:39 -0700 Subject: [PATCH] Update BQ ai_function_best_practices and constraints. PiperOrigin-RevId: 931383081 --- skills/developing-with-bigquery/SKILL.md | 12 ++-- .../ai-ml/ai_function_best_practices.md | 70 +++++++++++++------ 2 files changed, 55 insertions(+), 27 deletions(-) diff --git a/skills/developing-with-bigquery/SKILL.md b/skills/developing-with-bigquery/SKILL.md index 1ae9a6c..33d18c3 100755 --- a/skills/developing-with-bigquery/SKILL.md +++ b/skills/developing-with-bigquery/SKILL.md @@ -41,16 +41,16 @@ Guidelines for generating valid BigFrames code for data manipulation, model development, and visualization. - **Guide**: [BIGFRAMES.md](references/bigframes/BIGFRAMES.md) -Bigframes should be the default library/tool as it is more efficient than using -the BigQuery Python client library. +Bigframes should be the default library/tool for data manipulation and analysis in Python. However, if the user explicitly requests "BigQuery ML", "BQML", "BigQuery SQL", or "SQL", you MUST use native BQML SQL (via %%bqsql, magics) instead of BigFrames. ### 3. BigQuery ML & AI Functions (BQML SQL) +**CRITICAL** Best Practices: You MUST read and follow the global constraints and mandatory function routing rules in +[ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md) before writing any BQML AI/ML SQL query. + Usage rules and syntax standards for all BigQuery AI/ML functions via SQL (Forecasting, Generative AI, Classification, etc.). -- **Best Practices**: [ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md) - **Functions Reference**: - - **AI.CLASSIFY**: [ai_classify.md](references/ai-ml/ai_classify.md) - Classify text. - **AI.DETECT_ANOMALIES**: [ai_detect_anomalies.md](references/ai-ml/ai_detect_anomalies.md) - Detect anomalies. - **AI.EVALUATE**: [ai_evaluate.md](references/ai-ml/ai_evaluate.md) - Evaluate models. @@ -59,12 +59,12 @@ Usage rules and syntax standards for all BigQuery AI/ML functions via SQL - **AI.GENERATE_EMBEDDING**: [ai_generate_embedding.md](references/ai-ml/ai_generate_embedding.md) - Generate embeddings. - **AI.GENERATE_TABLE**: [ai_generate_table.md](references/ai-ml/ai_generate_table.md) - Table-valued AI generation. - **AI.IF**: [ai_if.md](references/ai-ml/ai_if.md) - Evaluate semantic conditions. - - **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identify key drivers. - **AI.SCORE**: [ai_score.md](references/ai-ml/ai_score.md) - Score data. - **AI.SEARCH**: [ai_search.md](references/ai-ml/ai_search.md) - Semantic search. - **AI.SIMILARITY**: [ai_similarity.md](references/ai-ml/ai_similarity.md) - Semantic similarity. - **Remote Models**: [remote_models.md](references/ai-ml/remote_models.md) - Working with remote models (Vertex AI). - - **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Step-by-step contribution analysis. + - **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Finds contributing factors, key drivers of change. Requires creating a MODEL entity. + - **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identifies key drivers, this is a TVF. - **VECTOR_SEARCH**: [vector_search.md](references/ai-ml/vector_search.md) - Vector search best practices. ### 4. Graph Analytics (Property Graphs & GQL) diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md b/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md index 533b1a0..99a9f4a 100755 --- a/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md +++ b/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md @@ -4,30 +4,58 @@ Rules and syntax standards for BigQuery AI and Machine Learning functions. ## 1. Global Constraints -* **Connection ID**: Use `'DEFAULT'` for the `connection` argument in remote `CREATE MODEL` statements. -* **Dataset Creation**: Use `CREATE SCHEMA IF NOT EXISTS .;`. - -## 2. Mandatory Function Routing - -Function/Use Case | Required Reference File -------------------------- | ---------------------------------------------- -**AI.FORECAST** | [ai_forecast.md](ai_forecast.md) -**AI.EVALUATE** | [ai_evaluate.md](ai_evaluate.md) -**AI.GENERATE_TABLE** | [ai_generate_table.md](ai_generate_table.md) -**AI.GENERATE_EMBEDDING** | [ai_generate_embedding.md](ai_generate_embedding.md) -**Remote Models** | [remote_models.md](remote_models.md) -**CONTRIBUTION_ANALYSIS** | [ml_contribution_analysis.md](ml_contribution_analysis.md) -**VECTOR_SEARCH** | [vector_search.md](vector_search.md) +* **Connection ID**: Use `'DEFAULT'` for the `connection` argument in remote + `CREATE MODEL` statements. +* **Dataset Creation**: Use `CREATE SCHEMA IF NOT EXISTS + .;`. +* **SQL Only**: You MUST use native BigQuery SQL (via `%%bqsql` magics) for + all BQML operations (model training, evaluation, prediction). Do NOT use + BigFrames (`bigframes.ml`) or the BigQuery Python client. ## 3. Mandatory Syntax Checks -* **Table-Valued Functions (TVFs)**: `AI.GENERATE_TABLE`, `AI.FORECAST`, `AI.EVALUATE`, and `AI.GENERATE_EMBEDDING` MUST be placed in the `FROM` clause. -* **Named Arguments**: `AI.FORECAST` and `AI.EVALUATE` require the `=>` operator for optional arguments. -* **The "Prompt" Alias**: For `AI.GENERATE_TABLE`, the input subquery must contain a column aliased as `prompt`. -* **Schema Quotes**: Ensure the `output_schema` string is enclosed in quotes. +* **Table-Valued Functions (TVFs)**: Table-Valued Functions (including, + but not limited to, `AI.GENERATE_TABLE`, `AI.FORECAST`, `AI.EVALUATE`, + and `AI.GENERATE_EMBEDDING`) MUST be placed in the `FROM` clause. +* **Named Arguments**: `AI.FORECAST` and `AI.EVALUATE` require the `=>` + operator for optional arguments. +* **The "Prompt" Alias**: For `AI.GENERATE_TABLE`, the input subquery must + contain a column aliased as `prompt`. +* **Schema Quotes**: Ensure the `output_schema` string is enclosed in quotes. ## 4. Model Selection -* **Time-series**: `AI.FORECAST` uses **TimesFM** endpoints. -* **Generative**: `AI.GENERATE_TABLE` uses **Gemini** endpoints. -* **Freshness**: Prefer current models (e.g., `gemini-2.5-flash`) over deprecated ones. +* **Time-series**: `AI.FORECAST` uses **TimesFM** endpoints. +* **Generative**: `AI.GENERATE_TABLE` uses **Gemini** endpoints. +* **Freshness**: Prefer current models (e.g., `gemini-2.5-flash`) over + deprecated ones. + +## 5. Data Exploration + +* **Mandatory Exploration**: Before training any model or running AI + functions, you MUST perform data exploration using: + 1. `ML.DESCRIBE_DATA` to understand the statistics of the dataset. + 2. A simple `SELECT` query with a `LIMIT` operator (e.g., `LIMIT 5` or + `LIMIT 10`) to sample the first few rows. + +## 6. Model Training and Hyperparameters + +* **Default Parameters**: Always rely on BQML's default parameters and + hyperparameters unless the prompt explicitly requests specific tuning. Do + not unnecessarily specify hyperparameters. If one is necessary, justify the + reasoning. +* **Data Splitting**: Most BQML models handle data splitting automatically + (default is `AUTO_SPLIT`). Do not perform manual training/validation/testing + splits (either via SQL subqueries or Python) unless explicitly instructed. + * **TimesFM Exception**: If performing time-series forecasting with + TimesFM (`AI.FORECAST`), you MUST split your dataset chronologically + into exactly two parts: + * **Historical Data (History)**: Used as history_data in `AI.EVALUATE` and + `input_data` in `AI.FORECAST`. + * **Evaluation Data (Actuals)**: Used as actual_data in `AI.EVALUATE` to + compare against the forecast. + +## 7. Model Evaluation + +* **Use BQML Functions**: Always use native BQML evaluation functions (e.g., + `ML.EVALUATE`, `ML.ARIMA_EVALUATE`, `AI.EVALUATE`) to compute metrics.