From 86cd0201237ed5ac30d8645c2ba50000fac84972 Mon Sep 17 00:00:00 2001 From: Data Cloud Agents Team Date: Thu, 18 Jun 2026 16:29:28 -0700 Subject: [PATCH] Project import generated by Copybara. PiperOrigin-RevId: 934604767 --- skills/developing-with-bigquery/SKILL.md | 56 ++- .../references/ai-forecast.md | 62 --- .../references/ai-ml/ai_classify.md | 92 ++++ .../references/ai-ml/ai_detect_anomalies.md | 110 +++++ .../{ai-evaluate.md => ai-ml/ai_evaluate.md} | 0 .../references/ai-ml/ai_forecast.md | 106 ++++ .../ai_function_best_practices.md} | 14 +- .../references/ai-ml/ai_generate.md | 104 ++++ .../ai_generate_embedding.md} | 0 .../ai_generate_table.md} | 0 .../references/ai-ml/ai_if.md | 55 +++ .../references/ai-ml/ai_key_drivers.md | 75 +++ .../references/ai-ml/ai_score.md | 52 ++ .../references/ai-ml/ai_search.md | 81 +++ .../references/ai-ml/ai_similarity.md | 81 +++ .../ml_contribution_analysis.md} | 0 .../remote_models.md} | 0 .../vector_search.md} | 0 .../references/{ => bigframes}/BIGFRAMES.md | 2 +- .../references/graph/graph_queries.md | 464 ++++++++++++++++++ .../references/graph/semantic_queries.md | 65 +++ .../references/{ => sql}/OPTIMIZATION.md | 0 22 files changed, 1327 insertions(+), 92 deletions(-) delete mode 100755 skills/developing-with-bigquery/references/ai-forecast.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_classify.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_detect_anomalies.md rename skills/developing-with-bigquery/references/{ai-evaluate.md => ai-ml/ai_evaluate.md} (100%) create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_forecast.md rename skills/developing-with-bigquery/references/{BQML.md => ai-ml/ai_function_best_practices.md} (71%) create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_generate.md rename skills/developing-with-bigquery/references/{ai-generate-embedding.md => ai-ml/ai_generate_embedding.md} (100%) rename skills/developing-with-bigquery/references/{ai-generate-table.md => ai-ml/ai_generate_table.md} (100%) create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_if.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_key_drivers.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_score.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_search.md create mode 100644 skills/developing-with-bigquery/references/ai-ml/ai_similarity.md rename skills/developing-with-bigquery/references/{ml-contribution-analysis.md => ai-ml/ml_contribution_analysis.md} (100%) rename skills/developing-with-bigquery/references/{remote-models.md => ai-ml/remote_models.md} (100%) rename skills/developing-with-bigquery/references/{vector-search.md => ai-ml/vector_search.md} (100%) rename skills/developing-with-bigquery/references/{ => bigframes}/BIGFRAMES.md (91%) create mode 100644 skills/developing-with-bigquery/references/graph/graph_queries.md create mode 100644 skills/developing-with-bigquery/references/graph/semantic_queries.md rename skills/developing-with-bigquery/references/{ => sql}/OPTIMIZATION.md (100%) diff --git a/skills/developing-with-bigquery/SKILL.md b/skills/developing-with-bigquery/SKILL.md index 155e1e2..1ae9a6c 100755 --- a/skills/developing-with-bigquery/SKILL.md +++ b/skills/developing-with-bigquery/SKILL.md @@ -1,14 +1,15 @@ --- name: developing-with-bigquery description: | - A repository of BigQuery-specific logic, knowledge, and specialized standards. + Provides BigQuery-specific logic, knowledge, and specialized standards. Use this skill whenever you are doing anything with BigQuery, including: - 1. BigQuery query optimization + 1. BigQuery query optimization (SQL) 2. BigFrames Python code - 3. BigQuery ML/AI functions. + 3. BigQuery ML/AI functions (SQL & Python) + 4. Graph Analytics (GQL & Property Graphs) license: Apache-2.0 metadata: - version: v1 + version: v2 publisher: google --- @@ -31,14 +32,14 @@ features: ### 1. Query Optimization Performance and efficiency guidelines for BigQuery SQL. Includes rules for -column pruning, pushdown, and materialization strategies. - **Guide**: -[OPTIMIZATION.md](references/OPTIMIZATION.md) +column pruning, pushdown, and materialization strategies. +- **Guide**: [OPTIMIZATION.md](references/sql/OPTIMIZATION.md) ### 2. BigFrames (BigQuery DataFrames) Guidelines for generating valid BigFrames code for data manipulation, model -development, and visualization. - **Guide**: -[BIGFRAMES.md](references/BIGFRAMES.md) +development, and visualization. +- **Guide**: [BIGFRAMES.md](references/bigframes/BIGFRAMES.md) Bigframes should be the default library/tool as it is more efficient than using the BigQuery Python client library. @@ -46,17 +47,28 @@ the BigQuery Python client library. ### 3. BigQuery ML & AI Functions (BQML SQL) Usage rules and syntax standards for all BigQuery AI/ML functions via SQL -(Forecasting, Generative AI, Classification, etc.). - **Guide**: -[BQML.md](references/BQML.md) - **Functions Reference**: - -[AI.FORECAST](references/ai-forecast.md) - -[AI.EVALUATE](references/ai-evaluate.md) - -[AI.GENERATE_TABLE](references/ai-generate-table.md) - -[AI.GENERATE_EMBEDDING](references/ai-generate-embedding.md) - -[Remote Models](references/remote-models.md) -[CONTRIBUTION_ANALYSIS](references/ml-contribution-analysis.md) -[VECTOR_SEARCH](references/vector-search.md) - -### 4. Notebook SQL cells - -Refer to `@skill:notebook-guidance` for standards on running BigQuery in -notebooks. +(Forecasting, Generative AI, Classification, etc.). +- **Best Practices**: [ai_function_best_practices.md](references/ai-ml/ai_function_best_practices.md) +- **Functions Reference**: + + - **AI.CLASSIFY**: [ai_classify.md](references/ai-ml/ai_classify.md) - Classify text. + - **AI.DETECT_ANOMALIES**: [ai_detect_anomalies.md](references/ai-ml/ai_detect_anomalies.md) - Detect anomalies. + - **AI.EVALUATE**: [ai_evaluate.md](references/ai-ml/ai_evaluate.md) - Evaluate models. + - **AI.FORECAST**: [ai_forecast.md](references/ai-ml/ai_forecast.md) - Time-series forecasting. + - **AI.GENERATE**: [ai_generate.md](references/ai-ml/ai_generate.md) - Generate text using LLMs. + - **AI.GENERATE_EMBEDDING**: [ai_generate_embedding.md](references/ai-ml/ai_generate_embedding.md) - Generate embeddings. + - **AI.GENERATE_TABLE**: [ai_generate_table.md](references/ai-ml/ai_generate_table.md) - Table-valued AI generation. + - **AI.IF**: [ai_if.md](references/ai-ml/ai_if.md) - Evaluate semantic conditions. + - **AI.KEY_DRIVERS**: [ai_key_drivers.md](references/ai-ml/ai_key_drivers.md) - Identify key drivers. + - **AI.SCORE**: [ai_score.md](references/ai-ml/ai_score.md) - Score data. + - **AI.SEARCH**: [ai_search.md](references/ai-ml/ai_search.md) - Semantic search. + - **AI.SIMILARITY**: [ai_similarity.md](references/ai-ml/ai_similarity.md) - Semantic similarity. + - **Remote Models**: [remote_models.md](references/ai-ml/remote_models.md) - Working with remote models (Vertex AI). + - **CONTRIBUTION_ANALYSIS**: [ml_contribution_analysis.md](references/ai-ml/ml_contribution_analysis.md) - Step-by-step contribution analysis. + - **VECTOR_SEARCH**: [vector_search.md](references/ai-ml/vector_search.md) - Vector search best practices. + +### 4. Graph Analytics (Property Graphs & GQL) + +Guidelines and best practices for querying property graphs in BigQuery. +- **Property Graph Guidelines**: [graph_queries.md](references/graph/graph_queries.md) - Standard GQL syntax and query patterns. +- **Semantic Graph Guidelines**: [semantic_queries.md](references/graph/semantic_queries.md) - Semantic graph operations and expand functions. diff --git a/skills/developing-with-bigquery/references/ai-forecast.md b/skills/developing-with-bigquery/references/ai-forecast.md deleted file mode 100755 index 2b07fbb..0000000 --- a/skills/developing-with-bigquery/references/ai-forecast.md +++ /dev/null @@ -1,62 +0,0 @@ -# AI.FORECAST -Used for time-series forecasting without model training. It automatically -detects frequency (daily, weekly, etc.) based on your timestamp column. - -## Syntax -```sql -SELECT - * -FROM - AI.FORECAST( - { TABLE `project.dataset.table` | (QUERY_STATEMENT) }, - data_col => 'DATA_COL', - timestamp_col => 'TIMESTAMP_COL' - [, model => 'MODEL'] - [, id_cols => ID_COLS] - [, horizon => HORIZON] - [, confidence_level => CONFIDENCE_LEVEL] - [, output_historical_time_series => OUTPUT_HISTORICAL_TIME_SERIES] - [, context_window => CONTEXT_WINDOW] - ) -``` - -## Input Arguments -| Argument | Requirement | Type | Description | -| :--- | :--- | :--- | :--- | -| **`input_data`** | **Required** | | The source table or subquery containing historical data. | -| **`data_col`** | **Required** | String | The numeric column to predict. | -| **`timestamp_col`** | **Required** | String | The column containing dates/timestamps. | -| **`id_cols`** | Optional | Array | Grouping columns for multiple series (e.g., `['store_id']`). | -| **`horizon`** | Optional | Int64 | Number of future points to predict. Defaults to 30. | -| **`confidence_level`** | Optional | Float64 | Confidence interval (0 to 1). Defaults to 0.95. | -| **`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`. | -| **`context_window`** | Optional | Int64 | The number of historical data points the model uses to forecast. Max 512. If not set, the model determines this automatically. | - -## Output Schema -The schema adjusts based on the `output_historical_time_series` flag. - -| Column | Type | Included if output_historical_time_series=FALSE | Included if output_historical_time_series=TRUE | Description | -| :--- | :--- | :---: | :---: | :--- | -| **`id_cols`** | (As Input) | Yes | Yes | Original identifiers for the series. | -| **`forecast_timestamp`** | TIMESTAMP | **Yes** | No | Timestamp for predicted points. | -| **`forecast_value`** | FLOAT64 | **Yes** | No | The 50% quantile (median) prediction. | -| **`time_series_timestamp`** | TIMESTAMP | No | **Yes** | Uniform timestamp column for both history and forecast. | -| **`time_series_data`** | FLOAT64 | No | **Yes** | Merged column: actual values for history, median for forecast. | -| **`time_series_type`** | STRING | No | **Yes** | Label: `'history'` or `'forecast'`. | -| **`prediction_interval_lower_bound`** | FLOAT64 | Yes | Yes | Lower bound (NULL for historical rows). | -| **`prediction_interval_upper_bound`** | FLOAT64 | Yes | Yes | Upper bound (NULL for historical rows). | -| **`confidence_level`** | FLOAT64 | Yes | Yes | The constant confidence level used. | -| **`ai_forecast_status`** | STRING | Yes | Yes | Error messages or empty string on success. | - -## Example: Forecasting with History - -```sql --- Forecast next 10 days of sales by store and include historical data for charting -SELECT * FROM AI.FORECAST( - TABLE `my_project.sales.daily_totals`, - data_col => 'total_sales', - timestamp_col => 'date', - id_cols => ['store_id'], - output_historical_time_series => TRUE -); -``` \ No newline at end of file diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_classify.md b/skills/developing-with-bigquery/references/ai-ml/ai_classify.md new file mode 100644 index 0000000..84e7534 --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_classify.md @@ -0,0 +1,92 @@ +# BigQuery AI.Classify + +`AI.CLASSIFY` categorizes unstructured data into a predefined set of labels. + +## Syntax Reference + +```sql +AI.CLASSIFY( + [ input => ] 'INPUT', + [ categories => ] 'CATEGORIES' + [, connection_id => 'CONNECTION_ID' ] + [, endpoint => 'ENDPOINT' ] + [, output_mode => 'OUTPUT_MODE' ] +) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :------------------ | :----------- | :------------ | :-------------------- | +| **`input`** | **Required** | String | The text content to | +: : : : classify. : +| **`categories`** | **Required** | Array | A list of target | +: : : : categories/labels. : +: : : : Can be : +: : : : `ARRAY` or : +: : : : `ARRAY>` (label, : +: : : : description). : +| **`connection_id`** | Optional | String | The connection ID to | +: : : : use for the LLM. : +| **`endpoint`** | Optional | String | The model name, e.g., | +: : : : `'gemini-2.5-flash'`. : +| **`output_mode`** | Optional | String | `'single'` (default) | +: : : : or `'multi'`. : +: : : : Determines the output : +: : : : type. : + +### Output Schema + +The output type depends on the `output_mode` argument: + +| Output Mode | output_mode Value | Type | Description | +| :--------------- | :---------------- | :-------------- | :------------------ | +| **Single Label** | `NULL` (Default) | `STRING` | The single category | +: : : : that best fits the : +: : : : input. : +| **Single Label | `'single'` | `ARRAY` | An array containing | +: (Explicit)** : : : exactly one : +: : : : category string. : +| **Multi Label** | `'multi'` | `ARRAY` | An array containing | +: : : : zero or more : +: : : : matching : +: : : : categories. : + +## Examples + +### Classify text into categories + +```sql +SELECT + content, + AI.CLASSIFY( + content, + categories => ['Spam', 'Not Spam', 'Urgent'], + connection_id => 'my-project.us.my-connection' + ) as classification +FROM `dataset.emails`; +``` + +### Classify text into multiple topics + +``` +SELECT + title, + body, + AI.CLASSIFY( + body, + categories => ['tech', 'sport', 'business', 'politics', 'entertainment', 'other'], + output_mode => 'multi') AS categories +FROM + `bigquery-public-data.bbc_news.fulltext` +LIMIT 100; +``` + +### Classify reviews by sentiment + +SELECT AI.CLASSIFY( ('Classify the review by sentiment: ', review), categories +=> [('green', 'The review is positive.'), ('yellow', 'The review is neutral.'), +('red', 'The review is negative.')]) AS ai_review_rating, reviewer_rating AS +human_provided_rating, review FROM `bigquery-public-data.imdb.reviews` WHERE +title = 'The English Patient' diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_detect_anomalies.md b/skills/developing-with-bigquery/references/ai-ml/ai_detect_anomalies.md new file mode 100644 index 0000000..5fc86a9 --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_detect_anomalies.md @@ -0,0 +1,110 @@ +# BigQuery AI.Detect_Anomalies + +`AI.DETECT_ANOMALIES` uses the pre-trained **TimesFM** model to identify +deviations in time series data without needing to train a custom model. + +## Syntax Reference + +This function compares a target dataset against a historical dataset to identify +anomalies. + +```sql +SELECT * +FROM AI.DETECT_ANOMALIES( + { TABLE `project.dataset.history_table` | (SELECT * FROM history_query) }, + { TABLE `project.dataset.target_table` | (SELECT * FROM target_query) }, + data_col => 'DATA_COL', + timestamp_col => 'TIMESTAMP_COL' + [, model => 'MODEL'] + [, id_cols => ID_COLS] + [, anomaly_prob_threshold => ANOMALY_PROB_THRESHOLD] +) + +``` + +### Input Arguments + +Argument | Requirement | Type | Description +:--------------------------- | :----------- | :------------ | :---------- +**`historical_data`** | **Required** | Table/Query | The source table or subquery containing historical data for training context. +**`target_data`** | **Required** | Table/Query | The source table or subquery containing data to analyze for anomalies. +**`data_col`** | **Required** | String | The numeric column to analyze. +**`timestamp_col`** | **Required** | String | The column containing dates/timestamps. +**`id_cols`** | Optional | Array | Grouping columns for multiple series (e.g., `['store_id']`). +**`anomaly_prob_threshold`** | Optional | Float64 | Threshold for anomaly detection (0 to 1). Defaults to 0.95. +**`model`** | Optional | String | Model version. Defaults to `'TimesFM 2.0'`. + +### Output Schema + +| Column | Type | Description | +| :------------------------------- | :--------- | :--------------------------- | +| **`id_cols`** | (As Input) | Original identifiers for the | +: : : series. : +| **`time_series_timestamp`** | TIMESTAMP | Timestamp for the analyzed | +: : : points. : +| **`time_series_data`** | FLOAT64 | The original data value. | +| **`is_anomaly`** | BOOL | TRUE if the point is | +: : : identified as an anomaly. : +| **`lower_bound`** | FLOAT64 | Lower bound of the expected | +: : : range. : +| **`upper_bound`** | FLOAT64 | Upper bound of the expected | +: : : range. : +| **`anomaly_probability`** | FLOAT64 | Probability that the point | +: : : is an anomaly. : +| **`ai_detect_anomalies_status`** | STRING | Error messages or empty | +: : : string on success. A minimum : +: : : of 3 data points is : +: : : required. : + +## Examples + +### Basic Anomaly Detection + +Detect anomalies in daily bike trips for a specific 2-month window based on +prior history. + +```sql +WITH bike_trips AS ( + SELECT EXTRACT(DATE FROM starttime) AS date, COUNT(*) AS num_trips + FROM `bigquery-public-data.new_york.citibike_trips` + GROUP BY date +) +SELECT * +FROM AI.DETECT_ANOMALIES( + -- Historical context (Training data equivalent) + (SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')), + -- Target range (Data to inspect for anomalies) + (SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'), + data_col => 'num_trips', + timestamp_col => 'date' +); + +``` + +### Multivariate Detection (Multiple Series) + +Use `id_cols` to detect anomalies separately for different user types (e.g., +Subscriber vs. Customer) in the same query. + +```sql +WITH bike_trips AS ( + SELECT + EXTRACT(DATE FROM starttime) AS date, usertype, gender, + COUNT(*) AS num_trips + FROM `bigquery-public-data.new_york.citibike_trips` + GROUP BY date, usertype, gender + ) +SELECT * +FROM + AI.DETECT_ANOMALIES( + # Historical data from a query + (SELECT * FROM bike_trips WHERE date <= DATE('2016-06-30')), + # Target data from a query + (SELECT * FROM bike_trips WHERE date BETWEEN '2016-07-01' AND '2016-09-01'), + data_col => 'num_trips', + timestamp_col => 'date', + id_cols => ['usertype', 'gender'], + model => "TimesFM 2.5", + anomaly_prob_threshold => 0.8); + +``` diff --git a/skills/developing-with-bigquery/references/ai-evaluate.md b/skills/developing-with-bigquery/references/ai-ml/ai_evaluate.md similarity index 100% rename from skills/developing-with-bigquery/references/ai-evaluate.md rename to skills/developing-with-bigquery/references/ai-ml/ai_evaluate.md diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_forecast.md b/skills/developing-with-bigquery/references/ai-ml/ai_forecast.md new file mode 100644 index 0000000..a384b2c --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_forecast.md @@ -0,0 +1,106 @@ +# BigQuery AI.Forecast + +`AI.FORECAST` leverages the pre-trained **TimesFM** foundation model to generate +forecasts without the need to train and manage custom models. + +## Syntax Reference + +```sql +SELECT + * +FROM + AI.FORECAST( + { TABLE `project.dataset.table` | (QUERY_STATEMENT) }, + data_col => 'DATA_COL', + timestamp_col => 'TIMESTAMP_COL' + [, model => 'MODEL'] + [, id_cols => ID_COLS] + [, horizon => HORIZON] + [, confidence_level => CONFIDENCE_LEVEL] + [, output_historical_time_series => OUTPUT_HISTORICAL_TIME_SERIES] + [, context_window => CONTEXT_WINDOW] + ) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :--------------------- | :----------- | :------------ | :---------------- | +| **`input_data`** | **Required** | | The source table | +: : : : or subquery : +: : : : containing : +: : : : historical data. : +| **`data_col`** | **Required** | String | The numeric | +: : : : column to : +: : : : predict. : +| **`timestamp_col`** | **Required** | String | The column | +: : : : containing : +: : : : dates/timestamps. : +| **`id_cols`** | Optional | Array | Grouping columns | +: : : : for multiple : +: : : : series (e.g., : +: : : : `['store_id']`). : +| **`horizon`** | Optional | Int64 | Number of future | +: : : : points to : +: : : : predict. Defaults : +: : : : to 10. The valid : +: : : : input range is : +: : : : [1, 10,000] : +| **`confidence_level`** | Optional | Float64 | Confidence | +: : : : interval (0 to : +: : : : 1). Defaults to : +: : : : 0.95. : +| **`model`** | Optional | String | Model version. | +: : : : Defaults to : +: : : : `'TimesFM 2.0'`. : +| **`context_window`** | Optional | Int64 | The number of | +: : : : historical data : +: : : : points the model : +: : : : uses to forecast. : +: : : : The min value is : +: : : : 64 and the max : +: : : : value is 2048 for : +: : : : `'TimesFM 2.0'`. : +: : : : If not set, the : +: : : : model determines : +: : : : this : +: : : : automatically. : + +### Output Schema + +The schema adjusts based on the `output_historical_time_series` flag. + +Column | Type | Included if output_historical_time_series=FALSE | Included if output_historical_time_series=TRUE | Description +:------------------------------------ | :--------- | :---------------------------------------------- | :--------------------------------------------- | :---------- +**`id_cols`** | (As Input) | Yes | Yes | Original identifiers for the series. +**`forecast_timestamp`** | TIMESTAMP | **Yes** | No | Timestamp for predicted points. +**`forecast_value`** | FLOAT64 | **Yes** | No | The 50% quantile (median) prediction. +**`time_series_timestamp`** | TIMESTAMP | No | **Yes** | Uniform timestamp column for both history and forecast. +**`time_series_data`** | FLOAT64 | No | **Yes** | Merged column: actual values for history, median for forecast. +**`time_series_type`** | STRING | No | **Yes** | Label: `'history'` or `'forecast'`. +**`prediction_interval_lower_bound`** | FLOAT64 | Yes | Yes | Lower bound (NULL for historical rows). +**`prediction_interval_upper_bound`** | FLOAT64 | Yes | Yes | Upper bound (NULL for historical rows). +**`confidence_level`** | FLOAT64 | Yes | Yes | The constant confidence level used. +**`ai_forecast_status`** | STRING | Yes | Yes | Error messages or empty string on success. A minimum of 3 data points is required. + +## Examples + +### Forecasting with History + +```sql +WITH + citibike_trips AS ( + SELECT EXTRACT(DATE FROM starttime) AS date, usertype, COUNT(*) AS num_trips + FROM `bigquery-public-data.new_york.citibike_trips` + GROUP BY date, usertype + ) +SELECT * +FROM + AI.FORECAST( + TABLE citibike_trips, + data_col => 'num_trips', + timestamp_col => 'date', + id_cols => ['usertype'], + horizon => 30, + output_historical_time_series => true); +``` diff --git a/skills/developing-with-bigquery/references/BQML.md b/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md similarity index 71% rename from skills/developing-with-bigquery/references/BQML.md rename to skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md index 5a86b8b..533b1a0 100755 --- a/skills/developing-with-bigquery/references/BQML.md +++ b/skills/developing-with-bigquery/references/ai-ml/ai_function_best_practices.md @@ -11,13 +11,13 @@ Rules and syntax standards for BigQuery AI and Machine Learning functions. Function/Use Case | Required Reference File ------------------------- | ---------------------------------------------- -**AI.FORECAST** | [ai-forecast.md](ai-forecast.md) -**AI.EVALUATE** | [ai-evaluate.md](ai-evaluate.md) -**AI.GENERATE_TABLE** | [ai-generate-table.md](ai-generate-table.md) -**AI.GENERATE_EMBEDDING** | [ai-generate-embedding.md](ai-generate-embedding.md) -**Remote Models** | [remote-models.md](remote-models.md) -**CONTRIBUTION_ANALYSIS** | [ml-contribution-analysis.md](ml-contribution-analysis.md) -**VECTOR_SEARCH** | [vector-search.md](vector-search.md) +**AI.FORECAST** | [ai_forecast.md](ai_forecast.md) +**AI.EVALUATE** | [ai_evaluate.md](ai_evaluate.md) +**AI.GENERATE_TABLE** | [ai_generate_table.md](ai_generate_table.md) +**AI.GENERATE_EMBEDDING** | [ai_generate_embedding.md](ai_generate_embedding.md) +**Remote Models** | [remote_models.md](remote_models.md) +**CONTRIBUTION_ANALYSIS** | [ml_contribution_analysis.md](ml_contribution_analysis.md) +**VECTOR_SEARCH** | [vector_search.md](vector_search.md) ## 3. Mandatory Syntax Checks diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_generate.md b/skills/developing-with-bigquery/references/ai-ml/ai_generate.md new file mode 100644 index 0000000..84c39ac --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_generate.md @@ -0,0 +1,104 @@ +# BigQuery AI.Generate + +`AI.GENERATE` is a general-purpose function text and content generation. + +## Syntax Reference + +```sql +AI.GENERATE( + [ prompt => ] 'PROMPT', + [, endpoint => 'ENDPOINT'] + [, model_params => 'MODEL_PARAMS'] + [, output_schema => 'OUTPUT_SCHEMA'] + [, connection_id => 'CONNECTION_ID'] + [, request_type => 'REQUEST_TYPE'] +) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :------------------ | :----------- | :----- | :-------------------- | +| **`prompt`** | **Required** | String | The prompt text or | +: : : : instruction for the : +: : : : model. : +| **`connection_id`** | Optional | String | The connection ID. | +: : : : Optional if : +: : : : configured via other : +: : : : means or testing. : +| **`endpoint`** | Optional | String | The model name, e.g., | +: : : : `'gemini-2.5-flash'`. : +| **`output_schema`** | Optional | String | Schema definition for | +: : : : structured output, : +: : : : e.g., `'answer BOOL, : +: : : : reason STRING'`. : +| **`request_type`** | Optional | String | `'DEDICATED'` or | +: : : : `'SHARED'`. : +| **`model_params`** | Optional | JSON | JSON object for model | +: : : : parameters (e.g., : +: : : : `temperature`, : +: : : : `max_output_tokens`). : + +### Output Schema + +Returns a `STRUCT` with the following fields: + +| Column Name | Type | Description | +| :------------------ | :------------------- | :----------------------------- | +| **`result`** | `STRING` (or Custom) | The generated content. If | +: : : `output_schema` is used, this : +: : : field is replaced by the : +: : : schema's fields. : +| **`status`** | `STRING` | API response status (empty on | +: : : success). : +| **`full_response`** | `JSON` | The complete raw JSON response | +: : : from the model (including : +: : : safety ratings, usage : +: : : metadata). : + +## Examples + +### Basic Text Generation + +```sql +SELECT + AI.GENERATE( + 'Summarize this article: ' || article_content, + connection_id => 'my-project.us.my-connection', + endpoint => 'gemini-2.5-flash' + ) as summary +FROM `dataset.articles` +LIMIT 5; +``` + +### Structured Output Generation + +```sql +SELECT + AI.GENERATE( + 'Extract the date and amount from this invoice: ' || invoice_text, + output_schema => 'date DATE, amount FLOAT64' + ) as extracted_data +FROM `dataset.invoices`; +``` + +### Process images in a Cloud Storage bucket + +```sql +CREATE SCHEMA IF NOT EXISTS bqml_tutorial; + +CREATE OR REPLACE EXTERNAL TABLE bqml_tutorial.product_images + WITH CONNECTION DEFAULT OPTIONS ( + object_metadata = 'SIMPLE', + uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png']); + +SELECT + uri, + STRING(OBJ.GET_ACCESS_URL(ref,'r').access_urls.read_url) AS signed_url, + AI.GENERATE( + ("What is this: ", OBJ.GET_ACCESS_URL(ref, 'r')), + output_schema => + "image_description STRING, entities_in_the_image ARRAY").* +FROM bqml_tutorial.product_images +WHERE uri LIKE "%aquarium%"; +``` diff --git a/skills/developing-with-bigquery/references/ai-generate-embedding.md b/skills/developing-with-bigquery/references/ai-ml/ai_generate_embedding.md similarity index 100% rename from skills/developing-with-bigquery/references/ai-generate-embedding.md rename to skills/developing-with-bigquery/references/ai-ml/ai_generate_embedding.md diff --git a/skills/developing-with-bigquery/references/ai-generate-table.md b/skills/developing-with-bigquery/references/ai-ml/ai_generate_table.md similarity index 100% rename from skills/developing-with-bigquery/references/ai-generate-table.md rename to skills/developing-with-bigquery/references/ai-ml/ai_generate_table.md diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_if.md b/skills/developing-with-bigquery/references/ai-ml/ai_if.md new file mode 100644 index 0000000..c12d709 --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_if.md @@ -0,0 +1,55 @@ +# BigQuery AI.If + +`AI.IF` is a semantic boolean function used to evaluate a condition described in +natural language. + +The function can be used to filter and join data based on conditions described +in natural language or multimodal input. The following are common use cases: + +- Sentiment analysis: Find customer reviews with negative sentiment. +- Topic analysis: Identify news articles related to a specific subject. +- Image analysis: Select images that contain a specific item. +- Security: Identify suspicious emails. + +## Syntax Reference + +```sql +AI.IF( + [ prompt => ] 'PROMPT' + [, connection_id => 'CONNECTION_ID' ] + [, endpoint => 'ENDPOINT' ] +) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :------------------ | :----------- | :------------ | :--------------------- | +| **`prompt`** | **Required** | String/Struct | The prompt text or a | +: : : : struct/tuple of : +: : : : `(data, instruction)`. : +| **`connection_id`** | Optional | String | The connection ID to | +: : : : use for the LLM. : +| **`endpoint`** | Optional | String | The model endpoint | +: : : : (e.g. : +: : : : `'gemini-2.5-flash'`). : + +### Output Schema + +| Column Name | Type | Description | +| :------------------ | :----- | :---------------------------------------- | +| **(Scalar Result)** | `BOOL` | `TRUE` if the condition is met, `FALSE` | +: : : otherwise. Returns `NULL` on error/safety : +: : : filter. : + +## Examples + +### Filter rows based on semantic meaning + +```sql +SELECT * +FROM `dataset.table` +WHERE AI.IF( + (content_column, 'Is this review positive?') +); +``` diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_key_drivers.md b/skills/developing-with-bigquery/references/ai-ml/ai_key_drivers.md new file mode 100644 index 0000000..52c13e0 --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_key_drivers.md @@ -0,0 +1,75 @@ +# BigQuery AI.Key_Drivers + +`AI.KEY_DRIVERS` automatically identifies the key dimensional segments most +responsible for driving changes in a specified metric between a defined interest +group and a reference group. + +## Syntax Reference + +```sql +SELECT + * +FROM + AI.KEY_DRIVERS( + { TABLE TABLE | (QUERY_STATEMENT) }, + metric_col => 'METRIC_COL', + dimension_cols => DIMENSION_COLS, + interest_label_col => 'INTEREST_LABEL_COL', + [, min_apriori_support => MIN_APRIORI_SUPPORT] + [, top_k => TOP_K] + [, enable_pruning => ENABLE_PRUNING] + ) +``` + +### Input Arguments + +Argument | Requirement | Type | Description +:------------------------ | :----------- | :-------------- | :---------- +**`input_data`** | **Required** | | The source table or subquery containing the data to analyze. +**`metric_col`** | **Required** | `String` | Metric column name. Must be of type: INT64, NUMERIC, BIGNUMERIC, or FLOAT64. +**`interest_label_col`** | **Required** | `String` | Boolean column name: `TRUE` for interest group, `FALSE` for reference group. +**`dimension_cols`** | **Required** | `ARRAY` | 1-12 dimension columns (INT64, BOOL, STRING); cannot be `metric_col` or `interest_label_col`. +**`min_apriori_support`** | Optional | `FLOAT64` | Minimum apriori support threshold [0, 1] for output segments. Default: 0.1. Cannot be used with `top_k`. +**`top_k`** | Optional | `INT64` | Return top k insights [1, 1M] by apriori support. If unset, uses `min_apriori_support=0.1`. Cannot be used with `min_apriori_support`. +**`enable_pruning`** | Optional | `BOOL` | If `TRUE` (default), redundant insights are pruned. If `FALSE`, all insights meeting thresholds are returned. Two segments are redundant if two conditions are met: 1) their metric values are equal 2) The dimensions and corresponding values of one row are a subset of the dimensions and corresponding values of the other. In this case, the row with more dimensions (the more descriptive row) is kept. + +### Output Schema + +Returns a `STRUCT` with the following fields: + +Column Name | Type | Description +:----------------------------------- | :-------------- | :---------- +**`drivers`** | `ARRAY` | Provides a list of drivers, or dimension values of interest, which describes each of the segments. +**`metric_interest`** | `NUMERIC` | The sum of the metric_column for the data in the interest segment. +**`metric_reference`** | `NUMERIC` | The sum of the metric_column for data in the reference segment. +**`difference`** | `NUMERIC` | The difference between the interest and reference metric values for a segment. +**`relative_difference`** | `NUMERIC` | The relative change of a segment, calculated as the difference divided by the reference metric value. +**`unexpected_difference`** | `NUMERIC` | Measures deviation of segment from the rest of the population's growth. Calculated as: (segment relative_difference - complement relative_difference) * segment reference metric. +**`relative_unexpected_difference`** | `NUMERIC` | The unexpected_difference divided by the expected interest metric value for a segment. +**`contribution`** | `NUMERIC` | Contains the absolute value of the difference value: `ABS(difference)`. +**`apriori_support`** | `NUMERIC` | Segment size relative to the total population (filters small segments). + +## Examples + +### Identifying Key Drivers in 2024 H2 Liquor Sales + +```sql +WITH InputData AS ( + SELECT + sale_dollars, + city, + category_name, + vendor_name, + (date > '2024-07-01') AS IS_H2 + FROM `bigquery-public-data.iowa_liquor_sales.sales` + WHERE EXTRACT(YEAR FROM DATE) = 2024 +) +SELECT * +FROM AI.KEY_DRIVERS( + TABLE InputData, + metric_col => 'sale_dollars', + dimension_cols => ['city', 'vendor_name', 'category_name'], + interest_label_col => 'IS_H2', + min_apriori_support => 0 +); +``` diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_score.md b/skills/developing-with-bigquery/references/ai-ml/ai_score.md new file mode 100644 index 0000000..1f7952c --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_score.md @@ -0,0 +1,52 @@ +# BigQuery AI.Score + +The `AI.SCORE` function is commonly used with the ORDER BY clause and works well +when you want to rank items. The following are common use cases: + +- Retail: Find the top 5 most negative customer reviews about a product. +- Hiring: Find the top 10 resumes that appear most qualified for a job post. +- Customer success: Find the top 20 best customer support interactions. + +## Syntax Reference + +```sql +AI.SCORE( + [ prompt => ] 'PROMPT' + [, connection_id => 'CONNECTION_ID' ] + [, endpoint => 'ENDPOINT' ] +) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :------------------ | :----------- | :------------ | :--------------------- | +| **`prompt`** | **Required** | String/Struct | The prompt text or a | +: : : : struct/tuple of : +: : : : `(data, instruction)`. : +| **`connection_id`** | Optional | String | The connection ID to | +: : : : use for the LLM. : +| **`endpoint`** | Optional | String | The model endpoint | +: : : : (e.g. : +: : : : `'gemini-2.5-flash'`). : + +### Output Schema + +| Column Name | Type | Description | +| :------------------ | :-------- | :----------------------------------------- | +| **(Scalar Result)** | `FLOAT64` | A numerical score representing the degree | +: : : to which the data matches the instruction. : + +## Examples + +### Rank rows by semantic relevance + +```sql +SELECT * +FROM `dataset.table` +ORDER BY AI.SCORE( + (content_column, 'relevance to sports'), + connection_id => 'my-project.us.my-connection' +) DESC +LIMIT 10; +``` diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_search.md b/skills/developing-with-bigquery/references/ai-ml/ai_search.md new file mode 100644 index 0000000..a712977 --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_search.md @@ -0,0 +1,81 @@ +# BigQuery AI.Search + +`AI.SEARCH` is a table-valued function for semantic search on tables that have +autonomous embedding generation enabled. If your table has a column that has +generated_expression metadata with format "AI.EMBED(source_column)", then +AI.SEARCH uses it to optimize the search. + +You can use AI.SEARCH to help with the following tasks: + +- Semantic search: search entities ranked by semantic similarity. +- Recommendation: return entities with attributes similar to a given entity. +- Classification: return the class of entities whose attributes are similar to + the given entity. +- Clustering: cluster entities whose attributes are similar to a given entity. +- Outlier detection: return entities whose attributes are least related to the + given entity. + +## Syntax Reference + +```sql +AI.SEARCH( + { TABLE base_table | base_table_query }, + column_to_search, + query_value + [, top_k => top_k_value ] + [, distance_type => distance_type_value ] + [, options => options_value] +) +``` + +**IMPORTANT:** Do not add "column_to_search =>" prefixes to the column_to_search +argument because column_to_search is a positional argument. + +### Input Arguments + +Argument | Requirement | Type | Description +:--------------------- | :----------- | :------- | :---------- +**`base_table`** | **One Of** | Table | The table to search for nearest neighbor embeddings. The table must have autonomous embedding generation enabled. +**`base_table_query`** | **One Of** | Subquery | The query that you can use to pre-filter the base table. Only SELECT, FROM, and WHERE clauses are allowed in this query. Don't apply any filters to the embedding column. +**`column_to_search`** | **Required** | STRING | A STRING literal that contains the name of the string column to search +**`query_value`** | **Required** | STRING | A string literal that represents the search query. +**`top_k`** | Optional | INT64 | A named argument with an INT64 value, specifies the number of nearest neighbors to return. The default is 10. +**`distance_type`** | Optional | STRING | A named argument with a STRING value. distance_type_value specifies the type of metric to use to compute the distance between two vectors. Supported distance types are EUCLIDEAN, COSINE, and DOT_PRODUCT. The default is EUCLIDEAN. Recommend to use COSINE. +**`options`** | Optional | STRING | A named argument with a JSON-formatted STRING value that specifies the following search options: `fraction_lists_to_search` or `use_brute_force` + +### Output Schema + +Column Name | Type | Description +:------------- | :------ | :---------------------------------------------------- +**`base`** | STRUCT | A struct containing all columns from the input table. +**`distance`** | FLOAT64 | The distance score between the query and the result. + +## Examples + +```sql +# Create a table with a product column, a description column, and an autonomous +# embedding generation column over the descrption column. +CREATE TABLE mydataset.products ( + name STRING, + description STRING, + description_embedding STRUCT, status STRING> + GENERATED ALWAYS AS (AI.EMBED( + description, + connection_id => 'us.example_connection', + endpoint => 'text-embedding-005' + )) + STORED OPTIONS( asynchronous = TRUE ) +); + +# Get all really fun toy products. +SELECT + base.name, + base.description, + distance +FROM AI.SEARCH(TABLE mydataset.products, 'description', "A really fun toy", distance_type => "COSINE"); + +# Get top 5 fun toy products. +SELECT DISTINCT(base.description) +FROM AI.SEARCH(TABLE `mydataset.products`, 'description', "A really fun toy", distance_type => "COSINE", top_k => 100) +LIMIT 5; +``` diff --git a/skills/developing-with-bigquery/references/ai-ml/ai_similarity.md b/skills/developing-with-bigquery/references/ai-ml/ai_similarity.md new file mode 100644 index 0000000..ab7931d --- /dev/null +++ b/skills/developing-with-bigquery/references/ai-ml/ai_similarity.md @@ -0,0 +1,81 @@ +# BigQuery AI.Similarity + +`AI.SIMILARITY` computes the semantic similarity between two inputs. + +Use cases include the following: + +- Semantic search: Search for text or images based off a description, without + having to match specific keywords. +- Recommendation: Return entities with attributes similar to a given entity. + +## Syntax Reference + +```sql +AI.SIMILARITY( + content1 => 'CONTENT1', + content2 => 'CONTENT2' + [, endpoint => 'ENDPOINT'] + [, model_params => 'MODEL_PARAMS'] + [, connection_id => 'CONNECTION_ID'] +) +``` + +### Input Arguments + +| Argument | Requirement | Type | Description | +| :------------------ | :----------- | :-------- | :----------------------- | +| **`content1`** | **Required** | String or | The first text content | +: : : ObjectRef : or image context. : +| **`content2`** | **Required** | String or | The second text content | +: : : ObjectRef : or image to compare : +: : : : against. : +| **`connection_id`** | Optional | String | The connection ID to use | +: : : : for the LLM. : +| **`endpoint`** | Optional | String | The model endpoint (e.g. | +: : : : `'text-embedding-005'`). : +| **`model_params`** | Optional | JSON | JSON object for model | +: : : : parameters (e.g., : +: : : : `temperature`, : +: : : : `max_output_tokens`). : + +### Output Schema + +| Column Name | Type | Description | +| :------------------ | :-------- | :---------------------------------- | +| **(Scalar Result)** | `FLOAT64` | A similarity score (e.g., cosine | +: : : similarity). Returns null if error. : + +## Examples + +### Compute semantic similarity between two text inputs + +```sql +SELECT AI.SIMILARITY( + content1 => 'The cat sat on the mat', + content2 => 'A feline is resting on the rug' +) as similarity_score; +``` + +### Compute semantic similarity between text and image + +```sql +CREATE SCHEMA IF NOT EXISTS cymbal_pets; + +CREATE OR REPLACE EXTERNAL TABLE cymbal_pets.product_images +WITH CONNECTION DEFAULT +OPTIONS ( + object_metadata = 'SIMPLE', + uris = ['gs://cloud-samples-data/bigquery/tutorials/cymbal-pets/images/*.png'] +); + +SELECT + uri, + OBJ.GET_READ_URL(ref) AS signed_url, + ai.similarity( + "aquarium device", + ref, + endpoint => 'multimodalembedding@001') AS similarity_score +FROM cymbal_pets.product_images +ORDER BY similarity_score DESC +LIMIT 3; +``` diff --git a/skills/developing-with-bigquery/references/ml-contribution-analysis.md b/skills/developing-with-bigquery/references/ai-ml/ml_contribution_analysis.md similarity index 100% rename from skills/developing-with-bigquery/references/ml-contribution-analysis.md rename to skills/developing-with-bigquery/references/ai-ml/ml_contribution_analysis.md diff --git a/skills/developing-with-bigquery/references/remote-models.md b/skills/developing-with-bigquery/references/ai-ml/remote_models.md similarity index 100% rename from skills/developing-with-bigquery/references/remote-models.md rename to skills/developing-with-bigquery/references/ai-ml/remote_models.md diff --git a/skills/developing-with-bigquery/references/vector-search.md b/skills/developing-with-bigquery/references/ai-ml/vector_search.md similarity index 100% rename from skills/developing-with-bigquery/references/vector-search.md rename to skills/developing-with-bigquery/references/ai-ml/vector_search.md diff --git a/skills/developing-with-bigquery/references/BIGFRAMES.md b/skills/developing-with-bigquery/references/bigframes/BIGFRAMES.md similarity index 91% rename from skills/developing-with-bigquery/references/BIGFRAMES.md rename to skills/developing-with-bigquery/references/bigframes/BIGFRAMES.md index dbde2bc..5221769 100755 --- a/skills/developing-with-bigquery/references/BIGFRAMES.md +++ b/skills/developing-with-bigquery/references/bigframes/BIGFRAMES.md @@ -14,7 +14,7 @@ Guidelines for generating valid code with the BigFrames (BigQuery DataFrame) lib * Prefer built-in accessors (e.g., `df.col.str.*`, `df.col.dt.*`) over remote UDFs. * **Do not use lambdas** with `Series.map()` or `DataFrame.apply()`. * **Schema Verification**: Do not assume schema of intermediate outputs. Check `.dtypes` after loading, and use `display()` with `.head()` or `.peek()`. -* **Visualization**: BigFrames Dataframe mostly works directly with Matplotlib, Seaborn, and other ploting libraries. If your attempt didn't work, try using the "plot" accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling "to_pandas()". +* **Visualization**: BigFrames Dataframe mostly works directly with Matplotlib, Seaborn, and other plotting libraries. If your attempt didn't work, try using the "plot" accessor. If that didn't work either, you MUST sample or aggregate your data to make it small enough before calling "to_pandas()". ## Model Development diff --git a/skills/developing-with-bigquery/references/graph/graph_queries.md b/skills/developing-with-bigquery/references/graph/graph_queries.md new file mode 100644 index 0000000..bc28020 --- /dev/null +++ b/skills/developing-with-bigquery/references/graph/graph_queries.md @@ -0,0 +1,464 @@ +# Graph Query Language (GQL) Query Generation Guidelines + +You are querying a property graph consisting of nodes and edges. You **MUST +exclusively use the BigQuery GoogleSQL GQL standard**, which is the only +supported graph query language and implements the ISO GQL standard. + +**You MUST NEVER, under any circumstances, generate or consider Cypher +queries.** Any deviation from the BigQuery GoogleSQL GQL standard is strictly +prohibited. + +## Pre-generation Checklist + +Before generating any GQL, you MUST: + +1. **Identify Output Intent**: Determine if the user intends to **visualize a + graph network** (requires `TO_JSON()`) or **view tabular data** (requires + specific properties). +2. **Verify Language Standard**: Confirm the query will use **BigQuery + GoogleSQL GQL**. NEVER use Cypher. + +## Core Directives for Agent Query Generation + +When generating graph queries, you must adhere to the following global +directives: + +1. **Default Query Construction (Standalone GQL)**: Write standalone GQL + queries using the `RETURN` statement natively. **Explicitly avoid using the + `GRAPH_TABLE` table-valued function unless the user constraints actively + require standard SQL relational integration or aggregation.** +2. **Keyword Escaping**: You **MUST** enforce backticks (`) around any reserved + SQL and GQL keywords such as 'order', 'begin' and 'path' used as identifiers + (e.g., column names, label names, variable names). +3. **Strictly Follow Graph Schema**: Ensure all labels (e.g., `:Person`, + `:Account`) and properties (e.g., `n.id`, `e.amount`) used in the query + strictly match the provided graph schema. Do NOT guess or hallucinate schema + elements. +4. **Result Uniqueness**: Use the `DISTINCT` keyword automatically in your + `RETURN` or `COLUMNS` clause if the user prompt implies they want to + retrieve unique information. +5. **Graph Path Variables**: When a query involves "paths", "path traversal", + "path finding", or finding relationships between nodes, you **MUST** assign + the matched pattern to a path variable (e.g., `MATCH p = ...`). + +## Basic GQL Query Construction + +A linear query statement in BigQuery GQL executes clauses sequentially. The +output of one clause provides the input (the "working table") to the next. + +Common sequential statements include: + +- `MATCH`: Identifies topological patterns in the graph. +- `WITH`: Projects variables from current scope into the next scope, + optionally sorting, limiting or grouping. +- `LET`: Defines a new variable or alias within the query scope. +- `FILTER`: Filters intermediate graph mappings. +- `RETURN`: Ends a GQL query or subquery, projecting the final graph + variables. +- `ORDER BY`, `LIMIT`, `OFFSET`: Control sorting and pagination. + +**Chaining with `NEXT`:** Multiple linear statements can be composed into a +compound query using the `NEXT` keyword. The results of the first statement pipe +into the statement following `NEXT`. + +### Example 1: Sequential Statements + +This query demonstrates filtering and projecting data through a single linear +statement sequence. + +```sql +GRAPH .. +MATCH (src:Account)-[t:Transfers]->(dst:Account) +LET transfer_amount = t.amount +FILTER transfer_amount > 1000 +WITH src, dst, transfer_amount +ORDER BY transfer_amount DESC +LIMIT 50 +RETURN src.id AS source, dst.id AS destination, transfer_amount +``` + +### Example 2: Chaining with NEXT + +This query demonstrates using `NEXT` to pipe the results of one graph pattern +match into a subsequent pattern match. + +```sql +GRAPH .. +MATCH (blocked:Account WHERE blocked.is_frozen = true) +RETURN blocked.id AS frozen_id +NEXT +MATCH (a:Account)-[t:Transfers]->(b:Account) +FILTER a.id = frozen_id +RETURN a.id AS source, b.id AS destination, t.amount AS amount +``` + +## Graph Pattern Matching + +A graph pattern matches topologies within a BigQuery property graph. Patterns +consist of vertices (nodes) and connecting edges. + +### Node Patterns + +Node patterns are enclosed in parentheses `()`. They identify entities in the +graph and can optionally bind to a variable or specify label and property +filters. + +- `MATCH (n)`: Matches any node and binds it to the variable `n`. +- `MATCH (p:Person)`: Matches nodes explicitly labeled with `Person`. +- `MATCH (p:Person|Account)`: Uses a label expression `|` (OR) to match nodes + that have *either* the `Person` or `Account` label. +- `MATCH (p:Person {id: 1})`: Matches nodes that satisfy a specific property + filter. +- `MATCH (p:Person WHERE p.age > 18)`: Matches nodes applying a `WHERE` + condition on properties. + +**Important Note on Label Expressions**: BigQuery matches a node if it possesses +*any* of the labels listed in an `OR` (`|`) expression. You cannot use `&` +directly in label expressions. + +### Edge Patterns + +Edge patterns represent the relationships between nodes. They are enclosed in +square brackets `[]` and connected using arrows (`-`, `->`, `<-`) to denote +directionality. + +- `MATCH (a)-[e]->(b)`: Matches any directed edge from `a` to `b`, binding the + edge to `e`. +- `MATCH (a)-[e:Transfers]->(b)`: Directed edge specifically labeled + `Transfers`. +- `MATCH (a)-[e:Transfers {amount: 50}]->(b)`: Edge with a specific property + filter applied. +- `MATCH (a)-[e:Transfers]-(b)`: Matches an undirected (any direction) edge + between `a` and `b`. Use preferred explicit direction when possible for + better performance. + +### Pattern Joins and Commas + +A complex graph pattern consists of one or more path patterns separated by +commas `,`. When multiple comma-separated patterns are used: + +- If they do not share any variables, they result in a **cross join**. +- If they share a common variable, BigQuery automatically performs an + **equijoin** on that variable. + +```sql +-- Equijoin example where 'interm' connects the two paths +GRAPH .. +MATCH (src:Account)-[t1:Transfers]->(interm:Account), + (interm)<-[:Owns]-(p:Person) +RETURN src.id AS account_id, p.name AS owner_name +``` + +### Variable-Length Paths and Quantifiers + +You can find multi-hop connections by appending a quantifier to an edge pattern, +defining variable-length paths. - `{m, n}`: Specifies that the edge pattern must +be repeated between `m` and `n` times (e.g., `{1,3}`). + +**Group Variables**: When an edge variable is quantified (e.g., +`[e:Transfers]->{1,3}`), the variable `e` becomes a "group variable." This +represents an array of the matched edges in the path. You must use array +functions to interact with it, such as `ARRAY_LENGTH(e)` or horizontal +aggregation like `SUM(e.amount)`. + +### Path Search Prefixes + +Variable-length paths can result in exponential combinations and repeating +paths. You can constrain the search between source and destination pairs using +search prefixes placed immediately before the path pattern: + +- `ANY`: Returns exactly one arbitrary matching path between each unique pair + of source and destination nodes. +- `ANY SHORTEST`: Returns a single path for each unique pair, specifically + choosing from those with the minimum number of edges (hops). +- `ANY CHEAPEST`: Returns a single path with the minimum total cost, computed + by aggregating `COST` expressions defined on the edges. + +```sql +GRAPH .. +MATCH ANY SHORTEST + (a:Account {id: 123})-[e:Transferred]->{1,3}(b:Account {id: 456}) +RETURN e +``` + +## GQL Functions and Operators + +BigQuery property graphs support specialized native functions for interrogating +graph elements and extracting path metadata. These functions can be used +directly within `MATCH`, `WHERE`, `LET`, and `RETURN`/`COLUMNS` clauses. + +### Path Extraction Functions + +When an entire path pattern is bound to a variable (e.g., `MATCH p = (...)`), +you can extract specific metadata and elements from it: + +- `PATH_FIRST(p)`: Extracts and returns the starting node of path `p`. +- `PATH_LAST(p)`: Extracts and returns the terminal (ending) node of path `p`. +- `PATH_LENGTH(p)`: Returns an `INT64` count representing the number of edge + hops in path `p`. +- `NODES(p)`: Returns an array of node elements, ordered by their sequence in + the path. +- `EDGES(p)`: Returns an array of edge elements, ordered by their sequence in + the path. + +```sql +GRAPH .. +MATCH p = (a:Account)-[t:Transfers]->{1,3}(b:Account) +RETURN PATH_LENGTH(p) AS hops, TO_JSON(NODES(p)) AS path_nodes +``` + +### Element Traversal and Inspection Functions + +These functions operate on individual node or edge element variables: + +- `DESTINATION_NODE_ID(e)`: Retrieves the unique internal string identifier of + an edge `e`'s destination node. +- `SOURCE_NODE_ID(e)`: Retrieves the unique internal string identifier of an + edge `e`'s source node. +- `ELEMENT_ID(x)`: Returns the unique internal identifier for the given node + or edge `x`. +- `LABELS(x)`: Returns an array of string labels bound to a node or edge + element `x`. + +## Output Formatting: Graph Visualization vs. Tabular Data + +When constructing the `RETURN` clause, strictly distinguish between **graph +visualization** intent and **tabular data** intent based on the user's +objective. + +### Path Variables + +You can assign an entire matched pattern sequence to a path variable using the +assignment operator `=`. This allows you to reference the entire topological +sequence later in the query. + +```sql +MATCH p = (a:Person)-[e:Knows]->(b:Person) +``` + +In this example, `p` represents the full path, encapsulating the nodes `a` and +`b` and the edge `e`. + +### 1. Graph Visualization Intent + +Use this when the user wants to see relationships, paths, topology, networks, +connectivity or entire entities (nodes/edges) as a whole. + +- **Trigger & Keywords**: "visualize", "show the graph", "network", + "connections", "find the path", "relationship between X and Y". +- **Default JSON Serialization (`TO_JSON`)**: Unless specific properties + (e.g., `n.name`) or path metrics (e.g., `PATH_LENGTH(p)`) are explicitly + requested, you **MUST** wrap all graph topology outputs (nodes, edges, and + path variables) in the standard `TO_JSON()` function. This ensures + compatibility with graphing UI components that expect full JSON objects. +- **Example**: `RETURN TO_JSON(src) AS source, TO_JSON(p) AS full_path` +- **Limit**: Always append `LIMIT 500` to the query to prevent overwhelming + the UI with too many nodes/edges, unless the user explicitly requests a + different number. + +```sql +GRAPH .. +MATCH p = (src:Person)-[e:Knows]->(dst:Person) +RETURN + TO_JSON(src) AS source_node, + TO_JSON(e) AS relationship, + TO_JSON(dst) AS destination_node, + TO_JSON(p) AS full_path +LIMIT 500 +``` + +### 2. Tabular or Chart Intent + +Use this when the user focuses on specific attributes, statistics, or metrics. + +- **Trigger & Keywords**: "what is the name", "list", "how many", "count", + "average", "top 10", "aggregate". +- **Action**: Return ONLY the specific required properties or aggregates. **Do + NOT** use `TO_JSON()`. +- **Example**: `RETURN account.id, SUM(t.amount) AS total_transfer` + +## GRAPH_TABLE Syntax and SQL Integration + +The `GRAPH_TABLE` table-valued function is the primary mechanism for integrating +property graph queries with standard SQL operations in BigQuery. + +### When to Use GRAPH_TABLE + +You **SHOULD** use `GRAPH_TABLE()` only when your query requires integration +with SQL capabilities beyond basic graph pattern matching. Use it for: + +- **SQL Aggregations & Analysis**: Mixing graph pattern matching with standard + SQL aggregations (e.g., `SUM`, `COUNT`, `GROUP BY`). +- **Relational Joins**: Joining graph query results with relational tables or + other `GRAPH_TABLE` calls. +- **Advanced SQL Operations**: Utilizing advanced SQL filtering, reporting, or + pagination on the graph results. + +### Basic Syntax + +The basic structure of a `GRAPH_TABLE` query involves specifying the graph name, +the GQL statements, and a `COLUMNS` clause to define the output relational +schema. + +```sql +SELECT + src_account_id, + COUNT(*) AS transfer_count, + SUM(amount) AS total_transfer_volume +FROM GRAPH_TABLE( + .. + MATCH (src:Account)-[t:Transfers]->(dst:Account) + WHERE src.is_blocked = true + COLUMNS (src.id AS src_account_id, t.amount AS amount) +) +GROUP BY src_account_id +HAVING total_transfer_volume > 10000 +ORDER BY total_transfer_volume DESC +``` + +### The COLUMNS Clause + +The `COLUMNS` clause is mandatory if you want to explicitly define the returned +table's schema. + +- **Explicit Projection**: It limits the output to only the specified + expressions from the graph query scope. +- **Anonymous Columns**: You *must* alias any expressions in the `COLUMNS` + clause if they generate an anonymous column (e.g., `COLUMNS (t.amount * 2 AS + doubled_amount)`). +- **Default Behavior**: If the `COLUMNS` clause is entirely omitted, + `GRAPH_TABLE` returns all graph pattern variables present in the query + scope. +- **Aggregations**: You can include standard SQL aggregate functions directly + within the `COLUMNS` clause to perform grouping and aggregation across the + rows of the resulting graph matches. + +### Joins with Relational Tables + +You can join the result of `GRAPH_TABLE` with other standard BigQuery tables or +even other `GRAPH_TABLE` results using standard SQL semantics (e.g., `JOIN`, +`LEFT JOIN`). + +To make a `GRAPH_TABLE` aware of variables from an earlier table in the `FROM` +clause, you can use parameterized `GRAPH_TABLE`. In the example below, `a.id` +from the `Accounts` table is passed into the `GRAPH_TABLE` scope: + +```sql +SELECT + a.name, + g.total_amount +FROM Accounts AS a +JOIN GRAPH_TABLE( + .. + MATCH (src:Account {id: a.id})-[t:Transfers]->(dst:Account) + COLUMNS (SUM(t.amount) AS total_amount) +) AS g +``` + +## Subquery Limitations + +A subquery in BigQuery GQL is enclosed in braces `{}` and evaluates nested +operations within a linear query statement. While BigQuery Graph supports +subqueries, there are critical limitations and syntax differences compared to +standard GoogleSQL that you **MUST** adhere to. + +### Mandatory Graph Name Specification + +In BigQuery Graph, unlike standard GoogleSQL, you **MUST** specify the graph +name within the subquery block. If the outer query uses `GRAPH +..`, the internal subquery must also explicitly +re-declare it. + +```sql +MATCH (n1) +WHERE EXISTS { + -- REQUIRED: You must re-specify the graph name here + GRAPH .. + MATCH (n2) + WHERE n1 = n2 + RETURN 1 as one +} +``` + +Failure to include the graph name in the subquery will result in a job-server +error. + +### The WHERE vs. FILTER Rule + +Certain types of subqueries **throw errors when used inside a `WHERE` clause** +because BigQuery's query planner cannot decorrelate them if they act as join +predicates. + +For the following subquery types, you **CANNOT** use the `WHERE` clause. You +**MUST** use the `FILTER` clause instead: `EXISTS`, `IN`, `LIKE`, and `LIKE +ANY/SOME/ALL` subqueries. + +**INCORRECT (will throw error):** `sql MATCH (p:Person) WHERE EXISTS { GRAPH +.. MATCH (p)-[:Owns]->(:Account) }` + +**CORRECT:** `sql MATCH (p:Person) FILTER EXISTS { GRAPH +.. MATCH (p)-[:Owns]->(:Account) }` + +### Supported Subquery Types and Correlations + +- **`ARRAY` Subquery**: Fully Supported. Evaluates the query block and returns + an array of the results. +- **`VALUE` Subquery**: Partially Supported. Evaluates the internal query and + returns a single scalar value. **Limitation**: `VALUE` subqueries throw + errors when correlated variables from the outer block are referenced inside + the `VALUE` subquery. +- **`EXISTS`, `IN`, `LIKE` Subqueries**: Partially Supported. **Limitation**: + Throw errors when correlated variables are used. Throw errors when used in + `WHERE` filter (Must use `FILTER`). + +## Query Optimization and Best Practices + +Performance is a key consideration for highly connected BigQuery graphs. Adhere +to these principles whenever writing GQL statements to ensure optimal execution. + +### 1. Start Traversals From Low-Cardinality Nodes + +Always write your path traversals so they originate from the lowest cardinality +nodes (the most specific entities). This drastically reduces the intermediate +result set sizes and speeds up execution, especially for variable-length +traversals. + +- **Example**: Instead of starting from a highly active `Account` node and + traversing backwards to find the owner, start with the specific `Person` + node and traverse forward. +- **Filter Early**: Push specific properties (e.g., `Account {id: 7}`) as + early as possible in your `MATCH` clause to prune the search space + immediately. + +### 2. Specify Labels Explicitly + +You must explicitly provide node and edge labels when they are known (e.g., +`(a:Account)-[:Transfers]->(b:Account)`). + +While BigQuery attempts to infer labels from query usage, if inference fails or +labels are omitted, the engine is forced to perform full table scans over +multiple distinct underlying node/edge tables. + +### 3. Avoid Bi-directional Graph Traversals + +BigQuery Graph schema physical implementations are directional. You should +always specify a source and destination node for an edge (using `->` or `<-`). + +Although query pattern syntax allows for bidirectional or undirected path +traversal (`(node)-[edge]-(node)`), doing so incurs a severe implicit +performance penalty. + +If you need to find an edge between two specific nodes regardless of direction, +**DO NOT** use a bidirectional pattern. Instead, use explicit directional +traversals combined with `UNION ALL`: + +**GOOD:** `sql GRAPH .. MATCH (a1:Account +{id:10})-[t:Transfer]->(a2:Account {id: 20}) RETURN t UNION ALL MATCH +(a2:Account{id: 20})-[t:Transfer]->(a1:Account {id: 10}) RETURN t` + +### 4. Prefer Single MATCH Statements + +When possible without sacrificing readability or violating logic intent, prefer +composing a single comprehensive `MATCH` statement over chaining multiple +individual `MATCH` statements. A single statement allows the query optimizer a +wider global view of the graph pattern, often leading to better execution plans. diff --git a/skills/developing-with-bigquery/references/graph/semantic_queries.md b/skills/developing-with-bigquery/references/graph/semantic_queries.md new file mode 100644 index 0000000..5c70605 --- /dev/null +++ b/skills/developing-with-bigquery/references/graph/semantic_queries.md @@ -0,0 +1,65 @@ +# Semantic Graph Specific Rules + +1. **Query the Flattened View:** Always query from the semantic graph using the + `GRAPH_EXPAND` table-valued function (TVF). The argument to `GRAPH_EXPAND` + should be the full graph name string (e.g., + "project_id.dataset_id.property_graph_id"). + + * **CRITICAL RULE:** The semantic graph is NOT a regular table, even + though its schema may be presented using `CREATE TABLE`. It is a graph. + You **MUST NEVER** query it directly as a table (e.g., `FROM + my_project.my_dataset.my_graph`). + * **CRITICAL FALLBACK RULE:** If a query using `GRAPH_EXPAND` fails (e.g., + due to syntax errors or system limits), **DO NOT** attempt to fallback + to querying it as a standard table. Doing so will result in a critical + `NOT_FOUND` error. + * The semantic graph is a virtual flattened view of the graph, which is + optimized for data analysis and answering questions. + + ```sql + SELECT ... + FROM GRAPH_EXPAND("project_id.dataset_id.property_graph_id") + WHERE ... + ``` + +2. **Querying Measures:** Columns marked with `is_measure=TRUE` in the schema + (e.g., `Customer_customer_count INT64 OPTIONS(is_measure=TRUE)`) are measure + columns. You **MUST** query these columns using the `AGG()` function. + + * **Syntax:** `AGG()` + * **Example:** + + ```sql + -- Given Schema: + -- CREATE TABLE `my_project.my_dataset.my_graph` ( + -- Customer_name STRING, + -- Customer_total_orders INT64 OPTIONS(is_measure=TRUE), + -- Product_name STRING + -- ); + + -- Querying the measure: + SELECT + Customer_name, + AGG(Customer_total_orders) AS total_orders + FROM GRAPH_EXPAND("my_project.my_dataset.my_graph") + GROUP BY Customer_name; + ``` + + * Do not apply other aggregation functions like `SUM`, `AVG`, etc. + directly to measure columns. Use `AGG()` instead. + +3. **Prefer Measures (AGG) over Standard SQL Aggregations:** You **MUST** + prioritize using pre-defined measures (columns with `is_measure=TRUE`) over + writing standard SQL aggregations (like `COUNT(DISTINCT ...)`, `SUM`, etc.) + whenever a relevant measure is available in the schema. + + * **Context:** Semantic graphs define business logic within measures to + ensure accuracy and prevent issues like overcounting. Generating + aggregations via standard SQL bypasses this logic. + * **Example Scenario:** If the user asks for the "total number of + entities", and the schema provides an `Entity_id` column as well as a + measure column `Entity_count INT64 OPTIONS(is_measure=TRUE)`: + * **INCORRECT (Standard SQL):** `sql SELECT COUNT(DISTINCT Entity_id) + AS total_entities ...` + * **CORRECT (Measure):** `sql SELECT AGG(Entity_count) AS + total_entities ...` diff --git a/skills/developing-with-bigquery/references/OPTIMIZATION.md b/skills/developing-with-bigquery/references/sql/OPTIMIZATION.md similarity index 100% rename from skills/developing-with-bigquery/references/OPTIMIZATION.md rename to skills/developing-with-bigquery/references/sql/OPTIMIZATION.md