Skip to content

Commit 2939825

Browse files
authored
Merge branch 'main' into notebook-mcp-final
2 parents 0977718 + 92a912b commit 2939825

7 files changed

Lines changed: 336 additions & 26 deletions

File tree

.release-please-manifest.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
{
2-
".": "0.1.1"
2+
".": "0.1.3"
33
}

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# Changelog
22

3+
## [0.1.3](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/compare/0.1.2...0.1.3) (2026-04-08)
4+
5+
6+
### Bug Fixes
7+
8+
* add missing notebook_guidance skill and fix data-autocleaning markdown formatting ([#6](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/issues/6)) ([3461cd9](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/commit/3461cd9f044e94cf89b5e90513ebf945e61f8863))
9+
10+
## [0.1.2](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/compare/0.1.1...0.1.2) (2026-04-08)
11+
12+
13+
### Bug Fixes
14+
15+
* set correct plugin path in Codex installer script ([#5](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/issues/5)) ([64195b6](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/commit/64195b605d882cf7a1e634655bc61813b93e650a))
16+
317
## [0.1.1](https://github.com/gemini-cli-extensions/data-cloud-ai-dev-kit/compare/0.1.0...0.1.1) (2026-04-07)
418

519

codex-install.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,8 @@ data.plugins = data.plugins || [];
5353
data.plugins = data.plugins.filter(p => p.name !== '${PLUGIN_NAME}');
5454
data.plugins.push({
5555
name: '${PLUGIN_NAME}',
56-
source: { source: 'local', path: './${PLUGIN_NAME}' },
56+
interface: { displayName: 'Google Data Cloud AI Dev Kit' },
57+
source: { source: 'local', path: './.agents/plugins/${PLUGIN_NAME}' },
5758
policy: { installation: 'AVAILABLE', authentication: 'ON_INSTALL' },
5859
category: 'Productivity'
5960
});

gemini-extension.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
22
"name": "data-cloud-ai-dev-kit",
3-
"version": "0.1.1",
3+
"version": "0.1.3",
44
"description": "This plugin provides a specialized suite of skills for data engineers and database practitioners working on Google Cloud. It acts as an expert assistant, allowing you to use natural language prompts in your preferred coding agent to architect complex data pipelines, transform data with dbt, write Spark and BigQuery SQL notebooks, and orchestrate end-to-end workflows across GCP's data ecosystem."
55
}

skills/data-autocleaning/SKILL.md

Lines changed: 28 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -121,21 +121,22 @@ Perform these checks **before** generating the `implementation_plan.md`.
121121

122122
#### Data Cleaning Rules
123123

124-
| Category | Rule |
125-
| --- | --- |
126-
| **Garbage Values** | Drop or convert to `NULL` only for malformed data (e.g., unparseable dates, zero-length strings for non-nullable integers). |
127-
| **Unit Normalization** | Standardize measurable units (e.g., `'C'``'F'`) to the most common unit. If units are too varied (e.g., `mg`, `liter`), leave as-is. |
128-
| **Type Conversion** | Use `COALESCE` with `SAFE.PARSE_*` functions for multiple date/time/datetime/timestamp formats. Fetch diverse samples when source data shows high variance. |
124+
- **Garbage Values**: Drop or convert to `NULL` only for malformed data (e.g., unparseable dates, zero-length strings for non-nullable integers).
125+
- **Unit Normalization**: Standardize measurable units (e.g., `'C'``'F'`) to the most common unit. If units are too varied (e.g., `mg`, `liter`), leave as-is.
126+
- **Type Conversion**: Use `COALESCE` with `SAFE.PARSE_*` functions for multiple date/time/datetime/timestamp formats. Fetch diverse samples when source data shows high variance.
129127

130128
#### JSON Data Handling
131129

132-
| Rule | Detail |
133-
| --- | --- |
134-
| **Parsing** | Use `SAFE.PARSE_JSON` to cast JSON strings to `JSON` type. **Never** use deprecated `JSON_EXTRACT_*`. |
135-
| **Extraction** | Flatten or extract fields **only** if a destination schema requires it. |
136-
| **Accessors** | Use `JSON_VALUE`, `JSON_QUERY`, `JSON_QUERY_ARRAY`, `JSON_VALUE_ARRAY` without `SAFE.` prefix (they are safe by default). |
137-
| **Schema mapping** | When a destination schema is provided, extract JSON fields to match target column names and types. |
138-
| **NULL handling** | If `SAFE.PARSE_JSON` returns NULL, keep the original string and note the invalid JSON in the cleaning summary. |
130+
- **Parsing**: Use `SAFE.PARSE_JSON` to cast JSON strings to `JSON` type.
131+
**Never** use deprecated `JSON_EXTRACT_*`.
132+
- **Extraction**: Flatten or extract fields **only** if a destination schema
133+
requires it.
134+
- **Accessors**: Use `JSON_VALUE`, `JSON_QUERY`, `JSON_QUERY_ARRAY`,
135+
`JSON_VALUE_ARRAY` without `SAFE.` prefix (they are safe by default).
136+
- **Schema mapping**: When a destination schema is provided, extract JSON
137+
fields to match target column names and types.
138+
- **NULL handling**: If `SAFE.PARSE_JSON` returns NULL, keep the original
139+
string and note the invalid JSON in the cleaning summary.
139140

140141
#### Array Data Handling
141142

@@ -182,10 +183,12 @@ Perform these checks **before** generating the `implementation_plan.md`.
182183
5. **Compare profiles** (Skip if scans were denied) — Check the new profile
183184
against the Step 1 profile for **every transformed column**:
184185
185-
Anomaly Type | Threshold
186-
--------------------- | -------------------------------------------------
187-
**NULL increase** | >1% increase compared to source (unless expected)
188-
**Value range shift** | Unexpected ranges or formats
186+
```markdown
187+
| Anomaly Type | Threshold |
188+
| --- | --- |
189+
| **NULL increase** | >1% increase compared to source (unless expected) |
190+
| **Value range shift** | Unexpected ranges or formats |
191+
```
189192
190193
6. **Iterate on anomalies** — For each anomaly:
191194
@@ -212,14 +215,16 @@ here instead of Job IDs.
212215
213216
### Step 4: Documentation
214217
215-
Your `walkthrough.md` must contain the following for each transformation:
218+
Your `walkthrough.md` must contain a table for each transformation in the following format:
216219
217-
Field | Description
218-
--------------------------------- | -----------------------------------------
219-
**Destination schema considered** | The target column/type being matched
220-
**Issue Detected** | What data quality problem was found
221-
**Transformation Applied** | The SQL logic used to fix it
222-
**Benefit** | Why this transformation improves the data
220+
```markdown
221+
| Field | Description |
222+
| --- | --- |
223+
| **Destination schema considered** | The target column/type being matched |
224+
| **Issue Detected** | What data quality problem was found |
225+
| **Transformation Applied** | The SQL logic used to fix it |
226+
| **Benefit** | Why this transformation improves the data |
227+
```
223228
224229
Include a summary of all quality review steps and profiling evidence.
225230

skills/notebook-guidance/SKILL.md

Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
---
2+
name: notebook-guidance
3+
description: |-
4+
This skill guides the use of Jupyter notebooks for data analysis, exploration, and visualization, particularly with BigQuery. It outlines best practices for cell-by-cell execution and validation, library installation, and structuring notebooks for clarity. It also covers specific rules for data cleaning, plotting, and integrating with BigQuery SQL and machine learning workflows.
5+
Relevant when any of the following conditions are true:
6+
1. The user request involves a data analysis, data exploration, data visualization, or data insights task that requires multiple steps, queries, or visualizations to answer.
7+
2. The user explicitly requests a notebook (.ipynb).
8+
3. You are creating, editing, or executing cells in a Jupyter notebook.
9+
4. You need to query BigQuery from within a notebook. DO NOT use the Python BigQuery client library; instead, you MUST use the BigQuery SQL cells feature explained in this skill.
10+
license: TBD
11+
metadata:
12+
version: v1
13+
publisher: google
14+
---
15+
16+
# Notebook Guidance
17+
18+
## When to Use a Notebook
19+
20+
Before choosing to use a notebook, evaluate the task complexity using these
21+
heuristics.
22+
23+
Use a notebook if you meet at least one of these criteria: * 📈 **Data Insights &
24+
Storytelling**: Use a notebook for any request to "give insights", "find
25+
trends", "explore data", or "analyze data". These tasks benefit from using
26+
visualizations to present the data. * 📊 **Visualizations are requested**: The
27+
user explicitly asks for charts or plots. * 🔄 **Stateful / Iterative
28+
Exploration**: You need to run a query, inspect results, and decide the next
29+
query based on those results while keeping state in memory.
30+
31+
Do NOT use a notebook ONLY if: * 📝 **Simple Fact/Status**: The request only
32+
requires a single number (e.g., "how many rows") or a status check (e.g., "when
33+
was this table updated"). * 🏃‍♂️ **Schema Preview**: The request is only about
34+
the schema or field types.
35+
36+
**Golden Rule of Data Storytelling:** If any analytical insight, trend, or
37+
comparison is involved, favor a notebook and a visualization. A notebook is the
38+
"standard" environment for our developer workflow; do not avoid it because of
39+
"overhead".
40+
41+
## Notebook Best Practices
42+
43+
The golden rule: **STEP BY STEP GENERATE CELL -> EXECUTE CELL -> VALIDATE
44+
OUTPUT**, do not generate the entire notebook all at once.
45+
46+
1. **EXECUTE-AND-VALIDATE LOOP**: Generate ONE cell, execute it, then verify
47+
the output. If the output is data (e.g. a dataframe), you MUST inspect it to
48+
confirm the logic is correct before generating the next step. Batch
49+
generation of an entire notebook is strictly prohibited because error
50+
propagation in notebooks is expensive to fix.
51+
2. **IDENTIFY DATA EARLY**: Use `@skill:discovering-gcp-data-assets` or
52+
BigQuery list tools to find the correct `project.dataset.table` before
53+
writing ANY code. If the table ID is missing, ask the user.
54+
3. **CLEAN FINAL STATE**: The final notebook MUST NOT have failed cells. If a
55+
cell fails, you MUST fix it. If you tried several versions, delete the
56+
failed attempts before you present the notebook to the user.
57+
4. **LOGICAL CHUNK FIDELITY**: Keep cells small. One logical transformation or
58+
visualization per cell. Group related cells into logical units (e.g., a
59+
BigQuery SQL query cell followed immediately by a Python visualization cell
60+
for those results). Use descriptive **markdown cells** to separate and
61+
document different logical sections.
62+
5. **GENERATE VISUALIZATIONS**: Always accompany data insights with
63+
visualizations; charts are often more effective than raw numbers for
64+
communicating trends and comparisons.
65+
66+
## Kernel & Environment Management
67+
68+
Notebooks run in specific **Kernels** (execution backends). You MUST ensure the
69+
kernel’s Python environment contains the necessary libraries (`bigframes`,
70+
`ipykernel`, etc.).
71+
72+
### Kernel Types
73+
74+
1. **Local Python**: Standard Python 3 kernel running on the notebook host
75+
(Managed instance, local machine).
76+
2. **Cloud Spark Remote (Dataproc Serverless)**: Transient Spark environment
77+
managed by GCP. Use for large-scale data processing.
78+
3. **Cloud Spark Remote (Dataproc Cluster)**: Persistent Spark clusters for
79+
shared or custom configurations.
80+
4. **Colab (Managed)**: Ephemeral Google-managed runtimes.
81+
82+
### No Active Kernel / Setup Check
83+
84+
1. **Infer or Ask about Kernel Preferences**:
85+
- **Infer from Context**:
86+
- If the task mentions "Spark", "PySpark", or "distributed compute",
87+
or if the active workspace is already a Spark cluster, lean towards
88+
**Remote Spark**.
89+
- If the task is focused on "BigQuery", "BigFrames", or standard API
90+
calls, lean towards **Local Python**.
91+
- **Ask when Ambiguous**: If multiple options fit, ask if they prefer a
92+
**Local Python** or a **Cloud/Remote Kernel** (e.g., Colab, Spark).
93+
2. **For Local Setup**: Use `@skill:managing-python-dependencies` to verify
94+
if a virtual environment exists. If not, create one. Ensure `ipykernel` is
95+
installed in that environment. Install any other relevant libraries.
96+
3. **For Remote Setup**: Advise the user to use the UI to select the
97+
appropriate remote kernel.
98+
99+
### Proper Library Installation
100+
101+
#### 1. Local Kernels
102+
103+
Before installing any python libraries, you MUST use
104+
`@skill:managing-python-dependencies` to detect how python dependencies are
105+
managed in the project.
106+
107+
#### 2. Remote Kernels (Spark/Colab)
108+
109+
Since these are often ephemeral or managed by GCP:
110+
111+
* Use `%pip install <package>` in the first cell if it's the only way to
112+
modify the runtime.
113+
* Check if the library is already available in the pre-installed stack.
114+
115+
When in doubt about the kernel type or preferred installation method, ask the
116+
user for clarification.
117+
118+
## Data Analysis & Visualization Rules
119+
120+
Guidelines for performing exploratory data analysis, data cleaning, and
121+
visualization in notebooks.
122+
123+
### Notebook Layout
124+
125+
The notebook should read like a story. While you have flexibility (e.g.,
126+
multiple visualizations for one data cell, or data cells building on each
127+
other), aim for this general flow:
128+
129+
1. **Title & Objective** (Markdown Cell)
130+
* What is this notebook for? (e.g., `# Retention Analysis`)
131+
2. **Section Header** (Markdown Cell)
132+
* What are we looking at now? (e.g., `## Exploring User Retention`)
133+
3. **Data Acquisition/Transformation** (SQL or Python Cell)
134+
* Query BigQuery or transform data.
135+
4. **Verification (Optional but Recommended)** (Python Cell)
136+
* `df.head()` or assert sanity checks.
137+
5. **Visualization (The Goal)** (Python Cell)
138+
* Plot the insight (e.g., `df.plot()`).
139+
140+
*Repeat steps 2-5 for each new sub-topic or insight. You can have multiple Data
141+
cells before a Visualization, or multiple Visualizations from one Data cell. The
142+
key is to keep them grouped logically and separated by Markdown headers.*
143+
144+
1. **Final Summary** (Markdown Cell)
145+
146+
* At the end of the notebook, add a markdown cell containing a summary
147+
paragraph that summarizes the findings to the user. The summary MUST
148+
follow these guidelines:
149+
* MUST NOT add Python code to the summary.
150+
* The summary MUST NOT start with a code block.
151+
* The summary MUST be strictly grounded in the numerical data verified in
152+
the notebook.
153+
* The summary MUST ONLY contain the following three sections:
154+
* ### Q&A If the data analysis task contains questions (implied or
155+
explicit), you MUST answer them based on the solving process. Skip
156+
this section if there are no questions to answer.
157+
* ### Data Analysis Key Findings Summarize the key analysis findings
158+
in bullet points, it's a plus to quote the numbers in the previous
159+
steps. Only report high-value findings, skip the obvious ones.
160+
* ### Insights or Next Steps Provide 1-2 concise insights or next
161+
steps in bullet points.
162+
163+
2. **Next Steps**: After you are done generating and executing the entire
164+
notebook successfully, and the summary is complete, notify the user and
165+
propose next step suggestions.
166+
167+
### Plotting Rules
168+
169+
1. You MUST use different colors for different features to ensure plots are
170+
readable for humans.
171+
2. When creating a plot, you MUST adjust the figure size based on the number of
172+
features. The labels and legends MUST NOT overlap.
173+
3. You SHOULD arrange the layout wisely. Using subplots CAN help in placing
174+
different plots effectively.
175+
4. You MUST use inline figures to present figures and plots along with code and
176+
text in the notebook.
177+
5. For clustering, use PCA to reduce to 2D before scatter plotting.
178+
6. Use **Line Charts** ONLY for continuous data (e.g. time series) where
179+
interpolation between points is meaningful.
180+
181+
### Data Cleaning Rules
182+
183+
1. You MUST be careful about missing values and duplicated values.
184+
2. You MUST NOT drop columns unless absolutely necessary. Dropping columns is
185+
irreversible.
186+
3. You SHOULD focus on columns directly related to accomplishing the task; not
187+
every column NEEDS to be cleaned.
188+
189+
## Specialized Notebook Guidance
190+
191+
Refer to the following resources for guidance on specific notebook topics:
192+
193+
### 1. BigQuery in Notebooks
194+
195+
Standards for using BigQuery SQL in notebooks and accessing results in Python.
196+
197+
- **Guide**:
198+
[bigquery_sql_in_notebooks.md](resources/bigquery_sql_in_notebooks.md)
199+
and the BigQuery skills.
200+
- **MUST READ WHEN**: You are writing BigQuery SQL queries in a notebook or
201+
processing query results in Python.
202+
203+
### 2. Machine Learning in Notebooks
204+
205+
Integration with machine learning workflows and best practices. - **Guide**: Use
206+
`@skill:ml-best-practices`. - **MUST READ WHEN**: The task involves machine
207+
learning, training a model, clustering, classification, regression, or
208+
time-series forecasting.
209+
210+
If any "MUST READ WHEN" condition is met, you MUST read the corresponding guide
211+
before proceeding.

0 commit comments

Comments
 (0)