|
| 1 | +--- |
| 2 | +name: notebook-guidance |
| 3 | +description: |- |
| 4 | + This skill guides the use of Jupyter notebooks for data analysis, exploration, and visualization, particularly with BigQuery. It outlines best practices for cell-by-cell execution and validation, library installation, and structuring notebooks for clarity. It also covers specific rules for data cleaning, plotting, and integrating with BigQuery SQL and machine learning workflows. |
| 5 | + Relevant when any of the following conditions are true: |
| 6 | + 1. The user request involves a data analysis, data exploration, data visualization, or data insights task that requires multiple steps, queries, or visualizations to answer. |
| 7 | + 2. The user explicitly requests a notebook (.ipynb). |
| 8 | + 3. You are creating, editing, or executing cells in a Jupyter notebook. |
| 9 | + 4. You need to query BigQuery from within a notebook. DO NOT use the Python BigQuery client library; instead, you MUST use the BigQuery SQL cells feature explained in this skill. |
| 10 | +license: TBD |
| 11 | +metadata: |
| 12 | + version: v1 |
| 13 | + publisher: google |
| 14 | +--- |
| 15 | + |
| 16 | +# Notebook Guidance |
| 17 | + |
| 18 | +## When to Use a Notebook |
| 19 | + |
| 20 | +Before choosing to use a notebook, evaluate the task complexity using these |
| 21 | +heuristics. |
| 22 | + |
| 23 | +Use a notebook if you meet at least one of these criteria: * 📈 **Data Insights & |
| 24 | +Storytelling**: Use a notebook for any request to "give insights", "find |
| 25 | +trends", "explore data", or "analyze data". These tasks benefit from using |
| 26 | +visualizations to present the data. * 📊 **Visualizations are requested**: The |
| 27 | +user explicitly asks for charts or plots. * 🔄 **Stateful / Iterative |
| 28 | +Exploration**: You need to run a query, inspect results, and decide the next |
| 29 | +query based on those results while keeping state in memory. |
| 30 | + |
| 31 | +Do NOT use a notebook ONLY if: * 📝 **Simple Fact/Status**: The request only |
| 32 | +requires a single number (e.g., "how many rows") or a status check (e.g., "when |
| 33 | +was this table updated"). * 🏃♂️ **Schema Preview**: The request is only about |
| 34 | +the schema or field types. |
| 35 | + |
| 36 | +**Golden Rule of Data Storytelling:** If any analytical insight, trend, or |
| 37 | +comparison is involved, favor a notebook and a visualization. A notebook is the |
| 38 | +"standard" environment for our developer workflow; do not avoid it because of |
| 39 | +"overhead". |
| 40 | + |
| 41 | +## Notebook Best Practices |
| 42 | + |
| 43 | +The golden rule: **STEP BY STEP GENERATE CELL -> EXECUTE CELL -> VALIDATE |
| 44 | +OUTPUT**, do not generate the entire notebook all at once. |
| 45 | + |
| 46 | +1. **EXECUTE-AND-VALIDATE LOOP**: Generate ONE cell, execute it, then verify |
| 47 | + the output. If the output is data (e.g. a dataframe), you MUST inspect it to |
| 48 | + confirm the logic is correct before generating the next step. Batch |
| 49 | + generation of an entire notebook is strictly prohibited because error |
| 50 | + propagation in notebooks is expensive to fix. |
| 51 | +2. **IDENTIFY DATA EARLY**: Use `@skill:discovering-gcp-data-assets` or |
| 52 | + BigQuery list tools to find the correct `project.dataset.table` before |
| 53 | + writing ANY code. If the table ID is missing, ask the user. |
| 54 | +3. **CLEAN FINAL STATE**: The final notebook MUST NOT have failed cells. If a |
| 55 | + cell fails, you MUST fix it. If you tried several versions, delete the |
| 56 | + failed attempts before you present the notebook to the user. |
| 57 | +4. **LOGICAL CHUNK FIDELITY**: Keep cells small. One logical transformation or |
| 58 | + visualization per cell. Group related cells into logical units (e.g., a |
| 59 | + BigQuery SQL query cell followed immediately by a Python visualization cell |
| 60 | + for those results). Use descriptive **markdown cells** to separate and |
| 61 | + document different logical sections. |
| 62 | +5. **GENERATE VISUALIZATIONS**: Always accompany data insights with |
| 63 | + visualizations; charts are often more effective than raw numbers for |
| 64 | + communicating trends and comparisons. |
| 65 | + |
| 66 | +## Kernel & Environment Management |
| 67 | + |
| 68 | +Notebooks run in specific **Kernels** (execution backends). You MUST ensure the |
| 69 | +kernel’s Python environment contains the necessary libraries (`bigframes`, |
| 70 | +`ipykernel`, etc.). |
| 71 | + |
| 72 | +### Kernel Types |
| 73 | + |
| 74 | +1. **Local Python**: Standard Python 3 kernel running on the notebook host |
| 75 | + (Managed instance, local machine). |
| 76 | +2. **Cloud Spark Remote (Dataproc Serverless)**: Transient Spark environment |
| 77 | + managed by GCP. Use for large-scale data processing. |
| 78 | +3. **Cloud Spark Remote (Dataproc Cluster)**: Persistent Spark clusters for |
| 79 | + shared or custom configurations. |
| 80 | +4. **Colab (Managed)**: Ephemeral Google-managed runtimes. |
| 81 | + |
| 82 | +### No Active Kernel / Setup Check |
| 83 | + |
| 84 | +1. **Infer or Ask about Kernel Preferences**: |
| 85 | + - **Infer from Context**: |
| 86 | + - If the task mentions "Spark", "PySpark", or "distributed compute", |
| 87 | + or if the active workspace is already a Spark cluster, lean towards |
| 88 | + **Remote Spark**. |
| 89 | + - If the task is focused on "BigQuery", "BigFrames", or standard API |
| 90 | + calls, lean towards **Local Python**. |
| 91 | + - **Ask when Ambiguous**: If multiple options fit, ask if they prefer a |
| 92 | + **Local Python** or a **Cloud/Remote Kernel** (e.g., Colab, Spark). |
| 93 | +2. **For Local Setup**: Use `@skill:managing-python-dependencies` to verify |
| 94 | + if a virtual environment exists. If not, create one. Ensure `ipykernel` is |
| 95 | + installed in that environment. Install any other relevant libraries. |
| 96 | +3. **For Remote Setup**: Advise the user to use the UI to select the |
| 97 | + appropriate remote kernel. |
| 98 | + |
| 99 | +### Proper Library Installation |
| 100 | + |
| 101 | +#### 1. Local Kernels |
| 102 | + |
| 103 | +Before installing any python libraries, you MUST use |
| 104 | +`@skill:managing-python-dependencies` to detect how python dependencies are |
| 105 | +managed in the project. |
| 106 | + |
| 107 | +#### 2. Remote Kernels (Spark/Colab) |
| 108 | + |
| 109 | +Since these are often ephemeral or managed by GCP: |
| 110 | + |
| 111 | +* Use `%pip install <package>` in the first cell if it's the only way to |
| 112 | + modify the runtime. |
| 113 | +* Check if the library is already available in the pre-installed stack. |
| 114 | + |
| 115 | +When in doubt about the kernel type or preferred installation method, ask the |
| 116 | +user for clarification. |
| 117 | + |
| 118 | +## Data Analysis & Visualization Rules |
| 119 | + |
| 120 | +Guidelines for performing exploratory data analysis, data cleaning, and |
| 121 | +visualization in notebooks. |
| 122 | + |
| 123 | +### Notebook Layout |
| 124 | + |
| 125 | +The notebook should read like a story. While you have flexibility (e.g., |
| 126 | +multiple visualizations for one data cell, or data cells building on each |
| 127 | +other), aim for this general flow: |
| 128 | + |
| 129 | +1. **Title & Objective** (Markdown Cell) |
| 130 | + * What is this notebook for? (e.g., `# Retention Analysis`) |
| 131 | +2. **Section Header** (Markdown Cell) |
| 132 | + * What are we looking at now? (e.g., `## Exploring User Retention`) |
| 133 | +3. **Data Acquisition/Transformation** (SQL or Python Cell) |
| 134 | + * Query BigQuery or transform data. |
| 135 | +4. **Verification (Optional but Recommended)** (Python Cell) |
| 136 | + * `df.head()` or assert sanity checks. |
| 137 | +5. **Visualization (The Goal)** (Python Cell) |
| 138 | + * Plot the insight (e.g., `df.plot()`). |
| 139 | + |
| 140 | +*Repeat steps 2-5 for each new sub-topic or insight. You can have multiple Data |
| 141 | +cells before a Visualization, or multiple Visualizations from one Data cell. The |
| 142 | +key is to keep them grouped logically and separated by Markdown headers.* |
| 143 | + |
| 144 | +1. **Final Summary** (Markdown Cell) |
| 145 | + |
| 146 | + * At the end of the notebook, add a markdown cell containing a summary |
| 147 | + paragraph that summarizes the findings to the user. The summary MUST |
| 148 | + follow these guidelines: |
| 149 | + * MUST NOT add Python code to the summary. |
| 150 | + * The summary MUST NOT start with a code block. |
| 151 | + * The summary MUST be strictly grounded in the numerical data verified in |
| 152 | + the notebook. |
| 153 | + * The summary MUST ONLY contain the following three sections: |
| 154 | + * ### Q&A If the data analysis task contains questions (implied or |
| 155 | + explicit), you MUST answer them based on the solving process. Skip |
| 156 | + this section if there are no questions to answer. |
| 157 | + * ### Data Analysis Key Findings Summarize the key analysis findings |
| 158 | + in bullet points, it's a plus to quote the numbers in the previous |
| 159 | + steps. Only report high-value findings, skip the obvious ones. |
| 160 | + * ### Insights or Next Steps Provide 1-2 concise insights or next |
| 161 | + steps in bullet points. |
| 162 | + |
| 163 | +2. **Next Steps**: After you are done generating and executing the entire |
| 164 | + notebook successfully, and the summary is complete, notify the user and |
| 165 | + propose next step suggestions. |
| 166 | + |
| 167 | +### Plotting Rules |
| 168 | + |
| 169 | +1. You MUST use different colors for different features to ensure plots are |
| 170 | + readable for humans. |
| 171 | +2. When creating a plot, you MUST adjust the figure size based on the number of |
| 172 | + features. The labels and legends MUST NOT overlap. |
| 173 | +3. You SHOULD arrange the layout wisely. Using subplots CAN help in placing |
| 174 | + different plots effectively. |
| 175 | +4. You MUST use inline figures to present figures and plots along with code and |
| 176 | + text in the notebook. |
| 177 | +5. For clustering, use PCA to reduce to 2D before scatter plotting. |
| 178 | +6. Use **Line Charts** ONLY for continuous data (e.g. time series) where |
| 179 | + interpolation between points is meaningful. |
| 180 | + |
| 181 | +### Data Cleaning Rules |
| 182 | + |
| 183 | +1. You MUST be careful about missing values and duplicated values. |
| 184 | +2. You MUST NOT drop columns unless absolutely necessary. Dropping columns is |
| 185 | + irreversible. |
| 186 | +3. You SHOULD focus on columns directly related to accomplishing the task; not |
| 187 | + every column NEEDS to be cleaned. |
| 188 | + |
| 189 | +## Specialized Notebook Guidance |
| 190 | + |
| 191 | +Refer to the following resources for guidance on specific notebook topics: |
| 192 | + |
| 193 | +### 1. BigQuery in Notebooks |
| 194 | + |
| 195 | +Standards for using BigQuery SQL in notebooks and accessing results in Python. |
| 196 | + |
| 197 | +- **Guide**: |
| 198 | + [bigquery_sql_in_notebooks.md](resources/bigquery_sql_in_notebooks.md) |
| 199 | + and the BigQuery skills. |
| 200 | +- **MUST READ WHEN**: You are writing BigQuery SQL queries in a notebook or |
| 201 | + processing query results in Python. |
| 202 | + |
| 203 | +### 2. Machine Learning in Notebooks |
| 204 | + |
| 205 | +Integration with machine learning workflows and best practices. - **Guide**: Use |
| 206 | +`@skill:ml-best-practices`. - **MUST READ WHEN**: The task involves machine |
| 207 | +learning, training a model, clustering, classification, regression, or |
| 208 | +time-series forecasting. |
| 209 | + |
| 210 | +If any "MUST READ WHEN" condition is met, you MUST read the corresponding guide |
| 211 | +before proceeding. |
0 commit comments