|
| 1 | +# GenieRX Specification |
| 2 | + |
| 3 | +## Purpose |
| 4 | + |
| 5 | +GenieRX is an analyzer and recommender for Genie spaces and their underlying semantic models. Its job is to: |
| 6 | + |
| 7 | +- Inspect how data and metrics are modeled for Genie (tables, views, metric views, knowledge store expressions, instructions). |
| 8 | +- Classify fields into authoritative facts, canonical metrics, and heuristic signals. |
| 9 | +- Recommend changes that align with Databricks best practices for Genie, Unity Catalog metric views, and the Genie knowledge store. |
| 10 | + |
| 11 | +GenieRX must never change data or semantics itself; it produces a structured review and recommendation set that humans can apply (or that other automation can implement safely). |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## 1. Core Concepts and Taxonomy |
| 16 | + |
| 17 | +GenieRX must reason about every field, metric, and score using the following taxonomy: |
| 18 | + |
| 19 | +### 1.1 Authoritative Facts |
| 20 | + |
| 21 | +**Definition:** |
| 22 | +- Directly sourced from a system of record (billing, CRM, product telemetry, etc.). |
| 23 | +- No business logic applied beyond basic cleaning (type casting, null handling). |
| 24 | + |
| 25 | +**Examples:** |
| 26 | +- Transaction amounts, usage measures, timestamps from logs. |
| 27 | +- Pipeline stages from CRM. |
| 28 | +- Owner/segment assignments from master data. |
| 29 | + |
| 30 | +**GenieRX behavior:** |
| 31 | +- Treat these as safe for Genie to query directly (tables or metric-view sources). |
| 32 | +- Recommend surfacing them as columns, dimensions, or base measures without caveats, as long as upstream data quality is acceptable. |
| 33 | + |
| 34 | +### 1.2 Canonical Metrics |
| 35 | + |
| 36 | +**Definition:** |
| 37 | +- Derived metrics with: |
| 38 | + - A clear, stable SQL definition. |
| 39 | + - Cross-team agreement (e.g., analytics, finance, ops). |
| 40 | + - An owner who is accountable for changes. |
| 41 | +- Examples: revenue, active users, funnel conversion, churn rate, cost per order. |
| 42 | + |
| 43 | +**GenieRX behavior:** |
| 44 | +- Prefer to implement as metric view measures or knowledge-store measures/filters/dimensions, not as ad hoc SQL in Genie instructions. |
| 45 | +- Encourage: |
| 46 | + - Centralized definition in Unity Catalog metric views where possible. |
| 47 | + - Short, precise names plus documentation (description + semantic metadata). |
| 48 | +- Mark these as safe to present as "facts" in Genie answers (subject to the usual "data as of & filters" context). |
| 49 | + |
| 50 | +### 1.3 Heuristic Signals |
| 51 | + |
| 52 | +**Definition:** |
| 53 | +- Derived fields that depend on subjective thresholds, incomplete joins, fragile text features, or evolving business rules. |
| 54 | +- Examples: |
| 55 | + - Coverage / gap flags based on keyword lists and spend thresholds. |
| 56 | + - "Is_X" tags inferred via heuristic classification. |
| 57 | + - Composite opportunity or risk scores with arbitrary buckets/weights. |
| 58 | + - Buckets that encode assumptions about missing data or multi-tenant joins. |
| 59 | + |
| 60 | +**GenieRX behavior:** |
| 61 | +- Always treat these as heuristic signals, not authoritative facts. |
| 62 | +- Recommend: |
| 63 | + - Implementing them as measures or filters with explicit caveats in the description and/or semantic metadata (for example, "heuristic", "approximate", "experimental"). |
| 64 | + - Avoiding column names that imply certainty (prefer `potential_*`, `*_score`, `*_heuristic_flag`). |
| 65 | +- When these are currently modeled as bare columns, GenieRX should: |
| 66 | + - Flag them as high risk for misinterpretation in Genie answers. |
| 67 | + - Suggest converting them into modeled measures/filters with clear labels and descriptions. |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## 2. Modeling Guidelines with Metric Views |
| 72 | + |
| 73 | +When the workspace uses Unity Catalog metric views as the semantic layer for Genie, GenieRX must evaluate and recommend according to the following patterns. |
| 74 | + |
| 75 | +### 2.1 Use Metric Views as the Primary Semantic Layer |
| 76 | + |
| 77 | +**Best practice:** |
| 78 | +- For governed KPIs and complex aggregations, define them once as metric views and use those in: |
| 79 | + - Genie spaces. |
| 80 | + - Dashboards and alerts. |
| 81 | + - SQL clients and downstream tools. |
| 82 | + |
| 83 | +**GenieRX should:** |
| 84 | +- Prefer metric views over ad hoc SQL in Genie instructions when: |
| 85 | + - Metrics are reused in many questions or dashboards. |
| 86 | + - Correct rollup is non-trivial (ratios, distinct counts, windowed metrics, etc.). |
| 87 | + |
| 88 | +### 2.2 Organize Semantics into Dimensions, Measures, and Filters |
| 89 | + |
| 90 | +Metric views express semantics as: |
| 91 | +- **Dimensions:** group-by attributes (e.g., account, segment, product, region, time grain). |
| 92 | +- **Measures:** aggregated values (sum, avg, distinct count, ratios, scores). |
| 93 | +- **Filters:** structured conditions used often for WHERE / HAVING. |
| 94 | + |
| 95 | +**GenieRX should:** |
| 96 | +- Check that: |
| 97 | + - Group-by attributes are modeled as dimensions, not repeated ad hoc in SQL. |
| 98 | + - Key KPIs are measures, not free-floating columns. |
| 99 | + - Common conditions ("active customers", "large orders", "priority accounts") are modeled as filters or boolean measures where appropriate. |
| 100 | +- Recommend refactors such as: |
| 101 | + - "Promote this repeated WHERE condition into a named filter `active_customers`." |
| 102 | + - "Move this ratio calculation into a metric-view measure instead of recomputing it in instructions." |
| 103 | + |
| 104 | +### 2.3 Implement Heuristic Logic as Measures/Filters, Not Core Columns |
| 105 | + |
| 106 | +For heuristic signals: |
| 107 | +- Prefer to keep raw inputs (spend, text features, joins) as authoritative columns, and encode heuristic logic as measures/filters in the metric view: |
| 108 | + - **Measures:** scores or counts indicating likelihood, risk, or opportunity. |
| 109 | + - **Filters:** boolean expressions such as `has_potential_gap`, `is_priority_account_heuristic`. |
| 110 | + |
| 111 | +**GenieRX should recommend:** |
| 112 | +- Use descriptions and semantic metadata to mark: |
| 113 | + - Purpose (e.g., "heuristic score to prioritize follow-up"). |
| 114 | + - Known limitations (e.g., "sensitive to join failures; may over-count"). |
| 115 | +- Avoid surfacing these measures as "the number of X" without caveats; instead, position them as signals. |
| 116 | + |
| 117 | +### 2.4 Enforce Metric-View Querying Best Practices |
| 118 | + |
| 119 | +Because metric views require explicit measure references: |
| 120 | +- Queries must use the `MEASURE()` aggregate function for measures; `SELECT *` is not supported. |
| 121 | + |
| 122 | +**GenieRX should:** |
| 123 | +- Check whether Genie SQL examples and instructions correctly reference measures using `MEASURE()` and: |
| 124 | + - Flag places where raw measure columns are referenced without `MEASURE()`. |
| 125 | + - Suggest corrected SQL patterns. |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## 3. Modeling Guidelines with the Genie Knowledge Store |
| 130 | + |
| 131 | +When the workspace uses Genie knowledge store features (space-level metadata, SQL expressions, entity/value mapping), GenieRX must evaluate and recommend according to these patterns. |
| 132 | + |
| 133 | +### 3.1 Use SQL Expressions for Structured Semantics |
| 134 | + |
| 135 | +The knowledge store lets authors define: |
| 136 | +- **Measures:** KPIs and metrics with explicit SQL expressions. |
| 137 | +- **Filters:** reusable boolean conditions. |
| 138 | +- **Dimensions:** computed attributes for grouping or bucketing. |
| 139 | + |
| 140 | +**GenieRX should:** |
| 141 | +- Encourage using SQL expressions for: |
| 142 | + - Non-trivial metrics (ratios, distinct counts, window functions). |
| 143 | + - Business-rule-based flags (e.g., "strategic customers", "at-risk contracts"). |
| 144 | + - Time-derived dimensions (e.g., fiscal period, week buckets). |
| 145 | +- Flag situations where: |
| 146 | + - The same logic is duplicated across multiple Genie SQL examples/instructions. |
| 147 | + - Important metrics only exist inside long-form instructions or user prompts. |
| 148 | + |
| 149 | +### 3.2 Align Table/Column Metadata with Business Terms |
| 150 | + |
| 151 | +**Best practice from Genie docs:** |
| 152 | +- Keep spaces topic-specific and domain-focused. |
| 153 | +- Use clear table and column descriptions and hide irrelevant or duplicate columns. |
| 154 | + |
| 155 | +**GenieRX should:** |
| 156 | +- Evaluate: |
| 157 | + - Whether key business terms are reflected in table/column descriptions and synonyms. |
| 158 | + - Whether noisy or unused columns remain exposed to Genie. |
| 159 | +- Recommend: |
| 160 | + - Adding or refining descriptions to explain what measures/dimensions represent. |
| 161 | + - Adding synonyms where business language differs from schema names. |
| 162 | + - Hiding columns that are raw, deprecated, or confusing for business users. |
| 163 | + |
| 164 | +### 3.3 Distinguish Canonical vs Heuristic in Descriptions |
| 165 | + |
| 166 | +For each SQL expression in the knowledge store, GenieRX should: |
| 167 | +- Classify as canonical metric or heuristic signal. |
| 168 | +- Recommend description patterns, for example: |
| 169 | + - **Canonical:** "Primary KPI for [domain]. Defined as ... and reviewed by [team]." |
| 170 | + - **Heuristic:** "Heuristic score that approximates [concept]. Based on thresholds X/Y/Z and subject to misclassification. Use as prioritization signal, not as exact count." |
| 171 | +- Suggest adding explicit notes for Genie: |
| 172 | + - "When answering questions with this metric, briefly explain that it is a heuristic estimate." |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## 4. Genie Space Best Practices to Enforce |
| 177 | + |
| 178 | +GenieRX must anchor its recommendations in the official Genie best practices and internal field guidance. |
| 179 | + |
| 180 | +### 4.1 Scope and Data Model |
| 181 | + |
| 182 | +- Spaces should be topic-specific (single domain, business area, or workflow), not "kitchen sink" collections of tables. |
| 183 | +- Use a small number of core tables or metric views with: |
| 184 | + - Clear relationships (defined either in metric views or in knowledge store join metadata). |
| 185 | + - Cleaned and de-duplicated columns. |
| 186 | + |
| 187 | +**GenieRX should:** |
| 188 | +- Flag spaces that: |
| 189 | + - Include many loosely related tables. |
| 190 | + - Depend heavily on raw staging tables instead of curated or metric views. |
| 191 | +- Recommend: |
| 192 | + - Splitting domains into separate spaces. |
| 193 | + - Using curated views / metric views to simplify the model. |
| 194 | + |
| 195 | +### 4.2 Instructions and Examples |
| 196 | + |
| 197 | +**Best practices include:** |
| 198 | +- Keep instructions concise and focused on business rules and semantics, not low-level SQL formatting. |
| 199 | +- Provide example SQL that demonstrates: |
| 200 | + - Correct use of metric views and measures. |
| 201 | + - Preferred filters and joins. |
| 202 | +- Use benchmarks and validation questions to evaluate Genie performance over time. |
| 203 | + |
| 204 | +**GenieRX should:** |
| 205 | +- Assess whether instructions: |
| 206 | + - Explain how core metrics are defined and when to use them. |
| 207 | + - Avoid unnecessary repetition and token-heavy prose. |
| 208 | +- Recommend: |
| 209 | + - Extracting embedded business rules from instructions into metric views and knowledge-store expressions. |
| 210 | + - Adding or refining benchmark question sets for critical KPIs. |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## 5. GenieRX Review Workflow |
| 215 | + |
| 216 | +When GenieRX analyzes a space or semantic model, it should follow this high-level workflow: |
| 217 | + |
| 218 | +### Step 1: Inventory Sources and Semantics |
| 219 | + |
| 220 | +- List all data sources used by the space: |
| 221 | + - Tables, views, metric views. |
| 222 | + - Knowledge-store SQL expressions (measures, filters, dimensions). |
| 223 | +- Identify all exposed fields and measures used in example SQL or benchmarks. |
| 224 | + |
| 225 | +### Step 2: Classify Fields Using the Taxonomy |
| 226 | + |
| 227 | +- For each column/measure, determine if it's an **authoritative fact**, **canonical metric**, or **heuristic signal** based on: |
| 228 | + - Upstream SoT (billing, CRM, product, etc.). |
| 229 | + - Presence in metric views or knowledge store. |
| 230 | + - Use of thresholds, keyword lists, or ad hoc scoring logic. |
| 231 | + |
| 232 | +### Step 3: Check Alignment with Databricks Best Practices |
| 233 | + |
| 234 | +- **Data model:** Topic-focused, few core tables/metric views, clean joins. |
| 235 | +- **Semantics:** Canonical metrics in metric views or knowledge-store measures/filters. |
| 236 | +- **Instructions:** Clear, concise, oriented around business questions and metrics. |
| 237 | +- **Evals:** Benchmarks or validation questions exist for key metrics. |
| 238 | + |
| 239 | +### Step 4: Generate Recommendations in Three Buckets |
| 240 | + |
| 241 | +**Safety & Clarity:** |
| 242 | +- Where might Genie misrepresent heuristic signals as facts? |
| 243 | +- Which metrics need stronger descriptions or caveats? |
| 244 | + |
| 245 | +**Semantic Modeling:** |
| 246 | +- Which repeated logic should be moved into metric views or SQL expressions? |
| 247 | +- Which filters or dimensions should be promoted into named entities? |
| 248 | + |
| 249 | +**Space Design:** |
| 250 | +- Should tables/views be swapped for metric views? |
| 251 | +- Are there irrelevant columns/tables that should be hidden? |
| 252 | +- Are there missing joins, synonyms, or value dictionaries that would improve answer quality? |
| 253 | + |
| 254 | +### Step 5: Summarize in a User-Friendly Report |
| 255 | + |
| 256 | +For each analyzed space/model, output: |
| 257 | + |
| 258 | +1. **Overview** - 1-2 paragraph summary of main findings and risk level (low/medium/high). |
| 259 | +2. **Semantic Model Assessment** - Table of key metrics/signals with: Name, type (authoritative/canonical/heuristic), grain, and notes. |
| 260 | +3. **Recommended Changes** - Ranked list of concrete actions (e.g., "Create metric view for X", "Convert Y to heuristic measure with description", "Hide columns A/B/C"). |
| 261 | +4. **Optional** - Suggestions for benchmarks or validation questions. |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## 6. Design Principles for GenieRX |
| 266 | + |
| 267 | +GenieRX should always adhere to these principles: |
| 268 | + |
| 269 | +- **Do not fabricate** underlying data or definitions; base assessments only on the actual space configuration, metric views, and knowledge store content. |
| 270 | +- **Bias toward explicit semantics:** Prefer named measures/filters/dimensions over ad hoc SQL or fragile instructions. |
| 271 | +- **Respect governance and ownership:** Highlight when changes would affect canonical metrics owned by other teams; recommend collaboration, not unilateral changes. |
| 272 | +- **Aim for explainability:** Recommendations should be understandable to data and business owners. "Move this heuristic from a column to a measure with caveats" is better than opaque tuning. |
| 273 | + |
| 274 | +--- |
| 275 | + |
| 276 | +## Sources |
| 277 | + |
| 278 | +- Unity Catalog metric views | Databricks on AWS |
| 279 | +- Build a knowledge store for more reliable Genie spaces | Databricks on AWS |
| 280 | +- Genie Best Practices |
| 281 | +- [Field Apps] GenieRX: a Genie analyzer / recommender |
| 282 | +- Product Analytics (go/product-analytics) |
| 283 | +- DAIS 2025 - UC Metrics - Discovery - Genie |
| 284 | +- Genie Guidelines |
| 285 | +- Genie Space - Field Engineering Guide |
| 286 | +- Writing Effective Databricks Genie Instructions |
| 287 | +- Genie + Metrics (FEIP-818) |
0 commit comments