|
| 1 | +# Building a workspace data model (advanced) |
| 2 | + |
| 3 | +Optional **deep pass** for a single authoritative markdown model. For a short checklist only, use the **Model** section in [WORKFLOWS.md](WORKFLOWS.md) and [DATA_MODEL.template.md](DATA_MODEL.template.md). |
| 4 | + |
| 5 | +**Output:** Save as `DATA_MODEL.md`, `data_model.md`, or `docs/DATA_MODEL.md` in the **project directory** where you run `hotdata` (not inside agent skill folders). |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Discover connections |
| 10 | + |
| 11 | +```bash |
| 12 | +hotdata connections list |
| 13 | +``` |
| 14 | + |
| 15 | +For each connection, record `id`, `name`, and `source_type`. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## 2. Enumerate tables, columns, and datasets |
| 20 | + |
| 21 | +If the catalog may be **stale** (recent DDL, new tables missing), run **`hotdata connections refresh <connection_id>`** for affected connections **before** relying on `tables list`. |
| 22 | + |
| 23 | +**Per connection:** |
| 24 | + |
| 25 | +```bash |
| 26 | +hotdata tables list --connection-id <connection_id> |
| 27 | +``` |
| 28 | + |
| 29 | +**Uploaded datasets:** |
| 30 | + |
| 31 | +```bash |
| 32 | +hotdata datasets list |
| 33 | +hotdata datasets <dataset_id> |
| 34 | +``` |
| 35 | + |
| 36 | +Capture schema for each dataset (columns, types) from the detail view. |
| 37 | + |
| 38 | +You can also refresh after enumeration if you discover drift: |
| 39 | + |
| 40 | +```bash |
| 41 | +hotdata connections refresh <connection_id> |
| 42 | +``` |
| 43 | + |
| 44 | +--- |
| 45 | + |
| 46 | +## 3. Enrich beyond column names (optional but valuable) |
| 47 | + |
| 48 | +Use **connector and tooling docs** when `source_type` (or table shapes) match: |
| 49 | + |
| 50 | +- **Vendor / ELT docs** — Your loader or integration vendor’s published schemas for canonical tables, PKs/FKs, and field semantics (link what you use so a human can verify). |
| 51 | +- **dlt** — [verified sources](https://dlthub.com/docs/dlt-ecosystem/verified-sources) for normalized layouts. |
| 52 | +- **dlt-loaded data** — If you see `_dlt_id`, `_dlt_load_id`, `_dlt_parent_id`: treat as pipeline metadata; `_dlt_parent_id` often links flattened child rows to parents when no explicit FK exists. Exclude these from **grain** statements unless the question is specifically about loads. |
| 53 | +- **Vectors** — Columns typed as lists of floats (e.g. embedding columns) are candidates for vector search; note them. |
| 54 | +- **Well-known SaaS shapes** — Apply general patterns (e.g. Stripe charges/customers, HubSpot contacts/deals) only when naming and structure fit; **link** the doc you used so a human can verify. |
| 55 | + |
| 56 | +Do **not** invent facts: if context is missing, say so and suggest a small sample query: |
| 57 | + |
| 58 | +```bash |
| 59 | +hotdata query "SELECT * FROM <connection>.<schema>.<table> LIMIT 5" |
| 60 | +``` |
| 61 | + |
| 62 | +--- |
| 63 | + |
| 64 | +## 4. Infer relationships |
| 65 | + |
| 66 | +For each table, capture where reasonable: |
| 67 | + |
| 68 | +1. **Grain** — One row = one `…` (required per table; if unknown, say unknown). |
| 69 | +2. **Primary keys** — `id`, `<entity>_id`, or composite patterns from names + types. |
| 70 | +3. **Foreign keys** — `_id` / `_fk` / name matches to other tables; confirm with connector docs when possible. |
| 71 | +4. **Parent–child** — Flattened API/JSON tables (often nested names) and dlt parent keys. |
| 72 | +5. **Cross-connection** — Same logical entity in two connections (keys, type mismatches, caveats). |
| 73 | + |
| 74 | +For **small** schemas (e.g. ≤5 tables in a domain), a short **ASCII diagram** helps. For larger ones, group by domain in prose (e.g. billing, identity, product). |
| 75 | + |
| 76 | +--- |
| 77 | + |
| 78 | +## 5. Search and index awareness |
| 79 | + |
| 80 | +For tables you care about: |
| 81 | + |
| 82 | +```bash |
| 83 | +hotdata indexes list -c <connection_id> --schema <schema> --table <table> [-w <workspace_id>] |
| 84 | +``` |
| 85 | + |
| 86 | +Note: |
| 87 | + |
| 88 | +- **Vector**-friendly columns (embeddings) vs **BM25**-friendly text (`title`, `body`, `description`, …). |
| 89 | +- **Time** columns — event grain vs slowly changing dimensions. |
| 90 | +- **Facts vs dimensions** — for analytics-oriented workspaces. |
| 91 | + |
| 92 | +When suggesting a new index, use the same connection/schema/table/column names as in `tables list` and the main skill’s `indexes create` examples. |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## 6. Document structure |
| 97 | + |
| 98 | +Start from [DATA_MODEL.template.md](DATA_MODEL.template.md) and extend as needed: |
| 99 | + |
| 100 | +- **Overview** — Domains and what the workspace is for. |
| 101 | +- **Per connection** — Optional subsection per source; for **deep** models, **repeat** one block per `connection.schema.table` (grain, column table with name/type/nullable/PK-FK/notes, relationships, queryability, caveats)—the template’s single `####` heading is a pattern to copy for each table. |
| 102 | +- **Datasets** — Same treatment as connection tables where relevant. |
| 103 | +- **Cross-connection joins** — Keys, semantics, type caveats. |
| 104 | +- **Search / index summary** — Table, column, index status, intended use. |
| 105 | + |
| 106 | +If the workspace has **many** tables (e.g. 50+), add a **table of contents** after the overview (connection → table counts). |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Error handling |
| 111 | + |
| 112 | +- If a CLI command fails, record the error in the doc and **continue** when possible. |
| 113 | +- Unreachable connections or empty table lists: note in the connections table (e.g. unreachable / no tables). |
| 114 | +- Do not abort the whole model for one bad connection. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Rules (keep quality high) |
| 119 | + |
| 120 | +- Every table gets an explicit **grain** (or “unknown”). |
| 121 | +- Prefer **documented** connector semantics over guesswork; **link** external docs when you use them. |
| 122 | +- Flag **test/dev** tables (`test`, `tmp`, `dev`, `staging` in names) as non-production when applicable. |
| 123 | +- Note **Utf8-stored numbers** and cast requirements where relevant. |
| 124 | +- Do not leave column **Notes** empty when domain knowledge or docs apply; “—” is weak unless the column is opaque/internal. |
| 125 | +- Align table names with **`hotdata tables list`** output (`connection.schema.table`). |
0 commit comments