ClickHouse · heavycrystal · Feb 11, 2026 · Copilot · Feb 11, 2026 · Copilot
diff --git a/.claude/skills/clickhouse-best-practices/AGENTS.md b/.claude/skills/clickhouse-best-practices/AGENTS.md
diff --git a/.claude/skills/clickhouse-best-practices/SKILL.md b/.claude/skills/clickhouse-best-practices/SKILL.md
@@ -0,0 +1,234 @@
+---
+name: clickhouse-best-practices
+description: MUST USE when reviewing ClickHouse schemas, queries, or configurations. Contains 28 rules that MUST be checked before providing recommendations. Always read relevant rule files and cite specific rules in responses.
+license: Apache-2.0
+metadata:
+  author: ClickHouse Inc
+  version: "0.3.0"
-  version: "0.3.0"
+  version: "0.1.0"
-  version: "0.3.0"
+  version: "0.1.0"
+---
+
+# ClickHouse Best Practices
+
+Comprehensive guidance for ClickHouse covering schema design, query optimization, and data ingestion. Contains 28 rules across 3 main categories (schema, query, insert), prioritized by impact.
+
+> **Official docs:** [ClickHouse Best Practices](https://clickhouse.com/docs/best-practices)
+
+## IMPORTANT: How to Apply This Skill
+
+**Before answering ClickHouse questions, follow this priority order:**
+
+1. **Check for applicable rules** in the `rules/` directory
+2. **If rules exist:** Apply them and cite them in your response using "Per `rule-name`..."
+3. **If no rule exists:** Use the LLM's ClickHouse knowledge or search documentation
+4. **If uncertain:** Use web search for current best practices
+5. **Always cite your source:** rule name, "general ClickHouse guidance", or URL
+
+**Why rules take priority:** ClickHouse has specific behaviors (columnar storage, sparse indexes, merge tree mechanics) where general database intuition can be misleading. The rules encode validated, ClickHouse-specific guidance.
+
+### For Formal Reviews
+
+When performing a formal review of schemas, queries, or data ingestion:
+
+---
+
+## Review Procedures
+
+### For Schema Reviews (CREATE TABLE, ALTER TABLE)
+
+**Read these rule files in order:**
+
+1. `rules/schema-pk-plan-before-creation.md` - ORDER BY is immutable
+2. `rules/schema-pk-cardinality-order.md` - Column ordering in keys
+3. `rules/schema-pk-prioritize-filters.md` - Filter column inclusion
+4. `rules/schema-types-native-types.md` - Proper type selection
+5. `rules/schema-types-minimize-bitwidth.md` - Numeric type sizing
+6. `rules/schema-types-lowcardinality.md` - LowCardinality usage
+7. `rules/schema-types-avoid-nullable.md` - Nullable vs DEFAULT
+8. `rules/schema-partition-low-cardinality.md` - Partition count limits
+9. `rules/schema-partition-lifecycle.md` - Partitioning purpose
+
+**Check for:**
+- [ ] PRIMARY KEY / ORDER BY column order (low-to-high cardinality)
+- [ ] Data types match actual data ranges
+- [ ] LowCardinality applied to appropriate string columns
+- [ ] Partition key cardinality bounded (100-1,000 values)
+- [ ] ReplacingMergeTree has version column if used
+
+### For Query Reviews (SELECT, JOIN, aggregations)
+
+**Read these rule files:**
+
+1. `rules/query-join-choose-algorithm.md` - Algorithm selection
+2. `rules/query-join-filter-before.md` - Pre-join filtering
+3. `rules/query-join-use-any.md` - ANY vs regular JOIN
+4. `rules/query-index-skipping-indices.md` - Secondary index usage
+5. `rules/schema-pk-filter-on-orderby.md` - Filter alignment with ORDER BY
+
+**Check for:**
+- [ ] Filters use ORDER BY prefix columns
+- [ ] JOINs filter tables before joining (not after)
+- [ ] Correct JOIN algorithm for table sizes
+- [ ] Skipping indices for non-ORDER BY filter columns
+
+### For Insert Strategy Reviews (data ingestion, updates, deletes)
+
+**Read these rule files:**
+
+1. `rules/insert-batch-size.md` - Batch sizing requirements
+2. `rules/insert-mutation-avoid-update.md` - UPDATE alternatives
+3. `rules/insert-mutation-avoid-delete.md` - DELETE alternatives
+4. `rules/insert-async-small-batches.md` - Async insert usage
+5. `rules/insert-optimize-avoid-final.md` - OPTIMIZE TABLE risks
+
+**Check for:**
+- [ ] Batch size 10K-100K rows per INSERT
+- [ ] No ALTER TABLE UPDATE for frequent changes
+- [ ] ReplacingMergeTree or CollapsingMergeTree for update patterns
+- [ ] Async inserts enabled for high-frequency small batches
+
+---
+
+## Output Format
+
+Structure your response as follows:
+
+```
+## Rules Checked
+- `rule-name-1` - Compliant / Violation found
+- `rule-name-2` - Compliant / Violation found
+...
+
+## Findings
+
+### Violations
+- **`rule-name`**: Description of the issue
+  - Current: [what the code does]
+  - Required: [what it should do]
+  - Fix: [specific correction]
+
+### Compliant
+- `rule-name`: Brief note on why it's correct
+
+## Recommendations
+[Prioritized list of changes, citing rules]
+```
+
+---
+
+## Rule Categories by Priority
+
+| Priority | Category | Impact | Prefix | Rule Count |
+|----------|----------|--------|--------|------------|
+| 1 | Primary Key Selection | CRITICAL | `schema-pk-` | 4 |
+| 2 | Data Type Selection | CRITICAL | `schema-types-` | 5 |
+| 3 | JOIN Optimization | CRITICAL | `query-join-` | 5 |
+| 4 | Insert Batching | CRITICAL | `insert-batch-` | 1 |
+| 5 | Mutation Avoidance | CRITICAL | `insert-mutation-` | 2 |
+| 6 | Partitioning Strategy | HIGH | `schema-partition-` | 4 |
+| 7 | Skipping Indices | HIGH | `query-index-` | 1 |
+| 8 | Materialized Views | HIGH | `query-mv-` | 2 |
+| 9 | Async Inserts | HIGH | `insert-async-` | 2 |
+| 10 | OPTIMIZE Avoidance | HIGH | `insert-optimize-` | 1 |
+| 11 | JSON Usage | MEDIUM | `schema-json-` | 1 |
+
+---
+
+## Quick Reference
+
+### Schema Design - Primary Key (CRITICAL)
+
+- `schema-pk-plan-before-creation` - Plan ORDER BY before table creation (immutable)
+- `schema-pk-cardinality-order` - Order columns low-to-high cardinality
+- `schema-pk-prioritize-filters` - Include frequently filtered columns
+- `schema-pk-filter-on-orderby` - Query filters must use ORDER BY prefix
+
+### Schema Design - Data Types (CRITICAL)
+
+- `schema-types-native-types` - Use native types, not String for everything
+- `schema-types-minimize-bitwidth` - Use smallest numeric type that fits
+- `schema-types-lowcardinality` - LowCardinality for <10K unique strings
+- `schema-types-enum` - Enum for finite value sets with validation
+- `schema-types-avoid-nullable` - Avoid Nullable; use DEFAULT instead
+
+### Schema Design - Partitioning (HIGH)
+
+- `schema-partition-low-cardinality` - Keep partition count 100-1,000
+- `schema-partition-lifecycle` - Use partitioning for data lifecycle, not queries
+- `schema-partition-query-tradeoffs` - Understand partition pruning trade-offs
+- `schema-partition-start-without` - Consider starting without partitioning
+
+### Schema Design - JSON (MEDIUM)
+
+- `schema-json-when-to-use` - JSON for dynamic schemas; typed columns for known
+
+### Query Optimization - JOINs (CRITICAL)
+
+- `query-join-choose-algorithm` - Select algorithm based on table sizes
+- `query-join-use-any` - ANY JOIN when only one match needed
+- `query-join-filter-before` - Filter tables before joining
+- `query-join-consider-alternatives` - Dictionaries/denormalization vs JOIN
+- `query-join-null-handling` - join_use_nulls=0 for default values
+
+### Query Optimization - Indices (HIGH)
+
+- `query-index-skipping-indices` - Skipping indices for non-ORDER BY filters
+
+### Query Optimization - Materialized Views (HIGH)
+
+- `query-mv-incremental` - Incremental MVs for real-time aggregations
+- `query-mv-refreshable` - Refreshable MVs for complex joins
+
+### Insert Strategy - Batching (CRITICAL)
+
+- `insert-batch-size` - Batch 10K-100K rows per INSERT
+
+### Insert Strategy - Async (HIGH)
+
+- `insert-async-small-batches` - Async inserts for high-frequency small batches
+- `insert-format-native` - Native format for best performance
+
+### Insert Strategy - Mutations (CRITICAL)
+
+- `insert-mutation-avoid-update` - ReplacingMergeTree instead of ALTER UPDATE
+- `insert-mutation-avoid-delete` - Lightweight DELETE or DROP PARTITION
+
+### Insert Strategy - Optimization (HIGH)
+
+- `insert-optimize-avoid-final` - Let background merges work
+
+---
+
+## When to Apply
+
+This skill activates when you encounter:
+
+- `CREATE TABLE` statements
+- `ALTER TABLE` modifications
+- `ORDER BY` or `PRIMARY KEY` discussions
+- Data type selection questions
+- Slow query troubleshooting
+- JOIN optimization requests
+- Data ingestion pipeline design
+- Update/delete strategy questions
+- ReplacingMergeTree or other specialized engine usage
+- Partitioning strategy decisions
+
+---
+
+## Rule File Structure
+
+Each rule file in `rules/` contains:
+
+- **YAML frontmatter**: title, impact level, tags
+- **Brief explanation**: Why this rule matters
+- **Incorrect example**: Anti-pattern with explanation
+- **Correct example**: Best practice with explanation
+- **Additional context**: Trade-offs, when to apply, references
+
+---
+
+## Full Compiled Document
+
+For the complete guide with all rules expanded inline: `AGENTS.md`
+
+Use `AGENTS.md` when you need to check multiple rules quickly without reading individual files.
diff --git a/.claude/skills/clickhouse-best-practices/rules/insert-async-small-batches.md b/.claude/skills/clickhouse-best-practices/rules/insert-async-small-batches.md
@@ -0,0 +1,55 @@
+---
+title: Use Async Inserts for High-Frequency Small Batches
+impact: HIGH
+impactDescription: "Server-side buffering when client batching isn't practical"
+tags: [insert, async, buffering, small-batches]
+---
+
+## Use Async Inserts for High-Frequency Small Batches
+
+**Impact: HIGH**
+
+When client-side batching isn't practical, async inserts buffer server-side and create larger parts automatically.
+
+**Incorrect (small batches without async):**
+
+```python
+# Small batches without async_insert - creates too many parts
+for batch in chunks(events, 100):
+    client.execute("INSERT INTO events VALUES", batch)
+```
+
+**Correct (enable async inserts):**
+
+```python
+# Enable async_insert with safe defaults
+client.execute("SET async_insert = 1")
+client.execute("SET wait_for_async_insert = 1")  # Confirms durability
+
+for batch in chunks(events, 100):
+    client.execute("INSERT INTO events VALUES", batch)
+# Server buffers and creates larger parts automatically
+```
+
+```sql
+-- Configure server-side for specific users
+ALTER USER my_app_user SETTINGS
+    async_insert = 1,
+    wait_for_async_insert = 1,
+    async_insert_max_data_size = 10000000,  -- Flush at 10MB
+    async_insert_busy_timeout_ms = 1000;    -- Flush after 1s
+```
+
+**Flush conditions (whichever occurs first):**
+- Buffer reaches `async_insert_max_data_size`
+- Time threshold `async_insert_busy_timeout_ms` elapses
+- Maximum insert queries accumulate
+
+**Return modes:**
+
+| Setting | Behavior | Use Case |
+|---------|----------|----------|
+| `wait_for_async_insert=1` | Waits for flush, confirms durability | **Recommended** |
+| `wait_for_async_insert=0` | Fire-and-forget, unaware of errors | **Risky** - only if you accept data loss |
+
+Reference: [Selecting an Insert Strategy](https://clickhouse.com/docs/best-practices/selecting-an-insert-strategy)
diff --git a/.claude/skills/clickhouse-best-practices/rules/insert-batch-size.md b/.claude/skills/clickhouse-best-practices/rules/insert-batch-size.md
@@ -0,0 +1,54 @@
+---
+title: Batch Inserts Appropriately (10K-100K rows)
+impact: CRITICAL
+impactDescription: "Each INSERT creates a part; single-row inserts overwhelm merge process"
+tags: [insert, batching, parts, performance]
+---
+
+## Batch Inserts Appropriately (10K-100K rows)
+
+**Impact: CRITICAL**
+
+Each INSERT creates a new data part. Single-row or small-batch inserts create thousands of tiny parts, overwhelming the merge process and causing cluster instability.
+
+**Incorrect (single-row or tiny batches):**
+
+```python
+# Single-row inserts - creates 10,000 parts!
+for event in events:
+    client.execute("INSERT INTO events VALUES", [event])
+
+# Tiny batches - still too many parts
+for batch in chunks(events, 100):  # 100 rows per INSERT
+    client.execute("INSERT INTO events VALUES", batch)
+```
+
+**Correct (proper batch size):**
+
+```python
+# Ideal batch size: 10,000-100,000 rows
+BATCH_SIZE = 10_000
+for batch in chunks(events, BATCH_SIZE):
+    client.execute("INSERT INTO events VALUES", batch)
+```
+
+**Recommended batch sizes:**
+
+| Threshold | Value |
+|-----------|-------|
+| Minimum | 1,000 rows |
+| Ideal range | 10,000-100,000 rows |
+| Insert rate (sync) | ~1 insert per second |
+
+**Validation:**
+
+```sql
+-- Monitor part count (>3000 per partition blocks inserts)
+SELECT table, count() as parts, sum(rows) as total_rows
+FROM system.parts
+WHERE active AND database = 'default'
+GROUP BY table
+ORDER BY parts DESC;
+```
+
+Reference: [Selecting an Insert Strategy](https://clickhouse.com/docs/best-practices/selecting-an-insert-strategy)
diff --git a/.claude/skills/clickhouse-best-practices/rules/insert-format-native.md b/.claude/skills/clickhouse-best-practices/rules/insert-format-native.md
@@ -0,0 +1,29 @@
+---
+title: Use Native Format for Best Insert Performance
+impact: MEDIUM
+impactDescription: "Native format is most efficient; JSONEachRow is expensive to parse"
+tags: [insert, format, Native, performance]
+---
+
+## Use Native Format for Best Insert Performance
+
+**Impact: MEDIUM**
+
+Data format affects insert performance. Native format is column-oriented with minimal parsing overhead.
+
+**Performance Ranking (fastest to slowest):**
+
+| Format | Notes |
+|--------|-------|
+| **Native** | Most efficient. Column-oriented, minimal parsing. Recommended. |
+| **RowBinary** | Efficient row-based alternative |
+| **JSONEachRow** | Easier to use but expensive to parse |
+
+**Example:**
+
+```python
+# Use Native format for best performance
+client.execute("INSERT INTO events VALUES", data, settings={'input_format': 'Native'})
-```python
-# Use Native format for best performance
-client.execute("INSERT INTO events VALUES", data, settings={'input_format': 'Native'})
+```bash
+# Use Native format for best performance with clickhouse-client
+clickhouse-client --query="INSERT INTO events FORMAT Native" < events.native
-```python
-# Use Native format for best performance
-client.execute("INSERT INTO events VALUES", data, settings={'input_format': 'Native'})
+```bash
+# Use Native format for best performance with clickhouse-client
+clickhouse-client --query="INSERT INTO events FORMAT Native" < events.native
+```
+
+Reference: [Selecting an Insert Strategy](https://clickhouse.com/docs/best-practices/selecting-an-insert-strategy)