Add ClickHouse data catalog by HamoonDBA · Pull Request #11858 · mindsdb/minds-platform

HamoonDBA · 2025-11-06T20:37:21Z

Description

This PR adds Data Catalog support for the ClickHouse handler, enabling AI agents to automatically read and utilize table and column metadata (including column comments) when generating SQL queries.

Key Changes:

Extended ClickHouseHandler to inherit from MetaDatabaseHandler
Implemented all required Data Catalog methods to read metadata from ClickHouse system tables
Added support for reading column comments set via ALTER TABLE ... MODIFY COLUMN ... COMMENT

Benefits:

Agents can now understand column descriptions and enum values automatically
Better SQL query generation for ClickHouse data sources
Consistent Data Catalog experience across different database handlers

Type of change

⚡ New feature (non-breaking change which adds functionality)
📄 This change requires a documentation update

Verification Process

To ensure the changes are working as expected:

Prerequisites:

Enable Data Catalog in config.json:

{
    "data_catalog": {
        "enabled": true
    }
}

Set up a ClickHouse connection and add column comments:

-- In ClickHouse
ALTER TABLE test_table 
MODIFY COLUMN status String COMMENT 'Status enum: pending, completed, cancelled';

Test Location:

MindsDB instance with ClickHouse integration
Access via http://localhost:47334

Verification Steps:

Connect ClickHouse to MindsDB:

CREATE DATABASE clickhouse_conn
WITH ENGINE = 'clickhouse',
PARAMETERS = {
    "host": "your_host",
    "port": 9000,
    "user": "default",
    "password": "password",
    "database": "your_database"
};

Create a text2sql skill:

CREATE SKILL clickhouse_skill
USING
    type = 'text2sql',
    database = 'clickhouse_conn',
    tables = ['test_table'],
    description = 'Test table with commented columns';

Create an agent to trigger Data Catalog generation:

CREATE AGENT test_agent
USING
    model = 'gpt-4',
    skills = ['clickhouse_skill'];

Verify column metadata is captured:

SELECT * FROM INFORMATION_SCHEMA.META_COLUMNS 
WHERE TABLE_SCHEMA = 'clickhouse_conn'
AND TABLE_NAME = 'test_table';

Expected: The COLUMN_DESCRIPTION field should contain the comments you set in ClickHouse.

Verify table metadata:

SELECT * FROM INFORMATION_SCHEMA.META_TABLES 
WHERE TABLE_SCHEMA = 'clickhouse_conn';

Verify column statistics (optional, may be slow for large tables):

SELECT * FROM INFORMATION_SCHEMA.META_COLUMN_STATISTICS 
WHERE TABLE_SCHEMA = 'clickhouse_conn'
AND TABLE_NAME = 'test_table';

Test with an agent query:
Ask the agent a natural language question about your ClickHouse data and verify it generates accurate SQL using the column descriptions.

Additional Media:

I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

My code follows the style guidelines(PEP 8) of MindsDB.
I have appropriately commented on my code, especially in complex areas.
Necessary documentation updates are either made or tracked in issues.
Relevant unit and integration tests are updated or added.

Implementation Details

Methods Implemented:

meta_get_tables(): Reads table metadata from system.tables
- Table name, schema, type (engine), description, row count
meta_get_columns(): Reads column metadata from system.columns ✨
- Column name, data type, column comments/descriptions, default values, nullable status
meta_get_column_statistics(): Computes statistics by querying tables
- NULL percentage, distinct value count, min/max values
meta_get_primary_keys(): Reads primary key info from system.columns
meta_get_foreign_keys(): Returns empty DataFrame (ClickHouse doesn't support FK constraints)

Notes:

Column statistics may be slow for large tables as it requires scanning the data

Users can set column descriptions in ClickHouse using:

ALTER TABLE table_name MODIFY COLUMN column_name Type COMMENT 'description';

This implementation follows the same pattern as existing handlers (MySQL, PostgreSQL, SQL Server)

Documentation TODO:

Add ClickHouse to the list of Data Catalog supported integrations in /docs/data_catalog/integrations/overview.mdx
Create usage example showing how to set column comments in ClickHouse

github-actions · 2025-11-06T20:37:32Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

entelligence-ai-pr-reviews · 2025-11-06T20:40:23Z

🔒 Entelligence AI Vulnerability Scanner

✅ No security vulnerabilities found!

Your code passed our comprehensive security analysis.

📊 Files Analyzed: 1 files

entelligence-ai-pr-reviews · 2025-11-06T20:40:26Z

+                type as data_type,
+                comment as column_description,
+                default_expression as column_default,
+                CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END as is_nullable


correctness: is_nullable in meta_get_columns is set based on is_in_primary_key, which is incorrect and will mislabel non-primary key columns as nullable even if they are NOT NULL.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

In mindsdb/integrations/handlers/clickhouse_handler/clickhouse_handler.py, lines 222-225, the `meta_get_columns` method incorrectly sets `is_nullable` using `CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END`, which does not accurately reflect column nullability. Replace this with `is_nullable as is_nullable` to correctly report the nullability status from ClickHouse system columns.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

type as data_type,

comment as column_description,

default_expression as column_default,

CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END as is_nullable

comment as column_description,

default_expression as column_default,

is_nullable as is_nullable

entelligence-ai-pr-reviews · 2025-11-06T20:40:39Z

+    def meta_get_column_statistics_for_table(
+        self, table_name: str, column_names: Optional[List[str]] = None
+    ) -> Response:
+        """
+        Retrieves column statistics for a specific table.
+
+        Args:
+            table_name (str): The name of the table.
+            column_names (Optional[List[str]]): List of column names to retrieve statistics for. 
+                                                  If None, statistics for all columns will be returned.
+
+        Returns:
+            Response: A response object containing the column statistics for the table.
+        """
+        database = self.connection_data['database']
+
+        # Get the list of columns for this table
+        columns_query = f"""
+            SELECT name, type
+            FROM system.columns
+            WHERE database = '{database}' AND table = '{table_name}'
+        """
+
+        if column_names:
+            quoted_names = [f"'{c}'" for c in column_names]
+            columns_query += f" AND name IN ({','.join(quoted_names)})"
+
+        try:
+            columns_result = self.native_query(columns_query)
+
+            if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:
+                logger.warning(f"No columns found for table {table_name}")
+                return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
+
+            # Build statistics query - collect all stats in one query
+            select_parts = []
+            for _, row in columns_result.data_frame.iterrows():
+                col = row['name']
+                # Use backticks to handle special characters in column names
+                select_parts.extend([
+                    f"countIf(`{col}` IS NULL) AS nulls_{col}",
+                    f"uniq(`{col}`) AS distincts_{col}",
+                    f"toString(min(`{col}`)) AS min_{col}",
+                    f"toString(max(`{col}`)) AS max_{col}",
+                ])
+
+            if not select_parts:
+                return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
+
+            # Build the query to get stats for all columns at once
+            stats_query = f"""
+                SELECT 
+                    count(*) AS total_rows,
+                    {', '.join(select_parts)}
+                FROM `{database}`.`{table_name}`
+            """
+
+            stats_result = self.native_query(stats_query)
+
+            if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:
+                logger.warning(f"Could not retrieve stats for table {table_name}")
+                # Return placeholder stats
+                placeholder_data = []
+                for _, row in columns_result.data_frame.iterrows():
+                    placeholder_data.append({
+                        'table_name': table_name,
+                        'column_name': row['name'],
+                        'null_percentage': None,
+                        'distinct_values_count': None,
+                        'most_common_values': None,
+                        'most_common_frequencies': None,
+                        'minimum_value': None,
+                        'maximum_value': None,
+                    })
+                return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))
+
+            # Parse the stats result
+            stats_data = stats_result.data_frame.iloc[0]
+            total_rows = stats_data.get('total_rows', 0)
+
+            # Build the final statistics DataFrame
+            all_stats = []
+            for _, row in columns_result.data_frame.iterrows():
+                col = row['name']
+                nulls = stats_data.get(f'nulls_{col}', 0)
+                distincts = stats_data.get(f'distincts_{col}', None)
+                min_val = stats_data.get(f'min_{col}', None)
+                max_val = stats_data.get(f'max_{col}', None)
+
+                # Calculate null percentage
+                null_pct = None
+                if total_rows is not None and total_rows > 0:
+                    null_pct = round((nulls / total_rows) * 100, 2)
+
+                all_stats.append({
+                    'table_name': table_name,
+                    'column_name': col,
+                    'null_percentage': null_pct,
+                    'distinct_values_count': distincts,
+                    'most_common_values': None,
+                    'most_common_frequencies': None,
+                    'minimum_value': min_val,
+                    'maximum_value': max_val,
+                })
+
+            return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))
+
+        except Exception as e:
+            logger.error(f"Exception while fetching statistics for table {table_name}: {e}")
+            # Return empty stats on error
+            return Response(
+                RESPONSE_TYPE.ERROR, 
+                error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"
+            )


performance: The meta_get_column_statistics_for_table method (lines 253-366) issues a full table scan for every requested table, which can cause severe performance degradation on large ClickHouse tables due to lack of sampling or row limits.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

Optimize the `meta_get_column_statistics_for_table` method in `mindsdb/integrations/handlers/clickhouse_handler/clickhouse_handler.py` (lines 253-366). The current implementation performs a full table scan for statistics, which is extremely slow on large ClickHouse tables. Refactor the code to use ClickHouse's `SAMPLE` clause (e.g., `SAMPLE 0.1`) to compute statistics on a sample of the data, significantly reducing query time and resource usage for large tables. Ensure the method still returns the same structure and handles errors gracefully.

📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change

def meta_get_column_statistics_for_table(

self, table_name: str, column_names: Optional[List[str]] = None

) -> Response:

"""

Retrieves column statistics for a specific table.

Args:

table_name (str): The name of the table.

column_names (Optional[List[str]]): List of column names to retrieve statistics for.

If None, statistics for all columns will be returned.

Returns:

Response: A response object containing the column statistics for the table.

"""

database = self.connection_data['database']

# Get the list of columns for this table

columns_query = f"""

SELECT name, type

FROM system.columns

WHERE database = '{database}' AND table = '{table_name}'

"""

if column_names:

quoted_names = [f"'{c}'" for c in column_names]

columns_query += f" AND name IN ({','.join(quoted_names)})"

try:

columns_result = self.native_query(columns_query)

if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:

logger.warning(f"No columns found for table {table_name}")

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

# Build statistics query - collect all stats in one query

select_parts = []

for _, row in columns_result.data_frame.iterrows():

col = row['name']

# Use backticks to handle special characters in column names

select_parts.extend([

f"countIf(`{col}` IS NULL) AS nulls_{col}",

f"uniq(`{col}`) AS distincts_{col}",

f"toString(min(`{col}`)) AS min_{col}",

f"toString(max(`{col}`)) AS max_{col}",

])

if not select_parts:

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

# Build the query to get stats for all columns at once

stats_query = f"""

SELECT

count(*) AS total_rows,

{', '.join(select_parts)}

FROM `{database}`.`{table_name}`

"""

stats_result = self.native_query(stats_query)

if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:

logger.warning(f"Could not retrieve stats for table {table_name}")

# Return placeholder stats

placeholder_data = []

for _, row in columns_result.data_frame.iterrows():

placeholder_data.append({

'table_name': table_name,

'column_name': row['name'],

'null_percentage': None,

'distinct_values_count': None,

'most_common_values': None,

'most_common_frequencies': None,

'minimum_value': None,

'maximum_value': None,

})

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))

# Parse the stats result

stats_data = stats_result.data_frame.iloc[0]

total_rows = stats_data.get('total_rows', 0)

# Build the final statistics DataFrame

all_stats = []

for _, row in columns_result.data_frame.iterrows():

col = row['name']

nulls = stats_data.get(f'nulls_{col}', 0)

distincts = stats_data.get(f'distincts_{col}', None)

min_val = stats_data.get(f'min_{col}', None)

max_val = stats_data.get(f'max_{col}', None)

# Calculate null percentage

null_pct = None

if total_rows is not None and total_rows > 0:

null_pct = round((nulls / total_rows) * 100, 2)

all_stats.append({

'table_name': table_name,

'column_name': col,

'null_percentage': null_pct,

'distinct_values_count': distincts,

'most_common_values': None,

'most_common_frequencies': None,

'minimum_value': min_val,

'maximum_value': max_val,

})

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))

except Exception as e:

logger.error(f"Exception while fetching statistics for table {table_name}: {e}")

# Return empty stats on error

return Response(

RESPONSE_TYPE.ERROR,

error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"

)

def meta_get_column_statistics_for_table(

self, table_name: str, column_names: Optional[List[str]] = None

) -> Response:

"""

Retrieves column statistics for a specific table, using sampling for large tables to avoid full scans.

"""

database = self.connection_data['database']

columns_query = f"""

SELECT name, type

FROM system.columns

WHERE database = '{database}' AND table = '{table_name}'

"""

if column_names:

quoted_names = [f"'{c}'" for c in column_names]

columns_query += f" AND name IN ({{','.join(quoted_names)}})"

try:

columns_result = self.native_query(columns_query)

if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:

logger.warning(f"No columns found for table {table_name}")

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

select_parts = []

for _, row in columns_result.data_frame.iterrows():

col = row['name']

select_parts.extend([

f"countIf(`{col}` IS NULL) AS nulls_{col}",

f"uniq(`{col}`) AS distincts_{col}",

f"toString(min(`{col}`)) AS min_{col}",

f"toString(max(`{col}`)) AS max_{col}",

])

if not select_parts:

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

# Use sampling for large tables

sample_clause = "SAMPLE 0.1" # 10% sample, adjust as needed

stats_query = f"""

SELECT

count(*) AS total_rows,

{', '.join(select_parts)}

FROM `{database}`.`{table_name}` {sample_clause}

"""

stats_result = self.native_query(stats_query)

if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:

logger.warning(f"Could not retrieve stats for table {table_name}")

placeholder_data = []

for _, row in columns_result.data_frame.iterrows():

placeholder_data.append({

'table_name': table_name,

'column_name': row['name'],

'null_percentage': None,

'distinct_values_count': None,

'most_common_values': None,

'most_common_frequencies': None,

'minimum_value': None,

'maximum_value': None,

})

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))

stats_data = stats_result.data_frame.iloc[0]

total_rows = stats_data.get('total_rows', 0)

all_stats = []

for _, row in columns_result.data_frame.iterrows():

col = row['name']

nulls = stats_data.get(f'nulls_{col}', 0)

distincts = stats_data.get(f'distincts_{col}', None)

min_val = stats_data.get(f'min_{col}', None)

max_val = stats_data.get(f'max_{col}', None)

null_pct = None

if total_rows is not None and total_rows > 0:

null_pct = round((nulls / total_rows) * 100, 2)

all_stats.append({

'table_name': table_name,

'column_name': col,

'null_percentage': null_pct,

'distinct_values_count': distincts,

'most_common_values': None,

'most_common_frequencies': None,

'minimum_value': min_val,

'maximum_value': max_val,

})

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))

except Exception as e:

logger.error(f"Exception while fetching statistics for table {table_name}: {e}")

return Response(

RESPONSE_TYPE.ERROR,

error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"

)

HamoonDBA · 2025-11-06T21:49:51Z

I have read the CLA Document and I hereby sign the CLA

Co-authored-by: andrew <elkin.andr@gmail.com>

Co-authored-by: martyna-mindsdb <109554435+martyna-mindsdb@users.noreply.github.com>

…val in ClickHouse handler

StpMax · 2026-03-12T11:59:41Z

+            SELECT 
+                name as table_name,
+                database as table_schema,
+                engine as table_type,


better to use 'BASE TABLE' for table_type

entelligence-ai-pr-reviews Bot reviewed Nov 6, 2025

View reviewed changes

github-actions Bot added a commit that referenced this pull request Nov 6, 2025

@HamoonDBA has signed the CLA in #11858

a446b09

sejubar and others added 11 commits November 11, 2025 17:52

Fqe 1675 - additional test for mysql api (mindsdb#11797)

8cbbe60

Co-authored-by: andrew <elkin.andr@gmail.com>

Update datastax.mdx (mindsdb#11873)

1c7eb20

Fixed MSSQL Query Execution Result Conversion (mindsdb#11876)

dba3869

Update documentation for ENV Variables (mindsdb#11667)

01a6b37

Co-authored-by: martyna-mindsdb <109554435+martyna-mindsdb@users.noreply.github.com>

[BUG] - Hubspot Handler - CRUD operations (mindsdb#11874)

a4bc712

[feat] Add ClickHouse data catalog

ee70c35

[fix] Update is_nullable column retrieval in ClickHouse metadata query

0ba0bd6

[feat] Implement nullable column check and optimize statistics retrie…

55603d2

…val in ClickHouse handler

[fix] Simplify column statistics retrieval by removing sampling logic

1927698

[fix] update Clickhouse meta handler info

134b2e7

[fix] Format clickhouse_handler with ruff 0.11.11

d4766c0

HamoonDBA force-pushed the feat/clickhouse-data-catalog branch from f6f1f27 to d4766c0 Compare November 13, 2025 06:34

StpMax self-requested a review March 11, 2026 14:17

StpMax changed the base branch from develop to releases/26.1.0 March 12, 2026 11:47

Merge branch 'releases/26.1.0' into feat/clickhouse-data-catalog

01e44e9

StpMax approved these changes Mar 12, 2026

View reviewed changes

StpMax merged commit c859bf3 into mindsdb:releases/26.1.0 Mar 12, 2026
27 of 31 checks passed

github-actions Bot locked and limited conversation to collaborators Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ClickHouse data catalog#11858

Add ClickHouse data catalog#11858
StpMax merged 12 commits into
mindsdb:releases/26.1.0from
HamoonDBA:feat/clickhouse-data-catalog

HamoonDBA commented Nov 6, 2025

Uh oh!

github-actions Bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

entelligence-ai-pr-reviews Bot commented Nov 6, 2025

Uh oh!

entelligence-ai-pr-reviews Bot Nov 6, 2025

Uh oh!

entelligence-ai-pr-reviews Bot Nov 6, 2025

Uh oh!

HamoonDBA commented Nov 6, 2025

Uh oh!

StpMax Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

HamoonDBA commented Nov 6, 2025

Description

Type of change

Verification Process

Prerequisites:

Test Location:

Verification Steps:

Additional Media:

Checklist:

Implementation Details

Methods Implemented:

Notes:

Documentation TODO:

Uh oh!

github-actions Bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

entelligence-ai-pr-reviews Bot commented Nov 6, 2025

Uh oh!

entelligence-ai-pr-reviews Bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

entelligence-ai-pr-reviews Bot Nov 6, 2025

Choose a reason for hiding this comment

Uh oh!

HamoonDBA commented Nov 6, 2025

Uh oh!

StpMax Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions Bot commented Nov 6, 2025 •

edited

Loading