Skip to content

Add ClickHouse data catalog#11858

Merged
StpMax merged 12 commits into
mindsdb:releases/26.1.0from
HamoonDBA:feat/clickhouse-data-catalog
Mar 12, 2026
Merged

Add ClickHouse data catalog#11858
StpMax merged 12 commits into
mindsdb:releases/26.1.0from
HamoonDBA:feat/clickhouse-data-catalog

Conversation

@HamoonDBA
Copy link
Copy Markdown
Contributor

Description

This PR adds Data Catalog support for the ClickHouse handler, enabling AI agents to automatically read and utilize table and column metadata (including column comments) when generating SQL queries.

Key Changes:

  • Extended ClickHouseHandler to inherit from MetaDatabaseHandler
  • Implemented all required Data Catalog methods to read metadata from ClickHouse system tables
  • Added support for reading column comments set via ALTER TABLE ... MODIFY COLUMN ... COMMENT

Benefits:

  • Agents can now understand column descriptions and enum values automatically
  • Better SQL query generation for ClickHouse data sources
  • Consistent Data Catalog experience across different database handlers

Type of change

  • ⚡ New feature (non-breaking change which adds functionality)
  • 📄 This change requires a documentation update

Verification Process

To ensure the changes are working as expected:

Prerequisites:

  1. Enable Data Catalog in config.json:
{
    "data_catalog": {
        "enabled": true
    }
}
  1. Set up a ClickHouse connection and add column comments:
-- In ClickHouse
ALTER TABLE test_table 
MODIFY COLUMN status String COMMENT 'Status enum: pending, completed, cancelled';

Test Location:

  • MindsDB instance with ClickHouse integration
  • Access via http://localhost:47334

Verification Steps:

  1. Connect ClickHouse to MindsDB:
CREATE DATABASE clickhouse_conn
WITH ENGINE = 'clickhouse',
PARAMETERS = {
    "host": "your_host",
    "port": 9000,
    "user": "default",
    "password": "password",
    "database": "your_database"
};
  1. Create a text2sql skill:
CREATE SKILL clickhouse_skill
USING
    type = 'text2sql',
    database = 'clickhouse_conn',
    tables = ['test_table'],
    description = 'Test table with commented columns';
  1. Create an agent to trigger Data Catalog generation:
CREATE AGENT test_agent
USING
    model = 'gpt-4',
    skills = ['clickhouse_skill'];
  1. Verify column metadata is captured:
SELECT * FROM INFORMATION_SCHEMA.META_COLUMNS 
WHERE TABLE_SCHEMA = 'clickhouse_conn'
AND TABLE_NAME = 'test_table';

Expected: The COLUMN_DESCRIPTION field should contain the comments you set in ClickHouse.

  1. Verify table metadata:
SELECT * FROM INFORMATION_SCHEMA.META_TABLES 
WHERE TABLE_SCHEMA = 'clickhouse_conn';
  1. Verify column statistics (optional, may be slow for large tables):
SELECT * FROM INFORMATION_SCHEMA.META_COLUMN_STATISTICS 
WHERE TABLE_SCHEMA = 'clickhouse_conn'
AND TABLE_NAME = 'test_table';
  1. Test with an agent query:
    Ask the agent a natural language question about your ClickHouse data and verify it generates accurate SQL using the column descriptions.

Additional Media:

  • I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

  • My code follows the style guidelines(PEP 8) of MindsDB.
  • I have appropriately commented on my code, especially in complex areas.
  • Necessary documentation updates are either made or tracked in issues.
  • Relevant unit and integration tests are updated or added.

Implementation Details

Methods Implemented:

  1. meta_get_tables(): Reads table metadata from system.tables

    • Table name, schema, type (engine), description, row count
  2. meta_get_columns(): Reads column metadata from system.columns

    • Column name, data type, column comments/descriptions, default values, nullable status
  3. meta_get_column_statistics(): Computes statistics by querying tables

    • NULL percentage, distinct value count, min/max values
  4. meta_get_primary_keys(): Reads primary key info from system.columns

  5. meta_get_foreign_keys(): Returns empty DataFrame (ClickHouse doesn't support FK constraints)

Notes:

  • Column statistics may be slow for large tables as it requires scanning the data
  • Users can set column descriptions in ClickHouse using:
    ALTER TABLE table_name MODIFY COLUMN column_name Type COMMENT 'description';
  • This implementation follows the same pattern as existing handlers (MySQL, PostgreSQL, SQL Server)

Documentation TODO:

  • Add ClickHouse to the list of Data Catalog supported integrations in /docs/data_catalog/integrations/overview.mdx
  • Create usage example showing how to set column comments in ClickHouse

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Nov 6, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@entelligence-ai-pr-reviews
Copy link
Copy Markdown
Contributor

🔒 Entelligence AI Vulnerability Scanner

No security vulnerabilities found!

Your code passed our comprehensive security analysis.

📊 Files Analyzed: 1 files


Comment on lines +222 to +225
type as data_type,
comment as column_description,
default_expression as column_default,
CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END as is_nullable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correctness: is_nullable in meta_get_columns is set based on is_in_primary_key, which is incorrect and will mislabel non-primary key columns as nullable even if they are NOT NULL.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

In mindsdb/integrations/handlers/clickhouse_handler/clickhouse_handler.py, lines 222-225, the `meta_get_columns` method incorrectly sets `is_nullable` using `CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END`, which does not accurately reflect column nullability. Replace this with `is_nullable as is_nullable` to correctly report the nullability status from ClickHouse system columns.
📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
type as data_type,
comment as column_description,
default_expression as column_default,
CASE WHEN is_in_primary_key = 1 THEN 0 ELSE 1 END as is_nullable
comment as column_description,
default_expression as column_default,
is_nullable as is_nullable

Comment on lines +253 to +366
def meta_get_column_statistics_for_table(
self, table_name: str, column_names: Optional[List[str]] = None
) -> Response:
"""
Retrieves column statistics for a specific table.

Args:
table_name (str): The name of the table.
column_names (Optional[List[str]]): List of column names to retrieve statistics for.
If None, statistics for all columns will be returned.

Returns:
Response: A response object containing the column statistics for the table.
"""
database = self.connection_data['database']

# Get the list of columns for this table
columns_query = f"""
SELECT name, type
FROM system.columns
WHERE database = '{database}' AND table = '{table_name}'
"""

if column_names:
quoted_names = [f"'{c}'" for c in column_names]
columns_query += f" AND name IN ({','.join(quoted_names)})"

try:
columns_result = self.native_query(columns_query)

if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:
logger.warning(f"No columns found for table {table_name}")
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

# Build statistics query - collect all stats in one query
select_parts = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
# Use backticks to handle special characters in column names
select_parts.extend([
f"countIf(`{col}` IS NULL) AS nulls_{col}",
f"uniq(`{col}`) AS distincts_{col}",
f"toString(min(`{col}`)) AS min_{col}",
f"toString(max(`{col}`)) AS max_{col}",
])

if not select_parts:
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())

# Build the query to get stats for all columns at once
stats_query = f"""
SELECT
count(*) AS total_rows,
{', '.join(select_parts)}
FROM `{database}`.`{table_name}`
"""

stats_result = self.native_query(stats_query)

if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:
logger.warning(f"Could not retrieve stats for table {table_name}")
# Return placeholder stats
placeholder_data = []
for _, row in columns_result.data_frame.iterrows():
placeholder_data.append({
'table_name': table_name,
'column_name': row['name'],
'null_percentage': None,
'distinct_values_count': None,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': None,
'maximum_value': None,
})
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))

# Parse the stats result
stats_data = stats_result.data_frame.iloc[0]
total_rows = stats_data.get('total_rows', 0)

# Build the final statistics DataFrame
all_stats = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
nulls = stats_data.get(f'nulls_{col}', 0)
distincts = stats_data.get(f'distincts_{col}', None)
min_val = stats_data.get(f'min_{col}', None)
max_val = stats_data.get(f'max_{col}', None)

# Calculate null percentage
null_pct = None
if total_rows is not None and total_rows > 0:
null_pct = round((nulls / total_rows) * 100, 2)

all_stats.append({
'table_name': table_name,
'column_name': col,
'null_percentage': null_pct,
'distinct_values_count': distincts,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': min_val,
'maximum_value': max_val,
})

return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))

except Exception as e:
logger.error(f"Exception while fetching statistics for table {table_name}: {e}")
# Return empty stats on error
return Response(
RESPONSE_TYPE.ERROR,
error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance: The meta_get_column_statistics_for_table method (lines 253-366) issues a full table scan for every requested table, which can cause severe performance degradation on large ClickHouse tables due to lack of sampling or row limits.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

Optimize the `meta_get_column_statistics_for_table` method in `mindsdb/integrations/handlers/clickhouse_handler/clickhouse_handler.py` (lines 253-366). The current implementation performs a full table scan for statistics, which is extremely slow on large ClickHouse tables. Refactor the code to use ClickHouse's `SAMPLE` clause (e.g., `SAMPLE 0.1`) to compute statistics on a sample of the data, significantly reducing query time and resource usage for large tables. Ensure the method still returns the same structure and handles errors gracefully.
📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
def meta_get_column_statistics_for_table(
self, table_name: str, column_names: Optional[List[str]] = None
) -> Response:
"""
Retrieves column statistics for a specific table.
Args:
table_name (str): The name of the table.
column_names (Optional[List[str]]): List of column names to retrieve statistics for.
If None, statistics for all columns will be returned.
Returns:
Response: A response object containing the column statistics for the table.
"""
database = self.connection_data['database']
# Get the list of columns for this table
columns_query = f"""
SELECT name, type
FROM system.columns
WHERE database = '{database}' AND table = '{table_name}'
"""
if column_names:
quoted_names = [f"'{c}'" for c in column_names]
columns_query += f" AND name IN ({','.join(quoted_names)})"
try:
columns_result = self.native_query(columns_query)
if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:
logger.warning(f"No columns found for table {table_name}")
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
# Build statistics query - collect all stats in one query
select_parts = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
# Use backticks to handle special characters in column names
select_parts.extend([
f"countIf(`{col}` IS NULL) AS nulls_{col}",
f"uniq(`{col}`) AS distincts_{col}",
f"toString(min(`{col}`)) AS min_{col}",
f"toString(max(`{col}`)) AS max_{col}",
])
if not select_parts:
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
# Build the query to get stats for all columns at once
stats_query = f"""
SELECT
count(*) AS total_rows,
{', '.join(select_parts)}
FROM `{database}`.`{table_name}`
"""
stats_result = self.native_query(stats_query)
if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:
logger.warning(f"Could not retrieve stats for table {table_name}")
# Return placeholder stats
placeholder_data = []
for _, row in columns_result.data_frame.iterrows():
placeholder_data.append({
'table_name': table_name,
'column_name': row['name'],
'null_percentage': None,
'distinct_values_count': None,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': None,
'maximum_value': None,
})
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))
# Parse the stats result
stats_data = stats_result.data_frame.iloc[0]
total_rows = stats_data.get('total_rows', 0)
# Build the final statistics DataFrame
all_stats = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
nulls = stats_data.get(f'nulls_{col}', 0)
distincts = stats_data.get(f'distincts_{col}', None)
min_val = stats_data.get(f'min_{col}', None)
max_val = stats_data.get(f'max_{col}', None)
# Calculate null percentage
null_pct = None
if total_rows is not None and total_rows > 0:
null_pct = round((nulls / total_rows) * 100, 2)
all_stats.append({
'table_name': table_name,
'column_name': col,
'null_percentage': null_pct,
'distinct_values_count': distincts,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': min_val,
'maximum_value': max_val,
})
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))
except Exception as e:
logger.error(f"Exception while fetching statistics for table {table_name}: {e}")
# Return empty stats on error
return Response(
RESPONSE_TYPE.ERROR,
error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"
)
def meta_get_column_statistics_for_table(
self, table_name: str, column_names: Optional[List[str]] = None
) -> Response:
"""
Retrieves column statistics for a specific table, using sampling for large tables to avoid full scans.
"""
database = self.connection_data['database']
columns_query = f"""
SELECT name, type
FROM system.columns
WHERE database = '{database}' AND table = '{table_name}'
"""
if column_names:
quoted_names = [f"'{c}'" for c in column_names]
columns_query += f" AND name IN ({{','.join(quoted_names)}})"
try:
columns_result = self.native_query(columns_query)
if columns_result.resp_type == RESPONSE_TYPE.ERROR or columns_result.data_frame.empty:
logger.warning(f"No columns found for table {table_name}")
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
select_parts = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
select_parts.extend([
f"countIf(`{col}` IS NULL) AS nulls_{col}",
f"uniq(`{col}`) AS distincts_{col}",
f"toString(min(`{col}`)) AS min_{col}",
f"toString(max(`{col}`)) AS max_{col}",
])
if not select_parts:
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame())
# Use sampling for large tables
sample_clause = "SAMPLE 0.1" # 10% sample, adjust as needed
stats_query = f"""
SELECT
count(*) AS total_rows,
{', '.join(select_parts)}
FROM `{database}`.`{table_name}` {sample_clause}
"""
stats_result = self.native_query(stats_query)
if stats_result.resp_type != RESPONSE_TYPE.TABLE or stats_result.data_frame.empty:
logger.warning(f"Could not retrieve stats for table {table_name}")
placeholder_data = []
for _, row in columns_result.data_frame.iterrows():
placeholder_data.append({
'table_name': table_name,
'column_name': row['name'],
'null_percentage': None,
'distinct_values_count': None,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': None,
'maximum_value': None,
})
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(placeholder_data))
stats_data = stats_result.data_frame.iloc[0]
total_rows = stats_data.get('total_rows', 0)
all_stats = []
for _, row in columns_result.data_frame.iterrows():
col = row['name']
nulls = stats_data.get(f'nulls_{col}', 0)
distincts = stats_data.get(f'distincts_{col}', None)
min_val = stats_data.get(f'min_{col}', None)
max_val = stats_data.get(f'max_{col}', None)
null_pct = None
if total_rows is not None and total_rows > 0:
null_pct = round((nulls / total_rows) * 100, 2)
all_stats.append({
'table_name': table_name,
'column_name': col,
'null_percentage': null_pct,
'distinct_values_count': distincts,
'most_common_values': None,
'most_common_frequencies': None,
'minimum_value': min_val,
'maximum_value': max_val,
})
return Response(RESPONSE_TYPE.TABLE, pd.DataFrame(all_stats))
except Exception as e:
logger.error(f"Exception while fetching statistics for table {table_name}: {e}")
return Response(
RESPONSE_TYPE.ERROR,
error_message=f"Could not retrieve statistics for table {table_name}: {str(e)}"
)

@HamoonDBA
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Nov 6, 2025
@HamoonDBA HamoonDBA force-pushed the feat/clickhouse-data-catalog branch from f6f1f27 to d4766c0 Compare November 13, 2025 06:34
@StpMax StpMax self-requested a review March 11, 2026 14:17
@StpMax StpMax changed the base branch from develop to releases/26.1.0 March 12, 2026 11:47
SELECT
name as table_name,
database as table_schema,
engine as table_type,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to use 'BASE TABLE' for table_type

@StpMax StpMax merged commit c859bf3 into mindsdb:releases/26.1.0 Mar 12, 2026
27 of 31 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Mar 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants