Skip to content

Feat: Add starrocks catalog support#8856

Merged
mscolnick merged 14 commits into
marimo-team:mainfrom
chris-celerdata:add-starrocks
Apr 1, 2026
Merged

Feat: Add starrocks catalog support#8856
mscolnick merged 14 commits into
marimo-team:mainfrom
chris-celerdata:add-starrocks

Conversation

@chris-celerdata
Copy link
Copy Markdown
Contributor

📝 Summary

Implements #8765.

🔍 Description of Changes

This PR introduces a new engine which extends SQLAlchemyEngine. When possible (e.g. working with default_catalog), the inherited methods are used; otherwise special "external" methods are used.

Primary overridden functions:

  • get_databases
  • get_tables_in_schema
  • get_table_details

Before
image

After
image

Notes

Integration testing against a real StarRocks instance is not present. There is no "in-memory" equivalent like what DuckDB and SQLite have.

image

Arrays of variable types default greedily to their inner types. There is currently no support for showing complex types with different icons from a standard string (i.e. ARRAY<INT> as not simply an integer). I'm not sure if this is relevant, but I thought I would mention it anyway.

📋 Checklist

  • I have read the contributor guidelines.
  • For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).
  • Tests have been added for the changes made.
  • Documentation has been updated where applicable, including docstrings for API changes.
  • Pull request title is a good summary of the changes - it will be used in the release notes.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Apr 1, 2026 7:42pm

Request Review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 24, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@chris-celerdata
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class StarRocks support to marimo’s SQL datasource discovery by introducing a StarRocksEngine that can enumerate catalogs and introspect external-catalog objects via SHOW .../DESC SQL when SQLAlchemy’s inspector can’t.

Changes:

  • Implement StarRocksEngine(SQLAlchemyEngine) with multi-catalog discovery and external-catalog fallbacks.
  • Extend SQL identifier quoting to support the starrocks dialect (backtick style) and add related tests.
  • Register the new engine in engine detection and add StarRocks optional test dependency / dependency manager entry.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
marimo/_sql/engines/starrocks.py New engine implementing StarRocks catalog/database/table discovery with SQL fallbacks for external catalogs.
marimo/_sql/get_engines.py Adds StarRocksEngine to SUPPORTED_ENGINES so it’s selected before generic SQLAlchemyEngine.
marimo/_sql/sql_quoting.py Adds starrocks to backtick-quoting dialects.
marimo/_sql/sql.py Updates unsupported-engine error message to mention StarRocks.
marimo/_dependencies/dependencies.py Adds DependencyManager.starrocks.
pyproject.toml Adds starrocks to test-optional and bans module-level starrocks imports.
tests/_sql/test_starrocks.py New unit tests for StarRocks engine behavior using mocks.
tests/_sql/test_sql_quoting.py Adds StarRocks quoting/roundtrip cases and qualified-name quoting case.
tests/_sql/test_get_engines.py Adds StarRocks engine selection test and datasource connection mapping assertion.

Comment thread marimo/_sql/sql_quoting.py
Comment thread marimo/_sql/engines/starrocks.py Outdated
Comment thread marimo/_sql/engines/starrocks.py Outdated
Comment thread tests/_sql/test_starrocks.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Bundle Report

Changes will increase total bundle size by 559 bytes (0.0%) ⬆️. This is within the configured threshold ✅

Detailed changes
Bundle name Size Change
marimo-esm 24.81MB 559 bytes (0.0%) ⬆️

Affected Assets, Files, and Routes:

view changes for bundle: marimo-esm

Assets Changed:

Asset Name Size Change Total Size Change (%)
assets/ai-*.js 555 bytes 248.9kB 0.22%
assets/markdown-*.js 4 bytes 5.12kB 0.08%

@dmadisetti dmadisetti added the enhancement New feature or request label Mar 24, 2026
@dmadisetti dmadisetti linked an issue Mar 25, 2026 that may be closed by this pull request
1 task
@Light2Dark
Copy link
Copy Markdown
Collaborator

Light2Dark commented Mar 25, 2026

hi @chris-celerdata, this sounds similar to an issue with other multi-db engines like Snowflake. There is a start of the fix here #8824

Would you mind having a look and seeing if that direction would fix this problem universally, rather than custom support for Starrocks? If not, we can also have a look (wish there was Starrocks cloud haha)

@chris-celerdata
Copy link
Copy Markdown
Contributor Author

chris-celerdata commented Mar 25, 2026

There is a StarRocks cloud. It can also be deployed on Docker. If you want to work with external catalogs, the documentation can be found here. However, the default_catalog being separated in the sidebar should be sufficient for testing.

I took a look at the other PR and it does not fix the problem; SQLAlchemyEngine is essentially untouched and the primary problem is that get_schemas and get_databases expect a singular database. The current implementation of get_databases in sqlalchemy.pywill only return one or zero databases. There is not an easy fix because SQLAlchemy does not have a standardized method for multi-database introspection (i.e. get_database_names or get_catalog_names). Any ideas for a more general approach would be appreciated!

Note: I considered doing something similar to get_default_database. In this function, there is a block that defines what SQL to call.

dialect_queries = {
    "postgresql": "SELECT current_database()",
    "mssql": "SELECT DB_NAME()",
    "timeplus": "SELECT current_database()",
}

This approach could be duplicated into get_databases for the relevant dialects. Would this be a better approach than a new engine?

@Light2Dark
Copy link
Copy Markdown
Collaborator

I think that's the eventual direction. That PR decouples get_schema, so eventually we can call get_databases and get all databases connected.

I'm okay to have this too, but there'll be some code drift once we merge that. In the meantime, would you want to create the frontend UI changes for adding a starrocks connection?

@chris-celerdata
Copy link
Copy Markdown
Contributor Author

I'd love to add StarRocks to the UI. I was referencing #8169, which talked about keeping the UI uncluttered. Has that changed?
Also, to be clear, you are talking about adding StarRocks to this dialog?
image

@Light2Dark
Copy link
Copy Markdown
Collaborator

Light2Dark commented Mar 27, 2026

@chris-celerdata , yeah. But I'm checking with the team on if we're okay to add to the UI.

The PR referenced #8824 has been merged, does this change this PR? Sorry for the back and forth on this

@chris-celerdata
Copy link
Copy Markdown
Contributor Author

I have pushed a commit with changes based on the merge. Any comments would be appreciated!

@Light2Dark
Copy link
Copy Markdown
Collaborator

Thank you! This is likely to get merged in soon, which is closer to global solution we talked about.
#8920

@Light2Dark
Copy link
Copy Markdown
Collaborator

Hey @chris-celerdata! We're good to add StarRocks in the UI if you're still down.

Copy link
Copy Markdown
Contributor

@mscolnick mscolnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry approved too early. i think it makes sense to pull in #8920 to simplify some logic.

we can add the UI changes in another PR / followup

@chris-celerdata
Copy link
Copy Markdown
Contributor Author

I have merged the main branch into this branch and made the changes to sqlalchemy.py in order for StarRocks to work. #8920 does simplify the logic. It should be ready for review!

Copy link
Copy Markdown
Contributor

@mscolnick mscolnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

marimo/_sql/engines/sqlalchemy.py:170

  • _get_inspector executes SET CATALOG ... on a pooled connection but never restores the previous catalog. That session-level state can leak to later operations that reuse the same DBAPI connection (e.g., queries/introspection happening after discovery may run against the last catalog visited). Consider capturing the current catalog at entry and restoring it in a finally block, or otherwise resetting connection state before returning it to the pool.
        from sqlalchemy import inspect, text

        _use_database_dialect_command: dict[str, str] = {
            "snowflake": f"USE DATABASE {self._quote_identifier(database)}",
            "starrocks": f"SET CATALOG {self._quote_identifier(database)}",
        }
        dialect_command = _use_database_dialect_command.get(self.dialect)

        if dialect_command is not None:
            with self._connection.connect() as connection:
                connection.execute(text(dialect_command))

Comment on lines 257 to 263
database_name: Optional[str] = None
dialect_queries = {
"postgresql": "SELECT current_database()",
"mssql": "SELECT DB_NAME()",
"timeplus": "SELECT current_database()",
"starrocks": "SELECT CATALOG()",
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For StarRocks, get_default_database() adds a SELECT CATALOG() fallback, but earlier in the method it returns self._connection.url.database when present. On StarRocks URLs, that value is typically the database/schema, not the catalog, which can make default_database inconsistent with _get_database_names() (catalog list) and break default highlighting/qualification in the UI. Consider skipping the URL-database fast path when dialect == 'starrocks' and always resolving the default catalog via CATALOG() (leaving the URL database to default_schema).

Copilot uses AI. Check for mistakes.
Comment on lines +366 to +367
return [str(row[0]) for row in result.fetchall()]

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_starrocks_database_names() assumes the catalog name is always in row[0]. To make this more robust across dialect/driver variations, consider selecting the column by name (using result.keys() like the Snowflake implementation) and raising a clear error if the expected column is missing.

Suggested change
return [str(row[0]) for row in result.fetchall()]
columns = list(result.keys())
# StarRocks generally returns a "Catalog" column for SHOW CATALOGS,
# but we defensively support a few plausible variants.
candidate_columns = ("Catalog", "catalog_name", "name")
name_col_index: Optional[int] = None
for col_name in candidate_columns:
if col_name in columns:
name_col_index = columns.index(col_name)
break
if name_col_index is None:
raise RuntimeError(
"Unexpected SHOW CATALOGS result: expected one of "
f"{candidate_columns!r} columns, but got {columns!r}"
)
return [str(row[name_col_index]) for row in result.fetchall()]

Copilot uses AI. Check for mistakes.
Comment on lines 356 to +383
@@ -361,8 +376,11 @@
Returns a single-element list with the default database when
the dialect has no dedicated discovery mechanism.
"""
if self.dialect.lower() == "snowflake":
dialect = self.dialect.lower()
if dialect == "snowflake":
return self._get_snowflake_database_names()
if dialect == "starrocks":
return self._get_starrocks_database_names()
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The StarRocks-specific paths (SET CATALOG in _get_inspector, SHOW CATALOGS discovery, and the starrocks branch in _get_database_names) aren’t covered by existing SQLAlchemyEngine tests. Since this file already has unit tests (including Snowflake mocking), consider adding a mocked StarRocks test to assert the emitted SQL and that the discovery branch is exercised.

Copilot uses AI. Check for mistakes.
Comment on lines +356 to +361
def _get_starrocks_database_names(self) -> list[str]:
"""Get catalog names for StarRocks via 'SHOW CATALOGS'.

StarRocks uses a three-level hierarchy (Catalog → Database → Table)
which maps to marimo's (Database → Schema → Table).
"""
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says it “introduces a new engine which extends SQLAlchemyEngine”, but there doesn’t appear to be a StarRocksEngine implementation in the codebase; instead the StarRocks behavior is implemented directly in SQLAlchemyEngine. Please either update the PR description to match the implementation or add the intended engine subclass + registration (if that’s still the plan).

Copilot uses AI. Check for mistakes.
boto3 = Dependency("boto3")

redshift_connector = Dependency("redshift_connector")
starrocks = Dependency("starrocks")
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DependencyManager.starrocks is introduced but appears unused (no references in the repo). If StarRocks isn’t being explicitly dependency-gated anywhere, consider removing this entry; otherwise, add the corresponding require_many/feature checks so this dependency record has an effect.

Suggested change
starrocks = Dependency("starrocks")

Copilot uses AI. Check for mistakes.
Comment thread pyproject.toml
"clickhouse-connect>=0.8.18",
"redshift-connector[full]>=2.1.7",
# For testing starrocks
"starrocks>=1.3.0",
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

starrocks>=1.3.0 is added to the test-optional dependency group, but the current test changes don’t import or require the StarRocks package (quoting tests are pure string logic). If there aren’t additional StarRocks tests elsewhere in the PR, consider removing this optional dependency until it’s needed, to keep the optional test environment lighter.

Suggested change
"starrocks>=1.3.0",

Copilot uses AI. Check for mistakes.
Comment thread marimo/_sql/sql.py
else:
raise ValueError(
"Unsupported engine. Must be a SQLAlchemy, Ibis, Clickhouse, DuckDB, Redshift or DBAPI 2.0 compatible engine."
"Unsupported engine. Must be a SQLAlchemy, Ibis, Clickhouse, DuckDB, Redshift, StarRocks or DBAPI 2.0 compatible engine."
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated error message lists StarRocks as a separate supported engine type, but StarRocks connections are still surfaced via the generic SQLAlchemy engine in SUPPORTED_ENGINES. Consider rewording to avoid implying a distinct engine class (e.g., “SQLAlchemy (including StarRocks dialect)” or similar).

Suggested change
"Unsupported engine. Must be a SQLAlchemy, Ibis, Clickhouse, DuckDB, Redshift, StarRocks or DBAPI 2.0 compatible engine."
"Unsupported engine. Must be a SQLAlchemy (including StarRocks dialect), Ibis, Clickhouse, DuckDB, Redshift, or DBAPI 2.0 compatible engine."

Copilot uses AI. Check for mistakes.
@mscolnick mscolnick merged commit 14e0e13 into marimo-team:main Apr 1, 2026
47 of 51 checks passed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.22.1-dev22

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

New Database: StarRocks

5 participants