-
Notifications
You must be signed in to change notification settings - Fork 4
Text-to-SQL (NLSQL): Help agents turn natural language into SQL queries #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from all commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
f648a33
Text-to-SQL: Help agents turn natural language into SQL queries
amotl a31852c
Text-to-SQL: Remove matters about text embeddings, NLSQL doesn't need it
amotl 53c242e
Text-to-SQL: Make LLM instance name configurable, for Azure
amotl 31981cd
Text-to-SQL: CLI improvements
amotl e7e5ab6
Text-to-SQL: Software tests
amotl 831bb83
Text-to-SQL: Naming things
amotl af3a342
Text-to-SQL: Implement suggestions by CodeRabbit
amotl 05c9561
Text-to-SQL: Add Anthropic provider
amotl 99344cd
Text-to-SQL: Copy editing. Suggestions by CodeRabbit.
amotl b2f881e
Text-to-SQL: Add Mistral provider
amotl aa5d5e3
Text-to-SQL: Copy editing. Suggestions by CodeRabbit.
amotl 4f3b412
Text-to-SQL: Add Hugging Face API provider
amotl f668641
Text-to-SQL: Improve disabling embeddings per context manager
amotl 57d5896
Text-to-SQL: Add Google API provider
amotl 7561394
Text-to-SQL: Improve logging and documentation
amotl 5c3f196
Text-to-SQL: Add llamafile provider
amotl 7999a6e
Text-to-SQL: Add Amazon Bedrock (Converse) provider
amotl 9463166
Text-to-SQL: Add rungpt provider (experimental)
amotl 414dc2d
Text-to-SQL: Refactoring
amotl 0124ca8
Text-to-SQL: Add Runpod serverless provider
amotl 92998d3
Text-to-SQL: Add OpenRouter provider
amotl 5dc27bb
Text-to-SQL: Implement suggestions by CodeRabbit
amotl a53fcfd
Text-to-SQL: Migrate to `llama-index-llms-google-genai` package
amotl f7a6149
Text-to-SQL: Pin models for integration tests. Use `gpt-4o-mini`.
amotl d2c8489
Text-to-SQL: Software tests with OpenRouter
amotl ee8e19b
Text-to-SQL: Separate software tests into different CI workflow
amotl 85f3a53
Text-to-SQL: Only permit SELECT statements by default (sqlgate)
amotl aaa4cd2
Text-to-SQL: Implement suggestions by CodeRabbit
amotl e56a719
Text-to-SQL: Copy editing. This and that.
amotl b8c0325
Text-to-SQL: Improve documentation
amotl fc0d32d
Text-to-SQL: This and that
amotl dcf7b18
Text-to-SQL: Implement suggestions by CodeRabbit
amotl 76f67da
Text-to-SQL: Remove RunGPT
amotl 4aea728
Text-to-SQL: Rename Hugging Face provider identifier
amotl 71467ef
Text-to-SQL: Improve documentation
amotl 01c2661
Text-to-SQL: Add more examples
amotl 8c7a9f7
Text-to-SQL: Implement suggestions by CodeRabbit
amotl 78f9697
Text-to-SQL: Implement suggestions by CodeRabbit
amotl a064cef
Text-to-SQL: Record Amazon Nova Lite problem in backlog
amotl 01c179a
Text-to-SQL: Fix YAML front matter
amotl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| --- | ||
| name: "Tests: NLSQL" | ||
|
|
||
| on: | ||
| pull_request: | ||
| paths: | ||
| - '.github/workflows/nlsql.yml' | ||
| - 'cratedb_toolkit/query/nlsql/**' | ||
| - 'tests/query/*nlsql*' | ||
| - 'pyproject.toml' | ||
| push: | ||
| branches: [ main ] | ||
| paths: | ||
| - '.github/workflows/nlsql.yml' | ||
| - 'cratedb_toolkit/query/nlsql/**' | ||
| - 'tests/query/*nlsql*' | ||
| - 'pyproject.toml' | ||
|
|
||
| # Allow job to be triggered manually. | ||
| workflow_dispatch: | ||
|
|
||
| # Run the job each night after CrateDB nightly has been published. | ||
| schedule: | ||
| - cron: '0 3 * * *' | ||
|
|
||
| # Cancel in-progress jobs when pushing to the same branch. | ||
| concurrency: | ||
| cancel-in-progress: true | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
|
|
||
| jobs: | ||
|
|
||
| tests: | ||
|
|
||
| runs-on: ${{ matrix.os }} | ||
| strategy: | ||
| fail-fast: false | ||
| matrix: | ||
| os: ["ubuntu-latest"] | ||
| python-version: [ | ||
| "3.10", | ||
| "3.14", | ||
| ] | ||
|
|
||
| env: | ||
| OS: ${{ matrix.os }} | ||
| PYTHON: ${{ matrix.python-version }} | ||
| # Do not tear down Testcontainers | ||
| TC_KEEPALIVE: true | ||
|
|
||
| name: Python ${{ matrix.python-version }} on OS ${{ matrix.os }} | ||
| steps: | ||
|
|
||
| - name: Acquire sources | ||
| uses: actions/checkout@v6 | ||
|
|
||
| - name: Install uv | ||
| uses: astral-sh/setup-uv@v7 | ||
| with: | ||
| activate-environment: 'true' | ||
| cache-suffix: ${{ matrix.python-version }} | ||
| enable-cache: true | ||
| python-version: ${{ matrix.python-version }} | ||
|
|
||
| - name: Set up project | ||
| run: | | ||
| # Install package in editable mode. | ||
| uv pip install --editable='.[nlsql,test]' | ||
|
|
||
| - name: Run software tests | ||
| env: | ||
| ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} | ||
| OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} | ||
| OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }} | ||
| run: | | ||
| pytest -m nlsql | ||
|
|
||
| - name: Upload coverage to Codecov | ||
| uses: codecov/codecov-action@v6 | ||
| env: | ||
| CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }} | ||
| with: | ||
| fail_ci_if_error: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,26 +1,9 @@ | ||
| import logging | ||
|
|
||
| import click | ||
| from click_aliases import ClickAliasedGroup | ||
|
|
||
| from ..util.cli import boot_click | ||
| from ..util.app import make_cli | ||
| from .convert.cli import convert_query | ||
| from .mcp.cli import cli as mcp_cli | ||
| from .nlsql.cli import llm_cli | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| @click.group(cls=ClickAliasedGroup) | ||
| @click.option("--verbose", is_flag=True, required=False, help="Turn on logging") | ||
| @click.option("--debug", is_flag=True, required=False, help="Turn on logging with debug level") | ||
| @click.version_option() | ||
| @click.pass_context | ||
| def cli(ctx: click.Context, verbose: bool, debug: bool): | ||
| """ | ||
| Query utilities. | ||
| """ | ||
| return boot_click(ctx, verbose, debug) | ||
|
|
||
|
|
||
| cli = make_cli() | ||
| cli.add_command(convert_query, name="convert") | ||
| cli.add_command(llm_cli, name="nlsql") | ||
| cli.add_command(mcp_cli, name="mcp") |
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,99 @@ | ||
| """ | ||
| Use an LLM to query a database in human language using LlamaIndex' NLSQLTableQueryEngine. | ||
| """ | ||
|
|
||
| import contextlib | ||
| import dataclasses | ||
| import logging | ||
| from typing import Optional | ||
|
|
||
| from cratedb_toolkit.query.nlsql.model import DatabaseInfo, ModelInfo | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
| llama_index_import_error: Optional[ImportError] = None | ||
|
|
||
| try: | ||
| from llama_index.core.base.response.schema import RESPONSE_TYPE | ||
| from llama_index.core.llms import LLM | ||
| from llama_index.core.query_engine import NLSQLTableQueryEngine | ||
| from llama_index.core.utilities.sql_wrapper import SQLDatabase | ||
| except ImportError as exc: | ||
| llama_index_import_error = exc | ||
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class DataQuery: | ||
| """ | ||
| DataQuery helps agents turn natural language into SQL queries. | ||
| It's the little sister of Google's QueryData product. [1] | ||
|
|
||
| We recommend evaluating the Text-to-SQL interface using the Gemma models if you are | ||
| looking at non-frontier variants that need less resources for inference. However, | ||
| depending on the complexity of your problem, you may also want to use cutting-edge | ||
| models with your provider of choice at the cost of higher resource usage. | ||
|
|
||
| Attention: Any natural language SQL table query engine and Text-to-SQL application | ||
| should be aware that executing arbitrary SQL queries can be a security risk. | ||
| It is recommended to take precautions as needed, such as using restricted roles, | ||
| read-only databases, sandboxing, etc. | ||
|
|
||
| [1] https://cloud.google.com/blog/products/databases/introducing-querydata-for-near-100-percent-accurate-data-agents | ||
| [2] https://github.com/kupp0/multi-db-property-search-data-agents | ||
| """ | ||
|
|
||
| db: DatabaseInfo | ||
| model: ModelInfo | ||
| query_engine: Optional["NLSQLTableQueryEngine"] = None | ||
| permit_all_statements: bool = False | ||
|
|
||
| def __post_init__(self): | ||
| """Initialize query engine.""" | ||
| if self.query_engine is None: | ||
| self.setup() | ||
|
|
||
| def setup(self): | ||
| """Configure database connection and query engine.""" | ||
| if llama_index_import_error: | ||
| raise ImportError( | ||
| "NLSQL support requires installing `cratedb-toolkit[nlsql]`" | ||
| ) from llama_index_import_error | ||
|
|
||
| from cratedb_toolkit.query.nlsql.util import configure_llm, disable_embeddings | ||
|
|
||
| # Configure model. | ||
| logger.info("Configuring LLM: provider=%s, name=%s", self.model.provider.name, self.model.name) | ||
| llm: LLM = configure_llm(self.model) | ||
| logger.info("Selected LLM: %s", llm.metadata.model_dump_json()) | ||
|
|
||
| # Configure database. | ||
| self.db.setup() | ||
|
|
||
| # schema = quote_relation_name(self.db.schema) if self.db.schema else None # noqa: ERA001 | ||
|
|
||
| # Configure NLSQL query engine. | ||
| logger.info("Creating query engine") | ||
| sql_database = SQLDatabase( | ||
| self.db.get_engine(), | ||
| schema=self.db.schema, | ||
| ignore_tables=self.db.ignore_tables, | ||
| include_tables=self.db.include_tables, | ||
| ) | ||
| with disable_embeddings(): | ||
| self.query_engine = NLSQLTableQueryEngine( | ||
| sql_database=sql_database, | ||
| llm=llm, | ||
| ) | ||
|
|
||
| def ask(self, question: str) -> "RESPONSE_TYPE": | ||
| """Invoke an inquiry to the LLM.""" | ||
| from cratedb_toolkit.query.nlsql.sqlgate import enable_sql_gateway | ||
|
|
||
| if not self.query_engine: | ||
| raise ValueError("Query engine not configured") | ||
| if self.permit_all_statements: | ||
| sql_gateway = contextlib.nullcontext | ||
| else: | ||
| sql_gateway = enable_sql_gateway | ||
| with sql_gateway(): | ||
| return self.query_engine.query(question) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,112 @@ | ||
| import json | ||
| import logging | ||
| import os | ||
| import sys | ||
| from typing import Optional | ||
|
|
||
| import click | ||
| from dotenv import load_dotenv | ||
|
|
||
| from cratedb_toolkit.option import ( | ||
| option_cluster_id, | ||
| option_cluster_name, | ||
| option_cluster_url, | ||
| option_password, | ||
| option_schema, | ||
| option_username, | ||
| ) | ||
| from cratedb_toolkit.query.nlsql.api import DataQuery | ||
| from cratedb_toolkit.query.nlsql.model import DatabaseInfo | ||
| from cratedb_toolkit.util.common import setup_logging | ||
| from cratedb_toolkit.util.data import asbool | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def help_llm(): | ||
| """ | ||
| Use an LLM to query the database in human language. | ||
|
|
||
| Synopsis | ||
| ======== | ||
|
|
||
| export CRATEDB_CLUSTER_URL=crate://localhost/ | ||
| ctk query nlsql "What is the average value for sensor 1?" | ||
|
|
||
| """ # noqa: E501 | ||
|
|
||
|
|
||
| @click.command() | ||
| @click.argument("question") | ||
| @option_cluster_id | ||
| @option_cluster_name | ||
| @option_cluster_url | ||
| @option_username | ||
| @option_password | ||
| @option_schema | ||
| @click.option("--llm-provider", type=str, required=False, help="LLM provider name") | ||
| @click.option("--llm-endpoint", type=str, required=False, help="LLM endpoint URL") | ||
| @click.option( | ||
| "--llm-instance", type=str, required=False, help="LLM model resource name, e.g. with Azure OpenAI service" | ||
| ) | ||
| @click.option("--llm-name", type=str, required=False, help="LLM model name for completions") | ||
| @click.option("--llm-api-key", type=str, required=False, help="LLM API key") | ||
| @click.option("--llm-api-version", type=str, required=False, help="LLM API version") | ||
| @click.pass_context | ||
| def llm_cli( | ||
| ctx: click.Context, | ||
| question: str, | ||
| cluster_id: str, | ||
| cluster_name: str, | ||
| cluster_url: str, | ||
| username: str, | ||
| password: str, | ||
| schema: str, | ||
| llm_provider: Optional[str], | ||
| llm_endpoint: Optional[str], | ||
| llm_instance: Optional[str], | ||
| llm_name: Optional[str], | ||
| llm_api_key: Optional[str], | ||
| llm_api_version: Optional[str], | ||
| ): | ||
| """ | ||
| Use an LLM to query a database in human language. | ||
| """ | ||
| from cratedb_toolkit.query.nlsql.util import read_llm_options | ||
|
|
||
| setup_logging() | ||
| load_dotenv() | ||
|
|
||
| # Read question. | ||
| if question == "-": | ||
| question = sys.stdin.read().strip() | ||
|
|
||
| schema = schema or "doc" | ||
| permit_all_statements = asbool(os.getenv("NLSQL_PERMIT_ALL_STATEMENTS")) | ||
|
|
||
| # Connect to database and configure LLM. | ||
| dburi = ctx.meta["address"].cluster_url | ||
|
amotl marked this conversation as resolved.
|
||
|
|
||
| # Configure natural language query machinery. | ||
| dataquery = DataQuery( | ||
| db=DatabaseInfo( | ||
| dburi=dburi, | ||
| schema=schema, | ||
| ), | ||
| model=read_llm_options( | ||
| llm_provider=llm_provider, | ||
| llm_name=llm_name, | ||
| llm_endpoint=llm_endpoint, | ||
| llm_instance=llm_instance, | ||
| llm_api_key=llm_api_key, | ||
| llm_api_version=llm_api_version, | ||
| ), | ||
| permit_all_statements=permit_all_statements, | ||
| ) | ||
|
|
||
| # Submit query. | ||
| response = dataquery.ask(question) | ||
| output = {"question": question, "answer": str(response)} | ||
| if response.metadata: | ||
| output.update(next(iter(response.metadata.values()))) | ||
| print(json.dumps(output, indent=2, default=str), file=sys.stdout) # noqa: T201 | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.