Skip to content

Commit a50c3dc

Browse files
committed
Text-to-SQL: Improve documentation
1 parent 87f004b commit a50c3dc

2 files changed

Lines changed: 189 additions & 39 deletions

File tree

doc/query/nlsql/backlog.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
----
2+
orphan: true
3+
----
4+
5+
# NLSQL backlog
6+
7+
## Iteration +1
8+
9+
- Document `--include-tables`
10+
- Exercise example that uses time ranges.
11+
- Exercise example that needs SQL JOINs.
12+
- Exercise example that uses vector database features.
13+
- Is the machinery using pgvector-specific instructions that should be adjusted for CrateDB?
14+
15+
## Iteration +2
16+
17+
- Add providers: anyscale,openllm,vllm
18+
- Validate providers: Azure, Google, Hugging Face, Mistral, RunGPT
19+
- Tests: When using the vanilla schema `testdrive-data` with `from tests.conftest import TESTDRIVE_DATA_SCHEMA`,
20+
the LLM gets confused, and thinks the table is called `sensor_data`. The error message is:
21+
» The error indicates that the specified table, "sensor_data," is not recognized in the "testdrive-data" schema.
22+
- How to prevent queries like `Who is Shakespeare?`?
23+
24+
## Notes
25+
26+
LlamaIndex provides access to many LLM model inference engines and services via
27+
Python packages available on PyPI prefixed with `llama-index-llms-`.
28+
We've unlocked a few popular ones, but there are certainly many more.
29+
30+
- Inference: anyscale,localai,mistral-rs,openllm,rapid-mlx
31+
- API I: databricks,deepseek,huggingface,ibm,litellm,llama-api,llama-cpp,openai-like
32+
- API II: azure-inference,cortex,grok,groq,meta,minimax,mlx,octoai,perplexity
33+
- Router: cloudflare-ai-gateway,featherlessai,modelscope,nano-gpt,neutrino,ovhcloud
34+
- More I: Dolly, Pythia, Nano-GPT (litellm), DuckDB-NSQL, nsql-llama-2-7B, pip-sql-1.3b-GGUF, SQLCoder-7B, Ellbendls/Qwen-3-4b-Text_to_SQL-GGUF
35+
- More II: kwaipilot/kat-coder-pro-v2, undi95/remm-slerp-l2-13b

doc/query/nlsql/index.md

Lines changed: 154 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1,53 +1,179 @@
1-
# Text-to-SQL query adapter
1+
(nlsql)=
2+
3+
# Natural language (NLSQL)
4+
5+
:::{div} sd-text-muted
6+
Talk to your data in natural language.
7+
:::
8+
9+
The CrateDB NLSQL package helps agents turn natural language into database queries,
10+
like [Vanna AI] or Google's [QueryData] but tailored to CrateDB.
11+
12+
## About
13+
14+
NLSQL provides a straightforward way to turn natural language into executable
15+
SQL by combining an LLM with explicit database context. It positions itself as
16+
an execution layer for data agents: agents handle reasoning and orchestration,
17+
while the NLSQL layer reliably generates, checks, and runs SQL against
18+
databases, returning results for downstream actions.
19+
20+
The trade-off is explicit: you shift effort from prompt tuning to context
21+
engineering and maintenance, but gain near-100% accuracy, stronger guardrails,
22+
and production reliability—especially for multistep or mission-critical
23+
workflows where probabilistic errors are unacceptable.
224

325
## Install
426

527
```shell
6-
uv pip install 'cratedb-toolkit[nlsql]'
28+
uv pip install --upgrade 'cratedb-toolkit[nlsql]'
729
```
830

31+
## Synopsis
32+
33+
```shell
34+
ctk query nlsql \
35+
--cluster-url="crate://crate@localhost:4200/?ssl=false" \
36+
--llm-provider="<provider-name>" \
37+
--llm-model="<model-name>" \
38+
--llm-api-key="<your-api-key>" \
39+
"What is the average value for sensor 1?"
40+
```
41+
42+
## Coverage
43+
44+
:::{rubric} Providers
45+
:::
46+
47+
Supports a range of providers
48+
Amazon Bedrock (+ Converse), Anthropic, Azure OpenAI, Google AI,
49+
Hugging Face Inference API, Llamafile, Mistral, Ollama, OpenAI,
50+
OpenRouter, RunGPT, or Runpod Serverless (OpenAI-compatible).
51+
52+
:::{rubric} Models
53+
:::
54+
55+
A wide range of models can be selected from the enumerated providers.
56+
We recommend Gemini, Gemma3, Llama 3.1, Qwen 2.5, or later,
57+
for example Gemma-3-1B, Llama-3.2-1B-Instruct, or Qwen3.5-0.8B.
58+
59+
:::{rubric} Multiple languages
60+
:::
61+
62+
> Q: What is the average value for sensor 1?
63+
>
64+
> A: The average value for sensor 1 is approximately 17.03.
65+
66+
> Q: ¿Cuál es el valor medio del sensor 1?
67+
>
68+
> A: El valor medio del sensor 1 es 17.0333.
69+
70+
> Q: Quelle est la valeur moyenne du capteur 1 ?
71+
>
72+
> A: La valeur moyenne du capteur 1 est de 17,0333.
73+
74+
> Q: Wie lautet der Durchschnittswert für Sensor 1?
75+
>
76+
> A: Der Durchschnittswert für Sensor 1 beträgt 17,0333.
77+
78+
> Q: Qual è il valore medio del sensore 1?
79+
>
80+
> A: Il valore medio del sensore 1 è pari a 17,0333.
81+
82+
## Details
83+
84+
NLSQL works by wrapping a SQL database and exposing a query interface where
85+
plain-language questions are translated into SQL, executed, and returned as
86+
answers. Developers configure the engine with a database connection and a
87+
bounded set of tables, ensuring the model generates queries only within a
88+
known schema and avoids context overflow.
89+
90+
The procedure follows a schema-grounded approach: the engine injects table
91+
structure (and optionally examples or retrieved context) into the prompt so
92+
the LLM can synthesize accurate queries instead of guessing. It can also
93+
integrate with retrieval components to dynamically select relevant tables
94+
or augment prompts at query time for more complex setups.
95+
96+
The engine acts as a thin orchestration layer for Text-to-SQL purposes,
97+
and for building NLSQL systems:
98+
it handles prompt construction, query generation, execution,
99+
and result formatting, while leaving control, safety (e.g., read-only
100+
roles), and schema design to the developer.
101+
102+
## Security
103+
104+
Any Text-to-SQL application should be aware that executing
105+
arbitrary SQL queries can be a security risk. It is recommended to
106+
take precautions as needed, such as using restricted roles, read-only
107+
databases, sandboxing, etc.
108+
109+
While we recommend to use a dedicated read-only user/role to guarantee
110+
100% safety, CrateDB NLSQL also prevents [Prompt-to-SQL Injections] by
111+
default, by classifying the SQL statement and only permitting access
112+
for `SELECT` statements.
113+
114+
The `permit_all_statements` API argument or the `NLSQL_PERMIT_ALL_STATEMENTS`
115+
environment variable can be used to relax that default when set to a boolean
116+
value, to allow all types of statements. Only enable this flag when you are
117+
sure about this behaviour.
118+
9119
## Usage
10120

121+
You can use CrateDB NLSQL from the command line and as a Python API.
122+
11123
### CLI
12124

125+
When using `ctk query nlsql` on the command line, we recommend to use
126+
environment variables to configure database and LLM connectivity.
127+
128+
For connecting to CrateDB on localhost, use a connection string like this:
13129
```shell
14-
export CRATEDB_CLUSTER_URL=crate://localhost/
15-
export LLM_PROVIDER=openai
16-
export OPENAI_API_KEY=<YOUR_OPENAI_API_KEY>
130+
export CRATEDB_CLUSTER_URL="crate://crate:crate@localhost:4200/?ssl=false"
17131
```
18-
132+
For connecting to CrateDB Cloud, use a connection string like this:
19133
```shell
20-
export CRATEDB_CLUSTER_URL=crate://localhost/
21-
export LLM_PROVIDER=amazon_bedrock_converse
134+
export CRATEDB_CLUSTER_URL="crate://admin:dZ...6LqB@example.eks1.eu-west-1.aws.cratedb.net:4200/?ssl=true"
22135
```
23136

137+
Configure LLM provider. Use one of amazon_bedrock, amazon_bedrock_converse,
138+
anthropic, azure, google, huggingface_api, llamafile, mistral, ollama,
139+
openai, openrouter, rungpt, runpod_serverless.
24140
```shell
25-
export CRATEDB_CLUSTER_URL=crate://localhost/
26-
export LLM_PROVIDER=anthropic
27-
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
141+
export LLM_PROVIDER="openai"
28142
```
29-
143+
Configure LLM model. The label format depends on the provider's conventions.
144+
It is an optional configuration setting: By default, each provider will
145+
select a standard model that is suitable for Text-to-SQL, yet cost-effective.
30146
```shell
31-
export CRATEDB_CLUSTER_URL=crate://localhost/
32-
export LLM_PROVIDER=google
33-
export GOOGLE_API_KEY=<YOUR_GOOGLE_API_KEY>
147+
export LLM_NAME="google/gemma-3-4b-it:free"
34148
```
35149

150+
To authenticate with LLM APIs, use corresponding environment variables like
151+
outlined below.
36152
```shell
37-
export CRATEDB_CLUSTER_URL=crate://localhost/
38-
export LLM_PROVIDER=huggingface_api
39-
export HF_TOKEN=<YOUR_HUGGINGFACE_API_TOKEN>
153+
export OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
40154
```
41-
42155
```shell
43-
export CRATEDB_CLUSTER_URL=crate://localhost/
44-
export LLM_PROVIDER=mistral
45-
export MISTRAL_API_KEY=<YOUR_MISTRAL_API_KEY>
156+
export ANTHROPIC_API_KEY="<YOUR_ANTHROPIC_API_KEY>"
157+
```
158+
```shell
159+
export GOOGLE_API_KEY="<YOUR_GOOGLE_API_KEY>"
160+
```
161+
```shell
162+
export HF_TOKEN="<YOUR_HUGGINGFACE_API_TOKEN>"
163+
```
164+
```shell
165+
export MISTRAL_API_KEY="<YOUR_MISTRAL_API_KEY>"
46166
```
47167

168+
(llm-ollama)=
169+
170+
:::{rubric} Ollama
171+
:::
172+
173+
For connecting to dedicated LLM instances, use the `LLM_ENDPOINT` environment
174+
variable. For example, to connect to a self-managed Ollama instance:
48175
```shell
49-
export CRATEDB_CLUSTER_URL=crate://localhost/
50-
export LLM_PROVIDER=ollama
176+
export LLM_PROVIDER="ollama"
51177
export LLM_ENDPOINT="http://100.83.17.54:11434/"
52178
```
53179
```shell
@@ -114,7 +240,7 @@ from cratedb_toolkit.query.nlsql.model import DatabaseInfo, ModelInfo, ModelProv
114240
engine = sa.create_engine("crate://")
115241
schema = "doc"
116242

117-
# Use Open AI GPT-4.
243+
# Use OpenAI GPT-4.
118244
dataquery = DataQuery(
119245
db=DatabaseInfo(engine=engine, schema=schema),
120246
model=ModelInfo(provider=ModelProvider.OPENAI, name="gpt-4.1"),
@@ -205,18 +331,7 @@ ctk query nlsql "What is the average value for sensor 1?"
205331
Answer: The average value for sensor 1 is approximately 17.03.
206332
```
207333

208-
## Local inference
209-
210-
:Llama-3.2-1B-Instruct: License LLaMA 3.2, Size 1.1 GB
211-
:Qwen3.5-0.8B: License Apache 2.0, Size 1.6 GB
212-
213-
## Backlog
214-
215-
LlamaIndex provides access to many LLM models via Python packages available
216-
on PyPI prefixed with `llama-index-llms-`.
217334

218-
- Inference: anyscale,llamafile,localai,mistral-rs,openllm,rapid-mlx,rungpt,vllm
219-
- API I: databricks,deepseek,huggingface,ibm,litellm,llama-api,llama-cpp,openai-like
220-
- API II: azure-inference,cortex,google-genai,grok,groq,meta,minimax,mlx,octoai,perplexity
221-
- Router: cloudflare-ai-gateway,featherlessai,modelscope,nano-gpt,neutrino,ovhcloud
222-
- More: Dolly, Pythia, Nano-GPT (litellm), DuckDB-NSQL, nsql-llama-2-7B, pip-sql-1.3b-GGUF, SQLCoder-7B, Ellbendls/Qwen-3-4b-Text_to_SQL-GGUF
335+
[Prompt-to-SQL Injections]: https://syssec.dpss.inesc-id.pt/papers/pedro_icse25.pdf
336+
[QueryData]: https://cloud.google.com/blog/products/databases/introducing-querydata-for-near-100-percent-accurate-data-agents
337+
[Vanna AI]: https://vanna.ai/

0 commit comments

Comments
 (0)