This document describes the evaluation pipeline for the Analytics Copilot (Text-to-SQL) model, including:
- Internal evaluation on the preprocessed
b-mc2/sql-create-contextval set. - Secondary external validation on the Spider dev set using lightweight, portfolio-friendly metrics.
The goal is to provide reproducible, scriptable evaluation that can run both in local development environments (including mock/offline modes) and in GPU-backed Colab sessions with trained adapters.
Internal evaluation uses the Alpaca-style validation file produced by the dataset builder:
data/processed/val.jsonl
Each line is a JSON object with at least:
-
instruction -
input– formatted schema + question, e.g.:### Schema: <CREATE TABLE ...> ### Question: <natural language question> -
output– normalized gold SQL query.
The internal evaluation script computes:
-
Exact Match (EM) – comparison on normalized SQL:
- Strips leading/trailing whitespace.
- Removes trailing semicolons.
- Collapses runs of whitespace into a single space.
- Implemented via
text2sql.eval.normalize.normalize_sql.
-
No-values Exact Match
- Builds on
normalize_sqland additionally replaces:- Single-quoted string literals with a placeholder (
'__STR__'). - Numeric literals (integers/decimals, optionally negative) with a
placeholder (
__NUM__).
- Single-quoted string literals with a placeholder (
- Useful to detect structural matches even when literal values differ.
- Builds on
-
SQL parse success rate
- Fraction of predictions that can be parsed by
sqlglot.parse_one. - Provides a lightweight proxy for syntactic validity of generated SQL.
- Fraction of predictions that can be parsed by
-
Schema adherence rate
- Uses the
CREATE TABLEcontext from each example and parses it withsqlglotto recover:- Known tables.
- Known columns per table.
- Parses the predicted SQL and extracts referenced table and column names.
- A prediction is schema-adherent if all referenced tables/columns appear in the context.
- Implemented via:
text2sql.eval.schema.parse_create_table_contexttext2sql.eval.schema.referenced_identifierstext2sql.eval.schema.schema_adherence
- Uses the
All metrics are aggregated via:
text2sql.eval.metrics.aggregate_metrics
which returns:
n_examplesexact_match–{count, rate}no_values_em–{count, rate}parse_success–{count, rate}schema_adherence–{count, rate}
Mock mode is designed for quick local checks and CI:
python scripts/evaluate_internal.py --mock \
--val_path data/processed/val.jsonl \
--out_dir reports/Behavior:
-
Uses the gold SQL (
output) as the prediction. -
Exercises normalization, parsing, schema adherence, and reporting code.
-
Produces:
reports/eval_internal.jsonreports/eval_internal.md
After fine-tuning with QLoRA (see docs/training.md), you can evaluate the
model using the trained adapters:
python scripts/evaluate_internal.py \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--max_examples 200 \
--temperature 0.0 \
--top_p 0.9 \
--max_new_tokens 256 \
--out_dir reports/Notes:
--device autoprefers GPU when available and falls back to CPU otherwise (with a warning).- By default, when running on CUDA the inference loader will try to load the
base model in 4-bit (bitsandbytes) for faster and more memory-efficient
evaluation. You can explicitly control this with:
--load_in_4bit/--no_load_in_4bit--bnb_4bit_quant_type,--bnb_4bit_compute_dtype, and--bnb_4bit_use_double_quantfor advanced 4-bit configuration.--dtype(defaultauto, which maps tofloat16on CUDA andfloat32on CPU)
--max_examplesallows you to subsample the validation set for quick runs.--smokeevaluates only a small handful of validation examples; on CPU-only environments it automatically falls back to--mockto avoid loading the large model while still exercising the metrics/reporting pipeline.- If you have a merged model directory, you can pass it as
--base_modeland omit--adapter_dir.
External validation uses two Hugging Face datasets:
-
Spider examples
- Dataset:
xlangai/spider - Split:
validation(configured via--spider_split) - Provides:
db_idquestionquery(gold SQL)
- Dataset:
-
Spider schema helper
- Dataset:
richardr1126/spider-schema - Provides:
db_idcreate_table_context– a serialized schema context withCREATE TABLEinformation for all tables in the database.
- Dataset:
License:
xlangai/spideris derived from the original Spider benchmark, andrichardr1126/spider-schemais licensed under CC BY-SA 4.0. In this project, Spider is used only for evaluation, not for training.
For each Spider example:
-
Look up
db_idin the schema helper dataset to retrievecreate_table_context. -
Build the schema + question input using the same format as internal evaluation:
### Schema: <create_table_context> ### Question: <Spider question> -
Use the same instruction text as training:
"Write a SQL query that answers the user's question using ONLY the tables and columns provided in the schema."
-
Wrap instruction + input into a full prompt using the training formatter:
- Implemented in
text2sql.eval.spider.build_spider_prompt, which internally reuses:text2sql.data_prep.INSTRUCTION_TEXTtext2sql.data_prep.build_input_texttext2sql.training.formatting.build_prompt
- Implemented in
Spider evaluation uses the same metric suite as internal evaluation:
- Exact Match (normalized SQL)
- No-values Exact Match
- SQL parse success rate
- Schema adherence rate
This provides a lightweight generalization check on Spider dev, but it is not a full reproduction of official Spider evaluation. In particular:
- Official Spider metrics include detailed component matching (SELECT, WHERE, GROUP BY, etc.).
- Execution-based evaluation is often used to measure semantic equivalence via query results.
Here we focus on structural/logical-form approximations that are easy to run without database execution, suitable for a portfolio-style baseline.
Mock mode uses small offline fixtures under tests/fixtures/ and does not
require internet:
python scripts/evaluate_spider_external.py --mock \
--out_dir reports/Behavior:
-
Loads:
tests/fixtures/spider_sample.jsonltests/fixtures/spider_schema_sample.jsonl
-
Uses gold SQL as predictions.
-
Produces:
reports/eval_spider.jsonreports/eval_spider.md
This is ideal for local smoke tests of the Spider pipeline.
With network access and a trained model, you can run full Spider dev evaluation:
python scripts/evaluate_spider_external.py \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--adapter_dir /path/to/outputs/adapters \
--device auto \
--spider_source xlangai/spider \
--schema_source richardr1126/spider-schema \
--spider_split validation \
--max_examples 200 \
--temperature 0.0 \
--top_p 0.9 \
--max_new_tokens 256 \
--out_dir reports/Notes:
- By default, when running on CUDA the inference loader will try to load the
base model in 4-bit (bitsandbytes) for faster and more memory-efficient
evaluation. You can explicitly control this with:
--load_in_4bit/--no_load_in_4bit--dtype(defaultauto, which maps tofloat16on CUDA andfloat32on CPU)
--max_examplesallows a lighter-weight subset run (e.g., 50–200 examples).- When
--mockis not set, the script downloads datasets viadatasets.load_dataset, so internet access is required.
Both evaluation scripts rely on a shared inference helper:
src/text2sql/infer.py
Key functions:
-
load_model_for_inference(base_model, adapter_dir=None, device='auto', load_in_4bit=None, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_compute_dtype='float16', dtype='auto')- Loads a base HF model or local directory.
- Optionally applies LoRA adapters from
adapter_dir. - Resolves device via:
"auto"→ GPU if available, otherwise CPU (with a warning)."cuda"/"cpu"for explicit control.
- When running on CUDA and
load_in_4bitis not explicitly set, the loader defaults to 4-bit quantization using bitsandbytes (NF4 + double-quant by default). This significantly reduces memory usage and speeds up evaluation on Colab-style GPUs. - The
bnb_4bit_*arguments allow you to tune quantization behavior when needed (e.g. quantization type, compute dtype, and whether double quant is used).
-
generate_sql(prompt, max_new_tokens, temperature, top_p) -> str- Uses the loaded model/tokenizer to generate text.
- Evaluation scripts post-process the raw text via
text2sql.training.formatting.ensure_sql_onlybefore metric computation.
This separation keeps the evaluation scripts thin and allows reuse of the inference pipeline in other tools (e.g., a Streamlit demo or interactive notebooks).
To keep the test suite lightweight and offline-friendly:
- Fixtures under
tests/fixtures/provide small synthetic datasets:eval_internal_sample.jsonl– mini val-style examples.spider_sample.jsonlandspider_schema_sample.jsonl– Spider-like examples and schemas.
- Unit tests cover:
- SQL normalization (
test_normalize_sql.py). - Schema parsing and adherence (
test_schema_adherence.py). - Metric aggregation (
test_metrics_aggregate.py). - Spider prompt construction (
test_prompt_building_spider.py).
- SQL normalization (
CI or local developers can run:
pytest -qwithout requiring internet access or GPU hardware. For full model-based evaluation, see the commands in sections 1.3.2 and 2.4.2 above.
If you see TensorFlow CUDA warnings in Colab logs (e.g. about missing
libcudart), they can generally be ignored for this project. The evaluation
scripts also set TF_CPP_MIN_LOG_LEVEL=3 to suppress most TensorFlow log
noise; you can optionally uninstall TensorFlow entirely if you are not using
it elsewhere in your notebook.