This document explains how to fine-tune Mistral-7B-Instruct-v0.1 on the processed Text-to-SQL dataset using QLoRA (4-bit) with Unsloth and bitsandbytes.
We provide:
- A Colab-friendly notebook:
notebooks/finetune_mistral7b_qlora_text2sql.ipynb
- A reproducible CLI script:
scripts/train_qlora.py
Both rely on the preprocessed instruction-tuning data produced by
scripts/build_dataset.py.
Before training, generate the processed JSONL files:
python scripts/build_dataset.pyThis creates:
data/processed/train.jsonldata/processed/val.jsonl
Each line is an Alpaca-style record with keys:
idinstructioninputoutputsourcemeta
For details, see docs/dataset.md.
The training stack uses:
torch(GPU strongly recommended)transformersacceleratepefttrlunslothbitsandbytes
These are included in requirements.txt. Install them with:
python -m pip install --upgrade pip
python -m pip install -r requirements.txtNote: QLoRA training is intended for GPU environments. A 7B model with 4-bit quantization typically requires a GPU with ≥ 16 GB VRAM (depending on sequence length and batch size). The CLI script supports CPU-only
--dry_runand--smokemodes for configuration checks.
File: notebooks/finetune_mistral7b_qlora_text2sql.ipynb
The notebook is organized into the following sections:
-
Overview
- Explains the Text-to-SQL task, QLoRA, and why providing explicit
CREATE TABLEschema context reduces hallucinations.
- Explains the Text-to-SQL task, QLoRA, and why providing explicit
-
Environment Setup
- Installs
unsloth,trl,bitsandbytes, etc. (for Colab). - Verifies GPU availability and prints device info.
- Sets random seeds for reproducibility.
- Installs
-
Load Processed Dataset
- Reads
data/processed/train.jsonlanddata/processed/val.jsonl. - Inspects a few records (instruction, input, output).
- Optionally subsamples for a fast dev run (e.g., 512 examples).
- Reads
-
Prompt Formatting
- Uses the same prompt pattern as the CLI script:
build_prompt(instruction, input)constructs:### Instruction: <instruction> ### Input: <schema + question> ### Response:- The final training text is
prompt + sql_output, where the SQL output is cleaned viaensure_sql_only.
- Shows 2–3 formatted examples end-to-end.
- Uses the same prompt pattern as the CLI script:
-
Model Loading & QLoRA Setup
- Loads
mistralai/Mistral-7B-Instruct-v0.1using Unsloth’sFastLanguageModel.from_pretrainedin 4-bit mode. - Explains key QLoRA hyperparameters:
r(rank): how many low-rank dimensions to add.alpha: scaling factor for the LoRA updates.dropout: regularization on the LoRA adapters.target_modules: which projection layers receive LoRA adapters.
- Loads
-
Training Configuration & SFTTrainer
- Uses TRL’s
SFTTrainer(withSFTConfig) to run supervised finetuning on the formatted dataset. - Key settings:
max_stepsper_device_train_batch_sizegradient_accumulation_stepslearning_ratewarmup_stepsmax_seq_length
- Includes a fast dev run mode:
- Limits to ~512 training examples.
- Uses
max_steps=20to quickly validate the setup.
- Uses TRL’s
-
Saving Artifacts
- Saves LoRA adapters under
outputs/adapters/. - Saves training configuration and metrics to:
outputs/run_meta.jsonoutputs/metrics.json
- Saves LoRA adapters under
-
Quick Sanity Inference
- Loads the trained adapters.
- Runs a few example schema + question pairs through the model.
- Prints generated SQL for manual inspection.
-
Next Steps
- Describes planned work:
- External validation on Spider dev.
- Pushing adapters/model to Hugging Face Hub.
- Integrating with a Streamlit-based Analytics Copilot UI.
- Describes planned work:
File: scripts/train_qlora.py
The script is designed for reproducible runs on a GPU machine. It uses the same prompt formatting as the notebook and can be integrated into automation or scheduled jobs.
Dry run (format a batch and exit, no model required):
python scripts/train_qlora.py --dry_runSmoke test (validate dataset + config; skip model loading on CPU-only):
python scripts/train_qlora.py --smokeFull training example (GPU required):
python scripts/train_qlora.py \
--train_path data/processed/train.jsonl \
--val_path data/processed/val.jsonl \
--base_model mistralai/Mistral-7B-Instruct-v0.1 \
--output_dir outputs/ \
--max_steps 500 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--learning_rate 2e-4 \
--warmup_steps 50 \
--weight_decay 0.0 \
--max_seq_length 2048 \
--lora_r 16 \
--lora_alpha 16 \
--lora_dropout 0.0 \
--seed 42Key flags (see --help for the full list):
-
--train_path,--val_path:- Paths to the processed Alpaca-style JSONL files.
-
--base_model:- Base model name; default:
mistralai/Mistral-7B-Instruct-v0.1.
- Base model name; default:
-
--output_dir:- Directory for all outputs (adapters, metrics, meta).
-
--max_steps,--per_device_train_batch_size,--gradient_accumulation_steps:- Control how much training is performed and the effective batch size.
-
--learning_rate,--warmup_steps,--weight_decay:- Standard optimizer hyperparameters.
-
--max_seq_length:- Sequence length used for tokenization. Larger values increase VRAM usage.
-
--lora_r,--lora_alpha,--lora_dropout:- LoRA hyperparameters that define the adapter capacity and regularization.
-
--seed:- Random seed for reproducibility.
-
--dry_run:- Loads and formats datasets, prints a few sample prompts, and exits. No model is loaded.
-
--smoke:- Validates dataset and configuration. On CPU-only systems, skips model loading; on GPU, can be extended to attempt a lightweight model load.
After a full training run (not --dry_run / --smoke), you should see:
-
outputs/adapters/:- LoRA adapter weights and tokenizer config.
-
outputs/run_meta.json:- Contains:
- Base model name.
- Dataset paths.
- Training hyperparameters (
TrainingConfig). - Dataset sizes.
- Git commit (if available).
- Run mode (
train,smoke, ordry_run).
- Contains:
-
outputs/metrics.json:- Contains training and evaluation metrics if available from the trainer.
For --dry_run and --smoke, the script still writes lightweight
run_meta.json and metrics.json with explanatory notes.
To use the notebook in Google Colab:
- Upload or clone the repository into Colab.
- Open
notebooks/finetune_mistral7b_qlora_text2sql.ipynb. - Run the environment setup cell:
- Installs
unsloth,trl,bitsandbytes, etc.
- Installs
- Mount Google Drive (optional) to persist
outputs/anddata/. - Run the preprocessing step or copy
data/processed/*.jsonlinto the environment. - Enable the fast dev run section first:
- Verifies that everything works with a very small subset of data.
- Once satisfied, disable fast dev mode and run a full training regime.
If you encounter CUDA OOM errors:
-
Reduce effective batch size:
- Lower
--per_device_train_batch_size. - Increase
--gradient_accumulation_stepsto maintain the same effective batch size.
- Lower
-
Shorten sequence length:
- Reduce
--max_seq_length(e.g., from 2048 → 1024 or 768).
- Reduce
-
Close other GPU processes:
- Free up GPU memory used by other notebooks or services.
- Consider:
- Reducing
max_stepsfor exploratory runs. - Using mixed precision (bf16/fp16) where supported.
- Ensuring data loading is not a bottleneck.
- Reducing
- Training itself requires a GPU.
- However, you can still:
- Run
python scripts/train_qlora.py --dry_runto validate dataset formatting and prompt construction. - Run
python scripts/train_qlora.py --smoketo validate config and pipeline wiring. - Use some notebook cells for data inspection and prompt experimentation.
- Run
After primary training on b-mc2/sql-create-context, we perform
secondary external validation on the Spider dev split (e.g.,
xlangai/spider), which is significantly more challenging (multi-table,
cross-domain text-to-SQL).
This is implemented via the Spider evaluation pipeline (Task 4). For details on datasets, metrics, and how to run the scripts, see: