ViPERSQL goes beyond simple Exact Match (EM), providing component-wise analysis and error categorization for deeper understanding of model performance.
Enhanced Evaluation Metrics
│
├── Exact Match (EM)
│ └── Binary: SQL query khớp 100%
│
├── Component-wise F1
│ ├── SELECT clause F1
│ ├── FROM clause F1
│ ├── WHERE clause F1
│ ├── GROUP BY clause F1
│ ├── ORDER BY clause F1
│ ├── HAVING clause F1
│ └── KEYWORDS F1
│
├── Error Analysis
│ ├── Missing clauses
│ ├── Extra clauses
│ ├── Wrong predicates
│ └── Schema errors
│
└── Query Complexity
├── Simple
├── Moderate
└── Challenging
SQL query prediction matches exactly 100% with gold query (after normalization).
where
- Lowercase:
SELECT→select - Remove extra whitespace:
select *→select * - Standardize quotes:
"table"→'table' - Sort clauses: Đảm bảo thứ tự consistent
- Remove trailing semicolon:
select * from t;→select * from t
- ❌ Binary (all-or-nothing)
- ❌ Does not show WHERE model fails
- ❌ Does not distinguish minor vs major errors
- ❌ Semantically equivalent queries scored as wrong
Example:
Predicted: SELECT name FROM students WHERE age > 18
Gold: SELECT name FROM students WHERE 18 < age→ EM = 0 (mặc dù semantically equivalent)
Evaluates each component of SQL query separately.
- SELECT - Columns to retrieve
- FROM - Tables involved
- WHERE - Filter conditions
- GROUP BY - Grouping columns
- ORDER BY - Sorting columns
- HAVING - Group filters
- KEYWORDS - SQL operations (DISTINCT, JOIN, etc.)
Example Query:
SELECT DISTINCT s.name, AVG(g.score)
FROM students s
JOIN grades g ON s.id = g.student_id
WHERE s.class = '10A'
GROUP BY s.name
HAVING AVG(g.score) > 8.0Extracted Components:
{
"SELECT": {"s.name", "AVG(g.score)"},
"FROM": {"students", "grades"},
"WHERE": {"s.class = '10A'"},
"GROUP BY": {"s.name"},
"HAVING": {"AVG(g.score) > 8.0"},
"KEYWORDS": {"DISTINCT", "JOIN", "AVG"}
}For each component:
Predicted SELECT: {s.name, s.age}
Gold SELECT: {s.name, AVG(g.score)}
where
Definition: Gold query has clause but predicted does not.
Example:
Predicted: SELECT name FROM students
Gold: SELECT name FROM students WHERE age > 18→ Missing WHERE clause
Definition: Predicted has unnecessary clause.
Example:
Predicted: SELECT name FROM students ORDER BY name
Gold: SELECT name FROM students→ Extra ORDER BY clause
Definition: Clause present but conditions are incorrect.
Example:
Predicted: WHERE age > 20
Gold: WHERE age > 18→ Wrong predicate value
Definition: Incorrect table or column names used.
Example:
Predicted: SELECT student_name FROM pupils
Gold: SELECT name FROM students→ Wrong table name, wrong column name
{
"missing_clauses": {
"WHERE": 15,
"GROUP BY": 8,
"HAVING": 3
},
"extra_clauses": {
"ORDER BY": 5
},
"wrong_predicates": 22,
"schema_errors": 10
}Characteristics:
- Single table
- Basic SELECT
- Simple WHERE (0-2 conditions)
- No JOINs, no aggregations
Example:
SELECT name FROM students WHERE class = '10A'Characteristics:
- 2-3 tables
- Simple JOINs
- Basic aggregations (COUNT, SUM, AVG)
- GROUP BY without HAVING
Example:
SELECT s.class, COUNT(*)
FROM students s
JOIN grades g ON s.id = g.student_id
GROUP BY s.classCharacteristics:
- 3+ tables
- Multiple JOINs
- Complex aggregations
- Nested queries
- HAVING clauses
Example:
SELECT s.name, AVG(g.score)
FROM students s
JOIN grades g ON s.id = g.student_id
JOIN courses c ON g.course_id = c.id
WHERE c.department = 'Math'
GROUP BY s.name
HAVING AVG(g.score) > (SELECT AVG(score) FROM grades){
"simple": {
"count": 100,
"em_accuracy": 0.85,
"avg_f1": 0.92
},
"moderate": {
"count": 80,
"em_accuracy": 0.65,
"avg_f1": 0.78
},
"challenging": {
"count": 20,
"em_accuracy": 0.30,
"avg_f1": 0.55
}
}Input: predicted_sql, gold_sql, db_id, schema_path
1. Normalize both queries
2. Check Exact Match
3. Extract components from both
4. Compute component-wise F1
5. Analyze errors
6. Classify complexity
7. Return metrics dict
Input: List of predictions, schema_path
1. For each prediction:
Evaluate single query
2. Aggregate metrics:
- Overall EM accuracy
- Average component F1
- Error statistics
- Complexity breakdown
3. Generate report
{
"request_id": "q_001",
"db_id": "academic",
"predicted_sql": "SELECT name FROM students WHERE age > 18",
"gold_sql": "SELECT name FROM students WHERE age >= 18",
"exact_match": false,
"syntax_valid": true,
"avg_f1": 0.92,
"f1_scores": {
"SELECT": 1.0,
"FROM": 1.0,
"WHERE": 0.67,
"GROUP BY": null,
"ORDER BY": null,
"HAVING": null,
"KEYWORDS": 1.0
},
"details": {
"error_analysis": {
"missing_clauses": [],
"extra_clauses": [],
"wrong_predicates": ["age > 18 vs age >= 18"]
},
"query_complexity": "simple"
}
}{
"metadata": {
"total_queries": 200,
"timestamp": "2025-12-01T10:30:00",
"dataset": "ViText2SQL std-level dev"
},
"overall_metrics": {
"exact_match_accuracy": 0.72,
"avg_f1": 0.84,
"syntax_valid_rate": 0.98
},
"component_f1": {
"SELECT": 0.91,
"FROM": 0.93,
"WHERE": 0.78,
"GROUP BY": 0.71,
"ORDER BY": 0.85,
"HAVING": 0.62,
"KEYWORDS": 0.88
},
"error_statistics": {
"missing_clauses": 45,
"extra_clauses": 12,
"wrong_predicates": 38,
"schema_errors": 18
},
"complexity_breakdown": {
"simple": {"count": 120, "em": 0.85, "f1": 0.92},
"moderate": {"count": 60, "em": 0.65, "f1": 0.78},
"challenging": {"count": 20, "em": 0.30, "f1": 0.55}
}
}# Basic evaluation (automatic)
python vipersql.py --strategy few-shot --samples 100
# Results saved to: results/few-shot-vir2_100_TIMESTAMP/results/few-shot-vir2_100_20251201_103000/
├── predictions.json # All predictions
├── eval_results.json # Detailed metrics
└── eval_report.txt # Human-readable summary
predictions.json:
{
"predictions": [
{
"db_id": "academic",
"question": "...",
"predicted": "SELECT ...",
"gold": "SELECT ..."
}
]
}eval_results.json:
- Full metrics (as shown in Batch Result above)
eval_report.txt:
=== Evaluation Report ===
Total Queries: 200
Exact Match Accuracy: 72.0%
Average F1 Score: 84.0%
Component F1 Scores:
- SELECT: 91.0%
- FROM: 93.0%
- WHERE: 78.0%
...
Error Analysis:
- Missing clauses: 45
- Extra clauses: 12
...
✅ Excellent - Model performs well
Action: Check error analysis for patterns
❌ Poor performance - Fundamental issues
Actions:
- Check training data quality
- Try different selector
- Improve prompts
- Use better model
High SELECT F1, Low WHERE F1:
- Model good at identifying columns
- Struggles with conditions
Low FROM F1:
- Schema understanding issues
- Need better schema descriptions
Low GROUP BY/HAVING F1:
- Complexity too high
- Need more aggregation examples
- Exact Match (EM) - Standard
- Execution Accuracy (EX) - Run query, compare results (requires DB)
- ✅ Component F1 - Detailed breakdown
- ✅ Error analysis - Understand failures
- ✅ Complexity analysis - Performance by difficulty
- ✅ Syntax validation - Catch parse errors
# .env configuration
ENABLE_COMPONENT_ANALYSIS=true
ENABLE_ERROR_ANALYSIS=true
EVALUATION_TIMEOUT=30# Specify schema for evaluation
python vipersql.py \
--schema-path dataset/ViText2SQL/std-level/tables.json \
--samples 100{
"database_breakdown": {
"academic": {"em": 0.75, "f1": 0.85},
"scholar": {"em": 0.68, "f1": 0.80},
"world_1": {"em": 0.72, "f1": 0.83}
}
}{
"sql_type_breakdown": {
"SELECT_only": {"count": 50, "em": 0.90},
"SELECT_WHERE": {"count": 80, "em": 0.75},
"JOIN": {"count": 40, "em": 0.60},
"AGGREGATION": {"count": 30, "em": 0.55}
}
}Solutions:
- Reduce samples
- Disable detailed error analysis
- Increase evaluation timeout
Causes:
- Component extraction issues
- Normalization differences
Solutions:
- Check SQL parsing
- Verify schema paths
- Review normalization logic
Solutions:
- Improve schema descriptions
- Add schema examples in prompts
- Check database coverage in training
- Strategies - Different generation approaches
- Selectors - Example selection impact on accuracy
- Configuration - Evaluation parameters
- Quick Start - Running evaluations