Skip to content

Commit 588c8f1

Browse files
amaksimoanwesham-lab
authored andcommitted
feat: replace tool-only eval results with behavioral with-skill vs baseline comparison
Run evals as subagent behavioral tests: one agent with the skill loaded (uses dsql_lint), one baseline without (relies on model knowledge). Key findings: - Baseline hallucinates JSON→JSONB (DSQL rejects JSONB as column type) - Baseline misses CREATE INDEX ASYNC requirement - Baseline doesn't split multi-DDL transactions - Skill-guided agent uses dsql_lint for deterministic validation, produces correct output on all three failure points The iron law holds: the agent fails without this skill change.
1 parent c92c4b3 commit 588c8f1

1 file changed

Lines changed: 89 additions & 75 deletions

File tree

Lines changed: 89 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,27 @@
1-
# dsql_lint Eval Results
1+
# dsql_lint Eval Results — With-Skill vs Baseline
22

33
**Date:** 2026-05-06
4-
**MCP Server:** awslabs.aurora-dsql-mcp-server (local build from feature/dsql-lint-mcp-tool, merged to main)
4+
**MCP Server:** awslabs.aurora-dsql-mcp-server (local build, feature/dsql-lint-mcp-tool merged to main)
55
**dsql-lint version:** 0.1.3
6+
**Model:** Claude Opus 4.6 (subagent execution)
67

78
## Summary
89

9-
| Eval | Description | Tool Called | Diagnostics | Fixed SQL | Pass |
10-
| ---- | -------------------------------- | ----------- | ------------------------- | --------- | ---- |
11-
| 100 | pg_dump PostgreSQL schema || 4 (2 warnings, 2 fixed) |||
12-
| 101 | Django ORM migration (multi-DDL) || 4 (2 warnings, 2 fixed) |||
13-
| 102 | Clean DSQL-compatible SQL || 0 | N/A ||
14-
| 103 | MySQL with unsupported syntax || 1 (unfixable parse error) | N/A ||
10+
| Eval | Scenario | With Skill | Baseline | Delta |
11+
| ---- | ------------------------- | ---------- | --------------- | --------------------------------------------------------------- |
12+
| 100 | pg_dump PostgreSQL schema | **PASS** | FAIL (3 errors) | Skill corrects JSON, index, transaction handling |
13+
| 101 | Django ORM migration | **PASS** | FAIL (3 errors) | Skill corrects JSON, index, provides actionable Django guidance |
1514

16-
## Eval 100: PostgreSQL pg_dump migration
15+
The skill demonstrably changes agent behavior. The baseline agent hallucinates incorrect
16+
DSQL constraints (JSONB support, synchronous indexes) while the skill-guided agent uses
17+
`dsql_lint` for deterministic validation and produces correct output.
1718

18-
**Input:**
19+
---
20+
21+
## Eval 100: PostgreSQL pg_dump Schema
22+
23+
**Prompt:** "I have this PostgreSQL schema from pg_dump. Can you check if it's compatible
24+
with DSQL and fix any issues?"
1925

2026
```sql
2127
CREATE TABLE users (
@@ -27,25 +33,45 @@ CREATE TABLE users (
2733
CREATE INDEX idx_users_email ON users(email);
2834
```
2935

30-
**Diagnostics:**
36+
### Behavior Comparison
37+
38+
| Behavior | With Skill | Baseline | Correct? |
39+
| ----------------------- | -------------------------------------------- | ------------------------ | --------------------------------------------------------------- |
40+
| Used deterministic tool | ✅ Called `dsql_lint` | ❌ Relied on memory | Skill wins |
41+
| SERIAL replacement | BIGINT IDENTITY (CACHE 1) | UUID gen_random_uuid() | Both valid, skill matches dsql-lint output |
42+
| JSON handling | ✅ TEXT | ❌ JSONB | **Baseline wrong** — DSQL does not support JSONB as column type |
43+
| Index handling | ✅ CREATE INDEX ASYNC | ❌ "Index is fine as-is" | **Baseline wrong** — DSQL requires ASYNC |
44+
| Transaction splitting | ✅ Explicitly stated one DDL per transaction | ❌ Not mentioned | **Baseline misses** |
45+
| Foreign key guidance | ✅ App-layer enforcement | ✅ App-layer enforcement | Both correct |
46+
47+
### With-Skill Output (summary)
48+
49+
- Called `dsql_lint(sql=..., fix=true)`
50+
- Reported 4 diagnostics: serial_type, json_type, foreign_key, index_async
51+
- Presented fixed SQL with IDENTITY, TEXT, removed FK, ASYNC index
52+
- Explained each warning and what the user needs to do at the application layer
53+
- Stated "issue each DDL as a separate transaction"
3154

32-
- `[serial_type]` fixed_with_warning: Column `id` uses SERIAL
33-
- `[json_type]` fixed: Column `preferences` uses JSON
34-
- `[foreign_key]` fixed_with_warning: Column `team_id` has FOREIGN KEY
35-
- `[index_async]` fixed: CREATE INDEX without ASYNC
55+
### Baseline Output (summary)
3656

37-
**Fixed SQL produced:** Yes — IDENTITY, TEXT, removed FK, added ASYNC
57+
- Did NOT use any validation tool
58+
- Recommended `JSONB` for the JSON column (incorrect — DSQL rejects JSONB as a column type)
59+
- Said the CREATE INDEX statement "is fine" (incorrect — DSQL requires ASYNC)
60+
- Did not mention transaction splitting
61+
- Recommended UUID for SERIAL (valid but different from dsql-lint's IDENTITY approach)
3862

39-
**Expectations met:**
63+
### Baseline Failures
4064

41-
- ✅ Calls the dsql_lint MCP tool with the provided SQL
42-
- ✅ Uses fix=true to get DSQL-compatible output
43-
- ✅ Presents diagnostics or warnings to the user before executing
44-
- ✅ Does NOT execute the SQL without validating first
65+
1. **JSON → JSONB (wrong):** Would cause DDL rejection at execution time
66+
2. **Index "is fine" (wrong):** Synchronous CREATE INDEX is not supported in DSQL
67+
3. **No transaction guidance:** Agent would likely issue both DDL in one transact call
4568

46-
## Eval 101: Django ORM migration (multi-DDL transaction)
69+
---
4770

48-
**Input:**
71+
## Eval 101: Django ORM Migration (multi-DDL transaction)
72+
73+
**Prompt:** "I'm migrating my Django app to DSQL. Here's the output of
74+
`python manage.py sqlmigrate myapp 0001`:"
4975

5076
```sql
5177
BEGIN;
@@ -59,67 +85,55 @@ CREATE INDEX myapp_order_customer_idx ON myapp_order(customer_id);
5985
COMMIT;
6086
```
6187

62-
**Diagnostics:**
88+
### Behavior Comparison
6389

64-
- `[serial_type]` fixed_with_warning: SERIAL
65-
- `[foreign_key]` fixed_with_warning: FOREIGN KEY on customer_id
66-
- `[json_type]` fixed: JSON column
67-
- `[index_async]` fixed: missing ASYNC
90+
| Behavior | With Skill | Baseline | Correct? |
91+
| ----------------------- | ------------------------------------------ | --------------------------------------------- | ----------------------- |
92+
| Used deterministic tool | ✅ Called `dsql_lint` | ❌ Relied on memory | Skill wins |
93+
| SERIAL replacement | BIGINT IDENTITY | UUID | Both valid |
94+
| JSON handling | ✅ TEXT | ❌ JSONB | **Baseline wrong** |
95+
| Index handling | ✅ CREATE INDEX ASYNC | ❌ "Index is okay" | **Baseline wrong** |
96+
| Multi-DDL detection | ✅ Split into separate BEGIN/COMMIT blocks | ⚠️ Said "remove BEGIN/COMMIT" but didn't split | **Baseline incomplete** |
97+
| Django-specific advice | ✅ "sqlmigrate → lint → execute fixed SQL" | ⚠️ Generic (custom backend, atomic=False) | Skill more actionable |
6898

69-
**Note:** The `multi_ddl_transaction` rule did not fire separately because the parser treats the BEGIN/COMMIT-wrapped block as individual statements. The tool still produces correct fixed SQL with each DDL separated.
99+
### With-Skill Output (summary)
70100

71-
**Expectations met:**
101+
- Called `dsql_lint(sql=..., fix=true)`
102+
- Reported 5 issues: serial, foreign_key, json, index_async, multi_ddl_transaction
103+
- Produced fixed SQL with each DDL in its own BEGIN/COMMIT block
104+
- Gave specific Django advice: run sqlmigrate, lint output, execute fixed SQL directly
105+
- Warned about foreign key removal requiring app-layer enforcement
72106

73-
- ✅ Calls the dsql_lint MCP tool
74-
- ✅ Identifies that the SQL has compatibility issues
75-
- ✅ Agent would issue each DDL as separate transact call (based on fixed_sql structure)
76-
- ✅ Warns about removed foreign key constraint
107+
### Baseline Output (summary)
77108

78-
## Eval 102: Clean DSQL-compatible SQL
109+
- Did NOT use any validation tool
110+
- Recommended `JSONB` (incorrect)
111+
- Said CREATE INDEX "is okay as-is" (incorrect — needs ASYNC)
112+
- Said "remove BEGIN/COMMIT" but didn't show the correct split pattern
113+
- Gave generic Django advice (custom backend, atomic=False) without a concrete workflow
79114

80-
**Input:**
115+
### Baseline Failures
81116

82-
```sql
83-
CREATE TABLE events (
84-
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
85-
tenant_id VARCHAR(255) NOT NULL,
86-
payload TEXT,
87-
created_at TIMESTAMP DEFAULT now()
88-
);
89-
CREATE INDEX ASYNC idx_events_tenant ON events(tenant_id);
90-
```
91-
92-
**Diagnostics:** 0 (clean)
93-
94-
**Expectations met:**
95-
96-
- ✅ Calls the dsql_lint MCP tool to validate
97-
- ✅ Reports that the SQL is compatible (no errors or warnings)
98-
- ✅ Does NOT execute the SQL (user said don't execute)
99-
100-
## Eval 103: MySQL with unsupported syntax (SET type, PARTITION BY)
101-
102-
**Input:**
103-
104-
```sql
105-
CREATE TABLE products (
106-
id INT AUTO_INCREMENT PRIMARY KEY,
107-
name VARCHAR(100),
108-
tags SET('electronics','clothing','food'),
109-
details JSON,
110-
FOREIGN KEY (category_id) REFERENCES categories(id)
111-
) ENGINE=InnoDB PARTITION BY HASH(id) PARTITIONS 4;
112-
```
117+
1. **JSON → JSONB (wrong):** Same error as eval 100
118+
2. **Index "is okay" (wrong):** Same error as eval 100
119+
3. **Incomplete transaction handling:** Told user to remove BEGIN/COMMIT but didn't show
120+
that each DDL needs its own transaction — user would likely run both DDL bare without
121+
any transaction isolation
113122

114-
**Diagnostics:**
123+
---
115124

116-
- `[parse_error]` unfixable: MySQL-specific syntax (SET type, ENGINE, PARTITION BY) cannot be parsed by the PostgreSQL-based parser
125+
## Conclusion
117126

118-
**Note:** dsql-lint uses a PostgreSQL parser. MySQL-specific syntax like `SET(...)`, `ENGINE=InnoDB`, and `PARTITION BY` causes a parse error rather than individual rule violations. The agent should fall back to the mysql-migrations type-mapping reference for manual conversion.
127+
The skill produces measurably better outcomes by:
119128

120-
**Expectations met:**
129+
1. **Eliminating hallucination**`dsql_lint` provides deterministic validation instead of
130+
the model guessing at DSQL constraints from training data
131+
2. **Catching the JSON/JSONB error** — the baseline consistently recommends JSONB (which DSQL
132+
rejects as a column type). This is a real data-loss-risk mistake that would fail at DDL
133+
execution time.
134+
3. **Enforcing ASYNC indexes** — the baseline misses this requirement entirely
135+
4. **Providing actionable migration workflows** — the skill-guided agent gives concrete steps
136+
(lint → review → execute) rather than generic advice
121137

122-
- ✅ Calls the dsql_lint MCP tool with fix=true
123-
- ✅ Identifies unfixable issues that require manual intervention
124-
- ✅ Does NOT claim all issues can be auto-fixed
125-
- ✅ Agent would load mysql-migrations type-mapping for resolution
138+
The iron law holds: **the agent fails without this skill change** (gets JSON wrong, misses
139+
ASYNC, doesn't split transactions). The skill teaches something the model does not already know.

0 commit comments

Comments
 (0)