Can AI Map and Transform Education Data?
Findings from Experiments 1 & 2: Evaluating Agentic Components for AI-Assisted Schema Mapping and Transform Expression Generation
| Author: | Olivier Mills (Baobab Tech) |
| Client: | DataKind |
| Date: | January 31, 2026 |
| Version: | 1.1 |
When education organizations share data, they face a fundamental problem: different systems use different formats. A student record in one database might have fields labeled "student_id" and "first_name," while another system expects "Person.Identifier" and "Person.Name.firstName." Before data can flow between systems, someone must create a translation guide, matching each field in the source system to its counterpart in the destination system.
This translation process, called schema mapping, has traditionally required skilled professionals to manually examine hundreds of fields, understand their meaning, and find the correct matches. For a typical dataset with 140 fields, this work takes 8 to 16 hours of expert time and remains prone to human error.
Recent advances in artificial intelligence have produced systems capable of autonomous reasoning and task execution, often called "agentic AI." These systems can read instructions, reason about problems, and take actions to accomplish goals. However, deploying agentic AI in production requires understanding which underlying components work reliably and cost-effectively.
This study evaluates two foundational capabilities that agentic AI systems need for data integration tasks:
- Schema Mapping: Can AI correctly identify which data fields correspond to each other across different systems?
- Code Generation: Can AI write the transformation logic needed to convert data values between formats?
These capabilities form the building blocks for more sophisticated agentic workflows in data engineering.
Our experiments demonstrated that AI systems can correctly identify approximately 70-80% of field matches, compared to only 40-45% for traditional automated methods. More importantly, AI can process an entire dataset in under one minute at a cost of less than 10 cents, compared to hours of professional labor.
For code generation, AI systems successfully produced valid transformation logic 91-98% of the time at a cost of less than one cent per transformation.
AI does not eliminate the need for human review. The remaining 20-30% of mapping cases require expert judgment, and AI occasionally proposes incorrect matches that humans must catch and correct. The practical benefit is transforming the task from "examine every field manually" to "review and verify AI suggestions," reducing total time from 8-16 hours to 1-3 hours.
This finding has implications for agentic AI deployment: even capable systems benefit from human-in-the-loop validation, particularly for consequential decisions.
A surprising finding contradicted common assumptions about AI capabilities. The largest, most expensive AI systems did not perform meaningfully better than smaller, more economical alternatives. In some cases, smaller systems actually produced better results.
For straightforward datasets, the least expensive AI option (costing less than one cent per run) achieved 96% of the accuracy of premium systems costing 24 times more. For code generation tasks, mid-tier systems outperformed premium alternatives while costing 5 to 8 times less.
This suggests that for well-defined agentic tasks, organizations should select AI components based on demonstrated performance rather than assuming that larger models are superior.
For a typical 140-field dataset:
- Traditional manual approach: 8-16 hours of expert time
- AI-assisted approach: Less than $0.20 in AI costs plus 1-3 hours of review time
- Use AI to generate initial field matches and transformation code
- Have human experts review AI suggestions, focusing on low-confidence predictions
- Verify a sample of high-confidence predictions to catch systematic errors
- Test transformation code against sample data before production use
- Human oversight remains essential. Agentic AI accelerates work but does not replace the need for domain expertise and validation.
- Task characteristics matter. Technically complex datasets require different AI configurations than simpler ones.
- Cost efficiency favors appropriately-sized systems. Premium AI services do not provide proportional benefits for well-defined tasks.
- Component reliability varies. Different AI providers show different reliability profiles; selection should be based on empirical testing.
The agentic AI components evaluated in this study can meaningfully assist with data integration tasks in education. The technology reduces costs and accelerates timelines while maintaining accuracy suitable for production use with appropriate human oversight.
Organizations considering agentic AI for data integration should pilot the approach with representative data, establish review procedures appropriate to their accuracy requirements, and select AI components based on demonstrated performance rather than marketing claims about capability.
Education data systems store information in different formats. When data moves between systems (such as from the Postsecondary Data Partnership (PDP) to the Learner Information Framework (LIF)), someone must determine which fields correspond to each other. A field called student_id in one system might map to Person.Identifier.identifier in another. This "schema mapping" process is traditionally manual, time-consuming, and error-prone.
Agentic AI systems, those capable of autonomous reasoning and multi-step task execution, require reliable underlying components. For data integration workflows, two foundational capabilities are essential: (1) semantic understanding to identify field correspondences across schemas, and (2) code generation to produce transformation logic. This study rigorously evaluates these agentic components to determine which approaches offer the best balance of accuracy, cost, and reliability for production deployment.
We evaluated 12 methods across three tiers of sophistication:
- Tier 1 (Traditional): Four text-matching approaches that compare field names and descriptions without understanding meaning
- Tier 2 (Small AI): Five compact language models that can read instructions and reason about semantics
- Tier 3 (Large AI): Three frontier language models with the strongest reasoning capabilities
We tested these on two schemas (PDP and CDTL), each containing approximately 140 source fields that needed mapping to 521 possible target fields in LIF. All results were evaluated against human-verified ground truth.
-
AI language models dramatically outperform traditional methods. The best AI approach achieved 71% precision (correct proposals) compared to 44% for the best traditional method, a 61% relative improvement.
-
Small AI models match large ones at a fraction of the cost. The smallest model we tested (GPT OSS 120b at $0.005 per run) achieved 96% of the accuracy of the most expensive model (GPT-5.2 at $0.12 per run). The 24x cost difference yielded only a 4% accuracy gain.
-
"Batch mode" outperforms field-by-field processing. Processing all fields in a single AI call produces better results AND costs 20-50x less than processing each field individually.
-
Schema characteristics matter significantly. The CDTL schema proved harder than PDP (F0.5 of 0.62 vs. 0.74) because its vocabulary overlaps with the target schema, confusing the AI.
-
Non-Claude AI models fail on technical schemas. For CDTL, models from providers other than Anthropic hallucinated extensively, inventing target fields that do not exist. Only Claude Haiku and Claude Sonnet reliably work for CDTL.
Building on Experiment 1, we tested whether AI can generate the actual transformation expressions (JSONata code) for mappings that require data conversion.
Key Findings:
-
All models achieved strong results. Full pass rates ranged from 91% to 98%, exceeding all success thresholds.
-
Tier 2 models outperformed Tier 3. In a reversal of expectations, Gemini Flash 3 (98.2%) and Claude Haiku (96.4%) achieved higher accuracy than Claude Sonnet (94.6%) and GPT-5.2 (91.1%).
-
Custom function constraints eliminate hallucinations. When instructed to use only approved functions, models achieved near-perfect compliance (100% for all models except one GPT-5.2 instance).
-
Speed and cost trade-offs are significant. Gemini Flash is 8x cheaper but 2.5x slower than Claude Haiku. For interactive applications, Claude Haiku offers the best balance.
| Use Case | Recommended Model | Cost (56 transforms) | Time | Pass Rate |
|---|---|---|---|---|
| Batch processing | Gemini Flash 3 | $0.04 | 12 min | 98.2% |
| Interactive use | Claude Haiku 4.5 | $0.13 | 5 min | 96.4% |
Experiment 1 (Schema Mapping):
| Schema | Recommended Model | Cost per Run | Expected Accuracy |
|---|---|---|---|
| PDP | GPT OSS 120b (batch) | $0.005 | F0.5 = 0.71 |
| CDTL | Claude Haiku (batch) | $0.05 | F0.5 = 0.62 |
Experiment 2 (Transform Expressions):
| Use Case | Recommended Model | Cost per Transform | Pass Rate |
|---|---|---|---|
| Batch | Gemini Flash 3 | $0.0007 | 98.2% |
| Interactive | Claude Haiku 4.5 | $0.0023 | 96.4% |
Practical impact: For a 141-field schema, AI can process all fields in under a minute for less than a nickel, correctly identifying approximately 25 of 30 real mappings and correctly rejecting approximately 100 of 110 non-mappings. For the subset requiring transforms, AI can generate valid JSONata expressions at 96-98% accuracy for less than a penny each. This reduces human review time from an estimated 8+ hours to 1-2 hours, but does not eliminate the need for human verification.
The key insight for agentic AI deployment: The assumption that larger models are always better does not hold for these foundational agentic tasks. In both experiments, smaller models matched or exceeded larger ones at a fraction of the cost. When building agentic workflows, select component models based on empirical performance for the specific task, schema characteristics, and interactivity requirements, not on general capability benchmarks or marketing claims about model size.
Every database organizes information using a "schema," a blueprint that defines what fields exist and how they relate to each other. A student information system might have fields like student_id, first_name, date_of_birth, and enrollment_date. A different system storing similar information might call these fields Person.Identifier.identifier, Person.Name.firstName, Person.birthDate, and Enrollment.entryDate.
Schema mapping is the process of determining which fields in one system correspond to which fields in another. Some mappings are obvious: first_name clearly corresponds to Person.Name.firstName. Others require domain expertise: delivery_method corresponds to Course.instructionalDelivery, but only someone familiar with education data would recognize this relationship.
The Learner Information Framework (LIF) is emerging as a common standard for representing education data. Organizations using other schemas (such as the Postsecondary Data Partnership (PDP) schema or the Credential Transparency Description Language (CDTL)) need to map their data to LIF to participate in broader data-sharing initiatives.
Currently, this mapping is performed manually by domain experts who examine each source field, search through hundreds of possible target fields, and determine the best match. This process is:
- Time-consuming: A schema with 140 fields might require 8-16 hours of expert time
- Error-prone: Humans miss mappings, especially when tired or unfamiliar with edge cases
- Expensive: Domain experts are scarce and their time is valuable
- Non-scalable: Each new schema requires starting from scratch
Recent advances in artificial intelligence, particularly large language models (LLMs) like GPT and Claude, suggest these tools might assist with schema mapping. Unlike simple text matching, LLMs can:
- Understand that "delivery method" and "instructional delivery" refer to the same concept
- Recognize that
student_idrelates toPerson.Identifierbecause students are people - Determine when a source field has no valid target and should be marked "no match"
However, AI assistance comes with questions: How accurate is it? How expensive? Do larger, more sophisticated models perform meaningfully better than smaller, cheaper ones? This experiment sought to answer these questions rigorously.
Typical attempts to evaluate AI for schema mapping suffer from confounded variables. Multiple factors changed between runs (prompts, models, preprocessing steps), making it impossible to attribute improvements to specific changes. Was the improvement from the new model, or from the revised prompt, or from fixing a bug in preprocessing?
This experiment enforced strict variable isolation:
- When comparing models, prompts remained identical
- When comparing prompting strategies, models remained fixed
- All parameters (temperature, token limits, prompt versions) were frozen before execution
- Every run was logged with complete inputs and outputs for reproducibility
We designed these experiments around six hypotheses, stated here in plain language:
Experiment 1: Schema Mapping
| ID | Question | Plain Language Interpretation |
|---|---|---|
| H1 | Can traditional text-matching methods achieve reliable accuracy? | Do we even need AI, or can simpler methods work? |
| H2 | Do larger AI models produce better results than smaller ones? | Is it worth paying 10-50x more for a "better" model? |
| H3 | Can AI detect when data transformation is required? | Beyond finding pairs, can AI identify when values need conversion? |
| H4 | Does mapping-type classification improve with model size? | Can AI distinguish direct matches from splits and merges? |
Experiment 2: Transform Expression Generation
| ID | Question | Plain Language Interpretation |
|---|---|---|
| H5 | Can LLMs produce syntactically valid JSONata expressions? | Can AI write executable transformation code? |
| H6 | Can LLMs be constrained to use only approved functions? | Can we prevent AI from inventing non-existent functions? |
Success thresholds were defined before running experiments:
- H1: Traditional methods achieve MCC > 0.5 (a statistical measure indicating reliable classification)
- H2: Improvement from Tier 2→3 is smaller than Tier 1→2 (diminishing returns)
- H3: Transform detection accuracy differs significantly across tiers
- H4: Type classification accuracy increases with model tier
- H5: Syntax validity rate > 60% (minimum viability), > 90% (strong result)
- H6: Function hallucination rate < 10% (acceptable), < 1% (strong result)
Experiment 1 (Schema Mapping):
- Source schemas: PDP (141 fields) and CDTL (139 fields)
- Target schema: LIF (521 fields)
- Methods: 4 traditional text-matching methods, 8 AI language models
- Variants: Standard (field-by-field) and batch (all fields at once) processing
- Total experiment runs: 33 completed runs with full evaluation
Experiment 2 (Transform Expression Generation):
- Transform mappings: 56 total (12 PDP, 44 CDTL)
- Models: 4 AI language models (Claude Haiku, Gemini Flash, Claude Sonnet, GPT-5.2)
- Custom function library: 5 helper functions for common transformations
- Total experiment runs: 8 (4 models × 2 schemas)
- Mapping in the reverse direction (LIF to PDP/CDTL)
- Other education data schemas beyond PDP and CDTL
- Production deployment with real-time human-in-the-loop workflows
- Prompt optimization beyond our frozen prompts (v1.0 for mapping, v1.1 for transforms)
- Fine-tuning models specifically for schema mapping
- Execution testing of generated transform expressions against sample data
- Semantic validation of transform outputs (correct values, not just valid syntax)
-
Limited ground truth: Our Gold-B reference dataset contains 38 verified PDP mappings and 99 verified CDTL mappings. While carefully constructed, larger test sets would provide higher statistical confidence.
-
Single prompt version: We froze our prompt before experimentation. Different prompting strategies might yield different results; this is an avenue for future research.
-
Two schemas only: PDP and CDTL have distinct characteristics. Results may not generalize to all education data schemas.
-
Snapshot in time: AI model capabilities and costs change rapidly. Results represent January 2026 conditions.
-
Reference data quality: Some "correct" answers in our ground truth may themselves be debatable. Appendix E discusses this in detail.
We organized our methods into three tiers representing fundamentally different approaches to schema mapping:
These methods compare field names and descriptions using mathematical similarity measures. They do not "understand" meaning; they detect patterns in characters and words.
| ID | Method | How It Works | Cost |
|---|---|---|---|
| E-01 | Fuzzy Matching | Compares character sequences, like spell-check. "student_id" matches "studentId" because most letters are the same. | Free |
| E-02 | Static Embeddings | Converts text to numerical vectors (lists of numbers) and measures how "close" they are in mathematical space. | Free |
| E-03 | Sentence Transformer (Small) | A neural network with 22 million parameters that learns semantic relationships. | Free |
| E-04 | Sentence Transformer (Large) | A larger neural network (335 million parameters) representing the state of the art in text similarity. | Free |
How embedding methods work (simplified):
Imagine a library where books are arranged not alphabetically, but by topic. Books about cooking are near each other; books about astronomy are in a different area. Embedding methods work similarly: they place text in a mathematical "space" where similar concepts are close together.
When we embed "delivery method" and "instructional delivery," they end up near each other in this space because they share conceptual meaning, even though they share few characters.
These models can read instructions, reason about meaning, and produce structured output. They are hosted in the cloud and charge per use.
| ID | Model | Provider | Parameters | Cost per Run |
|---|---|---|---|---|
| L-01 | GPT OSS 120b | OpenAI (Open Source) | ~5B active | $0.005 - $0.15 |
| L-02 | Llama 4 Maverick | Meta (via Groq) | 17B×128 experts | $0.006 - $0.34 |
| L-03 | Claude Haiku 4.5 | Anthropic | ~20B | $0.05 - $2.25 |
| L-04 | GPT-5-mini | OpenAI | ~25B | $0.01 - $0.14 |
| L-05 | Gemini Flash 3 | ~30B | $0.02 - $0.96 |
Cost ranges reflect batch mode (lower) vs. standard mode (higher).
The most capable (and most expensive) models available.
| ID | Model | Provider | Cost per Run |
|---|---|---|---|
| L-06 | Claude Sonnet 4.5 | Anthropic | $0.17 - $6.63 |
| L-07 | Gemini Pro 3 | $0.09 - $0.96 | |
| L-08 | GPT-5.2 | OpenAI | $0.12 - $0.83 |
For AI models, we tested two approaches:
Standard Mode (Field-by-Field):
- Send each source field to the AI separately
- AI returns one mapping decision per call
- 141 API calls for PDP schema
- Higher cost, but provides detailed reasoning for each field
Batch Mode (All at Once):
- Send all source fields in a single API call
- AI returns all mapping decisions at once
- 1 API call for entire schema
- Much lower cost; AI sees full context
PDP (Postsecondary Data Partnership):
- 141 source fields
- Relatively straightforward field names (
student_id,first_name,course_begin_date) - 21% of fields have valid mappings; 79% should be "no match"
CDTL (Credential Transparency Description Language):
- 139 source fields
- Technical vocabulary (
Competency.performanceLevelType,CredentialOrganization.agentType) - 51% of fields have valid mappings; 49% should be "no match"
LIF (Learner Information Framework):
- 521 target fields organized hierarchically
- Examples:
Person.Name.firstName,Course.instructionalDelivery,Credential.level
For each source field, the method must either:
- Select the correct target from 521 options, OR
- Correctly determine that no valid target exists
This is inherently difficult. Even for fields with valid mappings, the correct answer is 1 in 521 (0.2%). Random guessing would achieve near-zero accuracy.
We used three reference datasets:
Gold (Original):
- Manual mappings from human expert review
- PDP: 29 verified mappings; CDTL: 66 verified mappings
- 100% human-verified but limited coverage
Silver (AI-Generated):
- Generated by Claude Opus 4.5 (the most capable model, not used in experiments)
- PDP: 31 mappings; CDTL: 102 mappings
- 30% human-verified
Gold-B (Enhanced):
- Gold plus human expert corrections and additions from Silver verification
- PDP: 38 verified mappings (12 transforms); CDTL: 99 verified mappings (44 transforms)
- Our primary reference for final evaluation
Why we needed Silver:
The original Gold dataset was small (29-66 mappings) because manual verification is time-consuming. We used AI to propose additional mappings, then had a human expert verify a sample. Where Silver agreed with Gold (95.7% for PDP, 61.2% for CDTL), we gained confidence. Where they disagreed, the human expert adjudicated to create Gold-B.
We report several metrics, each capturing a different aspect of performance. Here we explain what they mean in plain language.
Precision: "When the AI proposes a mapping, how often is it correct?"
A precision of 70% means: of every 10 mappings the AI proposes, 7 are correct and 3 are wrong.
Why precision matters: Wrong mappings that look plausible can slip through human review and cause data errors downstream. We want to minimize these false positives.
Recall: "Of all the real mappings that exist, how many did the AI find?"
A recall of 80% means: if there are 30 real mappings, the AI found 24 and missed 6.
Why recall matters: Missed mappings mean data doesn't transfer correctly between systems. However, missed mappings are usually obvious (a field with no mapping gets flagged for human review), so they're less dangerous than wrong mappings.
F0.5: "A combined score that weights precision more heavily than recall."
Why F0.5: Standard F1 weights precision and recall equally. Since we believe wrong mappings (precision failures) are more costly than missed mappings (recall failures), we use F0.5 which weights precision twice as heavily.
Hallucination Rate: "How often does the AI invent fields that don't exist?"
A hallucination rate of 5% means: 5 of every 100 predictions reference a target field that isn't in the LIF schema at all. This indicates the AI is "making things up" rather than selecting from valid options.
Consider a method evaluated on PDP (30 real mappings among 141 fields):
| Scenario | Precision | Recall | F0.5 | Interpretation |
|---|---|---|---|---|
| Perfect | 100% | 100% | 1.00 | Every proposal correct, every mapping found |
| Our best result | 71% | 83% | 0.74 | Good but not perfect; needs human review |
| Conservative model | 90% | 40% | 0.73 | Rarely wrong, but misses many mappings |
| Aggressive model | 40% | 95% | 0.46 | Finds most mappings, but many wrong proposals |
| Random guessing | ~6% | ~20% | 0.08 | Baseline for comparison |
Matthews Correlation Coefficient (MCC): A single number from -1 to +1 that captures overall classification quality, accounting for class imbalance.
- MCC = +1: Perfect predictions
- MCC = 0: No better than random guessing
- MCC = -1: Perfectly wrong
Why MCC matters for this task: PDP has only 21% positive mappings (30/141). A method that always predicts "no match" would achieve 79% accuracy but 0% recall and MCC near 0. MCC correctly identifies such degenerate strategies.
Our threshold: MCC > 0.5 indicates practically useful performance; MCC > 0.3 indicates the method is learning meaningful patterns.
All AI runs used identical settings:
- Temperature: 0.0 (no randomness; deterministic outputs)
- Maximum tokens: 500 (sufficient for structured response)
- Prompt version: v1.0 (frozen before execution)
For "reasoning" models (GPT-5 series, GPT OSS, Gemini 3):
- Reasoning effort: "low" (minimal internal reasoning to reduce cost and variability)
The prompt instructed the AI to:
- Examine the source field and its description
- Search the provided list of 521 LIF target fields
- Either propose a match or indicate "no match"
- Classify the mapping type (CLEAN_MATCH, SPLIT, MERGE, or NO_MATCH)
- Indicate if data transformation would be required
- Provide brief reasoning
The complete prompt is documented in Appendix A.
Embedding methods produce similarity scores (e.g., 0.78). We tested multiple thresholds:
- If score > threshold: propose the match
- If score ≤ threshold: predict "no match"
We tested thresholds from 0.5 to 0.95 and report results at the optimal threshold for each method (typically 0.7 for embeddings, 0.9+ for fuzzy matching).
Every run was logged to a SQLite database with:
- Model identifier and configuration
- Complete input/output for each prediction
- Token counts and costs
- Evaluation metrics against all reference datasets
Detailed predictions were also saved as JSON files for manual inspection.
Full technical protocol: see Appendix A
Experiment 1 identified which fields require data transformation. Experiment 2 tested whether AI can generate the actual transformation code (JSONata expressions) for those mappings.
We used the 56 mappings from Gold-B that were flagged as requiring transforms:
- PDP: 12 transforms (e.g., date format conversion, country code lookup)
- CDTL: 44 transforms (e.g., credential level mapping, delivery method translation)
| ID | Model | Tier | Rationale |
|---|---|---|---|
| L-03 | Claude Haiku 4.5 | 2 | Best Exp1 performer for CDTL |
| L-05 | Gemini Flash 3 | 2 | Lowest cost option |
| L-06 | Claude Sonnet 4.5 | 3 | Strong transform detection in Exp1 |
| L-08 | GPT-5.2 | 3 | Best overall Exp1 accuracy |
To prevent models from generating huge lookup tables or inventing non-existent functions, we defined 5 helper functions representing common transformation patterns:
| Function | Purpose | Example |
|---|---|---|
$countryToISO(name) |
Country name → ISO code |
'United States' → 'US'
|
$normalizeDate(str) |
Date string → ISO8601 |
'01/15/2024' → '2024-01-15'
|
$genderToText(code) |
Gender code → text |
'M' → 'Male'
|
$ethnicityToText(code) |
Ethnicity code → text |
'H' → 'Hispanic or Latino'
|
$deliveryMethodToText(code) |
Delivery code → text |
'O' → 'Online'
|
The prompt explicitly instructed models to use only built-in JSONata functions or these custom functions.
Each generated expression was validated in two stages:
-
Syntax Validation: Parse the expression using the JSONata library. Expressions that fail parsing are marked as syntax errors.
-
Function Validation: Extract all function calls and verify each is either a built-in JSONata function or one of the 5 approved custom functions. Expressions calling unknown functions are marked as hallucinations.
Full pass requires both syntax validity and function validity.
Full technical protocol: see Appendix I
Traditional methods achieved moderate success but fell short of our reliability threshold. The best method (Sentence Transformer B) found 69% of real mappings but was only correct 44% of the time when proposing a match.
PDP Schema (38 verified mappings, 141 total fields):
| Method | Precision | Recall | F0.5 | MCC | Correct Proposals | Mappings Found |
|---|---|---|---|---|---|---|
| Sentence Transformer B (E-04) | 44% | 69% | 0.47 | 0.43 | 20 of 46 | 20 of 30 |
| Sentence Transformer A (E-03) | 46% | 21% | 0.37 | 0.22 | 6 of 13 | 6 of 30 |
| Fuzzy Matching (E-01) | 42% | 17% | 0.33 | 0.18 | 5 of 12 | 5 of 30 |
| Static Embeddings (E-02) | 24% | 38% | 0.26 | 0.16 | 11 of 46 | 11 of 30 |
The best method (Sentence Transformer B) achieved:
- 20 correct matches out of 46 proposals (44% precision)
- 20 of 30 real mappings found (69% recall)
- 26 incorrect proposals that would need human rejection
Hypothesis H1 verdict: ❌ Failed
Our threshold for "reliable accuracy" was MCC > 0.5. The best Tier 1 method achieved MCC = 0.43 on PDP. While this indicates the methods are learning patterns (better than random), they cannot reliably distinguish mappable from non-mappable fields.
On the CDTL schema, results were better (MCC = 0.63, exceeding our threshold), but CDTL has more favorable characteristics (higher positive rate, cleaner field names).
Success Case: course_begin_date → Course.courseBeginDate
All four methods correctly identified this mapping. The field names are nearly identical, so even simple character matching works.
Failure Case: student_id → should map to Person.Identifier.identifier
| Method | What It Proposed | Result |
|---|---|---|
| Fuzzy Matching | Person.Identifier.identifier |
✅ Correct |
| Static Embedding | Assessment.awardedByRefOrganization.identifier |
❌ Wrong |
| Sentence Transformer A | (below threshold) | ❌ Missed |
| Sentence Transformer B | (below threshold) | ❌ Missed |
The semantic models got confused by dozens of fields containing "identifier" and couldn't decide which was correct. Ironically, the "dumb" fuzzy matcher succeeded because student_id shares character patterns with Person.Identifier.identifier.
The Over-Matching Problem:
Traditional methods confidently proposed these incorrect mappings:
| Source Field | Model Proposed | Score | Truth |
|---|---|---|---|
time_to_credential |
Credential.Requirements.minimumDuration |
0.78 | NO MATCH |
course_type |
Course.courseLevelType |
0.78 | NO MATCH |
semester_session_gpa |
Course.courseGPAApplicability |
0.78 | NO MATCH |
These fields genuinely have no LIF equivalent, but the models found "close enough" targets and proposed them anyway. This over-matching behavior creates work for human reviewers who must reject plausible-looking but incorrect suggestions.
Detailed Tier 1 analysis: see Appendix B
AI language models dramatically outperformed traditional methods. The best AI approach achieved 71% precision compared to 44% for Tier 1, a 61% relative improvement. Surprisingly, small models nearly matched large models, confirming diminishing returns at scale.
Figure 1: The improvement from Tier 1 to Tier 2 (+51%) dwarfs the improvement from Tier 2 to Tier 3 (+4%).
| Model | Tier | Precision | Recall | F0.5 | MCC | Cost |
|---|---|---|---|---|---|---|
| GPT-5.2 | 3 | 71% | 83% | 0.74 | 0.76 | $0.12 |
| GPT OSS 120b | 2 | 70% | 77% | 0.71 | 0.73 | $0.005 |
| Claude Sonnet 4.5 | 3 | 61% | 77% | 0.63 | 0.68 | $0.14 |
| Llama 4 Maverick | 2 | 60% | 70% | 0.62 | 0.66 | $0.006 |
| Claude Haiku 4.5 | 2 | 56% | 80% | 0.59 | 0.68 | $0.05 |
| Metric | Tier 1 Best | Tier 2 Best | Tier 3 Best | T1→T2 Gain | T2→T3 Gain |
|---|---|---|---|---|---|
| F0.5 | 0.47 | 0.71 | 0.74 | +51% | +4% |
| Precision | 44% | 70% | 71% | +59% | +1% |
| MCC | 0.43 | 0.73 | 0.76 | +70% | +4% |
Hypothesis H2 verdict: ✅ Confirmed
The improvement from Tier 1 to Tier 2 is massive (+51% F0.5). The improvement from Tier 2 to Tier 3 is modest (+4% F0.5). Larger models provide diminishing returns for this task.
The smallest, cheapest model (GPT OSS 120b at $0.005) achieved 96% of the best model's accuracy (GPT-5.2 at $0.12):
| Model | F0.5 | Cost | % of Best Accuracy | Cost Relative |
|---|---|---|---|---|
| GPT-5.2 | 0.74 | $0.12 | 100% | 24x |
| GPT OSS 120b | 0.71 | $0.005 | 96% | 1x |
Practical implication: For most use cases, the cheapest model is the best choice. The 24x cost premium for GPT-5.2 yields only 4% accuracy improvement.
Figure 2: The Pareto frontier shows GPT OSS 120b (best value) and GPT-5.2 (best accuracy) as the only non-dominated options.
Figure 3: LLMs achieve a better precision-recall balance than Tier 1 methods (shown as square marker).
Detailed LLM analysis: see Appendix C
Processing all fields in a single AI call ("batch mode") produces better results AND costs 20-50x less than processing each field individually ("standard mode").
PDP Schema (Gold-B Reference):
| Model | Standard F0.5 | Standard Cost | Batch F0.5 | Batch Cost | Improvement | Savings |
|---|---|---|---|---|---|---|
| GPT-5.2 | 0.66 | $0.83 | 0.74 | $0.12 | +12% | 86% |
| Claude Sonnet | 0.52 | $6.63 | 0.62 | $0.17 | +19% | 97% |
| Claude Haiku | 0.56 | $2.25 | 0.59 | $0.05 | +5% | 98% |
-
Context awareness: When the AI sees all fields together, it can maintain consistency. If it maps
first_nametoPerson.Name.firstName, it's more likely to correctly maplast_nametoPerson.Name.lastNamerather than proposing an inconsistent target. -
Reduced prompt overhead: In standard mode, the full LIF schema (521 fields) is sent with every request. In batch mode, it's sent once.
-
Global reasoning: The AI can recognize patterns across fields, such as address components that should all map to the same parent entity.
Figure 4: Batch mode achieves better accuracy (left) at dramatically lower cost (right) for all tested models.
Detailed batch analysis: see Appendix D
CDTL proved significantly harder than PDP. The same methods that achieved F0.5 = 0.74 on PDP only reached F0.5 = 0.62 on CDTL, a 16% drop.
| Characteristic | PDP | CDTL |
|---|---|---|
| Best F0.5 | 0.74 | 0.62 |
| Best Precision | 71% | 60% |
| Positive Rate | 21% (30/141) | 51% (71/139) |
| Average Hallucinations | 1.5 per run | 24 per run |
Vocabulary Overlap: CDTL uses terms like "Competency," "Credential," "Assessment," and "Course" extensively. These same terms appear throughout the LIF target schema. The AI finds many "close matches" that aren't quite right:
| CDTL Field | AI Proposed | Why It's Wrong |
|---|---|---|
Credential.audienceLevelType |
Credential.level |
"type of level" ≠ "level" |
Course.audienceLevelType |
Course.courseLevelType |
Same semantic confusion |
IdentifierValue.identifier |
Person.Identifier.identifier |
Nothing indicates "person" |
Higher Stakes: With 51% positive rate (vs. 21% for PDP), there are more mappings to find but also more opportunities for false positives.
For PDP: All models work reliably. Choose based on cost.
For CDTL: Only Claude models (Haiku and Sonnet) reliably work. Other models hallucinate extensively:
| Model | CDTL Batch Hallucinations | Reliability |
|---|---|---|
| Claude Sonnet 4.5 | 0 | ✅ Reliable |
| Claude Haiku 4.5 | 3 | ✅ Reliable |
| GPT-5.2 | 2 | ✅ Reliable |
| Gemini Pro 3 | 0 | ✅ Reliable |
| Gemini Flash 3 | 0 | ✅ Reliable |
| GPT OSS 120b | 74 (53%) | |
| Llama 4 Maverick | 62 (45%) | |
| GPT-5-mini | 28 + 247 FP | ❌ Unusable |
Practical implication: For CDTL or similar technical schemas, use Claude Haiku batch ($0.05 per run). The cheapest models that work for PDP fail catastrophically on CDTL.
Figure 5: CDTL is harder than PDP (left), and some models hallucinate extensively on CDTL (right). GPT OSS 120b's 74 hallucinations represent a 53% failure rate.
Silver quality analysis: see Appendix E
All AI models excel at classifying mapping types (97%+ accuracy). Transform detection varies significantly by model, with GPT-5.2 showing a critical failure mode in batch processing.
Question: Can AI correctly identify whether a mapping is a direct match, split, merge, or "no match"?
| Model | PDP Type Accuracy | CDTL Type Accuracy |
|---|---|---|
| GPT-5.2 | 90% | 99% |
| Claude Sonnet 4.5 | 89% | 98% |
| Claude Haiku 4.5 | 89% | 99% |
| GPT-5-mini | 87% | 100% |
| GPT OSS 120b | 91% | 98% |
Hypothesis H4 verdict: ❌ Refuted
We hypothesized that type classification accuracy would increase with model size. Instead, even the smallest models achieve near-perfect accuracy (87%+). Type classification is not a reasoning-intensive task; pattern matching suffices.
Question: Can AI correctly identify when data values need conversion (e.g., date format changes, code translations)?
Gold-B reference identifies 12 PDP fields (8.5%) and 44 CDTL fields (31.7%) requiring transforms. Detection accuracy varies by model:
| Model | Mode | PDP Accuracy | CDTL Accuracy | Notes |
|---|---|---|---|---|
| Gemini Flash 3 | Batch | 100% | 86% | Perfect on PDP |
| Claude Sonnet 4.5 | Batch | 99% | 86% | Excellent, no false positives |
| Claude Haiku 4.5 | Batch | 97% | 84% | Strong performance |
| GPT-5.2 | Standard | 95% | 83% | Works in standard mode |
| GPT-5.2 | Batch | 43% | 75% | Severe over-prediction |
GPT-5.2 has a critical issue in batch mode: it over-predicts transforms dramatically on PDP, achieving only 43% accuracy with 11% precision (predicting transforms for nearly all fields when only 8.5% require them).
| Metric | Standard Mode | Batch Mode |
|---|---|---|
| PDP Transform accuracy | 95% | 43% |
| PDP Transform precision | 59% | 11% |
This issue is specific to batch mode and does not affect field pairing accuracy. For transform detection, use Gemini Flash 3 (perfect on PDP) or Claude models (strong on both schemas).
Hypothesis H3 verdict:
Transform detection works across all tiers (>85% accuracy) but has model-specific failure modes. No clear tier advantage emerged.
Detailed transform analysis: see Appendix F
Running the same experiment multiple times produces nearly identical results. Single runs are representative; multiple runs are unnecessary.
We ran GPT OSS 120b six times and Llama 4 Maverick four times under identical conditions:
| Model | Runs | Precision CV | Recall CV | F0.5 CV | MCC CV |
|---|---|---|---|---|---|
| GPT OSS 120b | 6 | 2.2% | 0.0% | 1.9% | 1.5% |
| Llama 4 Maverick | 4 | 4.9% | 2.4% | 4.7% | 4.5% |
CV (Coefficient of Variation) < 5% for all metrics indicates excellent reproducibility.
-
True positive counts are remarkably stable: GPT OSS 120b found exactly 23 correct mappings in all 6 runs.
-
Variance comes from borderline cases: The small differences between runs reflect uncertainty on fields where the model is genuinely uncertain, not random noise.
-
Temperature 0.0 ensures determinism: With no randomness in the model, outputs are highly reproducible.
Practical implication: You don't need to run experiments multiple times and average results. A single run is representative of model performance.
Variance analysis: see Appendix G
All four models achieved strong results generating valid JSONata expressions, with full pass rates ranging from 91% to 98%. In a reversal of Experiment 1 findings, Tier 2 models outperformed Tier 3 models on this task.
| Model | Tier | Syntax Valid | Full Pass | Total Cost | Total Time |
|---|---|---|---|---|---|
| Gemini Flash 3 | 2 | 98.2% | 98.2% | $0.039 | 12.1 min |
| Claude Haiku 4.5 | 2 | 96.4% | 96.4% | $0.128 | 4.9 min |
| Claude Sonnet 4.5 | 3 | 94.6% | 94.6% | $0.322 | 6.6 min |
| GPT-5.2 | 3 | 92.9% | 91.1% | $0.282 | 5.4 min |
All models exceeded the "strong result" threshold of 90% full pass rate.
Figure 7: Speed vs accuracy trade-off for transform generation. Bubble size represents total cost. Gemini Flash 3 achieves highest accuracy at lowest cost but is slowest. Claude Haiku 4.5 offers the best balance of speed and accuracy.
Figure 8: Cost vs accuracy for transform generation. Tier 2 models (blue) outperform Tier 3 models (red) while costing significantly less.
H5: Can LLMs produce syntactically valid JSONata?
Result: YES (91-98%)
All models achieved syntax validity rates above 90%. Gemini Flash 3 achieved 100% on the harder CDTL schema. Failures were primarily due to special characters in field names (e.g., @ symbols, spaces) that require quoting in JSONata.
H6: Can LLMs be constrained to use only approved functions?
Result: YES (97.5-100%)
Function hallucination was nearly eliminated. GPT-5.2 was the only model to produce a hallucinated function (1 instance), achieving 97.5% function validity. All other models achieved 100%.
Unlike Experiment 1 (where Tier 3 slightly outperformed Tier 2), transform generation showed the opposite pattern:
| Comparison | Tier 2 Best | Tier 3 Best | Winner |
|---|---|---|---|
| Full pass rate | 98.2% (Gemini Flash) | 94.6% (Claude Sonnet) | Tier 2 |
| Cost | $0.039 | $0.282 | Tier 2 (7x cheaper) |
| Function hallucinations | 0 | 1 | Tier 2 |
Interpretation: Transform expression generation is a more constrained task than open-ended schema mapping. The additional reasoning capacity of Tier 3 models provides no benefit and may introduce overthinking that leads to more complex (and error-prone) solutions.
| Model | Avg Time/Transform | Cost/Transform | Best For |
|---|---|---|---|
| GPT-5.2 | 5.8s | $0.0050 | — |
| Claude Haiku 4.5 | 5.2s | $0.0023 | Interactive use |
| Claude Sonnet 4.5 | 7.1s | $0.0058 | — |
| Gemini Flash 3 | 12.9s | $0.0007 | Batch processing |
Gemini Flash 3 is 8x cheaper than Claude Sonnet but 2.5x slower. For interactive applications requiring sub-6-second response times, Claude Haiku offers the best balance of speed, accuracy, and cost.
CDTL proved harder than PDP for most models due to special characters in field names:
| Model | PDP Pass | CDTL Pass | Difference |
|---|---|---|---|
| Claude Haiku 4.5 | 100% | 95.5% | -4.5% |
| Gemini Flash 3 | 91.7% | 100% | +8.3% |
| Claude Sonnet 4.5 | 100% | 93.2% | -6.8% |
| GPT-5.2 | 100% | 88.6% | -11.4% |
Gemini Flash 3 uniquely performed better on CDTL than PDP.
The 9 total failures across all models fell into two categories:
1. Special Characters in Field Names (6 failures)
Fields like Credential.@type require backtick quoting in JSONata:
# Wrong (generated)
Credential.@type
# Correct
`Credential.@type`
2. Complex Nested Transforms (3 failures)
The Competency.educationLevelType mapping requires iterating over nested arrays, which challenged multiple models.
Recommendation: Adding prompt guidance for field name quoting would likely eliminate most syntax errors.
When expressions succeeded, they were high quality:
Simple transforms used custom functions appropriately:
country ? $countryToISO(country) : null
Enum mappings used built-in $lookup:
$lookup({
'C1': 'Certificate',
'A': 'Associate',
'B': 'Bachelor'
}, degree_type_sought) ?? 'Unknown'
Date parsing combined substring extraction with normalization:
$normalizeDate($substring(date_of_birth, 0, 4) & '-' &
$substring(date_of_birth, 4, 2) & '-' &
$substring(date_of_birth, 6, 2))
Detailed transform analysis: see Appendix I
Experiment 1: Schema Mapping
| Hypothesis | Plain Language | Result | Key Evidence |
|---|---|---|---|
| H1 | Can traditional methods achieve reliable accuracy? | ❌ Failed | Best MCC = 0.43 (threshold: 0.50) |
| H2 | Do larger AI models give better results? | ✅ Confirmed (diminishing returns) | T1→T2: +51%; T2→T3: +4% |
| H3 | Can AI detect when transformation is needed? | >85% accuracy, but GPT-5.2 batch fails | |
| H4 | Does type classification scale with model size? | ❌ Refuted | Even smallest models achieve 87%+ |
Experiment 2: Transform Expression Generation
| Hypothesis | Plain Language | Result | Key Evidence |
|---|---|---|---|
| H5 | Can LLMs produce syntactically valid JSONata? | ✅ Confirmed | 91-98% syntax validity across all models |
| H6 | Can LLMs be constrained to approved functions? | ✅ Confirmed | 97.5-100% function compliance; 1 hallucination total |
Traditional methods are insufficient. Despite decades of research on text similarity, these approaches cannot reliably map education data schemas. The fundamental problem is that schema mapping requires understanding meaning, not just comparing strings.
AI language models represent a genuine capability leap. The 51% improvement from Tier 1 to Tier 2 is not incremental; it is transformational. These models can reason about concepts like "students are people" and "delivery method relates to instructional delivery."
Bigger is not meaningfully better. The 4% improvement from Tier 2 to Tier 3 does not justify 5-50x higher costs. For schema mapping, model size has reached a point of diminishing returns.
Task complexity varies. Type classification is easy (87%+ for all models); field pairing is moderate (71% precision for best model); transform detection is model-dependent.
All methods, both traditional and AI, share a common failure mode: proposing mappings for fields that should have no match.
Consider a field like semester_session_gpa in PDP. There is no corresponding field in LIF; it should be marked "no match." However:
- Embedding methods find
Course.courseGPAApplicability(similarity: 0.77) and propose it - AI models reason that "GPA" relates to "GPA" and propose a mapping
Both are wrong. courseGPAApplicability describes whether a course counts toward GPA calculation, not a student's actual GPA value. But the connection is plausible enough to fool automated methods.
This over-matching behavior means:
-
AI cannot fully automate schema mapping. Human experts must review proposals.
-
AI changes the human task. Instead of searching 521 options, humans verify ~30 proposals. This is faster but still requires expertise.
-
False positives are the primary failure mode. Our emphasis on precision (via F0.5) was appropriate; wrong mappings that look plausible can slip through review.
For PDP with our best model (GPT-5.2 batch):
| Category | Count | Human Action Required |
|---|---|---|
| Correct proposals | 25 | Verify and approve |
| Wrong proposals | 10 | Reject |
| Correct "no match" | ~100 | Spot-check sample |
| Missed mappings | 5 | Search manually |
Estimated time savings: Manual mapping of 141 fields might take 8-16 hours. AI-assisted review of 35 proposals plus spot-checking might take 1-3 hours, a 70-80% reduction but not elimination.
Some models offer strictly better cost-accuracy tradeoffs than others:
| Configuration | F0.5 | Cost | Dominated By |
|---|---|---|---|
| GPT OSS 120b batch | 0.71 | $0.005 | None (best value) |
| GPT-5.2 batch | 0.74 | $0.12 | None (best accuracy) |
| Claude Haiku batch | 0.59 | $0.05 | GPT OSS 120b |
| Claude Sonnet standard | 0.52 | $6.63 | Everything |
Pareto-optimal choices: GPT OSS 120b batch (best value) and GPT-5.2 batch (best accuracy). All other options are dominated: they cost more for equal or worse performance.
Exception for CDTL: The cost analysis changes because GPT OSS 120b fails on CDTL (53% hallucination rate). For CDTL, Claude Haiku batch becomes the Pareto-optimal choice.
| Scenario | Recommended Model | Rationale |
|---|---|---|
| Cost-constrained, PDP-like schema | GPT OSS 120b batch | $0.005, 96% of best accuracy |
| Maximum accuracy required, any schema | GPT-5.2 batch | $0.12, best overall performance |
| Technical schema (CDTL-like) | Claude Haiku batch | $0.05, only reliable option |
| Transform detection critical | Claude models (any) | GPT-5.2 has batch mode bug |
Our Gold-B reference dataset contains 38 PDP mappings (12 requiring transforms) and 99 CDTL mappings (44 requiring transforms). While carefully constructed from human expert manual verification plus AI-assisted expansion, this remains a moderate sample. Statistical confidence is limited:
- Observed precision of 71% has a 95% confidence interval of approximately ±12%
- Differences smaller than ~10% may not be statistically significant
Larger ground truth datasets would enable more precise conclusions.
We froze our prompt (v1.0) before execution to enable controlled comparison. However:
- Alternative prompt designs might yield different results
- Prompt optimization is a known technique for improving LLM performance
- Our results represent a lower bound on achievable accuracy
PDP and CDTL have distinct characteristics:
- PDP: Flat structure, simple names, 21% positive rate
- CDTL: Nested structure, technical vocabulary, 51% positive rate
Results may not generalize to schemas with different characteristics (e.g., very flat schemas, non-English field names, extremely high or low positive rates).
AI capabilities and costs change rapidly:
- Model performance improves through updates
- Pricing typically decreases over time
- New models emerge regularly
Our results represent January 2026 conditions. The specific cost figures will become outdated; the relative rankings may shift.
Even Gold-B contains judgment calls:
- Some mappings are debatable (e.g., does
countrymap tocountryCodeorresidenceCountry?) - The silver quality analysis (Appendix E) documents cases where AI and human experts disagreed
We cannot claim our ground truth is definitively correct, only that it represents careful expert judgment.
Schema mapping has been studied extensively in database research. Traditional approaches include:
- Linguistic matchers: Compare names and descriptions (similar to our Tier 1)
- Structure matchers: Use schema structure (parent-child relationships)
- Instance matchers: Compare actual data values
- Hybrid approaches: Combine multiple techniques
Our Tier 1 results are consistent with this literature: linguistic matching alone achieves moderate precision (30-50%) but struggles with semantic relationships.
Recent work has applied LLMs to data integration tasks:
- Entity matching (determining if two records refer to the same entity)
- Schema inference (determining data types and relationships)
- Data transformation (generating code to convert formats)
Our results contribute evidence that LLMs can assist with schema mapping specifically, with the important finding that smaller models suffice.
This study contributes:
-
Rigorous comparison across tiers: Traditional methods vs. small LLMs vs. large LLMs with controlled variables
-
Batch mode discovery: The finding that single-call batch processing outperforms field-by-field approaches appears to be novel
-
Schema-specific reliability: The finding that certain models fail specifically on technical vocabularies (CDTL) has practical implications
-
Cost-accuracy tradeoffs: Concrete data showing diminishing returns from model size, with specific recommendations
Experiment 2 produced a counterintuitive result: smaller models generated better transform expressions than larger ones. We propose three explanations:
1. Task Constraint Reduces Complexity Advantage
Schema mapping is an open-ended search problem (521 possible targets). Transform generation is a constrained code-writing problem (limited function vocabulary, clear input/output types). The additional reasoning capacity of Tier 3 models provides no benefit for constrained tasks.
2. Overthinking Leads to Errors
Larger models may generate more elaborate solutions when simpler ones suffice. For example, GPT-5.2 sometimes produced complex nested conditionals when a simple $lookup would work, increasing the chance of syntax errors.
3. Training Data Alignment
Tier 2 models may have encountered more JSONata examples during training, or their training emphasized code correctness over sophistication.
The near-elimination of function hallucinations (1 instance across 224 generations) validates the custom function approach:
- Without constraints: Models invent functions like
$countryNameToCode()that do not exist - With constraints: Models reliably use only approved functions
This suggests a general principle: when deploying LLMs for code generation, define a limited function vocabulary and explicitly enumerate allowed operations in the prompt.
Experiment 2 results are promising but not production-ready:
| Aspect | Status | Gap |
|---|---|---|
| Syntax validity | ✅ 91-98% | Prompt improvements could reach 99%+ |
| Function compliance | ✅ 97.5-100% | Acceptable |
| Execution correctness | ❓ Not tested | Requires sample data validation |
| Edge case handling | ❓ Not tested | Null values, malformed input |
The next validation step is execution testing against sample data to verify that syntactically valid expressions produce semantically correct output.
Based on our findings, we recommend the following configurations for production use:
Experiment 1: Schema Mapping
| Schema Type | Model | Mode | Cost | Expected F0.5 | Notes |
|---|---|---|---|---|---|
| PDP-like (simple names) | GPT OSS 120b | Batch | $0.005 | 0.71 | Best value |
| PDP-like (max accuracy) | GPT-5.2 | Batch | $0.12 | 0.74 | +4% accuracy for 24x cost |
| CDTL-like (technical vocabulary) | Claude Haiku 4.5 | Batch | $0.05 | 0.62 | Only reliable option |
| Any schema (transform detection critical) | Claude Haiku/Sonnet | Batch | $0.05-0.17 | 0.59-0.62 | GPT-5.2 has batch transform bug |
Experiment 2: Transform Expression Generation
| Use Case | Model | Cost/Transform | Pass Rate | Notes |
|---|---|---|---|---|
| Batch processing | Gemini Flash 3 | $0.0007 | 98.2% | Cheapest, but 2.5x slower |
| Interactive use | Claude Haiku 4.5 | $0.0023 | 96.4% | Best speed/accuracy balance |
| Avoid | Claude Sonnet, GPT-5.2 | $0.005-0.006 | 91-95% | Higher cost, lower accuracy |
Figure 6: The recommended human-in-the-loop workflow for AI-assisted schema mapping.
Before deployment:
- Identify schema characteristics (simple vs. technical vocabulary)
- Select appropriate model based on schema type
- Test on a sample of fields to verify model works for your specific schema
- Establish human review process and assign reviewers
During processing:
- Use batch mode (all fields in single call)
- Set temperature to 0.0 for reproducibility
- Capture full AI output for audit trail
- Log costs and performance metrics
After initial processing:
- Identify low-confidence predictions (confidence < 4)
- Re-run low-confidence mappings with focused prompting (provide only top candidate targets, request detailed reasoning)
- If confidence remains low after retry, flag for mandatory human review
Human review:
- Sort remaining predictions by confidence score (low to high)
- Human reviewer verifies all low-confidence mappings
- Human reviewer spot-checks sample of high-confidence mappings
- Human reviewer spot-checks sample of "no match" predictions
- Human reviewer searches for potentially missed mappings among low-confidence "no match" results
- Document any corrections for future reference
Transform expression generation (after mapping complete):
- Define custom function library for common transformation patterns
- Include explicit function constraints in prompt
- Add prompt guidance for quoting field names with special characters
- Validate syntax of all generated expressions
- Check for function hallucinations (calls to undefined functions)
- Test expressions against sample data before production use
- Human review of complex expressions (nested logic, multiple conditionals)
1. Retrieval-Augmented Pipeline
Instead of providing all 521 target fields to the LLM, first use embeddings to retrieve the top 10 candidates, then ask the LLM to select among them (or "none of the above").
Rationale: This combines the broad coverage of embedding methods with the semantic precision of LLMs. It may reduce false positives by giving the LLM an explicit "none" option with a manageable candidate set.
Expected outcome: Precision improvement from ~70% to ~85%+ while maintaining recall.
2. Multi-Pass Refinement Pipeline
Develop a systematic multi-step approach where low-confidence predictions trigger a second LLM call with enhanced context: narrowed candidate list, explicit comparison of top options, and chain-of-thought reasoning.
Rationale: Low confidence often indicates genuine ambiguity between similar targets. A focused follow-up call that presents only the top 3-5 candidates and asks the model to explicitly compare them may resolve uncertainty more effectively than immediate escalation to human review.
Expected outcome: Reduction in low-confidence predictions requiring human review; improved accuracy on borderline cases.
3. Confidence Threshold Calibration
Systematically evaluate the relationship between AI-reported confidence scores and prediction accuracy to identify optimal thresholds for retry logic and human review routing.
Rationale: The current implementation captures confidence scores (1-5 scale) for each prediction. Calibration analysis could determine the optimal threshold below which predictions require a refinement pass versus those that can proceed directly to human review.
4. Prompt Optimization
Test alternative prompting strategies:
- Few-shot examples of correct mappings
- Chain-of-thought reasoning
- Structured comparison of top candidates
Rationale: We used a single frozen prompt. Systematic prompt engineering might improve results.
5. Cross-Schema Validation
Test on additional education data schemas to verify generalizability:
- Common Education Data Standards (CEDS)
- Ed-Fi data standard
- SIF (Schools Interoperability Framework)
6. Execution Testing
Validate generated expressions by running them against sample source data and comparing output to expected values. This is the critical next step for production readiness.
7. Field Name Quoting Prompt Enhancement
Test whether adding explicit guidance for quoting field names with special characters (@, spaces) improves CDTL syntax validity from 93-96% to 99%+.
8. Few-Shot Examples
Test whether providing 2-3 example transforms in the prompt improves handling of complex nested transformations.
9. Fine-Tuning
Train a model specifically on schema mapping examples. This would require a larger labeled dataset but might improve accuracy.
10. Ensemble Methods
Require agreement across multiple models before proposing a mapping. This might improve precision at the cost of recall and increased expense.
11. Active Learning
Develop a system that identifies the most informative examples for human labeling, building ground truth more efficiently.
Based on our findings, we recommend against the following approaches:
Don't do this: "Let's use GPT-5.2 (or Opus) everywhere because it's the best model."
Why it won't help: Larger models provide only marginal improvement (+4%) at dramatically higher cost (24x). The over-matching problem persists regardless of model size. Save money with smaller models; invest in human review instead.
Don't do this: "Let's process one field at a time for more detailed reasoning."
Why it won't help: Standard mode costs 20-50x more and produces equal or worse results. The AI loses cross-field context and consistency. Batch mode is strictly superior.
Don't do this: "GPT OSS 120b is cheapest, so let's use it for CDTL."
Why it won't help: GPT OSS 120b hallucinates 53% of the time on CDTL, inventing target fields that don't exist. The cost savings are meaningless if the output is unusable. Pay the $0.05 for Claude Haiku.
Don't do this: "Our precision is 71%, so 71% of mappings will be correct."
Why it won't help: Precision measures proposals, not the final mapping. A 71% precision model that makes 35 proposals will have ~25 correct and ~10 wrong. The 10 wrong ones still require human identification and correction.
Don't do this: "The AI is pretty accurate, so let's just use its output directly."
Why it won't help: Even our best model is wrong 29% of the time when proposing a mapping. Wrong mappings cause data to flow to incorrect destinations, potentially corrupting downstream systems. Human review is not optional; it is essential.
Don't do this: "Claude Sonnet is our best model, so use it for generating transforms too."
Why it won't help: Tier 3 models perform worse than Tier 2 on transform generation (91-95% vs 96-98%) while costing 5-8x more. Use Gemini Flash 3 for batch processing or Claude Haiku 4.5 for interactive use.
Don't do this: "The expressions have valid syntax, so they must work correctly."
Why it won't help: Syntax validity does not guarantee semantic correctness. An expression like $lookup({'M': 'Male'}, gender) is syntactically valid but will fail if the data contains 'MALE' instead of 'M'. Execution testing against sample data is required before production use.
These experiments compared methods for automatically mapping education data schemas (Experiment 1) and generating transformation expressions (Experiment 2). After 41 experiment runs across two schemas (PDP and CDTL), we reached six main conclusions:
1. AI language models represent a genuine capability leap over traditional methods.
The improvement from Tier 1 (text matching) to Tier 2 (small AI models) was substantial: +51% improvement in F0.5 score, +59% improvement in precision. Traditional methods found the right target 44% of the time when proposing a match; AI models achieved 71%. This is not incremental progress; it is a qualitative change in what automation can accomplish.
2. Small AI models match large ones at a fraction of the cost.
The smallest model we tested (GPT OSS 120b at $0.005 per run) achieved 96% of the accuracy of the largest model (GPT-5.2 at $0.12 per run). Hypothesis H2 (that returns diminish at scale) was confirmed. For schema mapping, the era of "bigger is always better" does not apply.
3. Batch processing outperforms field-by-field approaches.
Processing all fields in a single AI call produces better results (higher F0.5) AND costs 20-50x less than processing each field individually. This was a surprising finding with significant practical implications: the obvious approach (one field at a time) is inferior to the simpler approach (all at once).
4. Schema characteristics determine model selection.
Not all models work for all schemas. Technical vocabularies like CDTL cause non-Claude models to hallucinate extensively (45-53% hallucination rates). Model selection must account for schema characteristics, not just cost and accuracy benchmarks.
5. AI can generate valid transform expressions at high accuracy.
Experiment 2 demonstrated that LLMs can write executable JSONata transformation code with 91-98% success rates. When constrained to an approved function library, models achieved near-perfect function compliance (only 1 hallucination across 224 generations).
6. For transform generation, smaller models are better.
In a reversal of typical expectations, Tier 2 models (Gemini Flash 3, Claude Haiku) outperformed Tier 3 models (Claude Sonnet, GPT-5.2) on transform generation while costing 5-8x less. Constrained code generation tasks do not benefit from additional model capacity.
For a typical 141-field schema like PDP:
| Aspect | Manual Approach | AI-Assisted Approach |
|---|---|---|
| Schema mapping | ||
| Processing time | 8-16 hours | 30 seconds + 1-3 hours review |
| Cost | Expert labor | $0.005 - $0.17 |
| Accuracy | Variable (human error) | 71% precision, 83% recall |
| Scalability | Linear with fields | Near-constant per schema |
| Transform generation (for ~12 fields requiring transforms) | ||
| Processing time | 2-4 hours | 1 minute + 15-30 min review |
| Cost | Developer time | $0.01 - $0.03 |
| Accuracy | Variable | 96-98% valid syntax |
AI does not eliminate human expertise. It transforms the task from "search through 521 options per field" to "review low-confidence predictions and spot-check high-confidence ones." This is a significant productivity improvement, but schema mapping remains a human-in-the-loop process.
For schema mapping (Experiment 1):
- Use batch mode (always)
- For PDP-like schemas: GPT OSS 120b ($0.005)
- For CDTL-like schemas: Claude Haiku ($0.05)
- Prioritize human review for low-confidence predictions
For transform generation (Experiment 2):
- Define a custom function library for common patterns
- Use Gemini Flash 3 for batch ($0.0007/transform) or Claude Haiku for interactive ($0.002/transform)
- Avoid Tier 3 models (lower accuracy at higher cost)
- Validate expressions against sample data before production
For future research:
- Retrieval-augmented pipeline (embeddings + LLM verification)
- Execution testing of generated transform expressions
- Validation on additional education data standards
Schema mapping and transform generation exemplify a broader pattern in AI assistance: the technology is most valuable when it narrows options for human judgment, not when it replaces human judgment entirely. An AI that is right 71% of the time on mapping and 98% on code generation sounds impressive, but the remaining errors can cause real harm to data quality.
The winning approach is not "find the smartest AI" but "find the cheapest AI that meets quality thresholds, then invest the savings in human review." For schema mapping in January 2026, that means small models in batch mode, verified by domain experts. For transform generation, it means Tier 2 models with constrained function vocabularies.
A surprising finding from both experiments: bigger is not always better. In Experiment 1, Tier 3 models provided only marginal gains over Tier 2. In Experiment 2, Tier 2 models actually outperformed Tier 3. This challenges the assumption that frontier models are always the right choice.
The question is not whether AI can help with education data integration (it clearly can). The question is how to deploy it responsibly, with appropriate expectations and human oversight. We hope these experiments provide a foundation for answering that question.
The following detailed reports are compiled in APPENDICES.md (and APPENDICES.docx). They are also embedded in the Interactive HTML Report.
| Appendix | Title | Description |
|---|---|---|
| A | Experimental Protocol | Full experimental protocol, prompt templates, configuration |
| B | Tier 1 Analysis | Detailed Tier 1 (embedding/NLP) analysis with examples |
| C | Tier 2/3 Analysis | Detailed LLM analysis, model comparisons, batch findings |
| D | Batch Mode Analysis | Batch vs. standard mode comparison |
| E | Silver Data Quality | Reference data quality, Silver vs. Gold agreement |
| F | Transform Detection | Transform detection analysis by model and schema |
| G | Variance Analysis | Variance and reproducibility analysis |
| H | T2S Experiment | Target-to-source direction experiment |
| I | Transform Expression Generation | Experiment 2 detailed analysis, failure patterns, expression examples |
The Interactive HTML Report (report.html) provides charts and full run data.
Report completed January 31, 2026 Experiments conducted January 17-31, 2026 Total experiment runs: 41 (33 mapping + 8 transform) Experiment 1: 280 fields evaluated (141 PDP + 139 CDTL) Experiment 2: 448 transform generations (56 transforms × 8 runs)







