Skip to content

Latest commit

 

History

History
1455 lines (904 loc) · 72.7 KB

File metadata and controls

1455 lines (904 loc) · 72.7 KB

 

 

 

Can AI Map and Transform Education Data?

Findings from Experiments 1 & 2: Evaluating Agentic Components for AI-Assisted Schema Mapping and Transform Expression Generation

 

 

 

 

Author: Olivier Mills (Baobab Tech)
Client: DataKind
Date: January 31, 2026
Version: 1.1

Summary for Decision-Makers

The Challenge

When education organizations share data, they face a fundamental problem: different systems use different formats. A student record in one database might have fields labeled "student_id" and "first_name," while another system expects "Person.Identifier" and "Person.Name.firstName." Before data can flow between systems, someone must create a translation guide, matching each field in the source system to its counterpart in the destination system.

This translation process, called schema mapping, has traditionally required skilled professionals to manually examine hundreds of fields, understand their meaning, and find the correct matches. For a typical dataset with 140 fields, this work takes 8 to 16 hours of expert time and remains prone to human error.

Why This Matters Now: The Rise of Agentic AI

Recent advances in artificial intelligence have produced systems capable of autonomous reasoning and task execution, often called "agentic AI." These systems can read instructions, reason about problems, and take actions to accomplish goals. However, deploying agentic AI in production requires understanding which underlying components work reliably and cost-effectively.

This study evaluates two foundational capabilities that agentic AI systems need for data integration tasks:

  1. Schema Mapping: Can AI correctly identify which data fields correspond to each other across different systems?
  2. Code Generation: Can AI write the transformation logic needed to convert data values between formats?

These capabilities form the building blocks for more sophisticated agentic workflows in data engineering.

What We Found

Agentic Components Perform Well on Structured Tasks

Our experiments demonstrated that AI systems can correctly identify approximately 70-80% of field matches, compared to only 40-45% for traditional automated methods. More importantly, AI can process an entire dataset in under one minute at a cost of less than 10 cents, compared to hours of professional labor.

For code generation, AI systems successfully produced valid transformation logic 91-98% of the time at a cost of less than one cent per transformation.

Human Oversight Remains Essential

AI does not eliminate the need for human review. The remaining 20-30% of mapping cases require expert judgment, and AI occasionally proposes incorrect matches that humans must catch and correct. The practical benefit is transforming the task from "examine every field manually" to "review and verify AI suggestions," reducing total time from 8-16 hours to 1-3 hours.

This finding has implications for agentic AI deployment: even capable systems benefit from human-in-the-loop validation, particularly for consequential decisions.

Larger AI Systems Are Not Necessarily Better

A surprising finding contradicted common assumptions about AI capabilities. The largest, most expensive AI systems did not perform meaningfully better than smaller, more economical alternatives. In some cases, smaller systems actually produced better results.

For straightforward datasets, the least expensive AI option (costing less than one cent per run) achieved 96% of the accuracy of premium systems costing 24 times more. For code generation tasks, mid-tier systems outperformed premium alternatives while costing 5 to 8 times less.

This suggests that for well-defined agentic tasks, organizations should select AI components based on demonstrated performance rather than assuming that larger models are superior.

Practical Implications

Cost Efficiency

For a typical 140-field dataset:

  • Traditional manual approach: 8-16 hours of expert time
  • AI-assisted approach: Less than $0.20 in AI costs plus 1-3 hours of review time

Recommended Approach for Agentic Data Integration

  1. Use AI to generate initial field matches and transformation code
  2. Have human experts review AI suggestions, focusing on low-confidence predictions
  3. Verify a sample of high-confidence predictions to catch systematic errors
  4. Test transformation code against sample data before production use

Key Considerations for Agentic AI Deployment

  • Human oversight remains essential. Agentic AI accelerates work but does not replace the need for domain expertise and validation.
  • Task characteristics matter. Technically complex datasets require different AI configurations than simpler ones.
  • Cost efficiency favors appropriately-sized systems. Premium AI services do not provide proportional benefits for well-defined tasks.
  • Component reliability varies. Different AI providers show different reliability profiles; selection should be based on empirical testing.

Conclusion

The agentic AI components evaluated in this study can meaningfully assist with data integration tasks in education. The technology reduces costs and accelerates timelines while maintaining accuracy suitable for production use with appropriate human oversight.

Organizations considering agentic AI for data integration should pilot the approach with representative data, establish review procedures appropriate to their accuracy requirements, and select AI components based on demonstrated performance rather than marketing claims about capability.

Technical Executive Summary

The Problem

Education data systems store information in different formats. When data moves between systems (such as from the Postsecondary Data Partnership (PDP) to the Learner Information Framework (LIF)), someone must determine which fields correspond to each other. A field called student_id in one system might map to Person.Identifier.identifier in another. This "schema mapping" process is traditionally manual, time-consuming, and error-prone.

Agentic AI Context

Agentic AI systems, those capable of autonomous reasoning and multi-step task execution, require reliable underlying components. For data integration workflows, two foundational capabilities are essential: (1) semantic understanding to identify field correspondences across schemas, and (2) code generation to produce transformation logic. This study rigorously evaluates these agentic components to determine which approaches offer the best balance of accuracy, cost, and reliability for production deployment.

What We Tested

We evaluated 12 methods across three tiers of sophistication:

  • Tier 1 (Traditional): Four text-matching approaches that compare field names and descriptions without understanding meaning
  • Tier 2 (Small AI): Five compact language models that can read instructions and reason about semantics
  • Tier 3 (Large AI): Three frontier language models with the strongest reasoning capabilities

We tested these on two schemas (PDP and CDTL), each containing approximately 140 source fields that needed mapping to 521 possible target fields in LIF. All results were evaluated against human-verified ground truth.

Key Findings

  1. AI language models dramatically outperform traditional methods. The best AI approach achieved 71% precision (correct proposals) compared to 44% for the best traditional method, a 61% relative improvement.

  2. Small AI models match large ones at a fraction of the cost. The smallest model we tested (GPT OSS 120b at $0.005 per run) achieved 96% of the accuracy of the most expensive model (GPT-5.2 at $0.12 per run). The 24x cost difference yielded only a 4% accuracy gain.

  3. "Batch mode" outperforms field-by-field processing. Processing all fields in a single AI call produces better results AND costs 20-50x less than processing each field individually.

  4. Schema characteristics matter significantly. The CDTL schema proved harder than PDP (F0.5 of 0.62 vs. 0.74) because its vocabulary overlaps with the target schema, confusing the AI.

  5. Non-Claude AI models fail on technical schemas. For CDTL, models from providers other than Anthropic hallucinated extensively, inventing target fields that do not exist. Only Claude Haiku and Claude Sonnet reliably work for CDTL.

Experiment 2: Can AI Write the Transform Code?

Building on Experiment 1, we tested whether AI can generate the actual transformation expressions (JSONata code) for mappings that require data conversion.

Key Findings:

  1. All models achieved strong results. Full pass rates ranged from 91% to 98%, exceeding all success thresholds.

  2. Tier 2 models outperformed Tier 3. In a reversal of expectations, Gemini Flash 3 (98.2%) and Claude Haiku (96.4%) achieved higher accuracy than Claude Sonnet (94.6%) and GPT-5.2 (91.1%).

  3. Custom function constraints eliminate hallucinations. When instructed to use only approved functions, models achieved near-perfect compliance (100% for all models except one GPT-5.2 instance).

  4. Speed and cost trade-offs are significant. Gemini Flash is 8x cheaper but 2.5x slower than Claude Haiku. For interactive applications, Claude Haiku offers the best balance.

Use Case Recommended Model Cost (56 transforms) Time Pass Rate
Batch processing Gemini Flash 3 $0.04 12 min 98.2%
Interactive use Claude Haiku 4.5 $0.13 5 min 96.4%

Bottom Line

Experiment 1 (Schema Mapping):

Schema Recommended Model Cost per Run Expected Accuracy
PDP GPT OSS 120b (batch) $0.005 F0.5 = 0.71
CDTL Claude Haiku (batch) $0.05 F0.5 = 0.62

Experiment 2 (Transform Expressions):

Use Case Recommended Model Cost per Transform Pass Rate
Batch Gemini Flash 3 $0.0007 98.2%
Interactive Claude Haiku 4.5 $0.0023 96.4%

Practical impact: For a 141-field schema, AI can process all fields in under a minute for less than a nickel, correctly identifying approximately 25 of 30 real mappings and correctly rejecting approximately 100 of 110 non-mappings. For the subset requiring transforms, AI can generate valid JSONata expressions at 96-98% accuracy for less than a penny each. This reduces human review time from an estimated 8+ hours to 1-2 hours, but does not eliminate the need for human verification.

The key insight for agentic AI deployment: The assumption that larger models are always better does not hold for these foundational agentic tasks. In both experiments, smaller models matched or exceeded larger ones at a fraction of the cost. When building agentic workflows, select component models based on empirical performance for the specific task, schema characteristics, and interactivity requirements, not on general capability benchmarks or marketing claims about model size.

1. Introduction

1.1 Background and Motivation

What Is Schema Mapping?

Every database organizes information using a "schema," a blueprint that defines what fields exist and how they relate to each other. A student information system might have fields like student_id, first_name, date_of_birth, and enrollment_date. A different system storing similar information might call these fields Person.Identifier.identifier, Person.Name.firstName, Person.birthDate, and Enrollment.entryDate.

Schema mapping is the process of determining which fields in one system correspond to which fields in another. Some mappings are obvious: first_name clearly corresponds to Person.Name.firstName. Others require domain expertise: delivery_method corresponds to Course.instructionalDelivery, but only someone familiar with education data would recognize this relationship.

Why This Matters for Education Data

The Learner Information Framework (LIF) is emerging as a common standard for representing education data. Organizations using other schemas (such as the Postsecondary Data Partnership (PDP) schema or the Credential Transparency Description Language (CDTL)) need to map their data to LIF to participate in broader data-sharing initiatives.

Currently, this mapping is performed manually by domain experts who examine each source field, search through hundreds of possible target fields, and determine the best match. This process is:

  • Time-consuming: A schema with 140 fields might require 8-16 hours of expert time
  • Error-prone: Humans miss mappings, especially when tired or unfamiliar with edge cases
  • Expensive: Domain experts are scarce and their time is valuable
  • Non-scalable: Each new schema requires starting from scratch

The Promise of AI Assistance

Recent advances in artificial intelligence, particularly large language models (LLMs) like GPT and Claude, suggest these tools might assist with schema mapping. Unlike simple text matching, LLMs can:

  • Understand that "delivery method" and "instructional delivery" refer to the same concept
  • Recognize that student_id relates to Person.Identifier because students are people
  • Determine when a source field has no valid target and should be marked "no match"

However, AI assistance comes with questions: How accurate is it? How expensive? Do larger, more sophisticated models perform meaningfully better than smaller, cheaper ones? This experiment sought to answer these questions rigorously.

Why This Experiment Was Designed Differently

Typical attempts to evaluate AI for schema mapping suffer from confounded variables. Multiple factors changed between runs (prompts, models, preprocessing steps), making it impossible to attribute improvements to specific changes. Was the improvement from the new model, or from the revised prompt, or from fixing a bug in preprocessing?

This experiment enforced strict variable isolation:

  • When comparing models, prompts remained identical
  • When comparing prompting strategies, models remained fixed
  • All parameters (temperature, token limits, prompt versions) were frozen before execution
  • Every run was logged with complete inputs and outputs for reproducibility

1.2 Research Questions

We designed these experiments around six hypotheses, stated here in plain language:

Experiment 1: Schema Mapping

ID Question Plain Language Interpretation
H1 Can traditional text-matching methods achieve reliable accuracy? Do we even need AI, or can simpler methods work?
H2 Do larger AI models produce better results than smaller ones? Is it worth paying 10-50x more for a "better" model?
H3 Can AI detect when data transformation is required? Beyond finding pairs, can AI identify when values need conversion?
H4 Does mapping-type classification improve with model size? Can AI distinguish direct matches from splits and merges?

Experiment 2: Transform Expression Generation

ID Question Plain Language Interpretation
H5 Can LLMs produce syntactically valid JSONata expressions? Can AI write executable transformation code?
H6 Can LLMs be constrained to use only approved functions? Can we prevent AI from inventing non-existent functions?

Success thresholds were defined before running experiments:

  • H1: Traditional methods achieve MCC > 0.5 (a statistical measure indicating reliable classification)
  • H2: Improvement from Tier 2→3 is smaller than Tier 1→2 (diminishing returns)
  • H3: Transform detection accuracy differs significantly across tiers
  • H4: Type classification accuracy increases with model tier
  • H5: Syntax validity rate > 60% (minimum viability), > 90% (strong result)
  • H6: Function hallucination rate < 10% (acceptable), < 1% (strong result)

1.3 Scope and Limitations

What We Tested

Experiment 1 (Schema Mapping):

  • Source schemas: PDP (141 fields) and CDTL (139 fields)
  • Target schema: LIF (521 fields)
  • Methods: 4 traditional text-matching methods, 8 AI language models
  • Variants: Standard (field-by-field) and batch (all fields at once) processing
  • Total experiment runs: 33 completed runs with full evaluation

Experiment 2 (Transform Expression Generation):

  • Transform mappings: 56 total (12 PDP, 44 CDTL)
  • Models: 4 AI language models (Claude Haiku, Gemini Flash, Claude Sonnet, GPT-5.2)
  • Custom function library: 5 helper functions for common transformations
  • Total experiment runs: 8 (4 models × 2 schemas)

What We Did Not Test

  • Mapping in the reverse direction (LIF to PDP/CDTL)
  • Other education data schemas beyond PDP and CDTL
  • Production deployment with real-time human-in-the-loop workflows
  • Prompt optimization beyond our frozen prompts (v1.0 for mapping, v1.1 for transforms)
  • Fine-tuning models specifically for schema mapping
  • Execution testing of generated transform expressions against sample data
  • Semantic validation of transform outputs (correct values, not just valid syntax)

Known Limitations

  1. Limited ground truth: Our Gold-B reference dataset contains 38 verified PDP mappings and 99 verified CDTL mappings. While carefully constructed, larger test sets would provide higher statistical confidence.

  2. Single prompt version: We froze our prompt before experimentation. Different prompting strategies might yield different results; this is an avenue for future research.

  3. Two schemas only: PDP and CDTL have distinct characteristics. Results may not generalize to all education data schemas.

  4. Snapshot in time: AI model capabilities and costs change rapidly. Results represent January 2026 conditions.

  5. Reference data quality: Some "correct" answers in our ground truth may themselves be debatable. Appendix E discusses this in detail.

2. Methods

2.1 Experimental Design

We organized our methods into three tiers representing fundamentally different approaches to schema mapping:

Tier 1: Traditional Text-Matching Methods

These methods compare field names and descriptions using mathematical similarity measures. They do not "understand" meaning; they detect patterns in characters and words.

ID Method How It Works Cost
E-01 Fuzzy Matching Compares character sequences, like spell-check. "student_id" matches "studentId" because most letters are the same. Free
E-02 Static Embeddings Converts text to numerical vectors (lists of numbers) and measures how "close" they are in mathematical space. Free
E-03 Sentence Transformer (Small) A neural network with 22 million parameters that learns semantic relationships. Free
E-04 Sentence Transformer (Large) A larger neural network (335 million parameters) representing the state of the art in text similarity. Free

How embedding methods work (simplified):

Imagine a library where books are arranged not alphabetically, but by topic. Books about cooking are near each other; books about astronomy are in a different area. Embedding methods work similarly: they place text in a mathematical "space" where similar concepts are close together.

When we embed "delivery method" and "instructional delivery," they end up near each other in this space because they share conceptual meaning, even though they share few characters.

Tier 2: Small AI Language Models

These models can read instructions, reason about meaning, and produce structured output. They are hosted in the cloud and charge per use.

ID Model Provider Parameters Cost per Run
L-01 GPT OSS 120b OpenAI (Open Source) ~5B active $0.005 - $0.15
L-02 Llama 4 Maverick Meta (via Groq) 17B×128 experts $0.006 - $0.34
L-03 Claude Haiku 4.5 Anthropic ~20B $0.05 - $2.25
L-04 GPT-5-mini OpenAI ~25B $0.01 - $0.14
L-05 Gemini Flash 3 Google ~30B $0.02 - $0.96

Cost ranges reflect batch mode (lower) vs. standard mode (higher).

Tier 3: Large AI Language Models

The most capable (and most expensive) models available.

ID Model Provider Cost per Run
L-06 Claude Sonnet 4.5 Anthropic $0.17 - $6.63
L-07 Gemini Pro 3 Google $0.09 - $0.96
L-08 GPT-5.2 OpenAI $0.12 - $0.83

Two Processing Variants

For AI models, we tested two approaches:

Standard Mode (Field-by-Field):

  • Send each source field to the AI separately
  • AI returns one mapping decision per call
  • 141 API calls for PDP schema
  • Higher cost, but provides detailed reasoning for each field

Batch Mode (All at Once):

  • Send all source fields in a single API call
  • AI returns all mapping decisions at once
  • 1 API call for entire schema
  • Much lower cost; AI sees full context

2.2 Data Sources

Source Schemas

PDP (Postsecondary Data Partnership):

  • 141 source fields
  • Relatively straightforward field names (student_id, first_name, course_begin_date)
  • 21% of fields have valid mappings; 79% should be "no match"

CDTL (Credential Transparency Description Language):

  • 139 source fields
  • Technical vocabulary (Competency.performanceLevelType, CredentialOrganization.agentType)
  • 51% of fields have valid mappings; 49% should be "no match"

Target Schema

LIF (Learner Information Framework):

  • 521 target fields organized hierarchically
  • Examples: Person.Name.firstName, Course.instructionalDelivery, Credential.level

The Needle-in-Haystack Challenge

For each source field, the method must either:

  1. Select the correct target from 521 options, OR
  2. Correctly determine that no valid target exists

This is inherently difficult. Even for fields with valid mappings, the correct answer is 1 in 521 (0.2%). Random guessing would achieve near-zero accuracy.

Ground Truth Construction

We used three reference datasets:

Gold (Original):

  • Manual mappings from human expert review
  • PDP: 29 verified mappings; CDTL: 66 verified mappings
  • 100% human-verified but limited coverage

Silver (AI-Generated):

  • Generated by Claude Opus 4.5 (the most capable model, not used in experiments)
  • PDP: 31 mappings; CDTL: 102 mappings
  • 30% human-verified

Gold-B (Enhanced):

  • Gold plus human expert corrections and additions from Silver verification
  • PDP: 38 verified mappings (12 transforms); CDTL: 99 verified mappings (44 transforms)
  • Our primary reference for final evaluation

Why we needed Silver:

The original Gold dataset was small (29-66 mappings) because manual verification is time-consuming. We used AI to propose additional mappings, then had a human expert verify a sample. Where Silver agreed with Gold (95.7% for PDP, 61.2% for CDTL), we gained confidence. Where they disagreed, the human expert adjudicated to create Gold-B.

2.3 Evaluation Metrics

We report several metrics, each capturing a different aspect of performance. Here we explain what they mean in plain language.

Primary Metrics

Precision: "When the AI proposes a mapping, how often is it correct?"

$$\text{Precision} = \frac{\text{Correct proposals}}{\text{Total proposals}}$$

A precision of 70% means: of every 10 mappings the AI proposes, 7 are correct and 3 are wrong.

Why precision matters: Wrong mappings that look plausible can slip through human review and cause data errors downstream. We want to minimize these false positives.

Recall: "Of all the real mappings that exist, how many did the AI find?"

$$\text{Recall} = \frac{\text{Mappings found}}{\text{Total real mappings}}$$

A recall of 80% means: if there are 30 real mappings, the AI found 24 and missed 6.

Why recall matters: Missed mappings mean data doesn't transfer correctly between systems. However, missed mappings are usually obvious (a field with no mapping gets flagged for human review), so they're less dangerous than wrong mappings.

F0.5: "A combined score that weights precision more heavily than recall."

$$F_{0.5} = 1.25 \times \frac{\text{Precision} \times \text{Recall}}{0.25 \times \text{Precision} + \text{Recall}}$$

Why F0.5: Standard F1 weights precision and recall equally. Since we believe wrong mappings (precision failures) are more costly than missed mappings (recall failures), we use F0.5 which weights precision twice as heavily.

Hallucination Rate: "How often does the AI invent fields that don't exist?"

$$\text{Hallucination Rate} = \frac{\text{Predictions with non-existent targets}}{\text{Total predictions}}$$

A hallucination rate of 5% means: 5 of every 100 predictions reference a target field that isn't in the LIF schema at all. This indicates the AI is "making things up" rather than selecting from valid options.

Understanding the Numbers in Context

Consider a method evaluated on PDP (30 real mappings among 141 fields):

Scenario Precision Recall F0.5 Interpretation
Perfect 100% 100% 1.00 Every proposal correct, every mapping found
Our best result 71% 83% 0.74 Good but not perfect; needs human review
Conservative model 90% 40% 0.73 Rarely wrong, but misses many mappings
Aggressive model 40% 95% 0.46 Finds most mappings, but many wrong proposals
Random guessing ~6% ~20% 0.08 Baseline for comparison

Why We Report MCC

Matthews Correlation Coefficient (MCC): A single number from -1 to +1 that captures overall classification quality, accounting for class imbalance.

  • MCC = +1: Perfect predictions
  • MCC = 0: No better than random guessing
  • MCC = -1: Perfectly wrong

Why MCC matters for this task: PDP has only 21% positive mappings (30/141). A method that always predicts "no match" would achieve 79% accuracy but 0% recall and MCC near 0. MCC correctly identifies such degenerate strategies.

Our threshold: MCC > 0.5 indicates practically useful performance; MCC > 0.3 indicates the method is learning meaningful patterns.

2.4 Experimental Protocol

AI Model Configuration

All AI runs used identical settings:

  • Temperature: 0.0 (no randomness; deterministic outputs)
  • Maximum tokens: 500 (sufficient for structured response)
  • Prompt version: v1.0 (frozen before execution)

For "reasoning" models (GPT-5 series, GPT OSS, Gemini 3):

  • Reasoning effort: "low" (minimal internal reasoning to reduce cost and variability)

Prompt Design

The prompt instructed the AI to:

  1. Examine the source field and its description
  2. Search the provided list of 521 LIF target fields
  3. Either propose a match or indicate "no match"
  4. Classify the mapping type (CLEAN_MATCH, SPLIT, MERGE, or NO_MATCH)
  5. Indicate if data transformation would be required
  6. Provide brief reasoning

The complete prompt is documented in Appendix A.

Threshold Selection for Tier 1

Embedding methods produce similarity scores (e.g., 0.78). We tested multiple thresholds:

  • If score > threshold: propose the match
  • If score ≤ threshold: predict "no match"

We tested thresholds from 0.5 to 0.95 and report results at the optimal threshold for each method (typically 0.7 for embeddings, 0.9+ for fuzzy matching).

Run Tracking

Every run was logged to a SQLite database with:

  • Model identifier and configuration
  • Complete input/output for each prediction
  • Token counts and costs
  • Evaluation metrics against all reference datasets

Detailed predictions were also saved as JSON files for manual inspection.

Full technical protocol: see Appendix A

2.5 Experiment 2: Transform Expression Generation

Objective

Experiment 1 identified which fields require data transformation. Experiment 2 tested whether AI can generate the actual transformation code (JSONata expressions) for those mappings.

Test Data

We used the 56 mappings from Gold-B that were flagged as requiring transforms:

  • PDP: 12 transforms (e.g., date format conversion, country code lookup)
  • CDTL: 44 transforms (e.g., credential level mapping, delivery method translation)

Models Tested

ID Model Tier Rationale
L-03 Claude Haiku 4.5 2 Best Exp1 performer for CDTL
L-05 Gemini Flash 3 2 Lowest cost option
L-06 Claude Sonnet 4.5 3 Strong transform detection in Exp1
L-08 GPT-5.2 3 Best overall Exp1 accuracy

Custom Function Library

To prevent models from generating huge lookup tables or inventing non-existent functions, we defined 5 helper functions representing common transformation patterns:

Function Purpose Example
$countryToISO(name) Country name → ISO code 'United States''US'
$normalizeDate(str) Date string → ISO8601 '01/15/2024''2024-01-15'
$genderToText(code) Gender code → text 'M''Male'
$ethnicityToText(code) Ethnicity code → text 'H''Hispanic or Latino'
$deliveryMethodToText(code) Delivery code → text 'O''Online'

The prompt explicitly instructed models to use only built-in JSONata functions or these custom functions.

Validation Pipeline

Each generated expression was validated in two stages:

  1. Syntax Validation: Parse the expression using the JSONata library. Expressions that fail parsing are marked as syntax errors.

  2. Function Validation: Extract all function calls and verify each is either a built-in JSONata function or one of the 5 approved custom functions. Expressions calling unknown functions are marked as hallucinations.

Full pass requires both syntax validity and function validity.

Full technical protocol: see Appendix I

3. Results

3.1 Tier 1: Traditional Text-Matching Methods

Summary

Traditional methods achieved moderate success but fell short of our reliability threshold. The best method (Sentence Transformer B) found 69% of real mappings but was only correct 44% of the time when proposing a match.

Performance Comparison

PDP Schema (38 verified mappings, 141 total fields):

Method Precision Recall F0.5 MCC Correct Proposals Mappings Found
Sentence Transformer B (E-04) 44% 69% 0.47 0.43 20 of 46 20 of 30
Sentence Transformer A (E-03) 46% 21% 0.37 0.22 6 of 13 6 of 30
Fuzzy Matching (E-01) 42% 17% 0.33 0.18 5 of 12 5 of 30
Static Embeddings (E-02) 24% 38% 0.26 0.16 11 of 46 11 of 30

What These Results Mean

The best method (Sentence Transformer B) achieved:

  • 20 correct matches out of 46 proposals (44% precision)
  • 20 of 30 real mappings found (69% recall)
  • 26 incorrect proposals that would need human rejection

Hypothesis H1 verdict:Failed

Our threshold for "reliable accuracy" was MCC > 0.5. The best Tier 1 method achieved MCC = 0.43 on PDP. While this indicates the methods are learning patterns (better than random), they cannot reliably distinguish mappable from non-mappable fields.

On the CDTL schema, results were better (MCC = 0.63, exceeding our threshold), but CDTL has more favorable characteristics (higher positive rate, cleaner field names).

Illustrative Examples

Success Case: course_begin_dateCourse.courseBeginDate

All four methods correctly identified this mapping. The field names are nearly identical, so even simple character matching works.

Failure Case: student_id → should map to Person.Identifier.identifier

Method What It Proposed Result
Fuzzy Matching Person.Identifier.identifier ✅ Correct
Static Embedding Assessment.awardedByRefOrganization.identifier ❌ Wrong
Sentence Transformer A (below threshold) ❌ Missed
Sentence Transformer B (below threshold) ❌ Missed

The semantic models got confused by dozens of fields containing "identifier" and couldn't decide which was correct. Ironically, the "dumb" fuzzy matcher succeeded because student_id shares character patterns with Person.Identifier.identifier.

The Over-Matching Problem:

Traditional methods confidently proposed these incorrect mappings:

Source Field Model Proposed Score Truth
time_to_credential Credential.Requirements.minimumDuration 0.78 NO MATCH
course_type Course.courseLevelType 0.78 NO MATCH
semester_session_gpa Course.courseGPAApplicability 0.78 NO MATCH

These fields genuinely have no LIF equivalent, but the models found "close enough" targets and proposed them anyway. This over-matching behavior creates work for human reviewers who must reject plausible-looking but incorrect suggestions.

Detailed Tier 1 analysis: see Appendix B

3.2 Tier 2/3: AI Language Models

Summary

AI language models dramatically outperformed traditional methods. The best AI approach achieved 71% precision compared to 44% for Tier 1, a 61% relative improvement. Surprisingly, small models nearly matched large models, confirming diminishing returns at scale.

Performance by Tier: The Jump from Embeddings to LLMs is Transformational

Figure 1: The improvement from Tier 1 to Tier 2 (+51%) dwarfs the improvement from Tier 2 to Tier 3 (+4%).

PDP Schema Results (Batch Mode, Gold-B Reference)

Model Tier Precision Recall F0.5 MCC Cost
GPT-5.2 3 71% 83% 0.74 0.76 $0.12
GPT OSS 120b 2 70% 77% 0.71 0.73 $0.005
Claude Sonnet 4.5 3 61% 77% 0.63 0.68 $0.14
Llama 4 Maverick 2 60% 70% 0.62 0.66 $0.006
Claude Haiku 4.5 2 56% 80% 0.59 0.68 $0.05

The Tier Jump

Metric Tier 1 Best Tier 2 Best Tier 3 Best T1→T2 Gain T2→T3 Gain
F0.5 0.47 0.71 0.74 +51% +4%
Precision 44% 70% 71% +59% +1%
MCC 0.43 0.73 0.76 +70% +4%

Hypothesis H2 verdict:Confirmed

The improvement from Tier 1 to Tier 2 is massive (+51% F0.5). The improvement from Tier 2 to Tier 3 is modest (+4% F0.5). Larger models provide diminishing returns for this task.

The Cost-Performance Surprise

The smallest, cheapest model (GPT OSS 120b at $0.005) achieved 96% of the best model's accuracy (GPT-5.2 at $0.12):

Model F0.5 Cost % of Best Accuracy Cost Relative
GPT-5.2 0.74 $0.12 100% 24x
GPT OSS 120b 0.71 $0.005 96% 1x

Practical implication: For most use cases, the cheapest model is the best choice. The 24x cost premium for GPT-5.2 yields only 4% accuracy improvement.

Cost vs. Accuracy: Small Models Offer Best Value

Figure 2: The Pareto frontier shows GPT OSS 120b (best value) and GPT-5.2 (best accuracy) as the only non-dominated options.

Precision vs. Recall across all models

Figure 3: LLMs achieve a better precision-recall balance than Tier 1 methods (shown as square marker).

Detailed LLM analysis: see Appendix C

3.3 Batch Mode vs. Standard Mode

Summary

Processing all fields in a single AI call ("batch mode") produces better results AND costs 20-50x less than processing each field individually ("standard mode").

Direct Comparison

PDP Schema (Gold-B Reference):

Model Standard F0.5 Standard Cost Batch F0.5 Batch Cost Improvement Savings
GPT-5.2 0.66 $0.83 0.74 $0.12 +12% 86%
Claude Sonnet 0.52 $6.63 0.62 $0.17 +19% 97%
Claude Haiku 0.56 $2.25 0.59 $0.05 +5% 98%

Why Batch Mode Works Better

  1. Context awareness: When the AI sees all fields together, it can maintain consistency. If it maps first_name to Person.Name.firstName, it's more likely to correctly map last_name to Person.Name.lastName rather than proposing an inconsistent target.

  2. Reduced prompt overhead: In standard mode, the full LIF schema (521 fields) is sent with every request. In batch mode, it's sent once.

  3. Global reasoning: The AI can recognize patterns across fields, such as address components that should all map to the same parent entity.

Batch vs. Standard Mode Comparison

Figure 4: Batch mode achieves better accuracy (left) at dramatically lower cost (right) for all tested models.

Detailed batch analysis: see Appendix D

3.4 Schema-Specific Findings

Summary

CDTL proved significantly harder than PDP. The same methods that achieved F0.5 = 0.74 on PDP only reached F0.5 = 0.62 on CDTL, a 16% drop.

PDP vs. CDTL Comparison

Characteristic PDP CDTL
Best F0.5 0.74 0.62
Best Precision 71% 60%
Positive Rate 21% (30/141) 51% (71/139)
Average Hallucinations 1.5 per run 24 per run

Why CDTL Is Harder

Vocabulary Overlap: CDTL uses terms like "Competency," "Credential," "Assessment," and "Course" extensively. These same terms appear throughout the LIF target schema. The AI finds many "close matches" that aren't quite right:

CDTL Field AI Proposed Why It's Wrong
Credential.audienceLevelType Credential.level "type of level" ≠ "level"
Course.audienceLevelType Course.courseLevelType Same semantic confusion
IdentifierValue.identifier Person.Identifier.identifier Nothing indicates "person"

Higher Stakes: With 51% positive rate (vs. 21% for PDP), there are more mappings to find but also more opportunities for false positives.

Critical Finding: Model Reliability Varies by Schema

For PDP: All models work reliably. Choose based on cost.

For CDTL: Only Claude models (Haiku and Sonnet) reliably work. Other models hallucinate extensively:

Model CDTL Batch Hallucinations Reliability
Claude Sonnet 4.5 0 ✅ Reliable
Claude Haiku 4.5 3 ✅ Reliable
GPT-5.2 2 ✅ Reliable
Gemini Pro 3 0 ✅ Reliable
Gemini Flash 3 0 ✅ Reliable
GPT OSS 120b 74 (53%) ⚠️ Unreliable
Llama 4 Maverick 62 (45%) ⚠️ Unreliable
GPT-5-mini 28 + 247 FP ❌ Unusable

Practical implication: For CDTL or similar technical schemas, use Claude Haiku batch ($0.05 per run). The cheapest models that work for PDP fail catastrophically on CDTL.

Schema Comparison: PDP vs CDTL

Figure 5: CDTL is harder than PDP (left), and some models hallucinate extensively on CDTL (right). GPT OSS 120b's 74 hallucinations represent a 53% failure rate.

Silver quality analysis: see Appendix E

3.5 Type Classification and Transform Detection

Summary

All AI models excel at classifying mapping types (97%+ accuracy). Transform detection varies significantly by model, with GPT-5.2 showing a critical failure mode in batch processing.

Type Classification Results

Question: Can AI correctly identify whether a mapping is a direct match, split, merge, or "no match"?

Model PDP Type Accuracy CDTL Type Accuracy
GPT-5.2 90% 99%
Claude Sonnet 4.5 89% 98%
Claude Haiku 4.5 89% 99%
GPT-5-mini 87% 100%
GPT OSS 120b 91% 98%

Hypothesis H4 verdict:Refuted

We hypothesized that type classification accuracy would increase with model size. Instead, even the smallest models achieve near-perfect accuracy (87%+). Type classification is not a reasoning-intensive task; pattern matching suffices.

Transform Detection Results

Question: Can AI correctly identify when data values need conversion (e.g., date format changes, code translations)?

Gold-B reference identifies 12 PDP fields (8.5%) and 44 CDTL fields (31.7%) requiring transforms. Detection accuracy varies by model:

Model Mode PDP Accuracy CDTL Accuracy Notes
Gemini Flash 3 Batch 100% 86% Perfect on PDP
Claude Sonnet 4.5 Batch 99% 86% Excellent, no false positives
Claude Haiku 4.5 Batch 97% 84% Strong performance
GPT-5.2 Standard 95% 83% Works in standard mode
GPT-5.2 Batch 43% 75% Severe over-prediction

The GPT-5.2 Transform Bug

GPT-5.2 has a critical issue in batch mode: it over-predicts transforms dramatically on PDP, achieving only 43% accuracy with 11% precision (predicting transforms for nearly all fields when only 8.5% require them).

Metric Standard Mode Batch Mode
PDP Transform accuracy 95% 43%
PDP Transform precision 59% 11%

This issue is specific to batch mode and does not affect field pairing accuracy. For transform detection, use Gemini Flash 3 (perfect on PDP) or Claude models (strong on both schemas).

Hypothesis H3 verdict: ⚠️ Partial

Transform detection works across all tiers (>85% accuracy) but has model-specific failure modes. No clear tier advantage emerged.

Detailed transform analysis: see Appendix F

3.6 Variance and Reproducibility

Summary

Running the same experiment multiple times produces nearly identical results. Single runs are representative; multiple runs are unnecessary.

Multi-Run Pilot Results

We ran GPT OSS 120b six times and Llama 4 Maverick four times under identical conditions:

Model Runs Precision CV Recall CV F0.5 CV MCC CV
GPT OSS 120b 6 2.2% 0.0% 1.9% 1.5%
Llama 4 Maverick 4 4.9% 2.4% 4.7% 4.5%

CV (Coefficient of Variation) < 5% for all metrics indicates excellent reproducibility.

Key Observations

  1. True positive counts are remarkably stable: GPT OSS 120b found exactly 23 correct mappings in all 6 runs.

  2. Variance comes from borderline cases: The small differences between runs reflect uncertainty on fields where the model is genuinely uncertain, not random noise.

  3. Temperature 0.0 ensures determinism: With no randomness in the model, outputs are highly reproducible.

Practical implication: You don't need to run experiments multiple times and average results. A single run is representative of model performance.

Variance analysis: see Appendix G

3.7 Transform Expression Generation (Experiment 2)

Summary

All four models achieved strong results generating valid JSONata expressions, with full pass rates ranging from 91% to 98%. In a reversal of Experiment 1 findings, Tier 2 models outperformed Tier 3 models on this task.

Combined Results (56 Transforms)

Model Tier Syntax Valid Full Pass Total Cost Total Time
Gemini Flash 3 2 98.2% 98.2% $0.039 12.1 min
Claude Haiku 4.5 2 96.4% 96.4% $0.128 4.9 min
Claude Sonnet 4.5 3 94.6% 94.6% $0.322 6.6 min
GPT-5.2 3 92.9% 91.1% $0.282 5.4 min

All models exceeded the "strong result" threshold of 90% full pass rate.

Speed vs Accuracy

Figure 7: Speed vs accuracy trade-off for transform generation. Bubble size represents total cost. Gemini Flash 3 achieves highest accuracy at lowest cost but is slowest. Claude Haiku 4.5 offers the best balance of speed and accuracy.

Cost vs Accuracy

Figure 8: Cost vs accuracy for transform generation. Tier 2 models (blue) outperform Tier 3 models (red) while costing significantly less.

Hypothesis Verdicts

H5: Can LLMs produce syntactically valid JSONata?

Result: YES (91-98%)

All models achieved syntax validity rates above 90%. Gemini Flash 3 achieved 100% on the harder CDTL schema. Failures were primarily due to special characters in field names (e.g., @ symbols, spaces) that require quoting in JSONata.

H6: Can LLMs be constrained to use only approved functions?

Result: YES (97.5-100%)

Function hallucination was nearly eliminated. GPT-5.2 was the only model to produce a hallucinated function (1 instance), achieving 97.5% function validity. All other models achieved 100%.

The Tier Reversal

Unlike Experiment 1 (where Tier 3 slightly outperformed Tier 2), transform generation showed the opposite pattern:

Comparison Tier 2 Best Tier 3 Best Winner
Full pass rate 98.2% (Gemini Flash) 94.6% (Claude Sonnet) Tier 2
Cost $0.039 $0.282 Tier 2 (7x cheaper)
Function hallucinations 0 1 Tier 2

Interpretation: Transform expression generation is a more constrained task than open-ended schema mapping. The additional reasoning capacity of Tier 3 models provides no benefit and may introduce overthinking that leads to more complex (and error-prone) solutions.

Speed vs. Cost Trade-off

Model Avg Time/Transform Cost/Transform Best For
GPT-5.2 5.8s $0.0050
Claude Haiku 4.5 5.2s $0.0023 Interactive use
Claude Sonnet 4.5 7.1s $0.0058
Gemini Flash 3 12.9s $0.0007 Batch processing

Gemini Flash 3 is 8x cheaper than Claude Sonnet but 2.5x slower. For interactive applications requiring sub-6-second response times, Claude Haiku offers the best balance of speed, accuracy, and cost.

Schema Difficulty

CDTL proved harder than PDP for most models due to special characters in field names:

Model PDP Pass CDTL Pass Difference
Claude Haiku 4.5 100% 95.5% -4.5%
Gemini Flash 3 91.7% 100% +8.3%
Claude Sonnet 4.5 100% 93.2% -6.8%
GPT-5.2 100% 88.6% -11.4%

Gemini Flash 3 uniquely performed better on CDTL than PDP.

Failure Patterns

The 9 total failures across all models fell into two categories:

1. Special Characters in Field Names (6 failures)

Fields like Credential.@type require backtick quoting in JSONata:

# Wrong (generated)
Credential.@type

# Correct
`Credential.@type`

2. Complex Nested Transforms (3 failures)

The Competency.educationLevelType mapping requires iterating over nested arrays, which challenged multiple models.

Recommendation: Adding prompt guidance for field name quoting would likely eliminate most syntax errors.

Expression Quality

When expressions succeeded, they were high quality:

Simple transforms used custom functions appropriately:

country ? $countryToISO(country) : null

Enum mappings used built-in $lookup:

$lookup({
  'C1': 'Certificate',
  'A': 'Associate',
  'B': 'Bachelor'
}, degree_type_sought) ?? 'Unknown'

Date parsing combined substring extraction with normalization:

$normalizeDate($substring(date_of_birth, 0, 4) & '-' &
               $substring(date_of_birth, 4, 2) & '-' &
               $substring(date_of_birth, 6, 2))

Detailed transform analysis: see Appendix I

4. Discussion

4.1 Summary of Hypothesis Tests

Experiment 1: Schema Mapping

Hypothesis Plain Language Result Key Evidence
H1 Can traditional methods achieve reliable accuracy? ❌ Failed Best MCC = 0.43 (threshold: 0.50)
H2 Do larger AI models give better results? ✅ Confirmed (diminishing returns) T1→T2: +51%; T2→T3: +4%
H3 Can AI detect when transformation is needed? ⚠️ Partial >85% accuracy, but GPT-5.2 batch fails
H4 Does type classification scale with model size? ❌ Refuted Even smallest models achieve 87%+

Experiment 2: Transform Expression Generation

Hypothesis Plain Language Result Key Evidence
H5 Can LLMs produce syntactically valid JSONata? ✅ Confirmed 91-98% syntax validity across all models
H6 Can LLMs be constrained to approved functions? ✅ Confirmed 97.5-100% function compliance; 1 hallucination total

What This Means

Traditional methods are insufficient. Despite decades of research on text similarity, these approaches cannot reliably map education data schemas. The fundamental problem is that schema mapping requires understanding meaning, not just comparing strings.

AI language models represent a genuine capability leap. The 51% improvement from Tier 1 to Tier 2 is not incremental; it is transformational. These models can reason about concepts like "students are people" and "delivery method relates to instructional delivery."

Bigger is not meaningfully better. The 4% improvement from Tier 2 to Tier 3 does not justify 5-50x higher costs. For schema mapping, model size has reached a point of diminishing returns.

Task complexity varies. Type classification is easy (87%+ for all models); field pairing is moderate (71% precision for best model); transform detection is model-dependent.

4.2 The Over-Matching Problem

All methods, both traditional and AI, share a common failure mode: proposing mappings for fields that should have no match.

How Over-Matching Manifests

Consider a field like semester_session_gpa in PDP. There is no corresponding field in LIF; it should be marked "no match." However:

  • Embedding methods find Course.courseGPAApplicability (similarity: 0.77) and propose it
  • AI models reason that "GPA" relates to "GPA" and propose a mapping

Both are wrong. courseGPAApplicability describes whether a course counts toward GPA calculation, not a student's actual GPA value. But the connection is plausible enough to fool automated methods.

The Human Review Implication

This over-matching behavior means:

  1. AI cannot fully automate schema mapping. Human experts must review proposals.

  2. AI changes the human task. Instead of searching 521 options, humans verify ~30 proposals. This is faster but still requires expertise.

  3. False positives are the primary failure mode. Our emphasis on precision (via F0.5) was appropriate; wrong mappings that look plausible can slip through review.

Quantifying the Review Burden

For PDP with our best model (GPT-5.2 batch):

Category Count Human Action Required
Correct proposals 25 Verify and approve
Wrong proposals 10 Reject
Correct "no match" ~100 Spot-check sample
Missed mappings 5 Search manually

Estimated time savings: Manual mapping of 141 fields might take 8-16 hours. AI-assisted review of 35 proposals plus spot-checking might take 1-3 hours, a 70-80% reduction but not elimination.

4.3 Cost-Effectiveness Analysis

The Pareto Frontier

Some models offer strictly better cost-accuracy tradeoffs than others:

Configuration F0.5 Cost Dominated By
GPT OSS 120b batch 0.71 $0.005 None (best value)
GPT-5.2 batch 0.74 $0.12 None (best accuracy)
Claude Haiku batch 0.59 $0.05 GPT OSS 120b
Claude Sonnet standard 0.52 $6.63 Everything

Pareto-optimal choices: GPT OSS 120b batch (best value) and GPT-5.2 batch (best accuracy). All other options are dominated: they cost more for equal or worse performance.

Exception for CDTL: The cost analysis changes because GPT OSS 120b fails on CDTL (53% hallucination rate). For CDTL, Claude Haiku batch becomes the Pareto-optimal choice.

When to Pay More

Scenario Recommended Model Rationale
Cost-constrained, PDP-like schema GPT OSS 120b batch $0.005, 96% of best accuracy
Maximum accuracy required, any schema GPT-5.2 batch $0.12, best overall performance
Technical schema (CDTL-like) Claude Haiku batch $0.05, only reliable option
Transform detection critical Claude models (any) GPT-5.2 has batch mode bug

4.4 Limitations of This Study

Ground Truth Quality

Our Gold-B reference dataset contains 38 PDP mappings (12 requiring transforms) and 99 CDTL mappings (44 requiring transforms). While carefully constructed from human expert manual verification plus AI-assisted expansion, this remains a moderate sample. Statistical confidence is limited:

  • Observed precision of 71% has a 95% confidence interval of approximately ±12%
  • Differences smaller than ~10% may not be statistically significant

Larger ground truth datasets would enable more precise conclusions.

Single Prompt Version

We froze our prompt (v1.0) before execution to enable controlled comparison. However:

  • Alternative prompt designs might yield different results
  • Prompt optimization is a known technique for improving LLM performance
  • Our results represent a lower bound on achievable accuracy

Schema Generalizability

PDP and CDTL have distinct characteristics:

  • PDP: Flat structure, simple names, 21% positive rate
  • CDTL: Nested structure, technical vocabulary, 51% positive rate

Results may not generalize to schemas with different characteristics (e.g., very flat schemas, non-English field names, extremely high or low positive rates).

Temporal Validity

AI capabilities and costs change rapidly:

  • Model performance improves through updates
  • Pricing typically decreases over time
  • New models emerge regularly

Our results represent January 2026 conditions. The specific cost figures will become outdated; the relative rankings may shift.

Reference Data Uncertainty

Even Gold-B contains judgment calls:

  • Some mappings are debatable (e.g., does country map to countryCode or residenceCountry?)
  • The silver quality analysis (Appendix E) documents cases where AI and human experts disagreed

We cannot claim our ground truth is definitively correct, only that it represents careful expert judgment.

4.5 Comparison to Prior Work

Schema Matching Literature

Schema mapping has been studied extensively in database research. Traditional approaches include:

  • Linguistic matchers: Compare names and descriptions (similar to our Tier 1)
  • Structure matchers: Use schema structure (parent-child relationships)
  • Instance matchers: Compare actual data values
  • Hybrid approaches: Combine multiple techniques

Our Tier 1 results are consistent with this literature: linguistic matching alone achieves moderate precision (30-50%) but struggles with semantic relationships.

LLM Applications to Data Tasks

Recent work has applied LLMs to data integration tasks:

  • Entity matching (determining if two records refer to the same entity)
  • Schema inference (determining data types and relationships)
  • Data transformation (generating code to convert formats)

Our results contribute evidence that LLMs can assist with schema mapping specifically, with the important finding that smaller models suffice.

Novel Contributions

This study contributes:

  1. Rigorous comparison across tiers: Traditional methods vs. small LLMs vs. large LLMs with controlled variables

  2. Batch mode discovery: The finding that single-call batch processing outperforms field-by-field approaches appears to be novel

  3. Schema-specific reliability: The finding that certain models fail specifically on technical vocabularies (CDTL) has practical implications

  4. Cost-accuracy tradeoffs: Concrete data showing diminishing returns from model size, with specific recommendations

4.6 Experiment 2: Transform Expression Generation

Why Tier 2 Outperformed Tier 3

Experiment 2 produced a counterintuitive result: smaller models generated better transform expressions than larger ones. We propose three explanations:

1. Task Constraint Reduces Complexity Advantage

Schema mapping is an open-ended search problem (521 possible targets). Transform generation is a constrained code-writing problem (limited function vocabulary, clear input/output types). The additional reasoning capacity of Tier 3 models provides no benefit for constrained tasks.

2. Overthinking Leads to Errors

Larger models may generate more elaborate solutions when simpler ones suffice. For example, GPT-5.2 sometimes produced complex nested conditionals when a simple $lookup would work, increasing the chance of syntax errors.

3. Training Data Alignment

Tier 2 models may have encountered more JSONata examples during training, or their training emphasized code correctness over sophistication.

The Custom Functions Approach

The near-elimination of function hallucinations (1 instance across 224 generations) validates the custom function approach:

  • Without constraints: Models invent functions like $countryNameToCode() that do not exist
  • With constraints: Models reliably use only approved functions

This suggests a general principle: when deploying LLMs for code generation, define a limited function vocabulary and explicitly enumerate allowed operations in the prompt.

Production Readiness

Experiment 2 results are promising but not production-ready:

Aspect Status Gap
Syntax validity ✅ 91-98% Prompt improvements could reach 99%+
Function compliance ✅ 97.5-100% Acceptable
Execution correctness ❓ Not tested Requires sample data validation
Edge case handling ❓ Not tested Null values, malformed input

The next validation step is execution testing against sample data to verify that syntactically valid expressions produce semantically correct output.

5. Recommendations

5.1 For Production Deployment

Recommended Configurations

Based on our findings, we recommend the following configurations for production use:

Experiment 1: Schema Mapping

Schema Type Model Mode Cost Expected F0.5 Notes
PDP-like (simple names) GPT OSS 120b Batch $0.005 0.71 Best value
PDP-like (max accuracy) GPT-5.2 Batch $0.12 0.74 +4% accuracy for 24x cost
CDTL-like (technical vocabulary) Claude Haiku 4.5 Batch $0.05 0.62 Only reliable option
Any schema (transform detection critical) Claude Haiku/Sonnet Batch $0.05-0.17 0.59-0.62 GPT-5.2 has batch transform bug

Experiment 2: Transform Expression Generation

Use Case Model Cost/Transform Pass Rate Notes
Batch processing Gemini Flash 3 $0.0007 98.2% Cheapest, but 2.5x slower
Interactive use Claude Haiku 4.5 $0.0023 96.4% Best speed/accuracy balance
Avoid Claude Sonnet, GPT-5.2 $0.005-0.006 91-95% Higher cost, lower accuracy

Recommended Workflow

Recommended Schema Mapping Workflow

Figure 6: The recommended human-in-the-loop workflow for AI-assisted schema mapping.

Implementation Checklist

Before deployment:

  • Identify schema characteristics (simple vs. technical vocabulary)
  • Select appropriate model based on schema type
  • Test on a sample of fields to verify model works for your specific schema
  • Establish human review process and assign reviewers

During processing:

  • Use batch mode (all fields in single call)
  • Set temperature to 0.0 for reproducibility
  • Capture full AI output for audit trail
  • Log costs and performance metrics

After initial processing:

  • Identify low-confidence predictions (confidence < 4)
  • Re-run low-confidence mappings with focused prompting (provide only top candidate targets, request detailed reasoning)
  • If confidence remains low after retry, flag for mandatory human review

Human review:

  • Sort remaining predictions by confidence score (low to high)
  • Human reviewer verifies all low-confidence mappings
  • Human reviewer spot-checks sample of high-confidence mappings
  • Human reviewer spot-checks sample of "no match" predictions
  • Human reviewer searches for potentially missed mappings among low-confidence "no match" results
  • Document any corrections for future reference

Transform expression generation (after mapping complete):

  • Define custom function library for common transformation patterns
  • Include explicit function constraints in prompt
  • Add prompt guidance for quoting field names with special characters
  • Validate syntax of all generated expressions
  • Check for function hallucinations (calls to undefined functions)
  • Test expressions against sample data before production use
  • Human review of complex expressions (nested logic, multiple conditionals)

5.2 For Future Research

High-Priority Investigations

1. Retrieval-Augmented Pipeline

Instead of providing all 521 target fields to the LLM, first use embeddings to retrieve the top 10 candidates, then ask the LLM to select among them (or "none of the above").

Rationale: This combines the broad coverage of embedding methods with the semantic precision of LLMs. It may reduce false positives by giving the LLM an explicit "none" option with a manageable candidate set.

Expected outcome: Precision improvement from ~70% to ~85%+ while maintaining recall.

2. Multi-Pass Refinement Pipeline

Develop a systematic multi-step approach where low-confidence predictions trigger a second LLM call with enhanced context: narrowed candidate list, explicit comparison of top options, and chain-of-thought reasoning.

Rationale: Low confidence often indicates genuine ambiguity between similar targets. A focused follow-up call that presents only the top 3-5 candidates and asks the model to explicitly compare them may resolve uncertainty more effectively than immediate escalation to human review.

Expected outcome: Reduction in low-confidence predictions requiring human review; improved accuracy on borderline cases.

3. Confidence Threshold Calibration

Systematically evaluate the relationship between AI-reported confidence scores and prediction accuracy to identify optimal thresholds for retry logic and human review routing.

Rationale: The current implementation captures confidence scores (1-5 scale) for each prediction. Calibration analysis could determine the optimal threshold below which predictions require a refinement pass versus those that can proceed directly to human review.

4. Prompt Optimization

Test alternative prompting strategies:

  • Few-shot examples of correct mappings
  • Chain-of-thought reasoning
  • Structured comparison of top candidates

Rationale: We used a single frozen prompt. Systematic prompt engineering might improve results.

5. Cross-Schema Validation

Test on additional education data schemas to verify generalizability:

  • Common Education Data Standards (CEDS)
  • Ed-Fi data standard
  • SIF (Schools Interoperability Framework)

Experiment 2 Follow-Up

6. Execution Testing

Validate generated expressions by running them against sample source data and comparing output to expected values. This is the critical next step for production readiness.

7. Field Name Quoting Prompt Enhancement

Test whether adding explicit guidance for quoting field names with special characters (@, spaces) improves CDTL syntax validity from 93-96% to 99%+.

8. Few-Shot Examples

Test whether providing 2-3 example transforms in the prompt improves handling of complex nested transformations.

Lower-Priority Investigations

9. Fine-Tuning

Train a model specifically on schema mapping examples. This would require a larger labeled dataset but might improve accuracy.

10. Ensemble Methods

Require agreement across multiple models before proposing a mapping. This might improve precision at the cost of recall and increased expense.

11. Active Learning

Develop a system that identifies the most informative examples for human labeling, building ground truth more efficiently.

5.3 What Won't Help

Based on our findings, we recommend against the following approaches:

Using Larger Models

Don't do this: "Let's use GPT-5.2 (or Opus) everywhere because it's the best model."

Why it won't help: Larger models provide only marginal improvement (+4%) at dramatically higher cost (24x). The over-matching problem persists regardless of model size. Save money with smaller models; invest in human review instead.

Using Standard (Field-by-Field) Mode

Don't do this: "Let's process one field at a time for more detailed reasoning."

Why it won't help: Standard mode costs 20-50x more and produces equal or worse results. The AI loses cross-field context and consistency. Batch mode is strictly superior.

Using Non-Claude Models for Technical Schemas

Don't do this: "GPT OSS 120b is cheapest, so let's use it for CDTL."

Why it won't help: GPT OSS 120b hallucinates 53% of the time on CDTL, inventing target fields that don't exist. The cost savings are meaningless if the output is unusable. Pay the $0.05 for Claude Haiku.

Relying on Precision Alone

Don't do this: "Our precision is 71%, so 71% of mappings will be correct."

Why it won't help: Precision measures proposals, not the final mapping. A 71% precision model that makes 35 proposals will have ~25 correct and ~10 wrong. The 10 wrong ones still require human identification and correction.

Eliminating Human Review

Don't do this: "The AI is pretty accurate, so let's just use its output directly."

Why it won't help: Even our best model is wrong 29% of the time when proposing a mapping. Wrong mappings cause data to flow to incorrect destinations, potentially corrupting downstream systems. Human review is not optional; it is essential.

Using Tier 3 Models for Transform Generation

Don't do this: "Claude Sonnet is our best model, so use it for generating transforms too."

Why it won't help: Tier 3 models perform worse than Tier 2 on transform generation (91-95% vs 96-98%) while costing 5-8x more. Use Gemini Flash 3 for batch processing or Claude Haiku 4.5 for interactive use.

Skipping Execution Testing

Don't do this: "The expressions have valid syntax, so they must work correctly."

Why it won't help: Syntax validity does not guarantee semantic correctness. An expression like $lookup({'M': 'Male'}, gender) is syntactically valid but will fail if the data contains 'MALE' instead of 'M'. Execution testing against sample data is required before production use.

6. Conclusion

Summary of Findings

These experiments compared methods for automatically mapping education data schemas (Experiment 1) and generating transformation expressions (Experiment 2). After 41 experiment runs across two schemas (PDP and CDTL), we reached six main conclusions:

1. AI language models represent a genuine capability leap over traditional methods.

The improvement from Tier 1 (text matching) to Tier 2 (small AI models) was substantial: +51% improvement in F0.5 score, +59% improvement in precision. Traditional methods found the right target 44% of the time when proposing a match; AI models achieved 71%. This is not incremental progress; it is a qualitative change in what automation can accomplish.

2. Small AI models match large ones at a fraction of the cost.

The smallest model we tested (GPT OSS 120b at $0.005 per run) achieved 96% of the accuracy of the largest model (GPT-5.2 at $0.12 per run). Hypothesis H2 (that returns diminish at scale) was confirmed. For schema mapping, the era of "bigger is always better" does not apply.

3. Batch processing outperforms field-by-field approaches.

Processing all fields in a single AI call produces better results (higher F0.5) AND costs 20-50x less than processing each field individually. This was a surprising finding with significant practical implications: the obvious approach (one field at a time) is inferior to the simpler approach (all at once).

4. Schema characteristics determine model selection.

Not all models work for all schemas. Technical vocabularies like CDTL cause non-Claude models to hallucinate extensively (45-53% hallucination rates). Model selection must account for schema characteristics, not just cost and accuracy benchmarks.

5. AI can generate valid transform expressions at high accuracy.

Experiment 2 demonstrated that LLMs can write executable JSONata transformation code with 91-98% success rates. When constrained to an approved function library, models achieved near-perfect function compliance (only 1 hallucination across 224 generations).

6. For transform generation, smaller models are better.

In a reversal of typical expectations, Tier 2 models (Gemini Flash 3, Claude Haiku) outperformed Tier 3 models (Claude Sonnet, GPT-5.2) on transform generation while costing 5-8x less. Constrained code generation tasks do not benefit from additional model capacity.

Practical Impact

For a typical 141-field schema like PDP:

Aspect Manual Approach AI-Assisted Approach
Schema mapping
Processing time 8-16 hours 30 seconds + 1-3 hours review
Cost Expert labor $0.005 - $0.17
Accuracy Variable (human error) 71% precision, 83% recall
Scalability Linear with fields Near-constant per schema
Transform generation (for ~12 fields requiring transforms)
Processing time 2-4 hours 1 minute + 15-30 min review
Cost Developer time $0.01 - $0.03
Accuracy Variable 96-98% valid syntax

AI does not eliminate human expertise. It transforms the task from "search through 521 options per field" to "review low-confidence predictions and spot-check high-confidence ones." This is a significant productivity improvement, but schema mapping remains a human-in-the-loop process.

Recommendations in Brief

For schema mapping (Experiment 1):

  • Use batch mode (always)
  • For PDP-like schemas: GPT OSS 120b ($0.005)
  • For CDTL-like schemas: Claude Haiku ($0.05)
  • Prioritize human review for low-confidence predictions

For transform generation (Experiment 2):

  • Define a custom function library for common patterns
  • Use Gemini Flash 3 for batch ($0.0007/transform) or Claude Haiku for interactive ($0.002/transform)
  • Avoid Tier 3 models (lower accuracy at higher cost)
  • Validate expressions against sample data before production

For future research:

  • Retrieval-augmented pipeline (embeddings + LLM verification)
  • Execution testing of generated transform expressions
  • Validation on additional education data standards

Final Thoughts

Schema mapping and transform generation exemplify a broader pattern in AI assistance: the technology is most valuable when it narrows options for human judgment, not when it replaces human judgment entirely. An AI that is right 71% of the time on mapping and 98% on code generation sounds impressive, but the remaining errors can cause real harm to data quality.

The winning approach is not "find the smartest AI" but "find the cheapest AI that meets quality thresholds, then invest the savings in human review." For schema mapping in January 2026, that means small models in batch mode, verified by domain experts. For transform generation, it means Tier 2 models with constrained function vocabularies.

A surprising finding from both experiments: bigger is not always better. In Experiment 1, Tier 3 models provided only marginal gains over Tier 2. In Experiment 2, Tier 2 models actually outperformed Tier 3. This challenges the assumption that frontier models are always the right choice.

The question is not whether AI can help with education data integration (it clearly can). The question is how to deploy it responsibly, with appropriate expectations and human oversight. We hope these experiments provide a foundation for answering that question.


Appendices

The following detailed reports are compiled in APPENDICES.md (and APPENDICES.docx). They are also embedded in the Interactive HTML Report.

Appendix Title Description
A Experimental Protocol Full experimental protocol, prompt templates, configuration
B Tier 1 Analysis Detailed Tier 1 (embedding/NLP) analysis with examples
C Tier 2/3 Analysis Detailed LLM analysis, model comparisons, batch findings
D Batch Mode Analysis Batch vs. standard mode comparison
E Silver Data Quality Reference data quality, Silver vs. Gold agreement
F Transform Detection Transform detection analysis by model and schema
G Variance Analysis Variance and reproducibility analysis
H T2S Experiment Target-to-source direction experiment
I Transform Expression Generation Experiment 2 detailed analysis, failure patterns, expression examples

The Interactive HTML Report (report.html) provides charts and full run data.


Report completed January 31, 2026 Experiments conducted January 17-31, 2026 Total experiment runs: 41 (33 mapping + 8 transform) Experiment 1: 280 fields evaluated (141 PDP + 139 CDTL) Experiment 2: 448 transform generations (56 transforms × 8 runs)