Compare embedding/NLP methods against LLMs for mapping PDP fields to LIF schema. Determine minimum viable model complexity for reliable schema mapping.
Raw CSV (in /data/schemas/ and /data/gold/):
pdp_schema_20260108.csv- Source schema (141 fields)reduced_lif_schema_20260108.csv- Target schema (521 fields)pdp_to_lif_mapping_20260109_w_manual_check.csv- Ground truth
Reference JSON (in /data/reference/, run python data/scripts/build_reference.py):
/data/reference/
├── pdp_gold_eval.json # 29 positive + 479 invalid targets
├── cdtl_gold_eval.json # 64 positive + 323 invalid targets
└── lif_fields.json # 521 LIF fields with has_source flag
Evaluation data:
| Schema | Positive Mappings | Invalid Targets | Total Eval Points |
|---|---|---|---|
| PDP | 29 (28 CLEAN, 1 SPLIT) | 492 | 521 |
| CDTL | 66 (64 CLEAN, 2 SPLIT) | 454 | 520 |
- Positive mappings: Verified source→target pairs (TP if matched)
- Invalid targets: LIF fields with NO source (FP if predicted)
LIF combined status (lif_fields.json):
has_source=true: 92 fieldshas_source=false: 422 fieldshas_source=null: 7 fields
Path format: Schema.Path.Field (e.g., Person.Name.firstName).
| Type | Definition |
|---|---|
| CLEAN_MATCH | Direct 1:1 correspondence |
| SPLIT | One source → multiple targets |
| MERGE | Multiple sources → one target |
| NO_MATCH | No corresponding target |
| ID | Method | Library | Notes |
|---|---|---|---|
| E-01 | Fuzzy Matching | RapidFuzz | Multi-algorithm ensemble (6 algorithms incl. Jaro-Winkler, best score wins) with component-weighted scoring |
| E-02 | Static Embedding | potion-retrieval-32M | CPU-fast, 512-dim, retrieval-optimized |
| E-03 | Sentence Transformer A | all-MiniLM-L6-v2 | 384-dim, fast |
| E-04 | Sentence Transformer B | bge-large-en-v1.5 | 1024-dim, SOTA |
Protocol (Embeddings E-02 to E-04):
- Field representation: Simplify LIF paths to reduce noise:
- Source: normalize field name (expand abbreviations, split underscores) + description
- Target: schema + leaf field name + description (e.g.,
Person.Identifier.identifier→Person identifier) - Rationale: Deep paths like
CompetencyFramework.Competency.ProficiencyDefinition.identifieradd token noise without semantic value
- Generate embeddings for source and target representations
- Compute cosine similarity matrix
- For each source, rank targets by similarity
- Apply threshold T ∈ {0.5, 0.6, 0.7, 0.8, 0.9, 0.95}
- If top score > T: propose match; else: NO_MATCH
Protocol (Fuzzy Matching E-01):
- Normalize text: split camelCase, expand abbreviations (e.g.,
id→identifier) - Score each source-target pair using 6 RapidFuzz algorithms (ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio, Jaro-Winkler), take maximum
- Compute weighted combination: 70% field name match + 15% description + 10% schema context + 5% full text
- Apply threshold (typically 0.90-0.95 for fuzzy vs 0.7-0.8 for embeddings—fuzzy scores reflect character overlap, not semantic similarity)
| ID | Model | Provider | Notes |
|---|---|---|---|
| L-01 | GPT OSS 120b | OpenAI (OSS) | ~5B active params, very fast, cheap |
| L-02 | Llama 4 Maverick | Meta (Groq) | 17Bx128E MoE, 128k context, 594 TPS |
| L-03 | Claude Haiku 4.5 | Anthropic | ~20B, fast, good instruction following |
| L-04 | GPT-5-mini | OpenAI | ~25B, efficient, strong reasoning for size |
| L-05 | Gemini Flash 3 | ~30B, optimized for speed |
| ID | Model | Provider | Notes |
|---|---|---|---|
| L-06 | Claude Sonnet 4.5 | Anthropic | Strong reasoning, balanced cost |
| L-07 | Gemini Pro 3 | Google's SOTA model | |
| L-08 | GPT-5.2 | OpenAI | OpenAI's flagship, strongest GPT reasoning |
LLM Protocol:
- Temperature: 0.0 (not supported by GPT-5 models, omitted for those)
- Max tokens: 500
- Reasoning effort: "low" (for reasoning models: GPT-5, GPT-OSS, Gemini 3)
- Output: Structured JSON
Note on Reasoning Models: GPT-5 series, GPT-OSS-120b, and Gemini 3 are reasoning models that use internal multi-pass reasoning before generating output. We set reasoning_effort="low" to minimize token consumption and variability while still producing reliable output. See OpenAI reasoning docs.
You are a data schema mapping expert. Map the source field to the target schema.
SOURCE SCHEMA: Postsecondary Data Partnership (PDP)
TARGET SCHEMA: Learner Information Framework (LIF)
TARGET SCHEMA FIELDS:
{target_field_list}
SOURCE FIELD:
- Name: {source_field_name}
- Description: {source_field_description}
TASK:
1. Identify the best matching target field(s), or indicate NO_MATCH if none exists
2. Classify the mapping type: CLEAN_MATCH, SPLIT, MERGE, or NO_MATCH
3. Indicate if data transformation is required (beyond simple renaming)
Respond ONLY with valid JSON:
{
"source_field": "{source_field_name}",
"mapping_type": "<CLEAN_MATCH|SPLIT|MERGE|NO_MATCH>",
"target_fields": ["<field1>", "<field2>"] or "NO_MATCH",
"requires_transform": <true|false>,
"transform_description": "<if applicable, otherwise null>",
"confidence": <1-5>,
"reasoning": "<brief explanation>"
}
IMPORTANT:
- target_fields must contain ONLY fields from the TARGET SCHEMA FIELDS list above
- If no match exists, use mapping_type "NO_MATCH" and target_fields "NO_MATCH"
| Metric | Formula | Priority |
|---|---|---|
| Precision | TP / (TP + FP) | High - false positives are dangerous |
| Recall | TP / (TP + FN) | Medium |
| F0.5 | 1.25 × P × R / (0.25P + R) | Primary summary metric |
| Hallucination Rate | Invalid targets / Total predictions | High |
| No-Match Precision | Correct no-match / All no-match predictions | Medium |
These metrics account for the imbalanced dataset (29 positives vs 112 negatives):
| Metric | Formula | What It Measures |
|---|---|---|
| Accuracy | (TP + TN) / Total | Overall correctness including "no match" |
| Specificity | TN / (TN + FP) | How well model rejects non-matches |
| Balanced Accuracy | (Recall + Specificity) / 2 | Average of finding matches AND rejecting non-matches |
| MCC | (TP×TN - FP×FN) / √(...) | Correlation coefficient, -1 to +1 (0 = random) |
Why these matter: A model that always says "no match" achieves 79.4% accuracy but 0% recall and 0 MCC. High accuracy alone doesn't indicate useful performance. MCC and Balanced Accuracy penalize this degenerate strategy.
Interpreting MCC:
- MCC = +1: Perfect prediction
- MCC = 0: No better than random
- MCC = -1: Perfect inverse (always wrong)
For this task, MCC > 0.3 indicates the model is learning meaningful patterns beyond just being conservative.
| Outcome | Definition |
|---|---|
| TP | Correct target proposed |
| FP | Wrong target OR target when gold says NO_MATCH |
| FN | NO_MATCH when gold has target OR no prediction made |
| TN | Correct NO_MATCH identification (source has no valid target, model says NO_MATCH) |
| Hallucination | Target field doesn't exist in schema (subset of FP) |
- Cost per Correct Mapping: Total API cost / TP count
- ACR (Accuracy-Cost Ratio): F0.5 / normalized cost (Opus = 1.0)
Each run produces:
{
"run_id": "E-01-2026-01-15-001",
"method": "all-MiniLM-L6-v2",
"threshold": 0.7,
"timestamp": "2026-01-15T10:00:00Z",
"results": [
{
"source_field": "student_id",
"predicted_target": "Person.identifier",
"predicted_type": "CLEAN_MATCH",
"confidence": 0.89,
"gold_target": "Person.identifier",
"gold_match": true,
"is_hallucination": false
}
],
"metrics": {
"precision": 0.85,
"recall": 0.78,
"f05": 0.83,
"f1": 0.81,
"hallucination_rate": 0.02,
"no_match_precision": 0.70
},
"cost_usd": 0.0,
"latency_ms": 1234
}code/
├── config.py # Paths, model configs, thresholds
├── data_loader.py # Load schemas and gold dataset
├── embeddings.py # Tier 1: embedding methods
├── llm_mapper.py # Tier 2-3: LLM API calls
├── evaluator.py # Compute metrics against gold
├── tracker.py # SQLite experiment tracking
├── run_experiment.py # Main orchestrator
└── utils.py # JSON validation, logging
SQLite Database: outputs/experiment.db
runs: Run metadata (model, schema, timestamp, status)metrics: Evaluation metrics (precision, recall, F0.5, MCC, etc.)costs: LLM costs (tokens, USD)predictions: Individual field predictions
JSON Files: outputs/runs/*.json
- Detailed predictions with raw LLM responses
- Full evaluation results
Use ExperimentTracker for all database operations (create, query, delete runs).
- Phase 1: Run all Tier 1 methods (E-01 to E-04) across all thresholds
- Phase 2: Run Tier 2 LLMs (L-01 to L-05)
- Phase 3: Run Tier 3 LLMs (L-06 to L-08)
- Phase 4: Generate silver dataset with Opus (type classifications)
- Analysis: Compare results, generate visualizations
Total Evaluations:
- Embedding: 4 methods × 5 thresholds × 141 fields = 2,820 evaluations
- LLM: 8 models × 141 fields = 1,128 evaluations
- Silver generation: 141 fields
| Hypothesis | Test |
|---|---|
| H1: Embeddings sufficient for clean matches | Tier 1 MCC > 0.5 (indicates reliable classification beyond chance) |
| H2: Diminishing returns at scale | Δ(F0.5) T1→T2 > Δ(F0.5) T2→T3 |
| H3: Transform detection needs reasoning | Tier 3 >> Tier 1/2 on transform accuracy (see /synthetic/h3_transforms/) |
| H4: Type classification scales with model | Type accuracy increases T1 → T2 → T3 (479 invalid targets for FP detection) |
Sample Size Limitation: The gold dataset contains only 29 PDP mappings (28 CLEAN_MATCH, 1 SPLIT) and 66 CDTL mappings. This limits statistical power.
Why MCC for H1: Precision alone is misleading with imbalanced data—a model can achieve 100% precision by being extremely conservative (few predictions). MCC captures overall classification quality:
- MCC > 0.3: Model learning patterns beyond chance
- MCC > 0.5: Practically useful classification performance
- MCC = 1.0: Perfect classification
Reporting Requirements:
- Point estimates with 95% CI where applicable
- Bootstrap resampling (n=1000) for composite metrics (F0.5, ACR)
- Explicit acknowledgment when differences are not statistically significant
Transform evaluation uses Gold-B, which includes verified transform flags and descriptions.
Gold-B = Silver (Opus-generated) + human expert verification/corrections
| Schema | Transform Fields | Categories |
|---|---|---|
| PDP | 12 | Date conversion, country codes, enum mappings, race parsing |
| CDTL | 44 | Enum mappings, object parsing, narrative extraction |
Reference files: data/reference/{pdp,cdtl}_goldb_eval.json
Generation script: data/scripts/build_goldb_reference.py
Transform data comes from verified CSV files with human expert corrections:
data/silver/pdp_silver_batch_verified_by_gold_tara_additions_*.csvdata/silver/cdtl_silver_batch_verified_by_gold_tara_additions_*.csv
| Model | Input | Output | Cached | Provider |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | $0.30 | Anthropic |
| Gemini Pro 3 | $2.00 | $12.00 | $0.20 | |
| GPT-5.2 | $1.75 | $14.00 | $0.175 | OpenAI |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.10 | Anthropic |
| Gemini Flash 3 | $0.50 | $3.00 | $0.05 | |
| GPT-5-mini | $0.25 | $2.00 | $0.025 | OpenAI |
| GPT OSS 120b (Groq) | $0.15 | $0.60 | $0.075 | OpenAI (Open) |
| Llama 4 Maverick (Groq) | $0.11 | $0.34 | $0.055 | Meta (Open) |
# Core
pandas
numpy
# Embeddings
sentence-transformers
rapidfuzz
model2vec # for potion-base-32M
# LLM APIs
anthropic
google-generativeai
openai
# Evaluation
scikit-learn