Experiment 1: Schema Mapping Baseline

Objective

Compare embedding/NLP methods against LLMs for mapping PDP fields to LIF schema. Determine minimum viable model complexity for reliable schema mapping.

Data Files

Raw CSV (in /data/schemas/ and /data/gold/):

pdp_schema_20260108.csv - Source schema (141 fields)
reduced_lif_schema_20260108.csv - Target schema (521 fields)
pdp_to_lif_mapping_20260109_w_manual_check.csv - Ground truth

Reference JSON (in /data/reference/, run python data/scripts/build_reference.py):

/data/reference/
├── pdp_gold_eval.json   # 29 positive + 479 invalid targets
├── cdtl_gold_eval.json  # 64 positive + 323 invalid targets
└── lif_fields.json      # 521 LIF fields with has_source flag

Evaluation data:

Schema	Positive Mappings	Invalid Targets	Total Eval Points
PDP	29 (28 CLEAN, 1 SPLIT)	492	521
CDTL	66 (64 CLEAN, 2 SPLIT)	454	520

Positive mappings: Verified source→target pairs (TP if matched)
Invalid targets: LIF fields with NO source (FP if predicted)

LIF combined status (lif_fields.json):

has_source=true: 92 fields
has_source=false: 422 fields
has_source=null: 7 fields

Path format: Schema.Path.Field (e.g., Person.Name.firstName).

Mapping Types

Type	Definition
CLEAN_MATCH	Direct 1:1 correspondence
SPLIT	One source → multiple targets
MERGE	Multiple sources → one target
NO_MATCH	No corresponding target

Model Tiers to Test

Tier 1: Embedding/NLP (Free, Local)

ID	Method	Library	Notes
E-01	Fuzzy Matching	RapidFuzz	Multi-algorithm ensemble (6 algorithms incl. Jaro-Winkler, best score wins) with component-weighted scoring
E-02	Static Embedding	potion-retrieval-32M	CPU-fast, 512-dim, retrieval-optimized
E-03	Sentence Transformer A	all-MiniLM-L6-v2	384-dim, fast
E-04	Sentence Transformer B	bge-large-en-v1.5	1024-dim, SOTA

Protocol (Embeddings E-02 to E-04):

Field representation: Simplify LIF paths to reduce noise:
- Source: normalize field name (expand abbreviations, split underscores) + description
- Target: schema + leaf field name + description (e.g., Person.Identifier.identifier → Person identifier)
- Rationale: Deep paths like CompetencyFramework.Competency.ProficiencyDefinition.identifier add token noise without semantic value
Generate embeddings for source and target representations
Compute cosine similarity matrix
For each source, rank targets by similarity
Apply threshold T ∈ {0.5, 0.6, 0.7, 0.8, 0.9, 0.95}
If top score > T: propose match; else: NO_MATCH

Protocol (Fuzzy Matching E-01):

Normalize text: split camelCase, expand abbreviations (e.g., id → identifier)
Score each source-target pair using 6 RapidFuzz algorithms (ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio, Jaro-Winkler), take maximum
Compute weighted combination: 70% field name match + 15% description + 10% schema context + 5% full text
Apply threshold (typically 0.90-0.95 for fuzzy vs 0.7-0.8 for embeddings—fuzzy scores reflect character overlap, not semantic similarity)

Tier 2: Small LLMs (ordered by size)

ID	Model	Provider	Notes
L-01	GPT OSS 120b	OpenAI (OSS)	~5B active params, very fast, cheap
L-02	Llama 4 Maverick	Meta (Groq)	17Bx128E MoE, 128k context, 594 TPS
L-03	Claude Haiku 4.5	Anthropic	~20B, fast, good instruction following
L-04	GPT-5-mini	OpenAI	~25B, efficient, strong reasoning for size
L-05	Gemini Flash 3	Google	~30B, optimized for speed

Tier 3: Large LLMs

ID	Model	Provider	Notes
L-06	Claude Sonnet 4.5	Anthropic	Strong reasoning, balanced cost
L-07	Gemini Pro 3	Google	Google's SOTA model
L-08	GPT-5.2	OpenAI	OpenAI's flagship, strongest GPT reasoning

LLM Protocol:

Temperature: 0.0 (not supported by GPT-5 models, omitted for those)
Max tokens: 500
Reasoning effort: "low" (for reasoning models: GPT-5, GPT-OSS, Gemini 3)
Output: Structured JSON

Note on Reasoning Models: GPT-5 series, GPT-OSS-120b, and Gemini 3 are reasoning models that use internal multi-pass reasoning before generating output. We set reasoning_effort="low" to minimize token consumption and variability while still producing reliable output. See OpenAI reasoning docs.

Prompt Template (v1.0)

You are a data schema mapping expert. Map the source field to the target schema.

SOURCE SCHEMA: Postsecondary Data Partnership (PDP)
TARGET SCHEMA: Learner Information Framework (LIF)

TARGET SCHEMA FIELDS:
{target_field_list}

SOURCE FIELD:
- Name: {source_field_name}
- Description: {source_field_description}

TASK:
1. Identify the best matching target field(s), or indicate NO_MATCH if none exists
2. Classify the mapping type: CLEAN_MATCH, SPLIT, MERGE, or NO_MATCH
3. Indicate if data transformation is required (beyond simple renaming)

Respond ONLY with valid JSON:
{
  "source_field": "{source_field_name}",
  "mapping_type": "<CLEAN_MATCH|SPLIT|MERGE|NO_MATCH>",
  "target_fields": ["<field1>", "<field2>"] or "NO_MATCH",
  "requires_transform": <true|false>,
  "transform_description": "<if applicable, otherwise null>",
  "confidence": <1-5>,
  "reasoning": "<brief explanation>"
}

IMPORTANT:
- target_fields must contain ONLY fields from the TARGET SCHEMA FIELDS list above
- If no match exists, use mapping_type "NO_MATCH" and target_fields "NO_MATCH"

Metrics

Primary (Tier 1 Evaluation)

Metric	Formula	Priority
Precision	TP / (TP + FP)	High - false positives are dangerous
Recall	TP / (TP + FN)	Medium
F0.5	1.25 × P × R / (0.25P + R)	Primary summary metric
Hallucination Rate	Invalid targets / Total predictions	High
No-Match Precision	Correct no-match / All no-match predictions	Medium

Class-Imbalance Metrics

These metrics account for the imbalanced dataset (29 positives vs 112 negatives):

Metric	Formula	What It Measures
Accuracy	(TP + TN) / Total	Overall correctness including "no match"
Specificity	TN / (TN + FP)	How well model rejects non-matches
Balanced Accuracy	(Recall + Specificity) / 2	Average of finding matches AND rejecting non-matches
MCC	(TP×TN - FP×FN) / √(...)	Correlation coefficient, -1 to +1 (0 = random)

Why these matter: A model that always says "no match" achieves 79.4% accuracy but 0% recall and 0 MCC. High accuracy alone doesn't indicate useful performance. MCC and Balanced Accuracy penalize this degenerate strategy.

Interpreting MCC:

MCC = +1: Perfect prediction
MCC = 0: No better than random
MCC = -1: Perfect inverse (always wrong)

For this task, MCC > 0.3 indicates the model is learning meaningful patterns beyond just being conservative.

Confusion Matrix Definitions

Outcome	Definition
TP	Correct target proposed
FP	Wrong target OR target when gold says NO_MATCH
FN	NO_MATCH when gold has target OR no prediction made
TN	Correct NO_MATCH identification (source has no valid target, model says NO_MATCH)
Hallucination	Target field doesn't exist in schema (subset of FP)

Cost Metrics

Cost per Correct Mapping: Total API cost / TP count
ACR (Accuracy-Cost Ratio): F0.5 / normalized cost (Opus = 1.0)

Output Schema

Each run produces:

{
  "run_id": "E-01-2026-01-15-001",
  "method": "all-MiniLM-L6-v2",
  "threshold": 0.7,
  "timestamp": "2026-01-15T10:00:00Z",
  "results": [
    {
      "source_field": "student_id",
      "predicted_target": "Person.identifier",
      "predicted_type": "CLEAN_MATCH",
      "confidence": 0.89,
      "gold_target": "Person.identifier",
      "gold_match": true,
      "is_hallucination": false
    }
  ],
  "metrics": {
    "precision": 0.85,
    "recall": 0.78,
    "f05": 0.83,
    "f1": 0.81,
    "hallucination_rate": 0.02,
    "no_match_precision": 0.70
  },
  "cost_usd": 0.0,
  "latency_ms": 1234
}

Implementation Modules

code/
├── config.py           # Paths, model configs, thresholds
├── data_loader.py      # Load schemas and gold dataset
├── embeddings.py       # Tier 1: embedding methods
├── llm_mapper.py       # Tier 2-3: LLM API calls
├── evaluator.py        # Compute metrics against gold
├── tracker.py          # SQLite experiment tracking
├── run_experiment.py   # Main orchestrator
└── utils.py            # JSON validation, logging

Data Storage

SQLite Database: outputs/experiment.db

runs: Run metadata (model, schema, timestamp, status)
metrics: Evaluation metrics (precision, recall, F0.5, MCC, etc.)
costs: LLM costs (tokens, USD)
predictions: Individual field predictions

JSON Files: outputs/runs/*.json

Detailed predictions with raw LLM responses
Full evaluation results

Use ExperimentTracker for all database operations (create, query, delete runs).

Run Order

Phase 1: Run all Tier 1 methods (E-01 to E-04) across all thresholds
Phase 2: Run Tier 2 LLMs (L-01 to L-05)
Phase 3: Run Tier 3 LLMs (L-06 to L-08)
Phase 4: Generate silver dataset with Opus (type classifications)
Analysis: Compare results, generate visualizations

Total Evaluations:

Embedding: 4 methods × 5 thresholds × 141 fields = 2,820 evaluations
LLM: 8 models × 141 fields = 1,128 evaluations
Silver generation: 141 fields

Success Criteria

Hypothesis	Test
H1: Embeddings sufficient for clean matches	Tier 1 MCC > 0.5 (indicates reliable classification beyond chance)
H2: Diminishing returns at scale	Δ(F0.5) T1→T2 > Δ(F0.5) T2→T3
H3: Transform detection needs reasoning	Tier 3 >> Tier 1/2 on transform accuracy (see `/synthetic/h3_transforms/`)
H4: Type classification scales with model	Type accuracy increases T1 → T2 → T3 (479 invalid targets for FP detection)

Statistical Considerations

Sample Size Limitation: The gold dataset contains only 29 PDP mappings (28 CLEAN_MATCH, 1 SPLIT) and 66 CDTL mappings. This limits statistical power.

Why MCC for H1: Precision alone is misleading with imbalanced data—a model can achieve 100% precision by being extremely conservative (few predictions). MCC captures overall classification quality:

MCC > 0.3: Model learning patterns beyond chance
MCC > 0.5: Practically useful classification performance
MCC = 1.0: Perfect classification

Reporting Requirements:

Point estimates with 95% CI where applicable
Bootstrap resampling (n=1000) for composite metrics (F0.5, ACR)
Explicit acknowledgment when differences are not statistically significant

Transform Ground Truth (Gold-B)

Transform evaluation uses Gold-B, which includes verified transform flags and descriptions.

Gold-B = Silver (Opus-generated) + human expert verification/corrections

Schema	Transform Fields	Categories
PDP	12	Date conversion, country codes, enum mappings, race parsing
CDTL	44	Enum mappings, object parsing, narrative extraction

Reference files: data/reference/{pdp,cdtl}_goldb_eval.json

Generation script: data/scripts/build_goldb_reference.py

Transform data comes from verified CSV files with human expert corrections:

data/silver/pdp_silver_batch_verified_by_gold_tara_additions_*.csv
data/silver/cdtl_silver_batch_verified_by_gold_tara_additions_*.csv

LLM Pricing (per million tokens, as of January 2026)

Model	Input	Output	Cached	Provider
Claude Sonnet 4.5	$3.00	$15.00	$0.30	Anthropic
Gemini Pro 3	$2.00	$12.00	$0.20	Google
GPT-5.2	$1.75	$14.00	$0.175	OpenAI
Claude Haiku 4.5	$1.00	$5.00	$0.10	Anthropic
Gemini Flash 3	$0.50	$3.00	$0.05	Google
GPT-5-mini	$0.25	$2.00	$0.025	OpenAI
GPT OSS 120b (Groq)	$0.15	$0.60	$0.075	OpenAI (Open)
Llama 4 Maverick (Groq)	$0.11	$0.34	$0.055	Meta (Open)

Dependencies

# Core
pandas
numpy

# Embeddings
sentence-transformers
rapidfuzz
model2vec  # for potion-base-32M

# LLM APIs
anthropic
google-generativeai
openai

# Evaluation
scikit-learn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment 1: Schema Mapping Baseline

Objective

Data Files

Mapping Types

Model Tiers to Test

Tier 1: Embedding/NLP (Free, Local)

Tier 2: Small LLMs (ordered by size)

Tier 3: Large LLMs

Prompt Template (v1.0)

Metrics

Primary (Tier 1 Evaluation)

Class-Imbalance Metrics

Confusion Matrix Definitions

Cost Metrics

Output Schema

Implementation Modules

Data Storage

Run Order

Success Criteria

Statistical Considerations

Transform Ground Truth (Gold-B)

LLM Pricing (per million tokens, as of January 2026)

Dependencies

FilesExpand file tree

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

Experiment 1: Schema Mapping Baseline

Objective

Data Files

Mapping Types

Model Tiers to Test

Tier 1: Embedding/NLP (Free, Local)

Tier 2: Small LLMs (ordered by size)

Tier 3: Large LLMs

Prompt Template (v1.0)

Metrics

Primary (Tier 1 Evaluation)

Class-Imbalance Metrics

Confusion Matrix Definitions

Cost Metrics

Output Schema

Implementation Modules

Data Storage

Run Order

Success Criteria

Statistical Considerations

Transform Ground Truth (Gold-B)

LLM Pricing (per million tokens, as of January 2026)

Dependencies