Skip to content

Latest commit

 

History

History
334 lines (258 loc) · 12.5 KB

File metadata and controls

334 lines (258 loc) · 12.5 KB

Experiment 1: Schema Mapping Baseline

Objective

Compare embedding/NLP methods against LLMs for mapping PDP fields to LIF schema. Determine minimum viable model complexity for reliable schema mapping.

Data Files

Raw CSV (in /data/schemas/ and /data/gold/):

  • pdp_schema_20260108.csv - Source schema (141 fields)
  • reduced_lif_schema_20260108.csv - Target schema (521 fields)
  • pdp_to_lif_mapping_20260109_w_manual_check.csv - Ground truth

Reference JSON (in /data/reference/, run python data/scripts/build_reference.py):

/data/reference/
├── pdp_gold_eval.json   # 29 positive + 479 invalid targets
├── cdtl_gold_eval.json  # 64 positive + 323 invalid targets
└── lif_fields.json      # 521 LIF fields with has_source flag

Evaluation data:

Schema Positive Mappings Invalid Targets Total Eval Points
PDP 29 (28 CLEAN, 1 SPLIT) 492 521
CDTL 66 (64 CLEAN, 2 SPLIT) 454 520
  • Positive mappings: Verified source→target pairs (TP if matched)
  • Invalid targets: LIF fields with NO source (FP if predicted)

LIF combined status (lif_fields.json):

  • has_source=true: 92 fields
  • has_source=false: 422 fields
  • has_source=null: 7 fields

Path format: Schema.Path.Field (e.g., Person.Name.firstName).

Mapping Types

Type Definition
CLEAN_MATCH Direct 1:1 correspondence
SPLIT One source → multiple targets
MERGE Multiple sources → one target
NO_MATCH No corresponding target

Model Tiers to Test

Tier 1: Embedding/NLP (Free, Local)

ID Method Library Notes
E-01 Fuzzy Matching RapidFuzz Multi-algorithm ensemble (6 algorithms incl. Jaro-Winkler, best score wins) with component-weighted scoring
E-02 Static Embedding potion-retrieval-32M CPU-fast, 512-dim, retrieval-optimized
E-03 Sentence Transformer A all-MiniLM-L6-v2 384-dim, fast
E-04 Sentence Transformer B bge-large-en-v1.5 1024-dim, SOTA

Protocol (Embeddings E-02 to E-04):

  1. Field representation: Simplify LIF paths to reduce noise:
    • Source: normalize field name (expand abbreviations, split underscores) + description
    • Target: schema + leaf field name + description (e.g., Person.Identifier.identifierPerson identifier)
    • Rationale: Deep paths like CompetencyFramework.Competency.ProficiencyDefinition.identifier add token noise without semantic value
  2. Generate embeddings for source and target representations
  3. Compute cosine similarity matrix
  4. For each source, rank targets by similarity
  5. Apply threshold T ∈ {0.5, 0.6, 0.7, 0.8, 0.9, 0.95}
  6. If top score > T: propose match; else: NO_MATCH

Protocol (Fuzzy Matching E-01):

  1. Normalize text: split camelCase, expand abbreviations (e.g., ididentifier)
  2. Score each source-target pair using 6 RapidFuzz algorithms (ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio, Jaro-Winkler), take maximum
  3. Compute weighted combination: 70% field name match + 15% description + 10% schema context + 5% full text
  4. Apply threshold (typically 0.90-0.95 for fuzzy vs 0.7-0.8 for embeddings—fuzzy scores reflect character overlap, not semantic similarity)

Tier 2: Small LLMs (ordered by size)

ID Model Provider Notes
L-01 GPT OSS 120b OpenAI (OSS) ~5B active params, very fast, cheap
L-02 Llama 4 Maverick Meta (Groq) 17Bx128E MoE, 128k context, 594 TPS
L-03 Claude Haiku 4.5 Anthropic ~20B, fast, good instruction following
L-04 GPT-5-mini OpenAI ~25B, efficient, strong reasoning for size
L-05 Gemini Flash 3 Google ~30B, optimized for speed

Tier 3: Large LLMs

ID Model Provider Notes
L-06 Claude Sonnet 4.5 Anthropic Strong reasoning, balanced cost
L-07 Gemini Pro 3 Google Google's SOTA model
L-08 GPT-5.2 OpenAI OpenAI's flagship, strongest GPT reasoning

LLM Protocol:

  • Temperature: 0.0 (not supported by GPT-5 models, omitted for those)
  • Max tokens: 500
  • Reasoning effort: "low" (for reasoning models: GPT-5, GPT-OSS, Gemini 3)
  • Output: Structured JSON

Note on Reasoning Models: GPT-5 series, GPT-OSS-120b, and Gemini 3 are reasoning models that use internal multi-pass reasoning before generating output. We set reasoning_effort="low" to minimize token consumption and variability while still producing reliable output. See OpenAI reasoning docs.

Prompt Template (v1.0)

You are a data schema mapping expert. Map the source field to the target schema.

SOURCE SCHEMA: Postsecondary Data Partnership (PDP)
TARGET SCHEMA: Learner Information Framework (LIF)

TARGET SCHEMA FIELDS:
{target_field_list}

SOURCE FIELD:
- Name: {source_field_name}
- Description: {source_field_description}

TASK:
1. Identify the best matching target field(s), or indicate NO_MATCH if none exists
2. Classify the mapping type: CLEAN_MATCH, SPLIT, MERGE, or NO_MATCH
3. Indicate if data transformation is required (beyond simple renaming)

Respond ONLY with valid JSON:
{
  "source_field": "{source_field_name}",
  "mapping_type": "<CLEAN_MATCH|SPLIT|MERGE|NO_MATCH>",
  "target_fields": ["<field1>", "<field2>"] or "NO_MATCH",
  "requires_transform": <true|false>,
  "transform_description": "<if applicable, otherwise null>",
  "confidence": <1-5>,
  "reasoning": "<brief explanation>"
}

IMPORTANT:
- target_fields must contain ONLY fields from the TARGET SCHEMA FIELDS list above
- If no match exists, use mapping_type "NO_MATCH" and target_fields "NO_MATCH"

Metrics

Primary (Tier 1 Evaluation)

Metric Formula Priority
Precision TP / (TP + FP) High - false positives are dangerous
Recall TP / (TP + FN) Medium
F0.5 1.25 × P × R / (0.25P + R) Primary summary metric
Hallucination Rate Invalid targets / Total predictions High
No-Match Precision Correct no-match / All no-match predictions Medium

Class-Imbalance Metrics

These metrics account for the imbalanced dataset (29 positives vs 112 negatives):

Metric Formula What It Measures
Accuracy (TP + TN) / Total Overall correctness including "no match"
Specificity TN / (TN + FP) How well model rejects non-matches
Balanced Accuracy (Recall + Specificity) / 2 Average of finding matches AND rejecting non-matches
MCC (TP×TN - FP×FN) / √(...) Correlation coefficient, -1 to +1 (0 = random)

Why these matter: A model that always says "no match" achieves 79.4% accuracy but 0% recall and 0 MCC. High accuracy alone doesn't indicate useful performance. MCC and Balanced Accuracy penalize this degenerate strategy.

Interpreting MCC:

  • MCC = +1: Perfect prediction
  • MCC = 0: No better than random
  • MCC = -1: Perfect inverse (always wrong)

For this task, MCC > 0.3 indicates the model is learning meaningful patterns beyond just being conservative.

Confusion Matrix Definitions

Outcome Definition
TP Correct target proposed
FP Wrong target OR target when gold says NO_MATCH
FN NO_MATCH when gold has target OR no prediction made
TN Correct NO_MATCH identification (source has no valid target, model says NO_MATCH)
Hallucination Target field doesn't exist in schema (subset of FP)

Cost Metrics

  • Cost per Correct Mapping: Total API cost / TP count
  • ACR (Accuracy-Cost Ratio): F0.5 / normalized cost (Opus = 1.0)

Output Schema

Each run produces:

{
  "run_id": "E-01-2026-01-15-001",
  "method": "all-MiniLM-L6-v2",
  "threshold": 0.7,
  "timestamp": "2026-01-15T10:00:00Z",
  "results": [
    {
      "source_field": "student_id",
      "predicted_target": "Person.identifier",
      "predicted_type": "CLEAN_MATCH",
      "confidence": 0.89,
      "gold_target": "Person.identifier",
      "gold_match": true,
      "is_hallucination": false
    }
  ],
  "metrics": {
    "precision": 0.85,
    "recall": 0.78,
    "f05": 0.83,
    "f1": 0.81,
    "hallucination_rate": 0.02,
    "no_match_precision": 0.70
  },
  "cost_usd": 0.0,
  "latency_ms": 1234
}

Implementation Modules

code/
├── config.py           # Paths, model configs, thresholds
├── data_loader.py      # Load schemas and gold dataset
├── embeddings.py       # Tier 1: embedding methods
├── llm_mapper.py       # Tier 2-3: LLM API calls
├── evaluator.py        # Compute metrics against gold
├── tracker.py          # SQLite experiment tracking
├── run_experiment.py   # Main orchestrator
└── utils.py            # JSON validation, logging

Data Storage

SQLite Database: outputs/experiment.db

  • runs: Run metadata (model, schema, timestamp, status)
  • metrics: Evaluation metrics (precision, recall, F0.5, MCC, etc.)
  • costs: LLM costs (tokens, USD)
  • predictions: Individual field predictions

JSON Files: outputs/runs/*.json

  • Detailed predictions with raw LLM responses
  • Full evaluation results

Use ExperimentTracker for all database operations (create, query, delete runs).

Run Order

  1. Phase 1: Run all Tier 1 methods (E-01 to E-04) across all thresholds
  2. Phase 2: Run Tier 2 LLMs (L-01 to L-05)
  3. Phase 3: Run Tier 3 LLMs (L-06 to L-08)
  4. Phase 4: Generate silver dataset with Opus (type classifications)
  5. Analysis: Compare results, generate visualizations

Total Evaluations:

  • Embedding: 4 methods × 5 thresholds × 141 fields = 2,820 evaluations
  • LLM: 8 models × 141 fields = 1,128 evaluations
  • Silver generation: 141 fields

Success Criteria

Hypothesis Test
H1: Embeddings sufficient for clean matches Tier 1 MCC > 0.5 (indicates reliable classification beyond chance)
H2: Diminishing returns at scale Δ(F0.5) T1→T2 > Δ(F0.5) T2→T3
H3: Transform detection needs reasoning Tier 3 >> Tier 1/2 on transform accuracy (see /synthetic/h3_transforms/)
H4: Type classification scales with model Type accuracy increases T1 → T2 → T3 (479 invalid targets for FP detection)

Statistical Considerations

Sample Size Limitation: The gold dataset contains only 29 PDP mappings (28 CLEAN_MATCH, 1 SPLIT) and 66 CDTL mappings. This limits statistical power.

Why MCC for H1: Precision alone is misleading with imbalanced data—a model can achieve 100% precision by being extremely conservative (few predictions). MCC captures overall classification quality:

  • MCC > 0.3: Model learning patterns beyond chance
  • MCC > 0.5: Practically useful classification performance
  • MCC = 1.0: Perfect classification

Reporting Requirements:

  • Point estimates with 95% CI where applicable
  • Bootstrap resampling (n=1000) for composite metrics (F0.5, ACR)
  • Explicit acknowledgment when differences are not statistically significant

Transform Ground Truth (Gold-B)

Transform evaluation uses Gold-B, which includes verified transform flags and descriptions.

Gold-B = Silver (Opus-generated) + human expert verification/corrections

Schema Transform Fields Categories
PDP 12 Date conversion, country codes, enum mappings, race parsing
CDTL 44 Enum mappings, object parsing, narrative extraction

Reference files: data/reference/{pdp,cdtl}_goldb_eval.json

Generation script: data/scripts/build_goldb_reference.py

Transform data comes from verified CSV files with human expert corrections:

  • data/silver/pdp_silver_batch_verified_by_gold_tara_additions_*.csv
  • data/silver/cdtl_silver_batch_verified_by_gold_tara_additions_*.csv

LLM Pricing (per million tokens, as of January 2026)

Model Input Output Cached Provider
Claude Sonnet 4.5 $3.00 $15.00 $0.30 Anthropic
Gemini Pro 3 $2.00 $12.00 $0.20 Google
GPT-5.2 $1.75 $14.00 $0.175 OpenAI
Claude Haiku 4.5 $1.00 $5.00 $0.10 Anthropic
Gemini Flash 3 $0.50 $3.00 $0.05 Google
GPT-5-mini $0.25 $2.00 $0.025 OpenAI
GPT OSS 120b (Groq) $0.15 $0.60 $0.075 OpenAI (Open)
Llama 4 Maverick (Groq) $0.11 $0.34 $0.055 Meta (Open)

Dependencies

# Core
pandas
numpy

# Embeddings
sentence-transformers
rapidfuzz
model2vec  # for potion-base-32M

# LLM APIs
anthropic
google-generativeai
openai

# Evaluation
scikit-learn