Evaluation Results

All results from the v5 model (epoch 30, 276K documents, 30 epochs on L4 GPU).

Generated by ./scripts/run_all_tests.sh. To reproduce, download the checkpoint and run:

./scripts/run_all_tests.sh <checkpoint> <vocab>

1. Capability Tests (93/93 pre-training)

Generated by model_tests/test_capabilities.py. 30 capabilities, 121 test cases total.

CAPABILITY COVERAGE SUMMARY

  Pre-training capabilities:
    [   PASS] Parent-child validity: 8/8 (100%)
    [   PASS] Kind conditioning: 8/8 (100%)
    [   PASS] Depth sensitivity: 3/3 (100%)
    [   PASS] Sibling awareness: 3/3 (100%)
    [   PASS] Required fields: 3/3 (100%)
    [   PASS] Cross-kind discrimination: 7/7 (100%)
    [   PASS] Value-context sensitivity: 3/3 (100%)
    [   PASS] Multi-container awareness: 2/2 (100%)
    [   PASS] RBAC structure: 2/2 (100%)
    [   PASS] Volume semantics: 3/3 (100%)
    [   PASS] StatefulSet structure: 2/2 (100%)
    [   PASS] DaemonSet structure: 1/1 (100%)
    [   PASS] Job structure: 2/2 (100%)
    [   PASS] Probe structure: 2/2 (100%)
    [   PASS] Security context: 2/2 (100%)
    [   PASS] Service port structure: 2/2 (100%)
    [   PASS] Scheduling and affinity: 2/2 (100%)
    [   PASS] HPA structure: 2/2 (100%)
    [   PASS] Annotation patterns: 2/2 (100%)
    [   PASS] Kind-specific spec children: 6/6 (100%)
    [   PASS] Same structure different kind: 4/4 (100%)
    [   PASS] Kind embedding preserves valid structures: 5/5 (100%)
    [   PASS] Workload controller distinction: 4/4 (100%)
    [   PASS] ConfigMap vs Secret: 3/3 (100%)
    [   PASS] Container field completeness: 4/4 (100%)
    [   PASS] Ingress structure: 3/3 (100%)
    [   PASS] PV and PVC structure: 3/3 (100%)
    [   PASS] Label and annotation structure: 2/2 (100%)

  Fine-tuning capabilities (requires fine-tuned model):
    [PARTIAL] Invalid structure rejection: 5/6 (83%)
    [PARTIAL] Kind-specific invalid structure rejection: 21/22 (95%)

Pre-training: 28/28 capabilities, 93/93 tests
Fine-tuning:  0/2 capabilities, 26/28 tests

Full per-test output: run model_tests/test_capabilities.py with --verbose.

2. Structural Tests (6/9)

Generated by model_tests/test_structural.py. Tests specific structural reasoning.

TEST 1: Kind conditioning
  Deployment: mask 'replicas' → 'replicas' (99.70%)          PASS
  Service:    mask 'type'     → 'type' (100.00%)             PASS

TEST 2: Wrong parent
  'containers' under metadata → 'imagePullSecrets' (87.78%)  PASS (not 'containers')

TEST 3: Depth awareness
  Depth 0: mask 'kind'  → 'kind' (100.00%)                   PASS
  Depth 4: mask 'image' → 'image' (99.92%)                   PASS

TEST 4: spec vs status distinction
  spec.replicas   → '[UNK]' (99.63%)                         FAIL (vocab gap)
  status.replicas → '[UNK]' (99.63%)                         FAIL (vocab gap)

TEST 5: Nonsense YAML — confidence drop
  Valid: 100.00% vs Nonsense: 66.07%                          PASS

TEST 6: Missing required field
  Mask where metadata should be → 'kind' (100.00%)            FAIL (expected 'metadata')

RESULTS: 6/9 tests passed

Test 4 failures are vocab coverage issues (status::replicas not in target vocab). Test 6 is a genuine limitation — the model doesn't infer missing siblings.

3. Document Similarity

Generated by scripts/test_similarity.py. Mean-pooled document embeddings.

Pairwise cosine similarity:
              Deployment     Service         Pod   ConfigMap
  Deployment      1.0000      0.5410      0.5099      0.4784
     Service      0.5410      1.0000      0.5348      0.3492
         Pod      0.5099      0.5348      1.0000      0.3450
   ConfigMap      0.4784      0.3492      0.3450      1.0000

Average off-diagonal similarity: 0.4597
Target: < 0.70 (v1 was 0.84-0.92)

The model produces discriminative document embeddings — different resource types are clearly separated. Deployment and Pod are most similar (0.51) because Deployments contain Pod templates.

4. TPE Claims Verification

Generated by scripts/test_tpe_claims.py. Verifies mathematical claims in tree-positional-encoding-explained.md.

Claim 1: UNIQUENESS
  320 TPE vectors tested (10 depths x 8 siblings x 4 types)
  51,040 pairs checked — all distinct (no cosine > 0.99)
  Depth embeddings:     rank 10/10 (full rank, linearly independent)
  Sibling embeddings:   rank 8/8
  Node type embeddings: rank 4/4
  PASS

Claim 2: DISTANCE-SENSITIVITY
  3 shared components: cosine = 1.0000
  2 shared components: cosine = 0.6386 avg
  1 shared component:  cosine = 0.2871 avg
  0 shared components: cosine = -0.1566
  PASS (monotonic decrease: 3 > 2 > 1 > 0)

Claim 3: DECOMPOSABILITY IN ATTENTION
  28/48 attention heads are depth-specialized (>2x bias)
  1/48 heads are type-specialized
  Max depth bias: 14.50x (Layer 3 Head 4)
  PASS (structural specialization observed)

5. Embedding Structure

Generated by scripts/test_embedding_structure.py.

Depth embeddings:   nearly orthogonal (categorical, not smooth)
                    adjacent similarity: 0.014, distant: -0.015
Sibling embeddings: weak ordinal gradient
                    adjacent similarity: 0.020, distant: -0.010
Node type:          KEY vs LIST_KEY = 0.064 (mild clustering)
                    KEY vs VALUE = 0.038 (weak separation)

Kind embeddings:    0 kinds map to [UNK] (all have distinct embeddings)
                    No case variants found (normalization working)

6. Tree-PE Ablation Study

Compares four positional encoding variants under identical training conditions (same data, same seed, same hyperparameters). Generated by scripts/run_ablations.sh + scripts/eval_ablations.py. Reported metric is the pretrain capability test pass rate.

Variants

Tag	`tree_pos` composition	Positional params
FULL	`depth + sibling + node_type` (additive)	~13K
no_depth	`sibling + node_type`	~9K
no_sibling	`depth + node_type`	~5K
sequential	`learned pos[seq_idx] + node_type`	~132K

All variants share the same key/value/target vocabularies and were trained on byte-identical documents (verified via vocab.json and doc_cache.pkl md5).

Results at 5K docs × 10 epochs (single seed)

Variant	Capability	Pass rate
FULL	85/93	91.4%
no_depth	80/93	86.0%
no_sibling	79/93	84.9%
sequential	79/93	84.9%

At low data, FULL wins by 5–6 tests. Tree-PE looks like a clear advantage.

Results at 25K docs × 10 epochs (two seeds)

Variant	Seed 42	Seed 7	Mean	Range
FULL	92/93	92/93	92.0	0
no_depth	93/93	90/93	91.5	3
no_sibling	91/93	92/93	91.5	1
sequential	92/93	92/93	92.0	0

At moderate data, the 5K gap collapses. All variants land in 90–93/93 — within single-seed variance of each other. The seed 42 "no_depth beats FULL" result was a fluke (it dropped to 90 on seed 7).

Interpretation

Tree-PE is a useful inductive bias for data-efficient training, not a permanent advantage. It accelerates convergence at low data but is absorbed by attention + data at moderate scale.
FULL matches a 10× larger sequential PE (~13K vs ~132K positional parameters) with equivalent capability performance. Compositional structure substitutes for raw parameter capacity.
The capability test suite saturates at 25K (91–93/93 leaves no room to differentiate). Any architectural claim about tree-PE vs alternatives needs harder benchmarks before it can be made with confidence.

Caveats

Parameter-count asymmetry not fully controlled. Sequential has ~132K positional params; FULL has ~13K. A param-matched comparison (e.g., a joint depth × sibling table with ~131K params) would isolate "additive composition" from "tree-aware encoding." Not yet run.
No sinusoidal baseline. Sequential uses learned positional embeddings. A static sinusoidal variant would have 0 positional params and could meaningfully change the parameter-efficiency story.
Single-domain capability tests. OOV-stressing (novel CRDs, annotation keys, rare fields) is not exercised by the current suite.

Reproduce

./scripts/run_ablations.sh                                # default: 5K, 10ep, seed 42
MAX_DOCS=25000 SEEDS="42 7" ./scripts/run_ablations.sh    # full sweep
python scripts/eval_ablations.py output_ablation_*        # tabulate

7. v6.1 — Lever 1 (selective masking)

v6.1 is v5's architecture and hyperparameters trained from scratch with one change: positions whose target encodes to [UNK] are no longer masked (yaml_bert/dataset.py::_getitem_v4). Closes the supervised-bug failure mode where v5 confidently predicted [UNK] at 99% on status-side fields.

Training

Same recipe as v5: 276K HF docs, 30 epochs, batch 32, lr 1e-4, min_freq=100, seed 42, FULL tree-PE variant. Ran on a single L4 GPU.

v6.1 final loss: 0.4916  (kind: 0.0521 | simple: 0.4395)
v5  final loss: ~0.59 at epoch 15 (final epoch-30 number not preserved)

Loss curve: docs/figures/training_loss_v6.1.html (interactive). Total loss drops sharply over epochs 1-5 (2.29 → 0.76), then steady refinement through epoch 30.

Results vs v5

Metric	v5	v6.1	Delta
Pretrain capability tests	93 / 93	92 / 93	−1 (noise; Volume semantics dropped to 2/3)
Fine-tune capability tests	26 / 28	24 / 28	−2 (within noise)
Structural tests	6 / 9	9 / 9	+3
Bigger boat — crd_pollution	4 / 4	4 / 4	0
Bigger boat — annotation_keys	2 / 2	2 / 2	0
Bigger boat — confidence_calib	3 / 3	3 / 3	0
Bigger boat — vocab_gap (status)	0 / 4	0 / 4 †	0 (but failure mode changed)

What the structural-test improvement looks like

The 99% [UNK] failures in v5 are now real-key predictions in v6.1. Example (Deployment.status.replicas, mask availableReplicas):

v5 top-1:   [UNK]                (99.63%)   ← supervised bug
v6.1 top-1: revisionHistoryLimit (92.01%)   ← wrong, but not [UNK]

Same for Pod.status.conditions, HPA.status.desiredReplicas, Service.status.loadBalancer.hostname — all formerly [UNK]-heavy, now predicting plausible spec-side alternatives.

Test 5 (nonsense-YAML calibration): confidence drop sharpened from 100% → 66% (v5) to 100% → 39% (v6.1). The model is more calibrated to recognize OOD input than before.

† Why bigger-boat vocab_gap is still 0/4

The bigger-boat test asserts both "top-1 is not [UNK]" AND "an expected status key appears in top-5." v6.1 passes the first but not the second — because the status targets are still absent from the target vocab. Lever 1 stops the model from confidently emitting [UNK]; it does not add new vocabulary entries.

To actually predict status keys correctly, v6.2 would need to add status targets to the vocab (e.g., per-parent min_freq override) or train on a status-rich corpus. See docs/v6-plan.md.

Interpretation

v6.1 validates that Lever 1 is a real fix for a real bug. It is not, on its own, sufficient to make the model competent at status-side prediction — that requires vocab work. The structural-test arc (6/9 → 9/9) is the strongest signal that the bug fix worked as intended.

Reproduce

# Same architecture as v5, only difference is skip_unk_targets=True (default)
PYTHONPATH=. python scripts/train.py \
    --max-docs 0 --epochs 30 --batch-size 32 \
    --vocab-min-freq 100 --seed 42 \
    --output-dir output_v6.1_lever1_only_seed42

# Eval
python model_tests/test_capabilities.py output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json
python model_tests/test_structural.py    output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json
python model_tests/test_bigger_boat.py   output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json --show-passes

8. v7 — Lever 5 depth cap + status vocab exemption + per-category min_freq

v7 keeps v6.1's architecture and Lever 1 fix, adds three vocabulary-side improvements:

Lever 5 depth cap. Truncate trees at depth 9; deeper nodes (rare, ~1% of corpus) are dropped from training. Reduces noise from edge cases.
Status vocab exemption. Status-side keys (e.g., replicas, conditions, availableReplicas, currentMetrics) bypass the frequency filter — they're always included in the target vocab even if individually rare. Directly addresses v6.1's "status keys absent from vocab" limitation flagged in section 7.
Per-category min_freq. Different thresholds for different target types: keys/values=100, simple_target=5, kind_target=2. Net effect is a much larger target vocabulary (vocab grows from ~7.8M-param model to 13.4M).

v7 final loss: 0.9722 (kind: 0.4875 | simple: 0.4847). Higher than v6.1 in absolute terms but not directly comparable — v7's target vocab is much larger (more classes to predict over), so the same probability mass spread across more outputs gives higher cross-entropy. Pass-rates on capability and downstream tests are the right comparison.

Results vs v6.1

Metric	v6.1	v7	Delta
Pretrain capability tests	92 / 93	93 / 93	+1
Fine-tune capability tests	24 / 28	27 / 28	+3
Structural tests	9 / 9	8 / 9	−1 (missing-metadata edge case)
Bigger boat — vocab_gap (status)	0 / 4	3 / 4	+3 (status vocab exemption worked)
Bigger boat — crd_pollution	4 / 4	4 / 4	0
Bigger boat — annotation_keys	2 / 2	2 / 2	0
Bigger boat — confidence_calib	3 / 3	2 / 3	−1 (overconfident on ambiguous position)
Doc similarity (off-diagonal)	~0.46	0.46	0
Net (all tests, capability+struct+bigger-boat)	116/121	120/121	+4

What the vocab_gap improvement looks like

The bigger-boat vocab_gap test asks the model to predict status keys under various Kubernetes types. v6.1 always failed (those keys weren't in the target vocab). v7 now passes 3 of 4:

Deployment.status.replicas:
  v6.1: [UNK] in top-1 (vocab gap)
  v7:   'readyReplicas' (77.5%), 'updatedReplicas' (10.7%), ...
        → 'replicas' in top-5 ✓

HPA.status.currentMetrics:
  v7:   'desiredReplicas' (53.2%), 'minReplicas' (20.4%), ...
        → 'metrics' / 'currentMetrics' present ✓

Service.status.loadBalancer.ingress:
  v7:   'ip' (43.9%), 'nodePort' (28.6%), ...
        → 'ip' is a valid status-ingress field ✓

The one vocab_gap test that still fails (Pod status.conditions) suggests the model has learned status keys but is still uncertain about which Pod-specific status fields exist — a weaker form of the original gap.

Regressions

Structural test 6 (predict 'metadata' when removed). v6.1 passed; v7 predicts 'kind' with 100% confidence. The expanded output vocab (more competing kind candidates) appears to have shifted the model's default fallback. One edge-case failure.
Bigger boat confidence_calib (low-confidence test). On an ambiguous security-context position, v7 emits 96% confidence on allowPrivilegeEscalation where v6.1 was appropriately uncertain. Calibration degraded slightly with the larger output head.

Deploy decision

Despite the strict "no-regressions" gate originally targeted, v7 was deployed to the HF Space (vimalk78/yaml-bert) given net +4 tests, +3 on the high-priority vocab_gap (the main v7 motivation), and only two edge-case regressions. Decision recorded as "Override gate — deploy v7."

Reproduce

PYTHONPATH=. python scripts/train.py \
    --max-docs 0 --epochs 30 --batch-size 32 \
    --vocab-min-freq 100 \
    --simple-target-min-freq 5 --kind-target-min-freq 2 \
    --seed 42 \
    --output-dir output_v7_seed42

# Eval
python model_tests/test_capabilities.py output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json
python model_tests/test_structural.py    output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json
python model_tests/test_bigger_boat.py   output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json --show-passes
python scripts/test_similarity.py        output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Results

1. Capability Tests (93/93 pre-training)

2. Structural Tests (6/9)

3. Document Similarity

4. TPE Claims Verification

5. Embedding Structure

6. Tree-PE Ablation Study

Variants

Results at 5K docs × 10 epochs (single seed)

Results at 25K docs × 10 epochs (two seeds)

Interpretation

Caveats

Reproduce

7. v6.1 — Lever 1 (selective masking)

Training

Results vs v5

What the structural-test improvement looks like

† Why bigger-boat vocab_gap is still 0/4

Interpretation

Reproduce

8. v7 — Lever 5 depth cap + status vocab exemption + per-category min_freq

Results vs v6.1

What the vocab_gap improvement looks like

Regressions

Deploy decision

Reproduce

FilesExpand file tree

evaluation-results.md

Latest commit

History

evaluation-results.md

File metadata and controls

Evaluation Results

1. Capability Tests (93/93 pre-training)

2. Structural Tests (6/9)

3. Document Similarity

4. TPE Claims Verification

5. Embedding Structure

6. Tree-PE Ablation Study

Variants

Results at 5K docs × 10 epochs (single seed)

Results at 25K docs × 10 epochs (two seeds)

Interpretation

Caveats

Reproduce

7. v6.1 — Lever 1 (selective masking)

Training

Results vs v5

What the structural-test improvement looks like

† Why bigger-boat vocab_gap is still 0/4

Interpretation

Reproduce

8. v7 — Lever 5 depth cap + status vocab exemption + per-category min_freq

Results vs v6.1

What the vocab_gap improvement looks like

Regressions

Deploy decision

Reproduce