All results from the v5 model (epoch 30, 276K documents, 30 epochs on L4 GPU).
Generated by ./scripts/run_all_tests.sh. To reproduce, download the checkpoint and run:
./scripts/run_all_tests.sh <checkpoint> <vocab>
Generated by model_tests/test_capabilities.py. 30 capabilities, 121 test cases total.
CAPABILITY COVERAGE SUMMARY
Pre-training capabilities:
[ PASS] Parent-child validity: 8/8 (100%)
[ PASS] Kind conditioning: 8/8 (100%)
[ PASS] Depth sensitivity: 3/3 (100%)
[ PASS] Sibling awareness: 3/3 (100%)
[ PASS] Required fields: 3/3 (100%)
[ PASS] Cross-kind discrimination: 7/7 (100%)
[ PASS] Value-context sensitivity: 3/3 (100%)
[ PASS] Multi-container awareness: 2/2 (100%)
[ PASS] RBAC structure: 2/2 (100%)
[ PASS] Volume semantics: 3/3 (100%)
[ PASS] StatefulSet structure: 2/2 (100%)
[ PASS] DaemonSet structure: 1/1 (100%)
[ PASS] Job structure: 2/2 (100%)
[ PASS] Probe structure: 2/2 (100%)
[ PASS] Security context: 2/2 (100%)
[ PASS] Service port structure: 2/2 (100%)
[ PASS] Scheduling and affinity: 2/2 (100%)
[ PASS] HPA structure: 2/2 (100%)
[ PASS] Annotation patterns: 2/2 (100%)
[ PASS] Kind-specific spec children: 6/6 (100%)
[ PASS] Same structure different kind: 4/4 (100%)
[ PASS] Kind embedding preserves valid structures: 5/5 (100%)
[ PASS] Workload controller distinction: 4/4 (100%)
[ PASS] ConfigMap vs Secret: 3/3 (100%)
[ PASS] Container field completeness: 4/4 (100%)
[ PASS] Ingress structure: 3/3 (100%)
[ PASS] PV and PVC structure: 3/3 (100%)
[ PASS] Label and annotation structure: 2/2 (100%)
Fine-tuning capabilities (requires fine-tuned model):
[PARTIAL] Invalid structure rejection: 5/6 (83%)
[PARTIAL] Kind-specific invalid structure rejection: 21/22 (95%)
Pre-training: 28/28 capabilities, 93/93 tests
Fine-tuning: 0/2 capabilities, 26/28 tests
Full per-test output: run model_tests/test_capabilities.py with --verbose.
Generated by model_tests/test_structural.py. Tests specific structural reasoning.
TEST 1: Kind conditioning
Deployment: mask 'replicas' → 'replicas' (99.70%) PASS
Service: mask 'type' → 'type' (100.00%) PASS
TEST 2: Wrong parent
'containers' under metadata → 'imagePullSecrets' (87.78%) PASS (not 'containers')
TEST 3: Depth awareness
Depth 0: mask 'kind' → 'kind' (100.00%) PASS
Depth 4: mask 'image' → 'image' (99.92%) PASS
TEST 4: spec vs status distinction
spec.replicas → '[UNK]' (99.63%) FAIL (vocab gap)
status.replicas → '[UNK]' (99.63%) FAIL (vocab gap)
TEST 5: Nonsense YAML — confidence drop
Valid: 100.00% vs Nonsense: 66.07% PASS
TEST 6: Missing required field
Mask where metadata should be → 'kind' (100.00%) FAIL (expected 'metadata')
RESULTS: 6/9 tests passed
Test 4 failures are vocab coverage issues (status::replicas not in target vocab). Test 6 is a genuine limitation — the model doesn't infer missing siblings.
Generated by scripts/test_similarity.py. Mean-pooled document embeddings.
Pairwise cosine similarity:
Deployment Service Pod ConfigMap
Deployment 1.0000 0.5410 0.5099 0.4784
Service 0.5410 1.0000 0.5348 0.3492
Pod 0.5099 0.5348 1.0000 0.3450
ConfigMap 0.4784 0.3492 0.3450 1.0000
Average off-diagonal similarity: 0.4597
Target: < 0.70 (v1 was 0.84-0.92)
The model produces discriminative document embeddings — different resource types are clearly separated. Deployment and Pod are most similar (0.51) because Deployments contain Pod templates.
Generated by scripts/test_tpe_claims.py. Verifies mathematical claims in tree-positional-encoding-explained.md.
Claim 1: UNIQUENESS
320 TPE vectors tested (10 depths x 8 siblings x 4 types)
51,040 pairs checked — all distinct (no cosine > 0.99)
Depth embeddings: rank 10/10 (full rank, linearly independent)
Sibling embeddings: rank 8/8
Node type embeddings: rank 4/4
PASS
Claim 2: DISTANCE-SENSITIVITY
3 shared components: cosine = 1.0000
2 shared components: cosine = 0.6386 avg
1 shared component: cosine = 0.2871 avg
0 shared components: cosine = -0.1566
PASS (monotonic decrease: 3 > 2 > 1 > 0)
Claim 3: DECOMPOSABILITY IN ATTENTION
28/48 attention heads are depth-specialized (>2x bias)
1/48 heads are type-specialized
Max depth bias: 14.50x (Layer 3 Head 4)
PASS (structural specialization observed)
Generated by scripts/test_embedding_structure.py.
Depth embeddings: nearly orthogonal (categorical, not smooth)
adjacent similarity: 0.014, distant: -0.015
Sibling embeddings: weak ordinal gradient
adjacent similarity: 0.020, distant: -0.010
Node type: KEY vs LIST_KEY = 0.064 (mild clustering)
KEY vs VALUE = 0.038 (weak separation)
Kind embeddings: 0 kinds map to [UNK] (all have distinct embeddings)
No case variants found (normalization working)
Compares four positional encoding variants under identical training conditions
(same data, same seed, same hyperparameters). Generated by
scripts/run_ablations.sh + scripts/eval_ablations.py. Reported metric is the
pretrain capability test pass rate.
| Tag | tree_pos composition |
Positional params |
|---|---|---|
| FULL | depth + sibling + node_type (additive) |
~13K |
| no_depth | sibling + node_type |
~9K |
| no_sibling | depth + node_type |
~5K |
| sequential | learned pos[seq_idx] + node_type |
~132K |
All variants share the same key/value/target vocabularies and were trained on
byte-identical documents (verified via vocab.json and doc_cache.pkl md5).
| Variant | Capability | Pass rate |
|---|---|---|
| FULL | 85/93 | 91.4% |
| no_depth | 80/93 | 86.0% |
| no_sibling | 79/93 | 84.9% |
| sequential | 79/93 | 84.9% |
At low data, FULL wins by 5–6 tests. Tree-PE looks like a clear advantage.
| Variant | Seed 42 | Seed 7 | Mean | Range |
|---|---|---|---|---|
| FULL | 92/93 | 92/93 | 92.0 | 0 |
| no_depth | 93/93 | 90/93 | 91.5 | 3 |
| no_sibling | 91/93 | 92/93 | 91.5 | 1 |
| sequential | 92/93 | 92/93 | 92.0 | 0 |
At moderate data, the 5K gap collapses. All variants land in 90–93/93 — within single-seed variance of each other. The seed 42 "no_depth beats FULL" result was a fluke (it dropped to 90 on seed 7).
- Tree-PE is a useful inductive bias for data-efficient training, not a permanent advantage. It accelerates convergence at low data but is absorbed by attention + data at moderate scale.
- FULL matches a 10× larger sequential PE (~13K vs ~132K positional parameters) with equivalent capability performance. Compositional structure substitutes for raw parameter capacity.
- The capability test suite saturates at 25K (91–93/93 leaves no room to differentiate). Any architectural claim about tree-PE vs alternatives needs harder benchmarks before it can be made with confidence.
- Parameter-count asymmetry not fully controlled. Sequential has ~132K
positional params; FULL has ~13K. A param-matched comparison (e.g., a joint
depth × siblingtable with ~131K params) would isolate "additive composition" from "tree-aware encoding." Not yet run. - No sinusoidal baseline. Sequential uses learned positional embeddings. A static sinusoidal variant would have 0 positional params and could meaningfully change the parameter-efficiency story.
- Single-domain capability tests. OOV-stressing (novel CRDs, annotation keys, rare fields) is not exercised by the current suite.
./scripts/run_ablations.sh # default: 5K, 10ep, seed 42
MAX_DOCS=25000 SEEDS="42 7" ./scripts/run_ablations.sh # full sweep
python scripts/eval_ablations.py output_ablation_* # tabulate
v6.1 is v5's architecture and hyperparameters trained from scratch with one
change: positions whose target encodes to [UNK] are no longer masked
(yaml_bert/dataset.py::_getitem_v4). Closes the supervised-bug failure
mode where v5 confidently predicted [UNK] at 99% on status-side fields.
Same recipe as v5: 276K HF docs, 30 epochs, batch 32, lr 1e-4, min_freq=100, seed 42, FULL tree-PE variant. Ran on a single L4 GPU.
v6.1 final loss: 0.4916 (kind: 0.0521 | simple: 0.4395)
v5 final loss: ~0.59 at epoch 15 (final epoch-30 number not preserved)
Loss curve: docs/figures/training_loss_v6.1.html (interactive). Total
loss drops sharply over epochs 1-5 (2.29 → 0.76), then steady refinement
through epoch 30.
| Metric | v5 | v6.1 | Delta |
|---|---|---|---|
| Pretrain capability tests | 93 / 93 | 92 / 93 | −1 (noise; Volume semantics dropped to 2/3) |
| Fine-tune capability tests | 26 / 28 | 24 / 28 | −2 (within noise) |
| Structural tests | 6 / 9 | 9 / 9 | +3 |
| Bigger boat — crd_pollution | 4 / 4 | 4 / 4 | 0 |
| Bigger boat — annotation_keys | 2 / 2 | 2 / 2 | 0 |
| Bigger boat — confidence_calib | 3 / 3 | 3 / 3 | 0 |
| Bigger boat — vocab_gap (status) | 0 / 4 | 0 / 4 † | 0 (but failure mode changed) |
The 99% [UNK] failures in v5 are now real-key predictions in v6.1. Example
(Deployment.status.replicas, mask availableReplicas):
v5 top-1: [UNK] (99.63%) ← supervised bug
v6.1 top-1: revisionHistoryLimit (92.01%) ← wrong, but not [UNK]
Same for Pod.status.conditions, HPA.status.desiredReplicas,
Service.status.loadBalancer.hostname — all formerly [UNK]-heavy,
now predicting plausible spec-side alternatives.
Test 5 (nonsense-YAML calibration): confidence drop sharpened from 100% → 66% (v5) to 100% → 39% (v6.1). The model is more calibrated to recognize OOD input than before.
The bigger-boat test asserts both "top-1 is not [UNK]" AND "an expected status key appears in top-5." v6.1 passes the first but not the second — because the status targets are still absent from the target vocab. Lever 1 stops the model from confidently emitting [UNK]; it does not add new vocabulary entries.
To actually predict status keys correctly, v6.2 would need to add status
targets to the vocab (e.g., per-parent min_freq override) or train on a
status-rich corpus. See docs/v6-plan.md.
v6.1 validates that Lever 1 is a real fix for a real bug. It is not, on its own, sufficient to make the model competent at status-side prediction — that requires vocab work. The structural-test arc (6/9 → 9/9) is the strongest signal that the bug fix worked as intended.
# Same architecture as v5, only difference is skip_unk_targets=True (default)
PYTHONPATH=. python scripts/train.py \
--max-docs 0 --epochs 30 --batch-size 32 \
--vocab-min-freq 100 --seed 42 \
--output-dir output_v6.1_lever1_only_seed42
# Eval
python model_tests/test_capabilities.py output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json
python model_tests/test_structural.py output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json
python model_tests/test_bigger_boat.py output_v6.1.../checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v6.1.../vocab.json --show-passes
v7 keeps v6.1's architecture and Lever 1 fix, adds three vocabulary-side improvements:
- Lever 5 depth cap. Truncate trees at depth 9; deeper nodes (rare, ~1% of corpus) are dropped from training. Reduces noise from edge cases.
- Status vocab exemption. Status-side keys (e.g.,
replicas,conditions,availableReplicas,currentMetrics) bypass the frequency filter — they're always included in the target vocab even if individually rare. Directly addresses v6.1's "status keys absent from vocab" limitation flagged in section 7. - Per-category min_freq. Different thresholds for different target types: keys/values=100, simple_target=5, kind_target=2. Net effect is a much larger target vocabulary (vocab grows from ~7.8M-param model to 13.4M).
v7 final loss: 0.9722 (kind: 0.4875 | simple: 0.4847). Higher than v6.1 in absolute terms but not directly comparable — v7's target vocab is much larger (more classes to predict over), so the same probability mass spread across more outputs gives higher cross-entropy. Pass-rates on capability and downstream tests are the right comparison.
| Metric | v6.1 | v7 | Delta |
|---|---|---|---|
| Pretrain capability tests | 92 / 93 | 93 / 93 | +1 |
| Fine-tune capability tests | 24 / 28 | 27 / 28 | +3 |
| Structural tests | 9 / 9 | 8 / 9 | −1 (missing-metadata edge case) |
| Bigger boat — vocab_gap (status) | 0 / 4 | 3 / 4 | +3 (status vocab exemption worked) |
| Bigger boat — crd_pollution | 4 / 4 | 4 / 4 | 0 |
| Bigger boat — annotation_keys | 2 / 2 | 2 / 2 | 0 |
| Bigger boat — confidence_calib | 3 / 3 | 2 / 3 | −1 (overconfident on ambiguous position) |
| Doc similarity (off-diagonal) | ~0.46 | 0.46 | 0 |
| Net (all tests, capability+struct+bigger-boat) | 116/121 | 120/121 | +4 |
The bigger-boat vocab_gap test asks the model to predict status keys under various Kubernetes types. v6.1 always failed (those keys weren't in the target vocab). v7 now passes 3 of 4:
Deployment.status.replicas:
v6.1: [UNK] in top-1 (vocab gap)
v7: 'readyReplicas' (77.5%), 'updatedReplicas' (10.7%), ...
→ 'replicas' in top-5 ✓
HPA.status.currentMetrics:
v7: 'desiredReplicas' (53.2%), 'minReplicas' (20.4%), ...
→ 'metrics' / 'currentMetrics' present ✓
Service.status.loadBalancer.ingress:
v7: 'ip' (43.9%), 'nodePort' (28.6%), ...
→ 'ip' is a valid status-ingress field ✓
The one vocab_gap test that still fails (Pod status.conditions) suggests the model has learned status keys but is still uncertain about which Pod-specific status fields exist — a weaker form of the original gap.
- Structural test 6 (predict 'metadata' when removed). v6.1 passed; v7 predicts 'kind' with 100% confidence. The expanded output vocab (more competing kind candidates) appears to have shifted the model's default fallback. One edge-case failure.
- Bigger boat confidence_calib (low-confidence test). On an ambiguous
security-context position, v7 emits 96% confidence on
allowPrivilegeEscalationwhere v6.1 was appropriately uncertain. Calibration degraded slightly with the larger output head.
Despite the strict "no-regressions" gate originally targeted, v7 was
deployed to the HF Space (vimalk78/yaml-bert) given net +4 tests, +3
on the high-priority vocab_gap (the main v7 motivation), and only two
edge-case regressions. Decision recorded as "Override gate — deploy v7."
PYTHONPATH=. python scripts/train.py \
--max-docs 0 --epochs 30 --batch-size 32 \
--vocab-min-freq 100 \
--simple-target-min-freq 5 --kind-target-min-freq 2 \
--seed 42 \
--output-dir output_v7_seed42
# Eval
python model_tests/test_capabilities.py output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json
python model_tests/test_structural.py output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json
python model_tests/test_bigger_boat.py output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json --show-passes
python scripts/test_similarity.py output_v7_seed42/checkpoints/yaml_bert_v4_epoch_30.pt --vocab output_v7_seed42/vocab.json