Skip to content

Latest commit

 

History

History
217 lines (157 loc) · 6.69 KB

File metadata and controls

217 lines (157 loc) · 6.69 KB

03 Micro-World Semantics (Main Positive Result)

Objective

Evaluate semantic class structure (True/False/Unknown) in a controlled procedural language setting with held-out lexicons/templates, and test representation vs decoder behavior.

This track exists because the earlier global-trace topology direction was too confounded by length, verbosity, and cap behavior on natural math traces.

Inputs

  • Procedural dataset generated by scripts/generate_micro_world_dataset.py.
  • Inference manifests from Qwen and Gemma families.
  • Verdict-region hidden-state extracts (verdict_states.npz) and logits summaries (logits_summary.npz).

Why this design was chosen

The micro-world design isolates semantic evaluation while keeping natural-language form:

  • nonce objects/attributes/relations (reduced memorization),
  • generated latent worlds with exact truth conditions,
  • explicit Unknown (non-entailment), not only binary truth/falsity,
  • held-out templates and held-out split lexicons.

This allows testing whether semantic classes are represented internally even when decoder behavior is imperfect.

Artifact generation pipeline

Step 1: dataset generation

python3 scripts/generate_micro_world_dataset.py \
  --out-dir artifacts/micro_world_v1/dataset \
  --seed 1729 \
  --train-worlds 100 \
  --dev-worlds 25 \
  --eval-worlds 100 \
  --props-per-world 9 \
  --paraphrases-per-prop 8

Primary outputs:

  • train.jsonl, dev.jsonl, eval.jsonl
  • audit.csv, world_summary.csv, manifest.json

Step 2: model inference (example command pattern)

python3 scripts/run_micro_world_inference.py \
  --model-id Qwen/Qwen3.5-2B \
  --dataset artifacts/micro_world_v1/dataset/eval.jsonl \
  --artifact-dir artifacts/micro_world_v1/generations/Qwen__Qwen3_5_2B_eval_full \
  --max-new-tokens 4 \
  --resume-skip-existing

Important switches used in controls:

  • --force-raw-prompt
  • --constrained-label-decoding
  • --prompt-variant base_label (for base-model-compatible prompt format)

Each example folder contains:

  • sample.json
  • verdict_states.npz
  • logits_summary.npz

and run-level manifest.csv.

Step 3: geometry analysis

python3 scripts/analyze_micro_world_geometry.py \
  --manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/analysis_<RUN>

Outputs include:

  • classification_summary.csv
  • classification_by_label.csv
  • confusion_matrix.csv
  • within_world_geometry_summary.csv
  • sign_test_summary.csv

Step 4: probe analysis (train worlds -> held-out eval worlds)

python3 scripts/run_micro_world_probe.py \
  --train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
  --test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
  --state-keys final_prompt verdict_token verdict_span_mean \
  --out-dir artifacts/micro_world_v1/probe_<RUN>

Outputs:

  • probe_summary.csv
  • probe_by_label.csv
  • probe_confusion.csv
  • decoder_baseline_eval.csv

Step 5: verdict-step label-logit analysis

python3 scripts/analyze_micro_world_label_logits.py \
  --model-id google/gemma-3-4b-it \
  --manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/label_logits_<RUN>

Outputs:

  • label_logits_summary.csv
  • unknown_gold_decoder_nonunknown.csv
  • label_logits_by_gold.csv

Step 6: layer sweep probes

python3 scripts/run_micro_world_layer_sweep_probe.py \
  --model-id google/gemma-3-4b-it \
  --train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
  --test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/layer_sweep_<RUN>

Outputs:

  • layer_sweep_summary.csv
  • layer_sweep_best.csv
  • layer_sweep_metadata.csv

Step 7: post-hoc readout intervention pilot

python3 scripts/run_readout_intervention.py \
  --out-dir artifacts/micro_world_v1/readout_intervention

Outputs:

  • aggregate_readout_intervention.csv
  • per-run intervention_summary.csv
  • per-run confusion and LOOW parameter files

Step 8: latent residual steering (pre-readout)

python3 scripts/run_latent_readout_steering.py \
  --out-dir artifacts/micro_world_v1/latent_readout_steering

Outputs:

  • aggregate_latent_steering.csv
  • per-run latent_steering_summary.csv
  • per-run unknown_margin_by_alpha.csv
  • per-run unknown_direction.npz, LOOW alpha files

Step 9: nonlinear probe sensitivity (shallow MLP)

python3 scripts/run_mlp_probe_sensitivity.py \
  --out-dir artifacts/micro_world_v1/probe_mlp_sensitivity

Outputs:

  • comparison_probe_states_mlp.csv
  • comparison_probe_linear_vs_mlp.csv
  • per-run probe_summary_mlp.csv

Consolidated comparison artifacts

These files aggregate the reported cross-run results:

  • artifacts/micro_world_v1/comparison_decoder_qwen_gemma.csv
  • artifacts/micro_world_v1/comparison_probe_states_qwen_gemma.csv
  • artifacts/micro_world_v1/comparison_decoder_constrained_vs_unconstrained_qwen_gemma.csv
  • artifacts/micro_world_v1/comparison_gemma_base_prompt_rerun.csv
  • artifacts/micro_world_v1/comparison_label_logits_gemma_it_vs_pt_basefmt.csv
  • artifacts/micro_world_v1/comparison_layer_sweep_gemma_it_vs_pt_basefmt.csv
  • artifacts/micro_world_v1/readout_intervention/aggregate_readout_intervention.csv
  • artifacts/micro_world_v1/latent_readout_steering/aggregate_latent_steering.csv
  • artifacts/micro_world_v1/probe_mlp_sensitivity/comparison_probe_linear_vs_mlp.csv

Controls run on this track

  1. Cross-family replication (Qwen and Gemma).
  2. Constrained decoding (True/False/Unknown only).
  3. Prompt-path controls (raw prompt path).
  4. Base-vs-instruct with base-specific format repair.
  5. Verdict-step logit competitiveness analysis.
  6. Layer-sweep probes for Unknown recoverability.
  7. Post-hoc readout intervention pilot.
  8. Pre-readout latent residual steering.
  9. Nonlinear probe sensitivity (MLP vs linear).

What this track supports

  • Decoder Unknown is under-expressed.
  • Unknown is recoverable from verdict-region hidden states.
  • The representation/readout gap replicates across families and controls.
  • Minimal latent steering improves aggregate metrics but does not uniformly recover Unknown.
  • Nonlinear probes recover substantially more Unknown signal in hard linear settings (notably Qwen3.5-4B no-think).

What this track does not claim

  • No universal geometry-of-truth scalar.
  • No proof that topology alone is the best predictor.
  • No full causal decomposition of the readout bottleneck.

Conclusion

This track provides the main positive claim: hidden states encode non-entailment signal more strongly than decoder outputs express it.