Evaluate semantic class structure (True/False/Unknown) in a controlled procedural language setting with held-out lexicons/templates, and test representation vs decoder behavior.
This track exists because the earlier global-trace topology direction was too confounded by length, verbosity, and cap behavior on natural math traces.
- Procedural dataset generated by
scripts/generate_micro_world_dataset.py. - Inference manifests from Qwen and Gemma families.
- Verdict-region hidden-state extracts (
verdict_states.npz) and logits summaries (logits_summary.npz).
The micro-world design isolates semantic evaluation while keeping natural-language form:
- nonce objects/attributes/relations (reduced memorization),
- generated latent worlds with exact truth conditions,
- explicit
Unknown(non-entailment), not only binary truth/falsity, - held-out templates and held-out split lexicons.
This allows testing whether semantic classes are represented internally even when decoder behavior is imperfect.
python3 scripts/generate_micro_world_dataset.py \
--out-dir artifacts/micro_world_v1/dataset \
--seed 1729 \
--train-worlds 100 \
--dev-worlds 25 \
--eval-worlds 100 \
--props-per-world 9 \
--paraphrases-per-prop 8Primary outputs:
train.jsonl,dev.jsonl,eval.jsonlaudit.csv,world_summary.csv,manifest.json
python3 scripts/run_micro_world_inference.py \
--model-id Qwen/Qwen3.5-2B \
--dataset artifacts/micro_world_v1/dataset/eval.jsonl \
--artifact-dir artifacts/micro_world_v1/generations/Qwen__Qwen3_5_2B_eval_full \
--max-new-tokens 4 \
--resume-skip-existingImportant switches used in controls:
--force-raw-prompt--constrained-label-decoding--prompt-variant base_label(for base-model-compatible prompt format)
Each example folder contains:
sample.jsonverdict_states.npzlogits_summary.npz
and run-level manifest.csv.
python3 scripts/analyze_micro_world_geometry.py \
--manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
--out-dir artifacts/micro_world_v1/analysis_<RUN>Outputs include:
classification_summary.csvclassification_by_label.csvconfusion_matrix.csvwithin_world_geometry_summary.csvsign_test_summary.csv
python3 scripts/run_micro_world_probe.py \
--train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
--test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
--state-keys final_prompt verdict_token verdict_span_mean \
--out-dir artifacts/micro_world_v1/probe_<RUN>Outputs:
probe_summary.csvprobe_by_label.csvprobe_confusion.csvdecoder_baseline_eval.csv
python3 scripts/analyze_micro_world_label_logits.py \
--model-id google/gemma-3-4b-it \
--manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
--out-dir artifacts/micro_world_v1/label_logits_<RUN>Outputs:
label_logits_summary.csvunknown_gold_decoder_nonunknown.csvlabel_logits_by_gold.csv
python3 scripts/run_micro_world_layer_sweep_probe.py \
--model-id google/gemma-3-4b-it \
--train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
--test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
--out-dir artifacts/micro_world_v1/layer_sweep_<RUN>Outputs:
layer_sweep_summary.csvlayer_sweep_best.csvlayer_sweep_metadata.csv
python3 scripts/run_readout_intervention.py \
--out-dir artifacts/micro_world_v1/readout_interventionOutputs:
aggregate_readout_intervention.csv- per-run
intervention_summary.csv - per-run confusion and LOOW parameter files
python3 scripts/run_latent_readout_steering.py \
--out-dir artifacts/micro_world_v1/latent_readout_steeringOutputs:
aggregate_latent_steering.csv- per-run
latent_steering_summary.csv - per-run
unknown_margin_by_alpha.csv - per-run
unknown_direction.npz, LOOW alpha files
python3 scripts/run_mlp_probe_sensitivity.py \
--out-dir artifacts/micro_world_v1/probe_mlp_sensitivityOutputs:
comparison_probe_states_mlp.csvcomparison_probe_linear_vs_mlp.csv- per-run
probe_summary_mlp.csv
These files aggregate the reported cross-run results:
artifacts/micro_world_v1/comparison_decoder_qwen_gemma.csvartifacts/micro_world_v1/comparison_probe_states_qwen_gemma.csvartifacts/micro_world_v1/comparison_decoder_constrained_vs_unconstrained_qwen_gemma.csvartifacts/micro_world_v1/comparison_gemma_base_prompt_rerun.csvartifacts/micro_world_v1/comparison_label_logits_gemma_it_vs_pt_basefmt.csvartifacts/micro_world_v1/comparison_layer_sweep_gemma_it_vs_pt_basefmt.csvartifacts/micro_world_v1/readout_intervention/aggregate_readout_intervention.csvartifacts/micro_world_v1/latent_readout_steering/aggregate_latent_steering.csvartifacts/micro_world_v1/probe_mlp_sensitivity/comparison_probe_linear_vs_mlp.csv
- Cross-family replication (
QwenandGemma). - Constrained decoding (
True/False/Unknownonly). - Prompt-path controls (raw prompt path).
- Base-vs-instruct with base-specific format repair.
- Verdict-step logit competitiveness analysis.
- Layer-sweep probes for Unknown recoverability.
- Post-hoc readout intervention pilot.
- Pre-readout latent residual steering.
- Nonlinear probe sensitivity (MLP vs linear).
- Decoder Unknown is under-expressed.
- Unknown is recoverable from verdict-region hidden states.
- The representation/readout gap replicates across families and controls.
- Minimal latent steering improves aggregate metrics but does not uniformly recover Unknown.
- Nonlinear probes recover substantially more Unknown signal in hard linear settings (notably Qwen3.5-4B no-think).
- No universal geometry-of-truth scalar.
- No proof that topology alone is the best predictor.
- No full causal decomposition of the readout bottleneck.
This track provides the main positive claim: hidden states encode non-entailment signal more strongly than decoder outputs express it.