03 Micro-World Semantics (Main Positive Result)

Objective

Evaluate semantic class structure (True/False/Unknown) in a controlled procedural language setting with held-out lexicons/templates, and test representation vs decoder behavior.

This track exists because the earlier global-trace topology direction was too confounded by length, verbosity, and cap behavior on natural math traces.

Inputs

Procedural dataset generated by scripts/generate_micro_world_dataset.py.
Inference manifests from Qwen and Gemma families.
Verdict-region hidden-state extracts (verdict_states.npz) and logits summaries (logits_summary.npz).

Why this design was chosen

The micro-world design isolates semantic evaluation while keeping natural-language form:

nonce objects/attributes/relations (reduced memorization),
generated latent worlds with exact truth conditions,
explicit Unknown (non-entailment), not only binary truth/falsity,
held-out templates and held-out split lexicons.

This allows testing whether semantic classes are represented internally even when decoder behavior is imperfect.

Artifact generation pipeline

Step 1: dataset generation

python3 scripts/generate_micro_world_dataset.py \
  --out-dir artifacts/micro_world_v1/dataset \
  --seed 1729 \
  --train-worlds 100 \
  --dev-worlds 25 \
  --eval-worlds 100 \
  --props-per-world 9 \
  --paraphrases-per-prop 8

Primary outputs:

train.jsonl, dev.jsonl, eval.jsonl
audit.csv, world_summary.csv, manifest.json

Step 2: model inference (example command pattern)

python3 scripts/run_micro_world_inference.py \
  --model-id Qwen/Qwen3.5-2B \
  --dataset artifacts/micro_world_v1/dataset/eval.jsonl \
  --artifact-dir artifacts/micro_world_v1/generations/Qwen__Qwen3_5_2B_eval_full \
  --max-new-tokens 4 \
  --resume-skip-existing

Important switches used in controls:

--force-raw-prompt
--constrained-label-decoding
--prompt-variant base_label (for base-model-compatible prompt format)

Each example folder contains:

sample.json
verdict_states.npz
logits_summary.npz

and run-level manifest.csv.

Step 3: geometry analysis

python3 scripts/analyze_micro_world_geometry.py \
  --manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/analysis_<RUN>

Outputs include:

classification_summary.csv
classification_by_label.csv
confusion_matrix.csv
within_world_geometry_summary.csv
sign_test_summary.csv

Step 4: probe analysis (train worlds -> held-out eval worlds)

python3 scripts/run_micro_world_probe.py \
  --train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
  --test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
  --state-keys final_prompt verdict_token verdict_span_mean \
  --out-dir artifacts/micro_world_v1/probe_<RUN>

Outputs:

probe_summary.csv
probe_by_label.csv
probe_confusion.csv
decoder_baseline_eval.csv

Step 5: verdict-step label-logit analysis

python3 scripts/analyze_micro_world_label_logits.py \
  --model-id google/gemma-3-4b-it \
  --manifest artifacts/micro_world_v1/generations/<RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/label_logits_<RUN>

Outputs:

label_logits_summary.csv
unknown_gold_decoder_nonunknown.csv
label_logits_by_gold.csv

Step 6: layer sweep probes

python3 scripts/run_micro_world_layer_sweep_probe.py \
  --model-id google/gemma-3-4b-it \
  --train-manifest artifacts/micro_world_v1/generations/<TRAIN_RUN>/manifest.csv \
  --test-manifest artifacts/micro_world_v1/generations/<EVAL_RUN>/manifest.csv \
  --out-dir artifacts/micro_world_v1/layer_sweep_<RUN>

Outputs:

layer_sweep_summary.csv
layer_sweep_best.csv
layer_sweep_metadata.csv

Step 7: post-hoc readout intervention pilot

python3 scripts/run_readout_intervention.py \
  --out-dir artifacts/micro_world_v1/readout_intervention

Outputs:

aggregate_readout_intervention.csv
per-run intervention_summary.csv
per-run confusion and LOOW parameter files

Step 8: latent residual steering (pre-readout)

python3 scripts/run_latent_readout_steering.py \
  --out-dir artifacts/micro_world_v1/latent_readout_steering

Outputs:

aggregate_latent_steering.csv
per-run latent_steering_summary.csv
per-run unknown_margin_by_alpha.csv
per-run unknown_direction.npz, LOOW alpha files

Step 9: nonlinear probe sensitivity (shallow MLP)

python3 scripts/run_mlp_probe_sensitivity.py \
  --out-dir artifacts/micro_world_v1/probe_mlp_sensitivity

Outputs:

comparison_probe_states_mlp.csv
comparison_probe_linear_vs_mlp.csv
per-run probe_summary_mlp.csv

Consolidated comparison artifacts

These files aggregate the reported cross-run results:

artifacts/micro_world_v1/comparison_decoder_qwen_gemma.csv
artifacts/micro_world_v1/comparison_probe_states_qwen_gemma.csv
artifacts/micro_world_v1/comparison_decoder_constrained_vs_unconstrained_qwen_gemma.csv
artifacts/micro_world_v1/comparison_gemma_base_prompt_rerun.csv
artifacts/micro_world_v1/comparison_label_logits_gemma_it_vs_pt_basefmt.csv
artifacts/micro_world_v1/comparison_layer_sweep_gemma_it_vs_pt_basefmt.csv
artifacts/micro_world_v1/readout_intervention/aggregate_readout_intervention.csv
artifacts/micro_world_v1/latent_readout_steering/aggregate_latent_steering.csv
artifacts/micro_world_v1/probe_mlp_sensitivity/comparison_probe_linear_vs_mlp.csv

Controls run on this track

Cross-family replication (Qwen and Gemma).
Constrained decoding (True/False/Unknown only).
Prompt-path controls (raw prompt path).
Base-vs-instruct with base-specific format repair.
Verdict-step logit competitiveness analysis.
Layer-sweep probes for Unknown recoverability.
Post-hoc readout intervention pilot.
Pre-readout latent residual steering.
Nonlinear probe sensitivity (MLP vs linear).

What this track supports

Decoder Unknown is under-expressed.
Unknown is recoverable from verdict-region hidden states.
The representation/readout gap replicates across families and controls.
Minimal latent steering improves aggregate metrics but does not uniformly recover Unknown.
Nonlinear probes recover substantially more Unknown signal in hard linear settings (notably Qwen3.5-4B no-think).

What this track does not claim

No universal geometry-of-truth scalar.
No proof that topology alone is the best predictor.
No full causal decomposition of the readout bottleneck.

Conclusion

This track provides the main positive claim: hidden states encode non-entailment signal more strongly than decoder outputs express it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03 Micro-World Semantics (Main Positive Result)

Objective

Inputs

Why this design was chosen

Artifact generation pipeline

Step 1: dataset generation

Step 2: model inference (example command pattern)

Step 3: geometry analysis

Step 4: probe analysis (train worlds -> held-out eval worlds)

Step 5: verdict-step label-logit analysis

Step 6: layer sweep probes

Step 7: post-hoc readout intervention pilot

Step 8: latent residual steering (pre-readout)

Step 9: nonlinear probe sensitivity (shallow MLP)

Consolidated comparison artifacts

Controls run on this track

What this track supports

What this track does not claim

Conclusion

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

03 Micro-World Semantics (Main Positive Result)

Objective

Inputs

Why this design was chosen

Artifact generation pipeline

Step 1: dataset generation

Step 2: model inference (example command pattern)

Step 3: geometry analysis

Step 4: probe analysis (train worlds -> held-out eval worlds)

Step 5: verdict-step label-logit analysis

Step 6: layer sweep probes

Step 7: post-hoc readout intervention pilot

Step 8: latent residual steering (pre-readout)

Step 9: nonlinear probe sensitivity (shallow MLP)

Consolidated comparison artifacts

Controls run on this track

What this track supports

What this track does not claim

Conclusion