Measure fixed-decoding within-question behavior and test whether token-cap termination is associated with error.
This track was the follow-up after global topology failed to provide a stable correctness signal.
- GSM8K question slice via
topology_paper/configs/gsm8k_qwen35.yaml. - Qwen3.5-2B with fixed decoding.
python3 scripts/run_multi_sample_questions.py \
--model-id Qwen/Qwen3.5-2B \
--question-start 0 \
--question-count 20 \
--samples-per-question 9 \
--temperature 0.7 \
--top-p 0.95 \
--max-new-tokens 384 \
--artifact-dir artifacts/qwen35_2b_within_questionPrimary outputs:
artifacts/qwen35_2b_within_question/run_manifest.csv- per-question sample folders containing:
sample.jsonhidden.npylogits_summary.npz
python3 scripts/analyze_within_question_basins.py \
--artifact-dir artifacts/qwen35_2b_within_question \
--out-dir artifacts/qwen35_2b_within_question/analysisOutputs:
aggregate_pairwise.csvaggregate_reference.csvquestion_summary.csv- sign-test files under the same folder
python3 scripts/run_cap_sensitivity.py \
--model-id Qwen/Qwen3.5-2B \
--source-artifact-dir artifacts/qwen35_2b_within_question \
--out-dir artifacts/qwen35_2b_cap_sensitivity_640 \
--temperature 0.7 \
--top-p 0.95 \
--max-new-tokens 640Primary output:
artifacts/qwen35_2b_cap_sensitivity_640/run_manifest.csv
python3 scripts/analyze_cap_sensitivity.py \
--sensitivity-dir artifacts/qwen35_2b_cap_sensitivity_640 \
--out-dir artifacts/qwen35_2b_cap_sensitivity_640/analysisOutputs:
transition_summary.csvper_question_transition_summary.csv
artifacts/qwen35_2b_within_question/run_manifest.csvartifacts/qwen35_2b_cap_sensitivity_640/analysis/transition_summary.csv
Key counts from those files:
- (n=180) fixed-config samples, with strong cap/error association.
- matched capped rerun (n=87): partial rescue but large residual failure.
Non-convergence is the dominant signal in this branch. Question-conditioned basin effects were only weak/underpowered here and did not scale cleanly under augmentation.