This repository contains code for our ICML 2026 paper Inference Time Optimization with Confidence Dynamics. It implements Confidence Dynamic Gain (CDG) based voting — a training-free inference-time answer-selection method that exploits how model confidence evolves along reasoning trajectories — and evaluates it on competition-level math reasoning benchmarks (AIME 2024, AIME 2025, HMMT 2025, BRUMO 2025) across four open-source reasoning LLMs (DeepSeek-R1-8B, gpt-oss-20B, Gemma-3-27B, QwQ-32B).
unzip submission.zip -d sampling_credit
cd sampling_credit
git init
git submodule update --initgit clone --recurse-submodules <repo-url>
cd sampling_credit| Submodule | Source |
|---|---|
deepconf |
facebookresearch/deepconf |
gpt-oss |
openai/gpt-oss |
sampling_credit/
├── scripts/
│ ├── prepare_data/ # Dataset preparation scripts
│ ├── inference/ # Model inference scripts
│ └── eval/ # Evaluation and figure generation
├── results/
│ ├── cache/ # Cached experiment results
│ │ ├── voting_methods/ # exp_voting_methods.py cache
│ │ ├── cdg_sweep/ # exp_cdg_sweep.py cache
│ │ ├── pass_at_1/ # exp_pass_at_1.py cache
│ │ └── histogram/ # util_histogram_cache.py cache
│ │ └── significance_test_cache/ # significance_test_confidence_change.py cache
│ └── figures/ # Generated figures
│ └── appendix/ # Appendix figures
├── deepconf/ # [submodule] DeepConf baseline
├── gpt-oss/ # [submodule] GPT-OSS model support
└── README.md
cd scripts/prepare_data
./prepare_all_datasets.shThis prepares all datasets (AIME 2024, AIME 2025, BRUMO 2025, HMMT 2025) in JSONL format.
cd scripts/inference
./run.sh --model <model> --dataset <dataset>See scripts/inference/README.md for detailed inference instructions.
cd scripts/eval
./run.sh --task allAvailable tasks:
cache- Generate all caches (voting, histogram, sweep)figures- Generate all figures (requires caches)sweep- Run CDG hyperparameter sweep onlyposition- Run position ablation experimentall- Run everything (cache + figures)
| Script | Description |
|---|---|
exp_voting_methods.py |
Compare voting methods (Majority, Mean, Top10 Tail, CDG) |
exp_cdg_sweep.py |
CDG hyperparameter sweep (alpha, beta, position_pct) |
exp_pass_at_1.py |
Pass@1 baseline accuracy (single randomly sampled trace) |
util_histogram_cache.py |
Extract histogram metrics for correct/wrong distributions |
| Script | Description |
|---|---|
significance_test_confidence_change.py |
Mann-Whitney U / Welch's t-test for confidence change (correct vs wrong traces) |
| Script | Description |
|---|---|
figure_scaling.py |
Main scaling figure (accuracy vs trace count) |
figure_scaling_appendix.py |
Per-model scaling figures for appendix |
figure_histogram.py |
Confidence distribution histograms |
figure_confidence_curve.py |
Confidence calibration curves |
figure_position_appendix.py |
Position_pct ablation figure |
| Dataset | Description |
|---|---|
aime2024 |
AIME 2024 (30 questions) |
aime2025 |
AIME 2025 (30 questions) |
brumo2025 |
BRUMO 2025 |
hmmt2025 |
HMMT February 2025 (30 questions) |
| Model Key | Model Name |
|---|---|
deepseek8b |
DeepSeek-R1-8B |
gemma3_27b |
Gemma-3-27B |
qwq32b |
QWQ-32B |
gptoss20b |
gpt-oss-20B |
All paths and parameters are centralized in config files:
scripts/eval/config.py- Evaluation configurationscripts/inference/config.py- Inference configurationscripts/inference/config.sh- Shell configuration for inference
CDG (Confidence Dynamic Gain) augments majority voting with two signals: (1) the mean per-trace confidence, and (2) the dynamic gain — i.e., how confidence evolves from the head to the tail of each reasoning trace. The final score for each candidate answer is:
score = count^alpha * mean(mean_conf + beta * gradient)
Where:
count= number of traces voting for this answeralpha= count dampening factor (default: 0.5)beta= gradient weight (model-specific)gradient= mean(last P%) - mean(first P%) of confidence trajectory
results/cache/voting_methods/cache.json- Voting method comparison resultsresults/cache/cdg_sweep/sweep_cache.json- Hyperparameter sweep resultsresults/cache/pass_at_1/cache.json- Pass@1 baseline resultsresults/cache/histogram/*.json- Per model-dataset histogram metricsresults/cache/histogram/significance_test_cache/*.json- Significance test raw data
results/figures/*.pdf- Main paper figuresresults/figures/appendix/*.pdf- Appendix figures
- Python 3.8+
- numpy
- matplotlib
- dynasor (
pip install git+https://github.com/hao-ai-lab/Dynasor.git)
For inference:
- vLLM
- PyTorch
- Transformers
If you find this work useful, please cite our ICML 2026 paper:
@inproceedings{wang2026cdg,
title = {Inference Time Optimization with Confidence Dynamics},
author = {Wang, Yu and Liu, Minghao and Wang, Jiayun and Huang, Jinrui and Shah, Ankit and Wei, Wei},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026}
}