Skip to content

nickforce989/mlci

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLci: Machine Learning confidence interval

Statistically rigorous ML model evaluation and benchmarking.

Every number you report from an ML experiment is a random variable. This library forces you to treat it that way.


Author

Niccolò Forzano --- github.com/nickforce989 · nic.forz@gmail.com


Installation

Starting from the mlci root directory:

pip install .

For PyTorch integration:

pip install . torch

For Bayesian comparison with full PyMC backend:

pip install ".[bayesian]"

Requirements

  • Python 3.9+
  • numpy >= 1.23
  • scipy >= 1.9
  • scikit-learn >= 1.1
  • matplotlib >= 3.5
  • seaborn >= 0.12
  • pandas >= 1.5
  • joblib >= 1.2
  • tqdm >= 4.64

All dependencies are installed automatically with pip install .. To install manually:

pip install numpy scipy scikit-learn matplotlib seaborn pandas joblib tqdm

To run the examples, no additional dependencies are needed — they use datasets that ship with scikit-learn. The PyTorch example additionally requires pip install torch.


The Problem

Standard ML practice:

RandomForest     accuracy: 0.9631
GradientBoosting accuracy: 0.9631
Difference: 0.0000 — they're equal

But these numbers came from a single run with a single random seed. Run it again and you get different numbers. The variance of accuracy across seeds on a typical benchmark is 0.2–0.5%. A 0.3% difference between two models is often indistinguishable from noise — but almost no paper will tell you that.

mlci makes it trivial to do this correctly.


Quickstart

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from mlci import Experiment
from mlci.stats.bootstrap import summary
from mlci.stats.tests import compare

X, y = load_breast_cancer(return_X_y=True)

rf = Experiment(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
    model_name="RandomForest",
)
gb = Experiment(
    model_factory=lambda seed: GradientBoostingClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=20, n_splits=5,
    model_name="GradientBoosting",
)

rf_results = rf.run()
gb_results = gb.run()

Summary with uncertainty:

────────────────────────────────────────────────────
  Model   : RandomForest
  Metric  : accuracy
  Seeds   : 20   Splits: 5
────────────────────────────────────────────────────
  Mean    : 0.9593
  95% CI  : [0.9577, 0.9608]  (±0.0015)
  Seed σ  : 0.0037  (variance from random init)
  Split σ : 0.0061  (variance from data split)
────────────────────────────────────────────────────

Statistical comparison:

compare(rf_results, gb_results, method="corrected_ttest")
────────────────────────────────────────────────────────────
  Comparison : RandomForest  vs  GradientBoosting
  Method     : Corrected Resampled t-test (Nadeau & Bengio, 2003)
  Effect     : +0.0004  [-0.0023, +0.0029]
  P(A > B)   : 0.430
  p-value    : 0.6305  (α=0.05)
  Conclusion : No statistically significant difference (p=0.6305 >= α=0.05).
               Cannot conclude either model is better.
────────────────────────────────────────────────────────────

The standard single-run reported both models at identical accuracy. mlci correctly identifies the difference is noise.


Running the Examples

Seven ready-to-run example scripts are included in the examples/ folder. All use datasets that ship with scikit-learn — no downloads needed.


worked_example.py — start here

The simplest and most focused example. Demonstrates the single most important claim of the library: that the standard practice of reporting a single accuracy number is misleading, and that two models which look identical under standard evaluation can be properly assessed with mlci.

Runs three models first the standard way (one number), then with mlci (20 seeds × 5 folds, confidence intervals, corrected t-test). Ends with a side-by-side summary showing exactly what information was lost.

python examples/worked_example.py
# Runtime: ~1 minute

Run this first — it's the clearest demonstration of why the library exists.


full_example.py — full feature walkthrough

Covers every major feature across nine parts: four classical models, bootstrap summaries, variance decomposition, all three comparison methods, bring-your-own scores, custom metrics, learning curves, five plots, and JSON save/load.

python examples/full_example.py
# Runtime: ~3 minutes

Plots produced (saved to examples/plots/):

  • score_distributions.png — KDE of raw scores per model
  • model_comparison.png — forest plot with CI bars
  • variance_decomposition.png — pie chart + score heatmap
  • learning_curve.png — RF vs LogReg with uncertainty bands
  • bootstrap_distribution.png — bootstrap distribution of the mean

neural_network_example.py — neural network from scratch

Implements a fully-connected MLP from scratch using NumPy only — no PyTorch, no TensorFlow. He initialisation, ReLU activations, mini-batch SGD with momentum, L2 regularisation, early stopping. Wrapped in a sklearn-compatible interface so mlci works on it identically to any other model.

python examples/neural_network_example.py
# Runtime: ~2 minutes

Plots produced (saved to examples/nn_plots/):

  • nn_training_curves.png — loss curves (train vs val) for each architecture
  • nn_score_distributions.png — KDE of raw scores for all models
  • nn_vs_classical_comparison.png — forest plot: MLP vs classical baselines
  • nn_architecture_comparison.png — bar chart comparing network sizes with CI
  • nn_learning_curve.png — MLP vs RandomForest learning curves with uncertainty bands
  • nn_variance_decomposition.png — variance sources for the MLP
  • nn_bootstrap_distribution.png — bootstrap distribution of MLP mean accuracy
  • nn_summary_panel.png — single-figure summary combining five panels

Key finding: on a small tabular dataset (569 samples), LogisticRegression outperforms the MLP (0.979 vs 0.973, p=0.0001). Neural networks need large datasets to outperform well-tuned linear models. mlci makes this conclusion rigorous rather than anecdotal.


sklearn_example.py — sklearn integration showcase

Demonstrates mlci.integrations.sklearn across four parts: defining models with PipelineFactory instead of lambdas, wrapping external scores with wrap_cross_val_scores, running 8 classifiers at once with ModelGrid, and running the same grid across three datasets to produce a cross-dataset comparison heatmap.

python examples/sklearn_example.py
# Runtime: ~2 minutes

Plots produced (saved to examples/sklearn_plots/):

  • sklearn_model_comparison.png — 8 classifiers ranked by accuracy with CI bars
  • sklearn_scaler_comparison.png — does scaler choice matter? (yes, dramatically)
  • sklearn_score_distributions.png — KDE for the top 3 models
  • sklearn_learning_curve.png — LogReg vs RF vs SVM with uncertainty bands
  • sklearn_cross_dataset_heatmap.png — mean accuracy grid: 3 models × 3 datasets
  • sklearn_variance_decomposition.png — variance sources for the best model

Key finding: MinMaxScaler drops LogisticRegression accuracy from 0.977 to 0.937 — a large, statistically significant difference that would be invisible from a single run.


torch_example.py — PyTorch integration

Plugs real PyTorch neural networks into mlci via TorchTrainer — a sklearn-compatible wrapper that handles seeding (torch + numpy + random + CUDA), early stopping, and disk caching (completed runs are saved and skipped on re-run). Compares three architectures (Small, Deep, Wide MLP) against classical baselines with full statistical rigour.

pip install torch   # required
python examples/torch_example.py
# Runtime: ~5 minutes first run; near-instant on re-run due to caching

Plots produced (saved to examples/torch_plots/):

  • torch_model_comparison.png — PyTorch MLPs vs classical baselines
  • torch_score_distribution.png — score distribution for the best architecture
  • torch_learning_curve.png — SmallMLP vs RandomForest with uncertainty bands
  • torch_training_curves.png — loss curves for all three architectures

Key feature — caching: each (seed, split) training run is hashed by architecture + config + data and saved to disk. Kill the process halfway through 50 runs, rerun the script, and it continues from where it left off. Essential when training takes minutes per run.


mde_example.py — minimum detectable effect calculator

Answers a question that comes up before every experiment: "How many seeds do I actually need?" Uses a short pilot run to estimate the score variance of your model, then works backwards from a target effect size and power level to give you a concrete seed budget — or, given a fixed budget, tells you the smallest effect you can reliably detect.

python examples/mde_example.py
# Runtime: ~30 seconds

What it demonstrates:

  • Pilot-based σ estimation (estimate_sigma) — runs 5 seeds to measure how much accuracy varies due to random initialisation alone, without any assumption about the distribution.
  • Seeds-to-detect query (mde_n_seeds) — given a target effect (e.g. 0.5% accuracy difference) and a target power (e.g. 80%), returns the minimum number of seeds required.
  • MDE-from-budget query (mde_effect) — given a fixed seed budget (e.g. 20 seeds), returns the minimum effect size you can detect at the specified power level.
  • Power reference table (quick_summary) — prints a grid of power values across a range of effect sizes and seed counts so you can reason about the tradeoff at a glance.
  • Power curve and heatmap (plot_power_curve, plot_power_heatmap) — visualises how power grows with more seeds, for multiple effect sizes simultaneously.
  • Verification run — actually runs 20 seeds, compares two models with the corrected t-test, and checks whether the observed effect falls above or below the computed MDE. Confirms the power estimate is consistent with real experimental outcomes.

All power calculations are built directly on the Nadeau-Bengio corrected t-test — the same test used by compare() — so the estimates are internally consistent with the rest of the library.

Plots produced (saved to examples/mde_plots/):

  • power_curves.png — power vs number of seeds for multiple effect sizes, with 80% target line
  • power_heatmap.png — effect size × seed count grid coloured by power

calibration_example.py — calibration analysis across seeds

Answers a different question: "When my model says 80% confidence, does that actually mean 80%?" sklearn's calibration_curve gives you a single reliability diagram from one run. This example uses mlci to run the same calibration analysis across many seeds and folds, producing uncertainty bands on the reliability diagram itself — so you can see whether calibration is genuinely stable or just lucky on a particular random split.

python examples/calibration_example.py
# Runtime: ~30 seconds

What it demonstrates:

  • CalibrationExperiment — runs calibration measurement across n_seeds × n_splits and aggregates ECE, MCE, and per-bin reliability statistics with proper uncertainty quantification.
  • ECE and MCE with confidence intervals — Expected Calibration Error and Maximum Calibration Error reported as distributions across seeds, not single numbers. The .summary() method prints the mean, 95% CI, and seed-level standard deviation for both.
  • Reliability diagrams with uncertainty bands (plot_reliability_diagram) — the standard "confidence vs accuracy" plot, but with a 95% CI band around the mean curve and optional faint overlay of every individual seed run. Shows at a glance how much the calibration curve moves between seeds.
  • ECE distribution plots (plot_ece_distribution) — KDE of the ECE values across seeds, one per model. A narrow distribution means calibration is consistent; a wide one means you got lucky (or unlucky) on a specific split.
  • Multi-model comparison (plot_calibration_comparison) — plots all models' reliability diagrams side-by-side with CI bands for direct comparison.
  • Reliability overlay (plot_reliability_overlay) — overlays all models' mean reliability curves on a single axis, making it easy to see which model is best-calibrated across the confidence range.

Three models are compared on the breast cancer dataset: RandomForest (tends to be overconfident at high probabilities), GradientBoosting (better calibrated but still biased), and LogisticRegression (typically the best-calibrated of the three on tabular data).

Plots produced (saved to examples/calibration_plots/):

  • reliability_randomforest.png — reliability diagram with 95% CI band for RandomForest
  • reliability_gradientboosting.png — same for GradientBoosting
  • reliability_logisticregression.png — same for LogisticRegression
  • ece_dist_randomforest.png — ECE distribution across seeds for RandomForest
  • ece_dist_gradientboosting.png — same for GradientBoosting
  • ece_dist_logisticregression.png — same for LogisticRegression
  • calibration_comparison.png — all three reliability diagrams side-by-side with CI bands
  • reliability_overlay.png — all three mean curves on a single axis

Core Features

1. Bootstrap Confidence Intervals

from mlci.stats.bootstrap import bootstrap_ci, summary

ci = bootstrap_ci(results, confidence=0.95)
print(ci)
# 0.9593 [95% CI: 0.9577, 0.9608] (width=0.0031)

Resamples at the seed level — the correct choice, since seeds are independent but folds within a seed share training data and are not.


2. Three Statistical Comparison Methods

from mlci.stats.tests import compare

# Method 1 (default): Corrected Resampled t-test (Nadeau & Bengio, 2003)
compare(a, b, method="corrected_ttest")

# Method 2: Wilcoxon signed-rank test (non-parametric, no distributional assumptions)
compare(a, b, method="wilcoxon")

# Method 3: Bayesian — reports P(A > B) rather than a binary p-value
compare(a, b, method="bayesian")

# Run all three at once
compare(a, b, method="all")

Why the Nadeau-Bengio corrected t-test matters

The standard paired t-test applied to k-fold CV scores is anti-conservative: it systematically overstates significance because the folds share training data, making the observations non-independent. Nadeau & Bengio (2003) derived a corrected variance estimator that accounts for this. It is well-known in the statistical ML literature but almost never used in practice because no mainstream library implements it cleanly.

Reference: Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239-281.


3. Variance Decomposition

Where does your score variance come from — random initialisation, or the data split?

from mlci.stats.anova import decompose_variance

print(decompose_variance(results))
────────────────────────────────────────────────────────
  Variance Decomposition: RandomForest (accuracy)
────────────────────────────────────────────────────────
  Total variance  : 0.000435
  Seed variance   : 0.000010  (2.3%)   <- random init
  Split variance  : 0.000016  (3.6%)   <- which data in test set
  Interaction     : 0.000409  (94.1%)  <- seed×split interaction
────────────────────────────────────────────────────────

Uses a two-way balanced ANOVA decomposition. High split variance means your evaluation is sensitive to which portion of your data ends up in the test set — a signal that you may need more folds or a larger dataset.


4. Full Bayesian Hierarchical Comparison (PyMC)

The existing method="bayesian" comparison in compare() estimates P(A > B) analytically from the bootstrap distribution of the mean difference. It is fast and requires no extra dependencies, but it does not model the structure of the (n_seeds × n_splits) score matrix. Folds within a seed share training data and are not independent, which means a flat normal approximation slightly overstates certainty.

The full Bayesian hierarchical comparison uses PyMC to build a proper two-level random-effects model:

diff[s, k]  ~  Normal(μ_s,        σ_within)   # observed per-(seed, split) diff
μ_s         ~  Normal(μ_global,   σ_between)   # per-seed random effect
μ_global    ~  Normal(0, 0.10)                 # global mean difference (prior)
σ_within    ~  HalfNormal(0.05)               # within-seed (fold) variance
σ_between   ~  HalfNormal(0.05)               # between-seed variance

where diff[s, k] = score_A[s, k] − score_B[s, k] and the random effect μ_s absorbs the correlation among folds that share the same random seed. Inference on μ_global is therefore correctly calibrated.

Installation

pip install ".[bayesian]"
# installs: pymc>=5.0, arviz>=0.16

Usage

from mlci.stats.bayesian_hierarchical import bayesian_hierarchical_compare

result = bayesian_hierarchical_compare(
    rf_results, gb_results,
    rope=0.005,        # differences < 0.5 pp are practically irrelevant
    hdi_prob=0.94,     # 94% HDI following Kruschke/McElreath conventions
    draws=2000,
    tune=1000,
    chains=4,
    plot=True,         # saves posterior + trace plots
    plot_path="examples/plots/",
)
print(result)

Or via the unified compare() dispatcher:

from mlci.stats.tests import compare

compare(rf_results, gb_results, method="bayesian_hierarchical", rope=0.005)

Example output

────────────────────────────────────────────────────────────────────
  Bayesian Hierarchical Comparison (PyMC)
────────────────────────────────────────────────────────────────────
  A  :  RandomForest
  B  :  GradientBoosting
  Metric    : accuracy
────────────────────────────────────────────────────────────────────
  μ(A−B)    : +0.0004  [94% HDI: -0.0022, +0.0031]
  P(A > B)  : 0.613
  ROPE      : [-0.0050, +0.0050]
  In ROPE   : 87.4% of posterior
────────────────────────────────────────────────────────────────────
  Decision  : Inconclusive — HDI overlaps the ROPE.
               Collect more data or relax the ROPE threshold.
────────────────────────────────────────────────────────────────────

Interpreting the output

Field Meaning
μ(A−B) Posterior mean of the expected performance gap, in metric units
94% HDI Highest Density Interval: the narrowest interval containing 94% of posterior probability
P(A > B) Probability that model A has higher true performance than model B
ROPE Region Of Practical Equivalence — differences smaller than this are treated as irrelevant
In ROPE Fraction of the posterior inside the ROPE

Decision rules (Kruschke 2018)

Condition Decision
94% HDI lies entirely inside the ROPE Models are practically equivalent
94% HDI lies entirely above the ROPE Model A is practically better
94% HDI lies entirely below the ROPE Model B is practically better
HDI overlaps the ROPE Inconclusive — collect more data

ROPE guidance

The ROPE should be set in the same units as your metric. Reasonable defaults:

Metric Suggested ROPE
accuracy / F1 / ROC-AUC rope=0.005 (±0.5 pp)
RMSE / MAE dataset-dependent; use ~1% of the target range
rope=0.01

You can also pass a custom interval directly:

bayesian_hierarchical_compare(a, b, rope=(-0.01, 0.01))

Plots

When plot=True, two figures are saved / displayed:

  • bayesian_posterior.png — posterior distribution of μ_global with the HDI interval, a vertical line at 0, and the ROPE shaded in red.
  • bayesian_trace.png — MCMC trace and rank plots for μ_global, σ_within, and σ_between; useful for diagnosing sampler pathologies.

Why hierarchical?

A flat model (e.g. a t-test on the per-split differences) treats all n_seeds × n_splits observations as independent. They are not — the n_splits fold scores within a single seed share the same model initialisation and the same train/test split boundaries. The random effect μ_s captures this within-seed dependency explicitly, so the posterior width for μ_global reflects the effective number of independent observations (closer to n_seeds than to n_seeds × n_splits).

This is especially important when n_splits is large relative to n_seeds: without the random effect, the model would incorrectly pool all fold scores as independent evidence, artificially narrowing the posterior and overstating certainty about the performance gap.

5. Sensitivity Analysis

Learning curves with uncertainty bands

from mlci.sensitivity.learning_curve import learning_curve
from mlci.viz.plots import plot_learning_curve

lc = learning_curve(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y,
    train_fractions=[0.1, 0.2, 0.4, 0.6, 0.8, 1.0],
    n_seeds=10, n_splits=5,
)
fig = plot_learning_curve(lc)

Each point on the curve has a confidence band rather than being a single line — so you can see when two models that look different are actually statistically indistinguishable at smaller data scales.

Hyperparameter sensitivity

How much does each hyperparameter actually move the needle?

from mlci.sensitivity.hyperparameter import hyperparameter_sensitivity

results = hyperparameter_sensitivity(
    model_factory=lambda seed, n_estimators, max_depth:
        RandomForestClassifier(n_estimators=n_estimators,
                               max_depth=max_depth,
                               random_state=seed),
    param_grid={
        "n_estimators": [10, 50, 100, 200],
        "max_depth":    [None, 5, 10],
    },
    X=X, y=y, n_seeds=10, n_splits=5,
)
results.summary_table()
# Most sensitive parameter: n_estimators  σ = 0.0089
# Second most sensitive:    max_depth     σ = 0.0031

Each combination in the grid is evaluated as a full mlci experiment (n_seeds × n_splits), giving you confidence intervals on every point and a sensitivity ranking showing which parameters to prioritise when tuning.

Dataset sensitivity

Which samples in your dataset are genuinely hard to classify?

from mlci.sensitivity.dataset import dataset_sensitivity

results = dataset_sensitivity(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, n_seeds=10, n_splits=5,
)
print(results)
# Easy  (rate=0.0): 412 samples  — always correct
# Medium (0<r≤0.5): 134 samples  — sometimes wrong
# Hard  (rate>0.5):  23 samples  — usually wrong

hard_X = X[results.hard_indices]   # the genuinely ambiguous samples

Tracks per-sample misclassification rates across all (seed, split) pairs. Samples that are frequently wrong regardless of seed are genuinely ambiguous — useful for dataset curation and identifying potential labelling errors before publishing a benchmark.


6. sklearn Integrations

PipelineFactory

Build model + scaler pipelines from a config dict, with automatic seed injection:

from mlci.integrations.sklearn import PipelineFactory

factory = PipelineFactory(
    LogisticRegression,
    params={"max_iter": 2000, "C": 1.0},
    scaler="standard",   # or "robust", "minmax", "none"
)

exp = Experiment(model_factory=factory, X=X, y=y, ...)

wrap_cross_val_scores

Already have scores? Bring them in without rerunning:

from mlci.integrations.sklearn import wrap_cross_val_scores
import numpy as np

scores = np.vstack([
    cross_val_score(RandomForestClassifier(random_state=s), X, y, cv=5)
    for s in range(10)
])

results = wrap_cross_val_scores(scores, metric="accuracy", model_name="RF")
summary(results)   # full CI, variance decomposition, comparisons

Works with any source: sklearn, PyTorch training loops, a collaborator's array of numbers — anything that produces scores.

ModelGrid

Run many models in one call:

from mlci.integrations.sklearn import ModelGrid

grid = ModelGrid(
    models={
        "LogisticRegression": PipelineFactory(LogisticRegression, scaler="standard"),
        "RandomForest":       PipelineFactory(RandomForestClassifier, scaler="none"),
        "SVM":                PipelineFactory(SVC, scaler="standard"),
    },
    X=X, y=y, metric="accuracy", n_seeds=15, n_splits=5,
)
all_results = grid.run()
# {model_name: ExperimentResults} — pass directly to plot_comparison(), compare(), etc.

7. PyTorch Integration

Wrap any nn.Module in a sklearn-compatible interface with caching:

import torch.nn as nn
from mlci.integrations.torch import TorchTrainer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

class MyNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(30, 64), nn.ReLU(),
            nn.Linear(64, 32), nn.ReLU(),
            nn.Linear(32, 1),
        )
    def forward(self, x):
        return self.net(x)

exp = Experiment(
    model_factory=lambda seed: Pipeline([
        ("scaler", StandardScaler()),
        ("net", TorchTrainer(
            model_fn=MyNet,
            lr=1e-3,
            n_epochs=100,
            patience=15,
            random_state=seed,
            cache_dir="./mlci_cache",   # resume interrupted runs
        )),
    ]),
    X=X, y=y, metric="accuracy", n_seeds=10, n_splits=5,
    model_name="MyNet",
)
results = exp.run()
summary(results)

TorchTrainer handles seeding (torch + numpy + random + CUDA), early stopping with validation-based checkpoint restore, and disk caching keyed by architecture + config + data fingerprint. Different architectures always get different cache entries.


8. Multi-Dataset Benchmarking

Run a model grid across multiple datasets with Benjamini-Hochberg multiple testing correction:

from mlci.benchmarks.runner import run_benchmark, print_benchmark_report
from mlci.benchmarks.datasets import load_all_classification

results = run_benchmark(
    models={
        "RF":  lambda seed: RandomForestClassifier(random_state=seed),
        "LR":  PipelineFactory(LogisticRegression, scaler="standard"),
        "SVM": PipelineFactory(SVC, scaler="standard"),
    },
    datasets=load_all_classification(),
    n_seeds=10, n_splits=5,
)
print_benchmark_report(results)

Without multiple testing correction, comparing 3 models on 4 datasets means 12 pairwise tests — at α=0.05 you would expect spurious significant results by chance. Benjamini-Hochberg controls the false discovery rate across all comparisons and is built in automatically.


9. Minimum Detectable Effect (MDE) Calculator

Before running an experiment, use the MDE calculator to determine how many seeds you need — or what the smallest detectable effect is given a fixed seed budget.

from mlci.stats.power import estimate_sigma, mde_n_seeds, mde_effect, quick_summary

# Step 1: estimate score variance from a small pilot run
pilot = Experiment(
    model_factory=lambda seed: RandomForestClassifier(random_state=seed),
    X=X, y=y, metric="accuracy", n_seeds=5, n_splits=5,
).run(verbose=False)

sigma = estimate_sigma(pilot)

# Step 2a: how many seeds to detect a 0.5% difference at 80% power?
result = mde_n_seeds(effect_size=0.005, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(result)
# → Need 28 seeds to detect a 0.5% effect at 80% power (α=0.05)

# Step 2b: with a budget of 20 seeds, what's the smallest detectable effect?
mde = mde_effect(n_seeds=20, sigma=sigma, n_splits=5, alpha=0.05, target_power=0.80)
print(mde)
# → With 20 seeds: MDE = 0.572%  (80% power, α=0.05)

Print a power reference table across a range of effects and seed counts:

quick_summary(
    sigma=sigma, n_splits=5, alpha=0.05,
    effects=[0.001, 0.002, 0.005, 0.010, 0.020],
    n_seeds_options=[5, 10, 20, 30, 50],
)

Generate power curves and a heatmap:

from mlci.stats.power import power_analysis, plot_power_curve, plot_power_heatmap
import numpy as np

curve = power_analysis(
    effects=np.linspace(0.001, 0.015, 8),
    n_seeds_range=list(range(5, 60, 5)),
    sigma=sigma, n_splits=5, alpha=0.05,
)
fig = plot_power_curve(curve, target_power=0.80, highlight_effects=[0.005, 0.010])
fig = plot_power_heatmap(curve, target_power=0.80)

All calculations use the Nadeau-Bengio corrected t-test variance formula — the same test used by compare() — so power estimates are directly consistent with your actual experimental outcomes.


10. Calibration Analysis

Measure whether your model's predicted probabilities are reliable, and quantify how stable calibration is across seeds — something sklearn's calibration_curve cannot tell you.

from mlci.calibration import CalibrationExperiment

cal = CalibrationExperiment(
    model_factory=lambda seed: RandomForestClassifier(n_estimators=100, random_state=seed),
    X=X, y=y,
    n_seeds=15, n_splits=5,
    n_bins=10, strategy="uniform",
    model_name="RandomForest",
)
result = cal.run()
print(result.summary())
────────────────────────────────────────────────────────────
  Calibration: RandomForest  (accuracy)
  Seeds: 15   Splits: 5   Bins: 10
────────────────────────────────────────────────────────────
  ECE  mean : 0.0431   95% CI: [0.0389, 0.0472]   σ=0.0042
  MCE  mean : 0.1203   95% CI: [0.1089, 0.1318]   σ=0.0113
────────────────────────────────────────────────────────────

Plot a reliability diagram with uncertainty bands across seeds:

from mlci.calibration import (
    plot_reliability_diagram,
    plot_ece_distribution,
    plot_calibration_comparison,
    plot_reliability_overlay,
)

# Single model: reliability diagram with 95% CI band + individual seed runs
fig = plot_reliability_diagram(result, confidence=0.95, show_individual_runs=True)

# ECE distribution across seeds
fig = plot_ece_distribution(result)

# Compare multiple models side-by-side
fig = plot_calibration_comparison([rf_cal, gb_cal, lr_cal], confidence=0.95)

# Overlay mean curves for all models on a single axis
fig = plot_reliability_overlay([rf_cal, gb_cal, lr_cal])

The key difference from sklearn: calibration_curve gives a single curve from one run. mlci gives you a 95% CI band on that curve across seeds, so you know whether your model is genuinely well-calibrated or whether the result depends heavily on which random split happened to be used.


11. Visualisation

from mlci.viz.plots import (
    plot_score_distribution,      # KDE + CI on the mean
    plot_comparison,              # Forest plot for multiple models
    plot_variance_decomposition,  # Pie chart + score heatmap
    plot_bootstrap_distribution,  # Bootstrap distribution of the mean
    plot_learning_curve,          # Learning curve with uncertainty bands
)

Supported Metrics

Built-in: "accuracy", "f1", "f1_binary", "roc_auc", "rmse", "mae", "r2", "mse"

Or pass any callable (y_true, y_pred) -> float:

Experiment(
    model_factory=...,
    metric=lambda yt, yp: my_custom_metric(yt, yp),
)

Saving and Loading Results

results.save("rf_results.json")
loaded = ExperimentResults.load("rf_results.json")

Results are stored as plain JSON — no pickle, no version lock-in. Share results with collaborators and they can run the full statistical analysis without retraining anything.


Design Principles

Seeds are the unit of resampling. When we bootstrap, we resample at the seed level, not the individual (seed, split) level. Seeds are independent; folds within a seed share training data and are not independent.

Report distributions, not point estimates. The default output of every function is either a distribution or a statistic with a confidence interval.

Correct by default. The default comparison method is the Nadeau-Bengio corrected t-test, not the standard paired t-test, because the standard test is wrong for this use case.

No hidden state. ExperimentResults is a plain data container. You can inspect, serialise, and reconstruct it without going through any library-specific machinery.

Framework-agnostic. The library works with any model that implements fit(X, y) and predict(X). sklearn, PyTorch, XGBoost, your own custom model — all treated identically.


Citation

If you use mlci in published work, please cite it:

@software{forzano2026mlci,
  author    = {Forzano, Niccolo},
  title     = {mlci: Statistically Rigorous ML Model Evaluation and Benchmarking},
  year      = {2026},
  url       = {https://github.com/nickforce989/mlci},
  note      = {Python package}
}

Or in plain text:

Forzano, N. (2026). mlci: Statistically Rigorous ML Model Evaluation and Benchmarking. https://github.com/nickforce989/mlci

Please also cite the underlying statistical method the library is built on:

@article{nadeau2003inference,
  author  = {Nadeau, Claude and Bengio, Yoshua},
  title   = {Inference for the Generalization Error},
  journal = {Machine Learning},
  year    = {2003},
  volume  = {52},
  number  = {3},
  pages   = {239--281}
}

Roadmap

  • Bootstrap confidence intervals
  • Corrected resampled t-test (Nadeau & Bengio, 2003)
  • Wilcoxon and Bayesian comparison methods
  • Variance decomposition (two-way ANOVA)
  • Learning curves with uncertainty bands
  • Hyperparameter sensitivity analysis
  • Dataset sensitivity (per-sample misclassification rates)
  • sklearn integrations (PipelineFactory, ModelGrid, wrap_cross_val_scores)
  • PyTorch integration with caching and early stopping
  • Multi-dataset benchmarking with Benjamini-Hochberg correction
  • PyMC-based full Bayesian hierarchical comparison
  • Minimum Detectable Effect (MDE) calculator with power curves and heatmaps
  • Calibration analysis with ECE/MCE confidence intervals and reliability diagrams
  • Full PyTorch/HuggingFace training loop integration with Slurm support
  • HTML report generation
  • Integration with MLflow / Weights & Biases
  • Support for multi-class and multi-label metrics

License

GPL

About

Statistically rigorous ML model evaluation and benchmarking.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages