Multivariate analysis, designed experiments, and process monitoring for Python. Built for chemometrics, manufacturing, and pharma data - the methods that scikit-learn skips.
process-improve provides production-grade implementations of the methods
practitioners actually use on real plant and lab data:
- PCA with SVD and NIPALS, plus native missing-value handling via Trimmed Score Regression
- PLS regression with a fully sklearn-compatible API, VIP scores, and cross-validated diagnostics
- TPLS - PLS for T-shaped (multi-block) data structures
- Outlier detection combining Hotelling's T² and SPE with an ESD-based test
- Designed experiments - full-factorial, fractional-factorial, and response-surface designs, plus a multi-stage DOE strategy recommender
- Process monitoring - Shewhart, CUSUM, and Holt-Winters control charts
- Batch data analysis - alignment, feature extraction, and multivariate batch monitoring (MBPCA / MBPLS)
- Interactive Plotly diagnostics bound directly to every fitted model
Outputs are pandas-native: scores, loadings, and predictions keep your row
and column labels.
It is the companion package to the online textbook Process Improvement using Data, and powers the statistical engine behind factori.al.
scikit-learn answers "what fits the data?" - process-improve answers
"is this batch normal, which variable went off, and how confident am I in the
prediction?" The two libraries are designed to be used together;
process-improve follows sklearn conventions (fit, predict, score, the
_ suffix on fitted attributes) and drops into existing pipelines.
| Capability | scikit-learn | process-improve |
|---|---|---|
| PCA, PLS with sklearn-style API | ✓ | ✓ |
| Missing-data fitting (NIPALS / TSR) | - | ✓ |
| Hotelling's T² + SPE outlier limits | - | ✓ |
| Variable-level score contributions | - | ✓ |
| Cross-validated coefficient confidence intervals | - | ✓ |
| Multi-block models (TPLS) | - | ✓ |
| Designed experiments (DoE) | - | ✓ |
| Control charts (Shewhart / CUSUM / Holt-Winters) | - | ✓ |
| Batch process monitoring (MBPCA / MBPLS) | - | ✓ |
| Plotly diagnostics built in | - | ✓ |
Labeled DataFrame outputs |
partial | ✓ |
pip install process-improveRequires Python 3.10 or newer. Built on numpy, pandas, scipy,
scikit-learn, statsmodels, plotly, and pyDOE3.
import pandas as pd
from process_improve.multivariate.methods import PCA, MCUVScaler
X = pd.read_csv("your_data.csv", index_col=0)
X_scaled = MCUVScaler().fit_transform(X)
pca = PCA(n_components=3).fit(X_scaled)
print(pca.r2_cumulative_) # cumulative R² per component
pca.score_plot() # interactive Plotly figure
# Flag outliers using combined T² and SPE limits at 95% confidence
outliers = pca.detect_outliers(conf_level=0.95)
# Which variables drove the first observation off?
contrib = pca.score_contributions(pca.scores_.iloc[0].values)from process_improve.multivariate.methods import PLS, MCUVScaler
# Scale X and Y separately
scaler_x = MCUVScaler().fit(X)
scaler_y = MCUVScaler().fit(Y)
X_s, Y_s = scaler_x.transform(X), scaler_y.transform(Y)
pls = PLS(n_components=3).fit(X_s, Y_s)
print(pls.beta_coefficients_) # regression coefficients (K x M)
print(pls.r2_cumulative_) # cumulative R² for Y
print(pls.vip()) # VIP scores per X variable
# Predict new observations, with diagnostics on the prediction
result = pls.predict(scaler_x.transform(X_new))
result.y_hat # point predictions
result.spe # squared prediction error
result.hotellings_t2 # Hotelling's T² for new observations
# Cross-validated component selection
cv_select = PLS.select_n_components(X_s, Y_s, max_components=6)
print(cv_select.n_components) # recommended number of components
print(cv_select.rmsecv) # RMSECV per component count
# Cross-validation with beta-coefficient confidence intervals
cv = pls.cross_validate(X_s, Y_s, cv="loo")
print(cv.beta_ci_lower, cv.beta_ci_upper) # 95% CI for each beta
print(cv.significant) # betas significantly != 0
print(cv.q_squared) # cross-validated R² (Q²)from process_improve.experiments.factor import Factor, Response
from process_improve.experiments.strategy import recommend_strategy
factors = [
Factor(name="Temperature", low=25, high=40, units="degC"),
Factor(name="pH", low=5.0, high=7.5),
Factor(name="Glucose", low=10, high=50, units="g/L"),
]
strategy = recommend_strategy(
factors=factors,
responses=[Response(name="Yield", goal="maximize", units="g/L")],
budget=40,
domain="fermentation",
)
for s in strategy["stages"]:
print(s["stage_number"], s["design_type"], s["estimated_runs"])Longer, fully-worked versions of each example live in the
Quickstart guide
and the process_improve/notebooks_examples/ folder.
New to designed experiments? The Applied DoE tutorial is an eight-module worked-solution series.
PCA and PLS follow scikit-learn conventions: fit() returns self, fitted
attributes end with a trailing underscore (scores_, loadings_, spe_,
hotellings_t2_, r2_cumulative_, ...), and predict() returns an
sklearn.utils.Bunch with named fields (y_hat, spe, hotellings_t2, ...).
Inputs are accepted as pandas.DataFrame, and index/column labels are
preserved through fit and transform.
- API reference & user guide: https://kgdunn.github.io/process-improve/
- Applied DoE tutorial (8 modules): https://kgdunn.github.io/process-improve/applied_doe/index.html
- Companion textbook: Process Improvement using Data
- Hosted experiment-design tool: factori.al
- Local docs build:
cd docs && make html
If you use this package in academic work, please cite it:
@software{dunn_process_improve,
author = {Dunn, Kevin G.},
title = {{process-improve: Multivariate Analysis for Process Improvement}},
year = {2026},
version = {v1.21.4},
url = {https://github.com/kgdunn/process-improve}
}A CITATION.cff file is included, so GitHub renders a "Cite this
repository" button in the sidebar.
Bug reports, feature requests, and pull requests are welcome. See CONTRIBUTING.md for development setup, testing, and code style. Bugs and feature requests can be filed on the issue tracker.
MIT - see LICENSE for details.