A PyTorch framework for compressing deep time series classification models via knowledge distillation (KD). The project trains a large teacher network on the UCR Archive and then transfers its knowledge to a smaller student network, allowing the student to approach teacher accuracy with a fraction of the parameters.
Three backbone architectures are supported out of the box: Inception, FCN, and ConvTran.
- Overview
- Repository Structure
- Installation
- Dataset Preparation
- Quick Start
- Configuration
- Architectures
- Knowledge Distillation
- Output Structure
- Command-Line Arguments
- Extending the Framework
Knowledge distillation trains a compact student model to mimic a larger pre-trained teacher using a weighted combination of:
- Student cross-entropy loss against ground-truth labels
- Distillation loss (KL divergence) between softened teacher and student logits at a chosen temperature
T
The combined objective is:
L = α · CE(student, labels) + (1 − α) · KL(softmax(student/T), softmax(teacher/T))
This framework runs three classifier modes:
teacher— train the full-capacity modelstudent_alone— train the compact model from scratch (KD baseline)student_kd— train the compact model with the teacher's guidance
Each experiment is repeated over multiple iterations, and the best teacher (lowest training loss) is selected to supervise the students.
KD4TSC/
├── config.py # All hyperparameters and paths
├── main.py # Experiment orchestrator
├── distiller.py # DistillationLoss + Distiller trainer
├── utils.py # Metrics, logging, teacher selection
├── analyze_results.py # (placeholder for result analysis)
├── data/
│ └── data_utils.py # UCR loader, znormalization, DataLoaders
├── models/
│ ├── inception.py # InceptionTime architecture
│ ├── fcn.py # Fully Convolutional Network
│ ├── convtran.py # ConvTran (Conv + Transformer)
│ ├── Attention.py # Self-attention variants (abs, rel-scalar, rel-vector)
│ └── AbsolutePositionalEncoding.py # tAPE / APE / learnable positional encodings
└── trainers/
└── trainer.py # Unified Trainer for all architectures
- Python ≥ 3.8
- PyTorch ≥ 1.10 (with CUDA recommended)
- NumPy, pandas, scikit-learn
- einops (used by attention modules)
# Clone the project
git clone <your-repo-url> KD4TSC
cd KD4TSC
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install torch numpy pandas scikit-learn einopsVerify GPU availability:
python -c "import torch; print(torch.cuda.is_available())"The framework uses the UCR Archive 2018 (univariate time series). Each dataset is expected at:
<PATH_DATA>/<DatasetName>/<DatasetName>_TRAIN.tsv
<PATH_DATA>/<DatasetName>/<DatasetName>_TEST.tsv
Each .tsv file contains one sample per row; the first column is the class label and the remaining columns are the time series values.
Update PATH_DATA in config.py to point to your local UCR Archive root:
PATH_DATA = '/path/to/UCRArchive_2018/'Open config.py and set:
ARCHITECTURE→'inception','fcn', or'convtran'CLASSIFIERS→ which modes to run (e.g.['teacher', 'student_kd', 'student_alone'])UNIVARIATE_DATASET_NAMES_2018→ list of datasets (defaults to one dataset for a quick test)PATH_DATAandPATH_OUT→ input/output paths
Students need a trained teacher. Run only the teacher mode:
python main.py --classifiers teacherThis will train teachers for every dataset × iteration combination and save checkpoints under PATH_OUT/<ARCH>/results/teacher/....
Once teachers exist:
python main.py --classifiers student_kd student_aloneThe framework will automatically pick the best teacher per dataset (lowest training loss) and train students for each (alpha, temperature) combination defined in config.py.
python main.py --datasets ACSF1 Coffee --classifiers student_kd --iterations 3All knobs live in config.py. Key sections:
EPOCHS = 1500
BATCH_SIZE = 64
LEARNING_RATE = 0.001
PATIENCE = 50 # ReduceLROnPlateau patience
MIN_LR = 0.0001
LR_FACTOR = 0.5ALPHA_LIST = [0.5] # weight for student CE loss
TEMPERATURE_LIST = [10] # softening temperatureA full grid is run over ALPHA_LIST × TEMPERATURE_LIST × ITERATIONS.
ITERATIONS = {
'teacher': 5,
'student_kd': 5,
'student_alone': 5,
}See the Architectures section below.
The active architecture is set by ARCHITECTURE in config.py. Each one supports independent teacher/student configurations.
Stacked Inception blocks with optional residual connections and a bottleneck layer.
INCEPTION_TEACHER_DEPTH = 6 # teacher has 6 Inception blocks
INCEPTION_STUDENT_DEPTH = 4 # student has 4 (≈33% compression)
INCEPTION_NB_FILTERS = 32
INCEPTION_BOTTLENECK_SIZE = 32
INCEPTION_KERNEL_SIZE = 40Compression dimension: depth (number of blocks).
A Fully Convolutional Network with flexible depth and width.
FCN_TEACHER_FILTERS = [128, 256, 128] # 3 layers
FCN_STUDENT_FILTERS = [20, 40, 20] # 3 layers, ~15% filters → heavy width compression
FCN_TEACHER_KERNEL_SIZES = [8, 5, 3]
FCN_STUDENT_KERNEL_SIZES = None # falls back to defaultsThree compression strategies are supported simultaneously:
| Strategy | Example | Notes |
|---|---|---|
| Width only | [20, 40, 20] vs [128, 256, 128] |
Same depth, fewer filters |
| Depth only | [128, 256] vs [128, 256, 128] |
Fewer layers, full filters |
| Depth + width | [64, 128] vs [128, 256, 128] |
Maximum compression |
| Deeper, narrower | [64, 96, 128, 96, 64] |
More layers, varied filters |
The main script reports the compression ratio at startup.
A hybrid Conv → Transformer architecture with relative positional encoding (eRPE / tAPE).
TEACHER_NUM_HEADS = 8
STUDENT_NUM_HEADS = 6Compression dimension: number of attention heads. Embedding size, feed-forward width, and dropout are fixed inside the model (emb_size=24, dim_ff=256).
The Distiller class (distiller.py) wraps a frozen teacher and a trainable student:
distiller = Distiller(student, teacher, alpha=0.5, temperature=10.0, device='cuda')For each batch:
- Teacher logits are computed under
torch.no_grad(). - Student logits are computed normally.
DistillationLosscombines:CrossEntropyLoss(student_logits, labels)KLDivLoss(log_softmax(student/T), softmax(teacher/T))
Note that the KL term in nn.KLDivLoss(reduction='batchmean') is not multiplied by T², which differs from the original Hinton et al. formulation. If you replicate that paper exactly, scale the distillation loss by T².
Results are written to PATH_OUT/<ARCHITECTURE>/results/:
Results/<ARCH>/results/
├── teacher/
│ └── UCRArchive_2018_itr_<i>/<Dataset>/
│ ├── best_model.pth
│ ├── last_model.pth
│ ├── history.csv
│ ├── df_metrics.csv # test accuracy, precision, recall, duration
│ ├── df_best_model.csv # epoch / loss of best checkpoint
│ └── DONE # marker file (skip if exists)
├── student_alone/
│ └── UCRArchive_2018_itr_<i>/<Dataset>/...
└── student_kd/
└── alpha_<α>/temperature_<T>/UCRArchive_2018_itr_<i>/<Dataset>/...
The DONE marker enables safe resumption — re-running main.py skips completed experiments.
utils.get_best_teacher() scans the teacher iterations for a given dataset and picks the checkpoint with the lowest training loss (df_best_model.csv). Set BEST_TEACHER_ONLY = True in config.py to enable this (default).
python main.py [--datasets D1 D2 ...]
[--classifiers C1 C2 ...]
[--iterations N]
[--architecture {inception,fcn,convtran}]| Flag | Default | Description |
|---|---|---|
--datasets |
All in UNIVARIATE_DATASET_NAMES_2018 |
Subset of datasets to run |
--classifiers |
CLASSIFIERS from config |
One or more of teacher, student_kd, student_alone |
--iterations |
From ITERATIONS dict |
Override iteration count for all classifiers |
--architecture |
ARCHITECTURE from config |
Override the backbone |
Examples:
# Train teachers on three datasets, three iterations each
python main.py --classifiers teacher --datasets ACSF1 Coffee GunPoint --iterations 3
# Run KD students with FCN architecture
python main.py --architecture fcn --classifiers student_kd
# Full pipeline on a single dataset
python main.py --datasets ACSF1 --classifiers teacher student_kd student_alone- Create
models/<your_model>.pywith aforward(x) -> logitsmodel class. - In
trainers/trainer.py, add a_build_<your_arch>_model(...)method that handles all three model types (teacher,student_alone,student_kd). - Wire it into
Trainer.__init__and add a config branch inmain.pyfor the architecture-specific parameters.
Subclass DistillationLoss in distiller.py and plug it into Distiller. For example, to add feature-map distillation, expose intermediate activations from your model and add an MSE term between teacher/student features.
Extend ALPHA_LIST and TEMPERATURE_LIST in config.py — the main loop already does the cartesian product. Results are organized into separate folders per (α, T) for easy comparison.
val_loaderisNoneinTrainer.fit, so the "validation" columns inhistory.csvare zeros. The best model is selected by training loss. Test set evaluation happens once at the end viasave_logs.- One-hot vs. integer labels are both handled — labels are converted to indices before cross-entropy.
- Z-normalization is applied per sample inside
data_utils.create_data_loaders. - Reproducibility:
torch.manual_seed(42)is set only for the train DataLoader; for full determinism, also seed NumPy, Python'srandom, and settorch.backends.cudnn.deterministic = True. - InceptionTime input channels: the first block assumes 1 input channel (univariate). For multivariate data, adjust
Inception_block.__init__.