A dense autoencoder trained on MNIST, implemented in three languages: plain C, Python/PyTorch, and Cython/PyTorch. All share the same architecture, hyperparameters, and Adam optimiser so the comparison is fair.
784 → 512 → 128 → 16 → 128 → 512 → 784
All activations are ReLU. The bottleneck compresses each 28×28 image down to 16 values. Loss is pixel-wise mean-squared error (matches nn.MSELoss(reduction='mean')).
| Implementation | Wall-clock | Avg MSE loss (1 epoch) |
|---|---|---|
| C (from scratch) | ~10–13 s | ~0.06 |
| Python / PyTorch | ~13–16 s | ~0.06 |
| Cython / PyTorch | ~9–15 s | ~0.06 |
Benchmarked on 5 000 training images, 1 epoch, batch size 1.
C uses OpenBLAS (cblas_dgemm) for matrix multiply and OpenMP (8 threads) for Adam weight updates.
Written entirely from scratch — no ML framework.
- Matrix layout: flat row-major
double *datablock for cache locality;double **entriesrow-pointer array for ergonomic indexing. - BLAS: OpenBLAS
cblas_dgemmreplaces the hand-rolled matrix multiply. - Adam: bias-corrected (
lr_t = lr × √(1−β₂ᵗ) / (1−β₁ᵗ)) with a single in-place flat loop — no temporary matrix allocations during parameter updates. - Parallelism: OpenMP
#pragma omp parallel foron weight-matrix Adam steps (threshold: >2048 elements); thread count set inconfig.h. - Weight init: Ziggurat algorithm for normal-distributed values.
All hyperparameters live in C/config.h — no other source file needs to be touched to change architecture or learning rate.
PyTorch reference implementation. Serves as the correctness and speed baseline.
Same PyTorch model as Python with Cython type annotations on loop variables and scalar accumulators to reduce interpreter overhead in the training loop. PyTorch tensor ops dominate at runtime so the gain is modest.
# 1. Download MNIST data (CSV format)
bash C/download.sh
# 2. Create virtual environment and install Python dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtsource .venv/bin/activate
python plot.py # 1 epoch (quick sanity check)
python plot.py --epochs 100 # 100 epochs (better reconstructions, ~70 min)Outputs plots/results.png — convergence curves, wall-clock bar chart, and
side-by-side reconstructions from all three implementations.
bash compare.sh# C
cd C && make && ./main # 1 epoch (AE_EPOCHS in config.h)
cd C && make && ./main --epochs 100 # override at runtime
# Python / PyTorch
python Python/autoencoder.py
# Cython / PyTorch
cd Cython
python setup.py build_ext --inplace
python run_cython.py- C implementation forked from mnist-from-scratch by Mark Kraay
- Normal distribution generator: Ziggurat inline C
- PyTorch autoencoder: Implementing an Autoencoder in PyTorch
