This workshop introduces deep learning for image classification using PyTorch. It is based on the ML Zoomcamp Deep Learning module (08-deep-learning) but adapted to use PyTorch instead of TensorFlow/Keras.
Before preparing this workshop, I asked Gemini (inside Google Colab) to translate my Keras code into PyTorch and this is what happened. This workshop is based on the code from this notebook.
In this workshop, you will learn how to build an image classification model using PyTorch and transfer learning. We'll work with a clothing dataset and progressively improve our model through experimentation and optimization.
- Introduction to PyTorch for deep learning
- Loading and preprocessing image data
- Using pre-trained models (MobileNetV2)
- Understanding convolutional neural networks (CNNs)
- Transfer learning: adapting pre-trained models
- Hyperparameter tuning: learning rate optimization
- Model checkpointing: saving the best model
- Adding more layers to improve performance
- Dropout regularization to prevent overfitting
- Data augmentation for better generalization
- Training the final model
- Using the model for predictions
- Exporting models to ONNX format
Tools:
- PyTorch
- torchvision
- PIL (Pillow)
- NumPy
We will use Google Colab, so no setup (like installing libraries or CUDA) is required.
Download the dataset:
git clone https://github.com/alexeygrigorev/clothing-dataset-small.gitThe dataset contains:
- 10 clothing categories (dress, hat, longsleeve, outwear, pants, shirt, shoes, shorts, skirt, t-shirt)
- Training, validation, and test splits
- Pre-organized directory structure
PyTorch is a popular open-source deep learning framework developed by Facebook's AI Research lab. It provides:
- Dynamic computation graphs (define-by-run)
- Pythonic API
- Strong GPU acceleration
- Rich ecosystem of tools and libraries
Key Differences from TensorFlow/Keras:
| TensorFlow/Keras | PyTorch |
|---|---|
model.fit() |
Manual training loop |
ImageDataGenerator |
Dataset + DataLoader + transforms |
keras.layers.Dense() |
nn.Linear() |
keras.Model |
nn.Module |
.h5 or .keras files |
.pth or .pt files |
PyTorch is a popular open-source deep learning framework developed by Facebook's AI Research lab.
Key differences from TensorFlow/Keras:
- Dynamic computation graphs (define-by-run)
- More Pythonic and flexible
- Manual training loops instead of
model.fit() - Explicit device management (CPU/GPU)
Images are represented as 3D arrays:
- Height × Width × Channels
- Channels: RGB (Red, Green, Blue)
- Each channel: 8 bits (0-255 values)
from PIL import Image
import numpy as np
# Load an image
img = Image.open('clothing-dataset-small/train/pants/0098b991-e36e-4ef1-b5ee-4154b21e2a92.jpg')
# Resize to target size
img = img.resize((224, 224))
# Convert to numpy array
x = np.array(img)
print(x.shape) # (224, 224, 3)Instead of training from scratch, we'll use a model pre-trained on ImageNet (1.4M images, 1000 classes).
Why use pre-trained models?
- Already learned to recognize edges, textures, shapes
- Saves training time
- Works well even with small datasets
- Better performance than training from scratch
We'll use MobileNetV2 (in the original tutorial we used Xception):
import torch
import torchvision.models as models
from torchvision import transforms
import numpy as np
# Load pre-trained model
model = models.mobilenet_v2(weights='IMAGENET1K_V1')
model.eval()
# Preprocessing for MobileNetV2
preprocess = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])Now do the prediction:
img = Image.open('clothing-dataset-small/train/pants/0098b991-e36e-4ef1-b5ee-4154b21e2a92.jpg')
img_t = preprocess(img)
batch_t = torch.unsqueeze(img_t, 0)
# Make prediction
with torch.no_grad():
output = model(batch_t)
# Get top predictions
_, indices = torch.sort(output, descending=True)Let's see what's inside:
!wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt -O imagenet_classes.txt
# Load ImageNet class names
with open("imagenet_classes.txt", "r") as f:
categories = [s.strip() for s in f.readlines()]
# Get top 5 predictions
top5_indices = indices[0, :5].tolist()
top5_classes = [categories[i] for i in top5_indices]
print("Top 5 predictions:")
for i, class_name in enumerate(top5_classes):
print(f"{i+1}: {class_name}")Key concepts:
- Input size: MobileNetV2 expects 224×224 images (Xception uses 299×299)
- Normalization: Images scaled with ImageNet mean and std
- Batch size: Number of images processed together
- Batch dimension: Shape (batch_size, channels, height, width) - e.g., (1, 3, 224, 224)
Convolutional Neural Networks (CNNs) are specialized neural networks for processing grid-like data such as images.
Key Components:
-
Convolutional Layer: Extracts features using filters
- Applies filters (e.g., 3×3, 5×5) to detect patterns
- Creates feature maps (one per filter)
- Detects edges, textures, shapes
-
ReLU Activation: Introduces non-linearity
f(x) = max(0, x)- Sets negative values to 0
- Helps network learn complex patterns
-
Pooling Layer: Down-samples feature maps
- Reduces spatial dimensions
- Max pooling: takes maximum value in a region
- Makes features more robust to small translations
-
Fully Connected (Dense) Layer: Final classification
- Flattens 2D feature maps to 1D vector
- Connects to output classes
CNN Workflow:
Input Image → Conv + ReLU → Pooling → Conv + ReLU → Pooling → Flatten → Dense → Output
Transfer Learning reuses a model trained on one task (ImageNet) for a different task (clothing classification).
Approach:
- Load pre-trained model (feature extractor)
- Remove original classification head
- Freeze convolutional layers
- Add custom layers for our task
- Train only the new layers
First, create a PyTorch Dataset to load images:
import os
from torch.utils.data import Dataset
from PIL import Image
class ClothingDataset(Dataset):
def __init__(self, data_dir, transform=None):
self.data_dir = data_dir
self.transform = transform
self.image_paths = []
self.labels = []
self.classes = sorted(os.listdir(data_dir))
self.class_to_idx = {cls: i for i, cls in enumerate(self.classes)}
for label_name in self.classes:
label_dir = os.path.join(data_dir, label_name)
for img_name in os.listdir(label_dir):
self.image_paths.append(os.path.join(label_dir, img_name))
self.labels.append(self.class_to_idx[label_name])
def __len__(self):
return len(self.image_paths)
def __getitem__(self, idx):
img_path = self.image_paths[idx]
image = Image.open(img_path).convert('RGB')
label = self.labels[idx]
if self.transform:
image = self.transform(image)
return image, labelfrom torchvision import transforms
input_size = 224
# ImageNet normalization values
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
# Simple transforms - just resize and normalize
train_transforms = transforms.Compose([
transforms.Resize((input_size, input_size)),
transforms.ToTensor(),
transforms.Normalize(mean=mean, std=std)
])
val_transforms = transforms.Compose([
transforms.Resize((input_size, input_size)),
transforms.ToTensor(),
transforms.Normalize(mean=mean, std=std)
])from torch.utils.data import DataLoader
train_dataset = ClothingDataset(
data_dir='./clothing-dataset-small/train',
transform=train_transforms
)
val_dataset = ClothingDataset(
data_dir='./clothing-dataset-small/validation',
transform=val_transforms
)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)import torch.nn as nn
import torchvision.models as models
class ClothingClassifierMobileNet(nn.Module):
def __init__(self, num_classes=10):
super(ClothingClassifierMobileNet, self).__init__()
# Load pre-trained MobileNetV2
self.base_model = models.mobilenet_v2(weights='IMAGENET1K_V1')
# Freeze base model parameters
for param in self.base_model.parameters():
param.requires_grad = False
# Remove original classifier
self.base_model.classifier = nn.Identity()
# Add custom layers
self.global_avg_pooling = nn.AdaptiveAvgPool2d((1, 1))
self.output_layer = nn.Linear(1280, num_classes)
def forward(self, x):
x = self.base_model.features(x)
x = self.global_avg_pooling(x)
x = torch.flatten(x, 1)
x = self.output_layer(x)
return ximport torch
import torch.optim as optim
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ClothingClassifierMobileNet(num_classes=10)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()Now train it:
# Training loop
num_epochs = 10
for epoch in range(num_epochs):
# Training phase
model.train() # Set the model to training mode
running_loss = 0.0
correct = 0
total = 0
# Iterate over the training data
for inputs, labels in train_loader:
# Move data to the specified device (GPU or CPU)
inputs, labels = inputs.to(device), labels.to(device)
# Zero the parameter gradients to prevent accumulation
optimizer.zero_grad()
# Forward pass
outputs = model(inputs)
# Calculate the loss
loss = criterion(outputs, labels)
# Backward pass and optimize
loss.backward()
optimizer.step()
# Accumulate training loss
running_loss += loss.item()
# Get predictions
_, predicted = torch.max(outputs.data, 1)
# Update total and correct predictions
total += labels.size(0)
correct += (predicted == labels).sum().item()
# Calculate average training loss and accuracy
train_loss = running_loss / len(train_loader)
train_acc = correct / total
# Validation phase
model.eval() # Set the model to evaluation mode
val_loss = 0.0
val_correct = 0
val_total = 0
# Disable gradient calculation for validation
with torch.no_grad():
# Iterate over the validation data
for inputs, labels in val_loader:
# Move data to the specified device (GPU or CPU)
inputs, labels = inputs.to(device), labels.to(device)
# Forward pass
outputs = model(inputs)
# Calculate the loss
loss = criterion(outputs, labels)
# Accumulate validation loss
val_loss += loss.item()
# Get predictions
_, predicted = torch.max(outputs.data, 1)
# Update total and correct predictions
val_total += labels.size(0)
val_correct += (predicted == labels).sum().item()
# Calculate average validation loss and accuracy
val_loss /= len(val_loader)
val_acc = val_correct / val_total
# Print epoch results
print(f'Epoch {epoch+1}/{num_epochs}')
print(f' Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')
print(f' Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')It's a lower-level framework, that's why we need to implement some of these things like calculating accuracy on validation.
The line optimizer.zero_grad() is crucial in the training loop.
In PyTorch, gradients are accumulated by default. This means that if you don't zero the gradients before calculating the gradients for the current batch, the gradients from the previous batch will be added to the gradients of the current batch. This would lead to incorrect updates to your model's parameters.
By calling optimizer.zero_grad(), you clear out the old gradients, ensuring that the gradients calculated during the loss.backward() call are only based on the current batch of data. This is essential for the optimizer to take the correct step during optimizer.step().
model.train() and model.eval() are needed to manage the behavior of certain layers during training and evaluation.
model.train() sets the model to training mode. In training mode, layers like Dropout and BatchNorm behave differently. Dropout layers are active (randomly dropping neurons), and BatchNorm layers update their running statistics (mean and variance) based on the current batch.
model.eval() sets the model to evaluation mode. In evaluation mode, Dropout layers are inactive (they pass through all neurons), and BatchNorm layers use their accumulated running statistics instead of the current batch statistics. This ensures consistent behavior during inference and prevents randomness from affecting the evaluation results.
Let's put it inside a function so it's easier for us to call it:
def train_and_evaluate(model, optimizer, train_loader, val_loader, criterion, num_epochs, device):
for epoch in range(num_epochs):
# Training phase
model.train()
running_loss = 0.0
correct = 0
total = 0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
train_loss = running_loss / len(train_loader)
train_acc = correct / total
# Validation phase
model.eval()
val_loss = 0.0
val_correct = 0
val_total = 0
with torch.no_grad():
for inputs, labels in val_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
val_loss += loss.item()
_, predicted = torch.max(outputs.data, 1)
val_total += labels.size(0)
val_correct += (predicted == labels).sum().item()
val_loss /= len(val_loader)
val_acc = val_correct / val_total
print(f'Epoch {epoch+1}/{num_epochs}')
print(f' Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f}')
print(f' Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}')The learning rate controls how much to update model weights during training. It's one of the most important hyperparameters.
Analogy: Reading speed
- Too fast: Skip details, poor understanding (may not converge)
- Too slow: Never finish the book (training takes too long)
- Just right: Good comprehension and efficiency
Experimentation approach:
- Try multiple values:
[0.0001, 0.001, 0.01, 0.1] - Train for a few epochs each
- Compare validation accuracy
- Choose the rate with best performance and smallest train/val gap
def make_model(learning_rate=0.01):
model = ClothingClassifierMobileNet(num_classes=10)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
return model, optimizerLet's test different learning rates:
learning_rates = [0.0001, 0.001, 0.01, 0.1]
for lr in learning_rates:
print(f'\n=== Learning Rate: {lr} ===')
model, optimizer = make_model(learning_rate=lr)
train_and_evaluate(model, optimizer, train_loader, val_loader, criterion, num_epochs, device)The best learning rate is 0.001 (accuracy 0.815).
Checkpointing saves the model during training to:
- Keep the best performing model
- Resume training if interrupted
- Avoid losing progress
Update the train function:
def train_and_evaluate_with_checkpointing(model, optimizer, train_loader, val_loader, criterion, num_epochs, device):
best_val_accuracy = 0.0 # Initialize variable to track the best validation accuracy
# exising code
# Checkpoint the model if validation accuracy improved
if val_acc > best_val_accuracy:
best_val_accuracy = val_acc
checkpoint_path = f'mobilenet_v2_{epoch+1:02d}_{val_acc:.3f}.pth'
torch.save(model.state_dict(), checkpoint_path)
print(f'Checkpoint saved: {checkpoint_path}')TensorFlow/Keras equivalent:
- Keras:
ModelCheckpointcallback - PyTorch: Manual saving in training loop
We can add intermediate dense layers between feature extraction and output:
class ClothingClassifierMobileNet(nn.Module):
def __init__(self, size_inner=100, num_classes=10):
super(ClothingClassifierMobileNet, self).__init__()
self.base_model = models.mobilenet_v2(weights='IMAGENET1K_V1')
for param in self.base_model.parameters():
param.requires_grad = False
self.base_model.classifier = nn.Identity()
self.global_avg_pooling = nn.AdaptiveAvgPool2d((1, 1))
self.inner = nn.Linear(1280, size_inner) # New inner layer
self.relu = nn.ReLU()
self.output_layer = nn.Linear(size_inner, num_classes)
def forward(self, x):
x = self.base_model.features(x)
x = self.global_avg_pooling(x)
x = torch.flatten(x, 1)
x = self.inner(x)
x = self.relu(x)
x = self.output_layer(x)
return xUpdate make_model:
def make_model(learning_rate=0.001, size_inner=100):
model = ClothingClassifierMobileNet(
num_classes=10,
size_inner=size_inner
)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
return model, optimizerExperiment with different sizes:
- Try:
size_inner = [10, 100, 1000] - Larger layers: more capacity, may overfit
- Smaller layers: faster, may underfit
Key points:
- Inner layer uses ReLU activation
- Output layer has no activation (logits)
CrossEntropyLossapplies softmax internally
Dropout randomly drops neurons during training to prevent overfitting.
How it works:
- Training: randomly set fraction of activations to 0
- Inference: use all neurons (dropout disabled automatically)
- Creates ensemble effect
Benefits:
- Prevents relying on specific features
- Forces learning robust patterns
- Reduces overfitting
class ClothingClassifierMobileNet(nn.Module):
def __init__(self, size_inner=100, droprate=0.2, num_classes=10):
super(ClothingClassifierMobileNet, self).__init__()
self.base_model = models.mobilenet_v2(weights='IMAGENET1K_V1')
for param in self.base_model.parameters():
param.requires_grad = False
self.base_model.classifier = nn.Identity()
self.global_avg_pooling = nn.AdaptiveAvgPool2d((1, 1))
self.inner = nn.Linear(1280, size_inner)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(droprate) # Add dropout
self.output_layer = nn.Linear(size_inner, num_classes)
def forward(self, x):
x = self.base_model.features(x)
x = self.global_avg_pooling(x)
x = torch.flatten(x, 1)
x = self.inner(x)
x = self.relu(x)
x = self.dropout(x) # Apply dropout
x = self.output_layer(x)
return xUpdate our function:
def make_model(
learning_rate=0.001,
size_inner=100,
droprate=0.2
):
model = ClothingClassifierMobileNet(
num_classes=10,
size_inner=size_inner,
droprate=droprate
)
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
return model, optimizerExperimentation:
- Try:
droprate = [0.0, 0.2, 0.5, 0.8] - Typical values: 0.2 to 0.5
- Higher dropout may need more training epochs
(Our case: best droprate is 0.2)
Data Augmentation artificially increases dataset size by applying random transformations to training images.
Common transformations:
- Rotation
- Horizontal/vertical flipping
- Zooming (random cropping)
- Shifting
- Shearing
Important rules:
- ✅ Apply ONLY to training data
- ❌ Never augment validation/test data
# Training transforms WITH augmentation
train_transforms = transforms.Compose([
transforms.RandomRotation(10), # Rotate up to 10 degrees
transforms.RandomResizedCrop(224, scale=(0.9, 1.0)), # Zoom
transforms.RandomHorizontalFlip(), # Horizontal flip
transforms.ToTensor(),
transforms.Normalize(mean=mean, std=std)
])
# Validation transforms - NO augmentation, same as before
val_transforms = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=mean, std=std)
])When to use augmentation:
- Small datasets
- Risk of overfitting
- Images can appear in different orientations
Tips:
- Choose augmentations that make sense for your data
- Too much augmentation can hurt performance
- Usually requires longer training (more epochs)
- If no improvement after ~20 epochs, don't use it
import glob
# Find best checkpoint
list_of_files = glob.glob('mobilenet_v2_*.pth')
latest_file = max(list_of_files, key=os.path.getctime)
print(f"Loading model from: {latest_file}")
# Load model
model = ClothingClassifierMobileNet(size_inner=32, droprate=0.2, num_classes=10)
model.load_state_dict(torch.load(latest_file))
model.to(device)
model.eval()from keras_image_helper import create_preprocessor
import numpy as np
def preprocess_pytorch_style(X):
# X: shape (1, 224, 224, 3), dtype=float32, values in [0, 255]
X = X / 255.0
mean = np.array([0.485, 0.456, 0.406]).reshape(1, 3, 1, 1)
std = np.array([0.229, 0.224, 0.225]).reshape(1, 3, 1, 1)
# Convert NHWC → NCHW (batch, height, width, channels → batch, channels, height, width)
X = X.transpose(0, 3, 1, 2)
# Normalize
X = (X - mean) / std
return X.astype(np.float32)
preprocessor = create_preprocessor(preprocess_pytorch_style, target_size=(224, 224))
# Predict from URL
url = 'http://bit.ly/mlbookcamp-pants'
X = preprocessor.from_url(url)
X = torch.Tensor(X).to(device)
with torch.no_grad():
pred = model(X).cpu().numpy()[0]
classes = [
"dress", "hat", "longsleeve", "outwear", "pants",
"shirt", "shoes", "shorts", "skirt", "t-shirt"
]
result = dict(zip(classes, pred.tolist()))
print(result)ONNX (Open Neural Network Exchange) is a format for model interoperability.
Benefits:
- Deploy on different platforms
- Use optimized runtimes (ONNX Runtime)
- Better inference performance
- Language-agnostic deployment
# Create dummy input
dummy_input = torch.randn(1, 3, 224, 224).to(device)
# Export to ONNX
onnx_path = "clothing_classifier_mobilenet_v2.onnx"
torch.onnx.export(
model,
dummy_input,
onnx_path,
verbose=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
print(f"Model exported to {onnx_path}")We will use it in the Serverless module.
| Concept | TensorFlow/Keras | PyTorch |
|---|---|---|
| Framework | High-level API (Keras) on TensorFlow | Low-level, explicit control |
| Data Loading | ImageDataGenerator |
Dataset + DataLoader |
| Transforms | preprocessing_function |
transforms.Compose() |
| Model | Functional API or Sequential | nn.Module class |
| Layers | keras.layers.Dense() |
nn.Linear() |
| Training | model.fit() |
Manual training loop |
| Loss | CategoricalCrossentropy |
CrossEntropyLoss |
| Optimizer | keras.optimizers.Adam |
optim.Adam |
| Saving | .h5 or .keras |
.pth or .pt |
| Checkpointing | ModelCheckpoint callback |
Manual in training loop |
| Device | Automatic | Explicit .to(device) |
- Transfer Learning: Reuse pre-trained models for new tasks
- CNN Architecture: Conv layers → Pooling → Dense layers
- Hyperparameter Tuning: Learning rate is critical
- Regularization: Dropout prevents overfitting
- Data Augmentation: Increases effective dataset size
- Model Checkpointing: Save best models during training
- PyTorch Workflow: Dataset → DataLoader → Model → Training Loop
- Start with pre-trained models (transfer learning)
- Freeze convolutional layers initially
- Use appropriate normalization (match pre-training)
- Experiment with one hyperparameter at a time
- Monitor train/val gap for overfitting
- Use checkpointing to save best models
- Augment training data only, not validation
- Train longer with dropout and augmentation
- Use GPU when available:
torch.cuda.is_available()
- Try different pre-trained models (ResNet, EfficientNet)
- Experiment with learning rate schedulers
- Fine-tune the entire model (unfreeze convolutional layers)
- Try different optimizers (SGD with momentum, AdamW)
- Deploy the ONNX model (see mlzoomcamp-serverless)
- Explore the original TensorFlow/Keras version
- PyTorch Documentation
- torchvision Models
- ML Zoomcamp Course
- Original Tutorial (TensorFlow/Keras)
- ONNX Documentation
This workshop is based on the ML Zoomcamp Deep Learning module by Alexey Grigorev, adapted to use PyTorch instead of TensorFlow/Keras.