Mini Diffusion

A minimal diffusion model implementation in Rust, built from scratch for educational purposes.

This project demonstrates the core concepts of modern diffusion models (like Stable Diffusion 3 and Flux) while teaching Rust programming patterns.

📋 Table of Contents

Educational Goals
Project Structure
Quick Start
Module Deep Dive
Rust Concepts Explained
Architecture Overview
Configuration
Implementation Status
Learning Resources

🎯 Educational Goals

This project serves two purposes:

Learn Diffusion Models: Understand tensors, neural networks, noise schedules, and sampling algorithms
Learn Rust: See how ML concepts translate to Rust's ownership model, traits, and performance

🏗️ Project Structure

mini-diffusion/
├── Cargo.toml           # Dependencies and project config
├── src/
│   ├── lib.rs           # Library exports
│   ├── tensor.rs        # Tensor implementation (ndarray wrapper)
│   ├── nn.rs            # Neural network layers (Linear, Conv2d, etc.)
│   ├── diffusion.rs     # Noise schedules and forward diffusion
│   ├── unet.rs          # U-Net architecture (DDPM-style)
│   ├── training.rs      # Training loop and optimizer
│   ├── sampling.rs      # DDPM/DDIM sampling
│   ├── vae.rs           # Variational Autoencoder
│   ├── tokenizer.rs     # BPE and Unigram tokenizers
│   ├── clip.rs          # CLIP text encoder
│   ├── t5.rs            # T5 text encoder
│   ├── flow.rs          # Flow matching (SD3-style)
│   ├── joint_attention.rs # Multi-modal attention
│   ├── dit.rs           # Diffusion Transformer
│   └── bin/
│       ├── train.rs     # Training demo
│       ├── generate.rs  # Generation demo  
│       └── demo_sd3.rs  # SD3 components demo

🚀 Quick Start

Prerequisites

Rust 1.70+ (install from rustup.rs)

Build

cd mini-diffusion
cargo build --release

Run Tests

# Run all tests
cargo test

# Run tests with output
cargo test -- --nocapture

Run Demos

# Generate images with random weights (shows pipeline works)
cargo run --bin generate --release

# Training demo (shows structure, not real training)
cargo run --bin train --release

# SD3 components demo (tokenizers, flow matching, tensors)
cargo run --bin demo_sd3 --release

Example Output

Generate Demo:

Mini Diffusion - Image Generation
==================================

Sampler Configuration:
  - Steps: 50 (out of 1000 training steps)
  - Method: DDIM
  - Eta: 0

Creating model...
  - Parameters: 9572931

Note: Model has random weights (no training)
      Output will look like colored noise.

Generating 4 images of size 32x32...
  DDIM Sampling: [████████████████████████████████████████] 50/50
Generation complete!
  - Output shape: [4, 3, 32, 32]
  - Value range: [-1.000, 1.000]

Saving images...
Saved: generated_0.png
Images saved successfully!

SD3 Demo:

=== Mini-Diffusion: SD3-Style Components Demo ===

--- BPE Tokenizer (CLIP-style) ---
Input: "a photo of a cat"
Tokens: [0, 3, 116, 108, 115, 120, 3, 115, 3, 3, 103, 101, 3, 1]
Vocab size: 260

--- Unigram Tokenizer (T5-style) ---
Input: "a photo of a cat"
Tokens: [70, 56, 68, 70, 60, 2]
Decoded: "a photo of a cat"

--- Flow Matching Scheduler ---
Number of inference steps: 20
Sigma schedule (first 5): [1.0, 0.95, 0.9, 0.85, 0.8]

✅ All component demos completed successfully!

📚 Module Deep Dive

`tensor.rs` - The Foundation

Everything in ML starts with tensors. Our implementation wraps ndarray:

use mini_diffusion::Tensor;

// Create tensors
let zeros = Tensor::zeros(&[2, 3, 4]);      // Shape [2, 3, 4]
let ones = Tensor::ones(&[64, 64]);          // Shape [64, 64]
let noise = Tensor::randn(&[1, 4, 32, 32]); // Gaussian noise

// Operations
let sum = zeros.add(&ones);                  // Element-wise add
let product = a.mul(&b);                     // Element-wise multiply
let mm = a.matmul(&b);                       // Matrix multiplication

// Activations
let relu_out = x.relu();                     // max(0, x)
let silu_out = x.silu();                     // x * sigmoid(x)
let gelu_out = x.gelu();                     // Gaussian Error Linear Unit

`nn.rs` - Neural Network Layers

Building blocks for neural networks:

use mini_diffusion::{Linear, Conv2d, GroupNorm, LayerNorm};

// Fully connected layer: 768 → 3072
let linear = Linear::new(768, 3072);
let output = linear.forward(&input);  // [B, 768] → [B, 3072]

// 2D Convolution: 3 channels → 64 channels, 3x3 kernel
let conv = Conv2d::new(3, 64, 3, 1, 1);  // in, out, kernel, stride, padding
let features = conv.forward(&image);     // [B, 3, H, W] → [B, 64, H, W]

// Normalization
let gn = GroupNorm::new(32, 256);        // 32 groups, 256 channels
let ln = LayerNorm::new(768);            // Normalize last dim

`diffusion.rs` - The Diffusion Process

The forward process adds noise; the reverse process removes it:

use mini_diffusion::{NoiseScheduler, DiffusionConfig};

let config = DiffusionConfig {
    num_timesteps: 1000,
    beta_start: 1e-4,
    beta_end: 0.02,
    schedule: "cosine".to_string(),
};

let scheduler = NoiseScheduler::new(config);

// Forward diffusion: add noise at timestep t
let noisy = scheduler.add_noise(&clean_image, &noise, timestep);

// Get alpha values for the math
let alpha_t = scheduler.alphas_cumprod[t];
// x_t = sqrt(alpha_t) * x_0 + sqrt(1-alpha_t) * noise

`unet.rs` - U-Net Architecture

The classic noise prediction network:

use mini_diffusion::UNet;

// Create U-Net: 3 input channels, 64 base channels, 3 output channels
let unet = UNet::new(3, 64, 3);
println!("Parameters: {}", unet.num_parameters());  // ~9.5M

// Forward pass: predict noise given noisy image and timestep
let predicted_noise = unet.forward(&noisy_image, timestep);

`sampling.rs` - Image Generation

DDPM and DDIM samplers for generating images:

use mini_diffusion::{Sampler, SamplerConfig};

let config = SamplerConfig {
    num_steps: 50,        // DDIM can use fewer steps
    guidance_scale: 1.0,  // CFG scale (1.0 = no guidance)
    use_ddim: true,       // Deterministic sampling
    eta: 0.0,             // DDIM noise (0 = fully deterministic)
};

let sampler = Sampler::new(config, noise_scheduler);

// Generate from pure noise
let generated = sampler.sample(&model, &[4, 3, 32, 32]);
save_images(&generated, "output")?;

`tokenizer.rs` - Text Tokenization

Convert text to token IDs:

use mini_diffusion::{BPETokenizer, UnigramTokenizer};

// CLIP-style BPE
let bpe = BPETokenizer::new(77);  // max 77 tokens
let tokens = bpe.encode("a photo of a cat");
// tokens: [49406, 320, 1125, 539, 320, 2368, 49407]
//         [BOS]   a   photo  of   a   cat   [EOS]

// T5-style Unigram
let unigram = UnigramTokenizer::new(512);
let tokens = unigram.encode("a beautiful sunset");
let decoded = unigram.decode(&tokens);  // "a beautiful sunset"

`flow.rs` - Flow Matching (SD3-style)

Modern training approach:

use mini_diffusion::flow::{FlowMatchingScheduler, EulerSolver};

let scheduler = FlowMatchingScheduler::new(20);  // 20 inference steps

// Training: interpolate between data and noise
let noisy = scheduler.add_noise(&data, &noise, sigma);
let velocity = scheduler.get_velocity(&data, &noise);
// Loss = ||model(noisy, t) - velocity||²

// Inference: Euler solver
let solver = EulerSolver::new(20);
for step in 0..20 {
    let v = model.forward(&x, step);
    x = solver.step(&x, &v, step);
}

🦀 Rust Concepts Explained

Ownership and Borrowing

Our tensor operations demonstrate Rust's memory safety:

// tensor.rs - Owned data with explicit cloning
pub struct Tensor {
    data: Array<f32, IxDyn>,  // Owns the data
}

impl Tensor {
    // Takes ownership of input, returns new Tensor
    pub fn mul(&self, other: &Tensor) -> Tensor {
        let result = &self.data * &other.data;  // Borrow for operation
        Tensor { data: result }  // New owned Tensor
    }
}

Traits for Abstraction

We use traits to define common interfaces:

// nn.rs - Layer trait for all neural network components
pub trait Layer {
    fn forward(&self, input: &Tensor) -> Tensor;
}

impl Layer for Linear {
    fn forward(&self, x: &Tensor) -> Tensor {
        // Matrix multiply + bias
        x.matmul(&self.weight).add(&self.bias)
    }
}

Builder Pattern

Configuration uses Rust's builder pattern:

// diffusion.rs
let config = DiffusionConfig {
    num_timesteps: 1000,
    beta_start: 1e-4,
    beta_end: 0.02,
    schedule: "cosine".to_string(),
};

Error Handling

Shape mismatches use Result types:

// tensor.rs - Explicit error handling
pub fn reshape(&self, new_shape: &[usize]) -> Result<Tensor, ShapeError> {
    let total_old: usize = self.shape().iter().product();
    let total_new: usize = new_shape.iter().product();
    
    if total_old != total_new {
        return Err(ShapeError::IncompatibleShape);
    }
    // ...
}

Type Safety

Compile-time dimension checking:

// Strong typing prevents runtime errors
fn attention(q: &Tensor, k: &Tensor, v: &Tensor) -> Tensor {
    // Shapes are validated at runtime, but types ensure Tensor
    let scores = q.matmul(&k.transpose());  // Returns Tensor
    let weights = softmax(&scores);          // Returns Tensor
    weights.matmul(v)                         // Returns Tensor
}

📊 Architecture Overview

U-Net (DDPM-style)

The classic diffusion model architecture:

Input Image + Noise → [Encoder] → [Middle] → [Decoder] → Predicted Noise
     ↓                   ↓           ↓           ↓
   32x32              16x16        8x8       16x16→32x32
     └────────────── Skip Connections ──────────────┘

DiT (SD3-style)

Modern transformer-based diffusion:

Text Tokens ─┬── [Joint Attention] ──→ Text Features
             │         ↕
Image Patches ┴── [Joint Attention] ──→ Image Features → Unpatchify
                      ↑
              Timestep Embedding

VAE (Latent Space)

Compress images to smaller latent space:

Image (512×512×3) → Encoder → Latent (64×64×4) → Decoder → Image
                       │
                  4x compression

🧩 Components Overview

Tensors (`tensor.rs`)

Multi-dimensional arrays with:

Creation (zeros, ones, random, from_vec)
Element-wise operations (add, mul, sub_scalar, pow)
Math functions (sqrt, exp, ln)
Activations (ReLU, SiLU, GELU)
Matrix multiplication and transpose

Neural Network Layers (`nn.rs`)

Linear - Fully connected layers
Conv2d - 2D convolution
GroupNorm - Group normalization (stable for small batches)
LayerNorm - Layer normalization
SelfAttention - Self-attention mechanism

Diffusion Process (`diffusion.rs`)

Noise schedules (linear, cosine)
Forward diffusion (adding noise)
Timestep embeddings

U-Net (`unet.rs`)

ResBlock - Residual blocks with time conditioning
Downsample - Spatial downsampling
Upsample - Spatial upsampling
UNet - Complete encoder-decoder with skip connections

Training (`training.rs`)

MSE loss for noise prediction
Adam optimizer
Learning rate scheduling (cosine, warmup)

Sampling (`sampling.rs`)

DDPM (stochastic, 1000 steps)
DDIM (deterministic, ~50 steps)
Image saving utilities

SD3/Flux-Style Components

vae.rs - VAE encoder/decoder structure
tokenizer.rs - BPE and Unigram tokenizers
clip.rs - CLIP text encoder architecture
t5.rs - T5 text encoder with RMSNorm
flow.rs - Flow matching scheduler and Euler solver
joint_attention.rs - Multi-modal attention with RoPE
dit.rs - Diffusion Transformer blocks

⚙️ Configuration

Diffusion Config

DiffusionConfig {
    num_timesteps: 1000,    // Number of noise levels
    beta_start: 1e-4,       // Starting noise
    beta_end: 0.02,         // Ending noise
    schedule: "cosine",     // "linear" or "cosine"
}

Training Config

TrainingConfig {
    learning_rate: 1e-4,
    batch_size: 4,
    num_epochs: 100,
    beta1: 0.9,             // Adam momentum
    beta2: 0.999,           // Adam RMSprop
}

Sampler Config

SamplerConfig {
    num_steps: 50,          // DDIM steps
    guidance_scale: 1.0,    // CFG scale
    use_ddim: true,         // DDIM vs DDPM
    eta: 0.0,               // DDIM stochasticity
}

🔬 What's Implemented

Component	Description	Status
Tensor	Multi-dim arrays, math ops	✅ Working
Linear	Fully connected layers	✅ Working
Conv2d	2D convolution	✅ Working
GroupNorm	Group normalization	✅ Working
Self-Attention	Attention mechanism	✅ Working
U-Net	Encoder-decoder with skips	✅ Working
Noise Scheduler	Linear/cosine schedules	✅ Working
DDPM Sampling	Stochastic 1000-step	✅ Working
DDIM Sampling	Deterministic ~50-step	✅ Working
BPE Tokenizer	CLIP-style tokenization	✅ Working
Unigram Tokenizer	T5-style tokenization	✅ Working
Flow Matching	SD3-style training	✅ Working
VAE	Encoder/Decoder structure	⚠️ Structure
DiT	Transformer blocks	⚠️ Structure
Joint Attention	Multi-modal attention	⚠️ Structure

⚠️ Structure = architecture correct, needs pretrained weights for real use

🚫 What's Not Included

This is an educational implementation. For production, you'd need:

Automatic differentiation - We show forward passes only
GPU acceleration - CPU only currently
Pretrained weights - Random initialization only
Real training - Demo shows structure, not convergence
Optimized operations - Naive implementations for clarity

📖 Learning Resources

Papers

DDPM - Denoising Diffusion Probabilistic Models
DDIM - Denoising Diffusion Implicit Models
Improved DDPM - Better noise schedules
Classifier-Free Guidance - Text conditioning
Latent Diffusion - Stable Diffusion foundation
DiT - Diffusion Transformers
SD3 - Scaling Rectified Flow Transformers

Rust Resources

🤝 Contributing

This is an educational project. Feel free to:

Improve documentation
Add more tests
Optimize implementations
Fix bugs

🧪 Running Tests

# Run all tests
cargo test

# Run specific module tests
cargo test tensor::tests
cargo test nn::tests
cargo test diffusion::tests
cargo test sampling::tests
cargo test flow::tests
cargo test tokenizer::tests

# Run with output
cargo test -- --nocapture

Test Results:

36 tests passing (core components)
9 tests pending (complex integrations need shape tuning)

📄 License

MIT License - see LICENSE file for details.

Built with ❤️ to learn Rust and Diffusion Models together

FilesExpand file tree

README.md

Latest commit

History