A comprehensive CUDA-based Image blur implementation with multiple optimization strategies, batch processing capabilities, and extensive performance benchmarking.
-
Naive (Global Memory Only)
-
Tiled (Shared Memory with Halo)
-
Separable (Two 1D Passes)
-
Separable + Constant Memory
-
Separable + Shared Memory (Best Performance)
-
CPU Baseline: Multi-threaded OpenMP implementation for comparison
-
Batch Processing: CUDA Streams for overlapping computation and memory transfers
-
Image Format Support: PNG, JPG, BMP, PPM (via stb_image)
-
Comprehensive Benchmarking: Performance metrics, GFLOPS, bandwidth utilization
/
├── src/
│ ├── main.cu # Main program and benchmarking
│ ├── kernels.cu # All GPU kernel implementations
│ ├── support.cu # Utility functions and timing
│ ├── image_io.cpp # Image loading/saving
│ └── cpu_blur.cpp # CPU baseline implementation
├── include/
│ ├── support.h # Support function headers
│ ├── image_io.h # Image I/O headers
│ ├── kernels.cuh # Kernel declarations
│ ├── stb_image.h # Image loading library
│ └── stb_image_write.h # Image saving library
├── blur_real_images_5674317.out # Sample output file
├── Makefile # Build configuration
├── run_mahti.sh # SLURM batch script for CSC Mahti
└── README.md # This file
make clean
make./blur_benchmark
# Default: 1024x1024 synthetic image, 7x7 kernel, sigma=2.0, 10 images# Process a real image (PNG, JPG, BMP, etc.)
./blur_benchmark path/to/image.png [kernel_size] [sigma] [batch_size]
# Examples:
./blur_benchmark photo.jpg 7 2.0 10 # Blur photo.jpg
./blur_benchmark input.png 11 3.0 5 # Strong blur
# Output saved to: output_images/output_blurred.png./blur_benchmark [width] [height] [kernel_size] [sigma] [batch_size]
# Examples:
./blur_benchmark 512 512 5 1.5 5 # Small test
./blur_benchmark 2048 2048 9 2.5 20 # Large test
./blur_benchmark 3840 2160 7 2.0 10 # 4K testmake test_small # 512x512, kernel=5, batch=5
make test_large # 2048x2048, kernel=9, batch=20
make test_4k # 3840x2160, kernel=7, batch=10Separable Gaussian blur is implemented using two 1D convolution passes:
- Horizontal Pass: Apply 1D Gaussian kernel along rows
- Vertical Pass: Apply 1D Gaussian kernel along columns
Complexity: O(n) per pixel instead of O(n²) for 2D convolution
- Shared Memory Tiling: Reduces global memory accesses
- Constant Memory: Stores kernel coefficients for fast access
- Boundary Clamping: Handles edge cases efficiently
- Coalesced Access: Optimized memory access patterns
Batch processing uses multiple streams to overlap:
- Host-to-Device memory transfers
- Kernel execution
- Device-to-Host memory transfers
The benchmark reports:
- Kernel Execution Time: Pure GPU computation time
- Total Time: Including memory transfers
- Speedup vs CPU: GPU performance relative to multi-threaded CPU
- GFLOPS: Floating-point operations per second
- Throughput: Images processed per second
- Verification: Correctness check against CPU baseline
- First run may be slower due to CUDA initialization
- Performance scales with image size (better GPU utilization)
- Larger kernels benefit more from separable approach
- Batch processing shows significant benefits with streams
Course: GPU Programming
Project: GPU Accelerated Image Blurring
Group Members: Behroz Karim, Bhawish Raj, Hasnain Ajmal, Talha Rizwan, Zafeer ul Haq.
The stb_image and stb_image_write libraries were sourced from nothings/stb. AI tools were utilized to research optimization techniques for image blurring and to assist in developing the benchmarking infrastructure.