GPU Accelerated Image Blurring

A comprehensive CUDA-based Image blur implementation with multiple optimization strategies, batch processing capabilities, and extensive performance benchmarking.

Multiple GPU Implementations:

Naive (Global Memory Only)
Tiled (Shared Memory with Halo)
Separable (Two 1D Passes)
Separable + Constant Memory
Separable + Shared Memory (Best Performance)
CPU Baseline: Multi-threaded OpenMP implementation for comparison
Batch Processing: CUDA Streams for overlapping computation and memory transfers
Image Format Support: PNG, JPG, BMP, PPM (via stb_image)
Comprehensive Benchmarking: Performance metrics, GFLOPS, bandwidth utilization

Project Structure

/
├── src/
│   ├── main.cu              # Main program and benchmarking
│   ├── kernels.cu           # All GPU kernel implementations
│   ├── support.cu           # Utility functions and timing
│   ├── image_io.cpp         # Image loading/saving
│   └── cpu_blur.cpp         # CPU baseline implementation
├── include/
│   ├── support.h            # Support function headers
│   ├── image_io.h           # Image I/O headers
│   ├── kernels.cuh          # Kernel declarations
│   ├── stb_image.h          # Image loading library
│   └── stb_image_write.h    # Image saving library
├── blur_real_images_5674317.out            # Sample output file
├── Makefile                 # Build configuration
├── run_mahti.sh             # SLURM batch script for CSC Mahti
└── README.md                # This file

Running the Code

Compilation

make clean
make

Run Default Benchmark

./blur_benchmark
# Default: 1024x1024 synthetic image, 7x7 kernel, sigma=2.0, 10 images

Process Your Own Image

# Process a real image (PNG, JPG, BMP, etc.)
./blur_benchmark path/to/image.png [kernel_size] [sigma] [batch_size]

# Examples:
./blur_benchmark photo.jpg 7 2.0 10           # Blur photo.jpg
./blur_benchmark input.png 11 3.0 5           # Strong blur

# Output saved to: output_images/output_blurred.png

Benchmark with Synthetic Images

./blur_benchmark [width] [height] [kernel_size] [sigma] [batch_size]

# Examples:
./blur_benchmark 512 512 5 1.5 5          # Small test
./blur_benchmark 2048 2048 9 2.5 20       # Large test
./blur_benchmark 3840 2160 7 2.0 10       # 4K test

Run Test Suites

make test_small    # 512x512, kernel=5, batch=5
make test_large    # 2048x2048, kernel=9, batch=20
make test_4k       # 3840x2160, kernel=7, batch=10

Implementation Details

Gaussian Blur Algorithm

Separable Gaussian blur is implemented using two 1D convolution passes:

Horizontal Pass: Apply 1D Gaussian kernel along rows
Vertical Pass: Apply 1D Gaussian kernel along columns

Complexity: O(n) per pixel instead of O(n²) for 2D convolution

Memory Optimizations

Shared Memory Tiling: Reduces global memory accesses
Constant Memory: Stores kernel coefficients for fast access
Boundary Clamping: Handles edge cases efficiently
Coalesced Access: Optimized memory access patterns

CUDA Streams

Batch processing uses multiple streams to overlap:

Host-to-Device memory transfers
Kernel execution
Device-to-Host memory transfers

Performance Metrics

The benchmark reports:

Kernel Execution Time: Pure GPU computation time
Total Time: Including memory transfers
Speedup vs CPU: GPU performance relative to multi-threaded CPU
GFLOPS: Floating-point operations per second
Throughput: Images processed per second
Verification: Correctness check against CPU baseline

Note

First run may be slower due to CUDA initialization
Performance scales with image size (better GPU utilization)
Larger kernels benefit more from separable approach
Batch processing shows significant benefits with streams

Authors

Course: GPU Programming

Project: GPU Accelerated Image Blurring

Group Members: Behroz Karim, Bhawish Raj, Hasnain Ajmal, Talha Rizwan, Zafeer ul Haq.

Acknowledgement

The stb_image and stb_image_write libraries were sourced from nothings/stb. AI tools were utilized to research optimization techniques for image blurring and to assist in developing the benchmarking infrastructure.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Accelerated Image Blurring

Multiple GPU Implementations:

Project Structure

Running the Code

Compilation

Run Default Benchmark

Process Your Own Image

Benchmark with Synthetic Images

Run Test Suites

Implementation Details

Gaussian Blur Algorithm

Memory Optimizations

CUDA Streams

Performance Metrics

Note

Authors

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
include		include
src		src
Makefile		Makefile
README.md		README.md
blur_real_images_5674317.out		blur_real_images_5674317.out
run_mahti.sh		run_mahti.sh

Folders and files

Latest commit

History

Repository files navigation

GPU Accelerated Image Blurring

Multiple GPU Implementations:

Project Structure

Running the Code

Compilation

Run Default Benchmark

Process Your Own Image

Benchmark with Synthetic Images

Run Test Suites

Implementation Details

Gaussian Blur Algorithm

Memory Optimizations

CUDA Streams

Performance Metrics

Note

Authors

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages