AIComputing101
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docker/README.md‎
Lines changed: 2 additions & 2 deletions b/‎docker/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎modules/module1/README.md‎
Lines changed: 23 additions & 9 deletions b/‎modules/module1/README.md‎
Lines changed: 23 additions & 9 deletions
diff --git a/‎modules/module1/content.md‎
Lines changed: 2 additions & 0 deletions b/‎modules/module1/content.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎modules/module1/examples/README.md‎
Lines changed: 39 additions & 37 deletions b/‎modules/module1/examples/README.md‎
Lines changed: 39 additions & 37 deletions
diff --git a/‎modules/module2/README.md‎
Lines changed: 37 additions & 22 deletions b/‎modules/module2/README.md‎
Lines changed: 37 additions & 22 deletions
diff --git a/‎modules/module2/content.md‎
Lines changed: 2 additions & 0 deletions b/‎modules/module2/content.md‎
Lines changed: 2 additions & 0 deletions
@@ -217,7 +217,7 @@ cd gpu-programming-101
 
 # Inside container: verify GPU access and start learning
 /workspace/test-gpu.sh
-cd modules/module1 && make && ./01_vector_addition_cuda
+cd modules/module1 && make && ./build/01_vector_addition_cuda
 ```
 
 ### Option 2: Native Installation
 
@@ -73,7 +73,7 @@ docker/
 
 ### ROCm Development Container
 **Image**: `gpu-programming-101:rocm`  
-**Base**: `rocm/dev-ubuntu-22.04:7.0`
+**Base**: `rocm/dev-ubuntu-22.04:7.0-complete`
 
 **Features**:
 - ROCm 7.0 with HIP development environment
@@ -298,7 +298,7 @@ sudo apt update && sudo apt upgrade docker-ce docker-compose
 
 # Check base image availability
 docker pull nvidia/cuda:12.9.1-devel-ubuntu22.04
-docker pull rocm/dev-ubuntu-22.04:7.0
+docker pull rocm/dev-ubuntu-22.04:7.0-complete
 ```
 
 **"Permission denied errors"**
 
@@ -20,25 +20,34 @@ After completing this module, you will be able to:
 
 ### Prerequisites
 - NVIDIA GPU with CUDA support OR AMD GPU with ROCm support
-- CUDA Toolkit 11.0+ or ROCm 4.0+
+- CUDA Toolkit 12.0+ or ROCm 6.0+ (Docker images provide CUDA 12.9.1 and ROCm 7.0)
 - C/C++ compiler (GCC, Clang, or MSVC)
 
+Tip: You can skip native installs by using our Docker environment (recommended):
+```
+./docker/scripts/run.sh --auto
+```
+
 ### Running Examples
 
 Navigate to the examples directory:
 ```bash
 cd examples/
 ```
 
-Build and run examples:
+Build and run examples (binaries are written to `build/`):
 ```bash
-# Build all examples
+# Build all examples for your detected GPU
 make
 
-# Run specific examples
-./01_vector_addition_cuda
-./04_device_info_cuda
-./05_performance_comparison
+# Run specific examples (CUDA)
+./build/01_vector_addition_cuda
+./build/04_device_info_cuda
+./build/05_performance_comparison_cuda || ./build/05_performance_comparison
+
+# Or HIP versions (cross-platform)
+./build/02_vector_addition_hip
+./build/04_device_info_hip
 ```
 
 ## Examples Overview
@@ -48,9 +57,14 @@ make
 | `01_vector_addition_cuda.cu` | Basic CUDA vector addition | Kernels, memory management, error handling |
 | `02_vector_addition_hip.cpp` | Cross-platform HIP version | HIP API, portability |
 | `03_matrix_addition_cuda.cu` | 2D matrix operations | 2D threading, indexing |
+| `03_matrix_addition_hip.cpp` | HIP 2D matrix operations | HIP indexing, portability |
 | `04_device_info_cuda.cu` | GPU properties and capabilities | Device queries, system info |
-| `05_performance_comparison.cu` | CPU vs GPU benchmarking | Performance analysis, timing |
-| `06_debug_example.cu` | Debugging and optimization | Error checking, occupancy |
+| `04_device_info_hip.cpp` | HIP device and platform info | HIP device queries |
+| `05_performance_comparison_cuda.cu` | CPU vs GPU benchmarking (CUDA) | Performance analysis, timing |
+| `05_performance_comparison_hip.cpp` | Benchmarking (HIP) | HIP performance, memory bandwidth |
+| `06_debug_example_cuda.cu` | Debugging and optimization (CUDA) | Error checking, occupancy |
+| `06_debug_example_hip.cpp` | Debugging and optimization (HIP) | HIP debugging |
+| `07_cross_platform_comparison.cpp` | AMD vs NVIDIA comparison | Portability, tuning |
 
 ## Topics Covered
 
 
@@ -1,6 +1,8 @@
 # Module 1: Foundations of GPU Programming with CUDA and HIP
 *Heterogeneous Data Parallel Computing*
 
+> Environment note: Examples are validated in containers using CUDA 12.9.1 (Ubuntu 22.04) and ROCm 7.0 (rocm/dev-ubuntu-22.04:7.0-complete). Using Docker is recommended for a consistent setup.
+
 ## Learning Objectives
 After completing this module, you will be able to:
 - Understand the fundamental differences between CPU and GPU architectures
 
@@ -6,7 +6,7 @@ This directory contains practical examples that accompany Module 1 of the GPU Pr
 
 ### CUDA Examples (NVIDIA)
 | File | Description | Key Concepts |
-|------|-------------||--------------|
+|------|-------------|--------------|
 | `01_vector_addition_cuda.cu` | Basic CUDA vector addition with error handling | Kernels, memory management, error checking |
 | `03_matrix_addition_cuda.cu` | 2D matrix addition with thread indexing | 2D threading, grid configuration |
 | `04_device_info_cuda.cu` | Query and display GPU properties | Device queries, capability checking |
@@ -15,7 +15,7 @@ This directory contains practical examples that accompany Module 1 of the GPU Pr
 
 ### HIP Examples (AMD/NVIDIA Cross-Platform)
 | File | Description | Key Concepts |
-|------|-------------||--------------|
+|------|-------------|--------------|
 | `02_vector_addition_hip.cpp` | Cross-platform vector addition using HIP | HIP API, portability |
 | `03_matrix_addition_hip.cpp` | 2D matrix addition with HIP | Cross-platform 2D threading |
 | `04_device_info_hip.cpp` | HIP device properties and platform detection | HIP device queries, platform abstraction |
@@ -26,14 +26,14 @@ This directory contains practical examples that accompany Module 1 of the GPU Pr
 ## Prerequisites
 
 ### For CUDA Examples
-- NVIDIA GPU with compute capability 3.5+
-- NVIDIA drivers (version 450+)
-- CUDA Toolkit 11.0+
+- NVIDIA GPU with compute capability 5.0+
+- NVIDIA drivers 550+ recommended
+- CUDA Toolkit 12.0+ (Docker uses CUDA 12.9.1)
 - GCC/Clang compiler
 
 ### For HIP Examples
 - AMD GPU with ROCm support OR NVIDIA GPU
-- ROCm 4.0+ (for AMD) or CUDA 11.0+ (for NVIDIA backend)
+- ROCm 6.0+ (for AMD) or CUDA 12.0+ (for NVIDIA backend)
 - HIP compiler (hipcc)
 
 ## Quick Start
@@ -59,23 +59,25 @@ make help
 
 ### Manual Compilation
 
+Binaries are written to `build/` by the Makefile.
+
 **CUDA Examples:**
 ```bash
-nvcc -o vector_add 01_vector_addition_cuda.cu
-nvcc -o matrix_add 03_matrix_addition_cuda.cu
-nvcc -o device_info 04_device_info_cuda.cu
-nvcc -o performance 05_performance_comparison.cu
-nvcc -o debug 06_debug_example.cu
+nvcc -o build/01_vector_addition_cuda 01_vector_addition_cuda.cu
+nvcc -o build/03_matrix_addition_cuda 03_matrix_addition_cuda.cu
+nvcc -o build/04_device_info_cuda 04_device_info_cuda.cu
+nvcc -o build/05_performance_comparison_cuda 05_performance_comparison_cuda.cu
+nvcc -o build/06_debug_example_cuda 06_debug_example_cuda.cu
 ```
 
 **HIP Examples:**
 ```bash
-hipcc -o vector_add_hip 02_vector_addition_hip.cpp
-hipcc -o matrix_add_hip 03_matrix_addition_hip.cpp
-hipcc -o device_info_hip 04_device_info_hip.cpp
-hipcc -o performance_hip 05_performance_comparison_hip.cpp
-hipcc -o debug_hip 06_debug_example_hip.cpp
-hipcc -o cross_platform 07_cross_platform_comparison.cpp
+hipcc -o build/02_vector_addition_hip 02_vector_addition_hip.cpp
+hipcc -o build/03_matrix_addition_hip 03_matrix_addition_hip.cpp
+hipcc -o build/04_device_info_hip 04_device_info_hip.cpp
+hipcc -o build/05_performance_comparison_hip 05_performance_comparison_hip.cpp
+hipcc -o build/06_debug_example_hip 06_debug_example_hip.cpp
+hipcc -o build/07_cross_platform_comparison 07_cross_platform_comparison.cpp
 ```
 
 ## Example Descriptions
@@ -91,8 +93,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make vector_add_cuda
-./vector_add_cuda
+make
+./build/01_vector_addition_cuda
 ```
 
 **Expected Output:**
@@ -116,8 +118,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make vector_add_hip
-./vector_add_hip
+make hip
+./build/02_vector_addition_hip
 ```
 
 ### 3. Matrix Addition (CUDA)
@@ -131,8 +133,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make matrix_add_cuda
-./matrix_add_cuda
+make
+./build/03_matrix_addition_cuda
 ```
 
 ### 3b. Matrix Addition (HIP)
@@ -146,8 +148,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make matrix_add_hip
-./matrix_add_hip
+make hip
+./build/03_matrix_addition_hip
 ```
 
 ### 4. Device Information (CUDA)
@@ -161,8 +163,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make device_info_cuda
-./device_info_cuda
+make
+./build/04_device_info_cuda
 ```
 
 ### 4b. Device Information (HIP)
@@ -176,8 +178,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make device_info_hip
-./device_info_hip
+make hip
+./build/04_device_info_hip
 ```
 
 ### 5. Performance Comparison (CUDA)
@@ -191,8 +193,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make performance_cuda
-./performance_cuda
+make
+./build/05_performance_comparison_cuda
 ```
 
 ### 5b. Performance Comparison (HIP)
@@ -207,8 +209,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make performance_hip
-./performance_hip
+make hip
+./build/05_performance_comparison_hip
 ```
 
 ### 6. Debug Example (CUDA)
@@ -222,8 +224,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make debug_cuda
-./debug_cuda
+make debug
+./build/06_debug_example_cuda
 ```
 
 ### 6b. Debug Example (HIP)
@@ -238,8 +240,8 @@ Demonstrates:
 
 **Usage:**
 ```bash
-make debug_hip
-./debug_hip
+make debug hip
+./build/06_debug_example_hip
 ```
 
 ### 7. Cross-Platform Comparison
 
@@ -1,7 +1,7 @@
-# Module 2: Multi-Dimensional Data Processing
+# Module 2: Advanced GPU Memory Management
 
 ## Overview
-This module explores multidimensional grid organization, thread mapping to data structures, image processing kernels, and matrix multiplication algorithms.
+This module focuses on GPU memory hierarchy mastery and performance optimization: shared memory tiling, memory coalescing, texture/read-only memory usage, unified memory, and bandwidth optimization.
 
 ## Learning Objectives
 After completing this module, you will be able to:
@@ -12,16 +12,31 @@ After completing this module, you will be able to:
 - Handle boundary conditions in multidimensional algorithms
 
 ## Module Content
-- **[content.md](content.md)** - Complete module content (Coming Soon)
-- **[examples/](examples/)** - Practical code examples (Coming Soon)
+- **[content.md](content.md)** - Complete module content
+- **[examples/](examples/)** - Practical code examples
 
-## Status: 🚧 Under Development
+## Quick Start
 
-This module is currently being developed. Check back soon for:
-- Comprehensive theory and explanations
-- Working code examples  
-- Hands-on exercises
-- Performance benchmarks
+### Prerequisites
+- NVIDIA GPU with CUDA support OR AMD GPU with ROCm support
+- CUDA Toolkit 12.0+ or ROCm 6.0+ (Docker images provide CUDA 12.9.1 and ROCm 7.0)
+- C/C++ compiler (GCC, Clang, or MSVC)
+
+Recommended: use our Docker dev environment
+```
+./docker/scripts/run.sh --auto
+```
+
+### Build and Run
+```bash
+cd modules/module2/examples
+make            # auto-detects your GPU and builds accordingly
+
+# Run a few examples (binaries in build/)
+./build/01_shared_memory_transpose_cuda    # or _hip on AMD
+./build/02_memory_coalescing_cuda          # or _hip on AMD
+./build/04_unified_memory_cuda
+```
 
 ## Topics to be Covered
 
@@ -30,20 +45,20 @@ This module is currently being developed. Check back soon for:
 - Grid size calculations for arbitrary data sizes
 - Thread-to-data mapping strategies
 
-### 2. Image Processing Applications
-- Image convolution kernels
-- Color space transformations
-- Image filtering and enhancement
+### 2. Memory Access Patterns
+- Coalesced vs strided access
+- Structure of Arrays vs Array of Structures
+- Read-only/texture cache benefits
 
-### 3. Matrix Operations
-- Matrix multiplication algorithms
-- Tiled matrix multiplication
-- Memory access optimization
+### 3. Shared Memory and Tiling
+- Tiled transpose with bank-conflict avoidance
+- Block-level cooperation and synchronization
+- Padding strategies to avoid bank conflicts
 
-### 4. Advanced Indexing
-- Row-major vs column-major layouts
-- Handling non-square matrices
-- Boundary checking techniques
+### 4. Unified Memory and Bandwidth
+- Unified memory prefetch and advice
+- Measuring and optimizing memory bandwidth
+- Analyzing profiler metrics for memory performance
 
 ---
 **Duration**: 6-8 hours  
 
@@ -1,6 +1,8 @@
 # Module 2: Advanced GPU Memory Management and Optimization
 *Mastering GPU Memory Hierarchies and Performance Optimization*
 
+> Environment note: Examples are tested in Docker containers with CUDA 12.9.1 and ROCm 7.0 (rocm/dev-ubuntu-22.04:7.0-complete). Prefer Docker for reproducible builds.
+
 ## Learning Objectives
 After completing this module, you will be able to:
 - Master GPU memory hierarchy and optimization strategies