You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>The <codeclass="bg-gray-100 px-1 py-0.5 rounded">gpu-programming-101</code> repo has always focused on <strong>practical progression</strong> — from "hello world" CUDA kernels to production-grade code. Today, we’re releasing a new advanced module: <codeclass="bg-gray-100 px-1 py-0.5 rounded">04_cuda_optimization</code>, packed with techniques to turn your prototypes into high-performance applications.</p>
<p>Our initial CUDA examples (in <codeclass="bg-gray-100 px-1 py-0.5 rounded">02_basic_kernels</code>) are great for learning, but they leave significant performance on the table. For example, our naive matrix multiplication kernel:</p>
__shared__ float As[16][16]; // Shared memory tile for A
75
+
__shared__ float Bs[16][16]; // Shared memory tile for B
76
+
77
+
// ... (tile loading logic) ...
78
+
79
+
// Reuse tiles for 16x16 output elements
80
+
for (int t = 0; t <N/16; t++) {
81
+
// Load tiles from global memory to shared memory
82
+
As[threadIdx.y][threadIdx.x] = A[...];
83
+
Bs[threadIdx.y][threadIdx.x] = B[...];
84
+
__syncthreads();
85
+
86
+
// Compute partial sum using shared memory
87
+
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
88
+
__syncthreads();
89
+
}
90
+
}</pre>
91
+
</div>
92
+
93
+
<h4>2. Memory Coalescing</h4>
94
+
<p>Our <codeclass="bg-gray-100 px-1 py-0.5 rounded">memory_coalescing_demo.cu</code> shows how to align global memory access with GPU memory banks, reducing latency by 70% for strided access patterns.</p>
95
+
96
+
<h4>3. Profiling Workflow</h4>
97
+
<p>We’ve added a step-by-step guide to using NVIDIA Visual Profiler (nvvp) to identify bottlenecks, with example profiles from our before/after kernels.</p>
98
+
99
+
<h3>From Repo to Real-World</h3>
100
+
<p>What makes this module unique? It’s tied to real use cases:
101
+
<ulclass="list-disc pl-6 mb-6 space-y-2">
102
+
<li>Image convolution (used in our <codeclass="bg-gray-100 px-1 py-0.5 rounded">03_image_processing</code> module)</li>
103
+
<li>Particle simulation (extended from our <codeclass="bg-gray-100 px-1 py-0.5 rounded">02_basic_kernels/nbody.cu</code>)</li>
104
+
<li>Neural network inference (compatible with our PyTorch-CUDA bridge examples)</li>
105
+
</ul>
106
+
107
+
<p>Update your repo with <codeclass="bg-gray-100 px-1 py-0.5 rounded">git pull</code> and start optimizing — then share your speedups in the <ahref="https://github.com/AIComputing101/gpu-programming-101/issues" target="_blank" class="text-primary hover:underline">issue tracker</a>. We’re especially excited to see how you apply these techniques to your own projects!</p>
108
+
</div>
109
+
</div>
110
+
</section>
111
+
112
+
<!-- Related Posts -->
113
+
<sectionclass="py-12 bg-gray-50">
114
+
<divclass="container mx-auto px-4">
115
+
<divclass="max-w-3xl mx-auto">
116
+
<h2class="text-2xl font-semibold mb-8">Related to GPU-101</h2>
0 commit comments