You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Our initial CUDA examples (in <codeclass="bg-gray-100 px-1 py-0.5 rounded">02_basic_kernels</code>) are great for learning, but they leave significant performance on the table. For example, our naive matrix multiplication kernel:</p>
__shared__ float As[16][16]; // Shared memory tile for A
180
-
__shared__ float Bs[16][16]; // Shared memory tile for B
181
-
182
-
// ... (tile loading logic) ...
183
-
184
-
// Reuse tiles for 16x16 output elements
185
-
for (int t = 0; t <N/16; t++) {
186
-
// Load tiles from global memory to shared memory
187
-
As[threadIdx.y][threadIdx.x] = A[...];
188
-
Bs[threadIdx.y][threadIdx.x] = B[...];
189
-
__syncthreads();
190
-
191
-
// Compute partial sum using shared memory
192
-
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
193
-
__syncthreads();
194
-
}
195
-
}</code></pre>
176
+
}</code></pre>
196
177
197
178
<h3>2. Memory Coalescing</h3>
198
179
<p>Our <codeclass="bg-gray-100 px-1 py-0.5 rounded">memory_coalescing_demo.cu</code> shows how to align global memory access with GPU memory banks, reducing latency by 70% for strided access patterns.</p>
<p>Our original DQN used uniform experience replay, which wastes time on low-impact transitions. @ml-engineer-jane implemented prioritized replay, weighting transitions by their temporal difference (TD) error:</p>
120
-
121
-
<codeclass="bg-gray-100 px-1 py-0.5 rounded block my-4"># In 03_advanced_agents/dqn_prioritized.py
138
+
139
+
<preclass="code-block"><codeclass="language-python"># In 03_advanced_agents/dqn_prioritized.py
0 commit comments