Types of parallelism:
- Task based: do different operations in parallel
- e.g. multiply/add
- e.g. GNU Make (running subtasks in parallel)
- Suitable for multi-core CPU or networks of computers
- Data based: same operations in lock step on different data in parallel
- e.g. image ops on all pixels in an image simultaneously
- Suitable for GPUs
Latency vs Throughput:
- Latency oriented
- Get each result with minimum delay
- Get first result ASAP, even if it slows getting all results
- Throughput oriented
- Get all results back with minimum delay
- Doesn't matter how long to get each result, so long as total time is short as possible
- Large cache to speed up memory access
- Temporal/Spatial locality, Working sets (likely to use same value from
memory again, likely to use value close to used value from memory)
- Ensure each operation has smallest probability of fetching from slow memory
- May be a good idea to pre-fetch data
- Temporal/Spatial locality, Working sets (likely to use same value from
memory again, likely to use value close to used value from memory)
- Complex control units
- Short pipelines, Branch prediction, Data forwarding
- Make each instruction finish ASAP with minimal pipeline stalls
- Short pipelines, Branch prediction, Data forwarding
- Complex, energy expensive ALUs
- Large complex transistor arrangements
- Minimise no. of clock cycles per operation, get result ASAP
- Large complex transistor arrangements
- Small caches
- For staging data
- Get blocks of data for groups of threads to work on
- Avoids separate fetches from each thread
- Not for temporally/spatially located data
- For staging data
- Simple control units
- No branch prediction or data forwarding
- Control shared by multiple threads working on different data
- Simple energy efficient ALU
- Long pipeline
- Large no. of cycles per operation, heavily pipelined
- Long wait for 1st result (filling the pipeline), following results come quickly
- Requires large no. of threads to keep processor occupied
- Conceptually, host CPU and GPU are separate devices, connected by
communication path
- Need to generate separate code for each device. Nvidia compiler for CUDA "nvcc"
- nvcc takes C/C++ with Nvidia extensions, separates and compiles GPU code itself, passes host code to host for compilation
- Result binary contains host and device binary, downloaded to device from host when run
- Have to identify if function runs on host, device or both. Identify where it's
callable from:
- Host:
__host__ void f(...)- Default, can be omitted
- Callable from host only
- Device:
__global__ void f(...)- Special functions called kernel functions
- Callable from host only, how host gets code to run on GPU
- Device:
__device__ void f(...)- Callable from device only, helper functions to kernel functions and other GPU functions
- Both:
__host__ __device__ void f(...)- Generates host and device function, same code can run on both. Host/device cannot call other version
- Host:
- When calling kernel function, specify how threads are organised to execute it
- Every thread executes same kernel function
- Different GPU devices can support different no. of threads
- Making thread structure uniform would sacrifice potential computing power
-
Don't want fixed no. of threads to dictate size of largest vector we can add
-
Don't want to change code to run on different GPU with different thread count
-
Want to be able to organise threads to co-locate groups of threads on sets of processing units, taking advantage of shared caches and synchronisation facilities.
-
Nvidia GPUs organise threads into hierarchical structure
- A Grid is a collection of Blocks
- A Block is a collection of Threads
- A Thread is execution of a kernel on a single processing unit
There is the concept of a warp:
- A set of tightly related threads, must execute fully in lock step with each other
- Not part of the CUDA spec, but a feature of all Nvidia GPUs in low level hardware design
- No. of threads in a warp is a feature of GPU but current GPUs mostly 32
- Low level basis of thread scheduling on a GPU. If a thread is scheduled to execute, so must all other threads in warp
- Executing same instructions in lock step, all threads in warp have same instruction exec timing
- A block can have between 1 and max block size no. of threads for GPU. Is high-level basis for thread scheduling
- Because of warps, block size should be multiple of warp size, otherwise blocks are padded with remaining threads from a warp and many are wasted
- Grids have a large no. of blocks, more than can be executed at once
- Grid = whole problem divided into bitesize blocks
- (Missing notes)
(Missing notes)
- Latency: time from start to end of task
- Work: measure of work to do, i.e. FLOPS, num images processed
- Throughput: work done per time unit
- Speedup: improvement in speed from using 1 computational unit to
Pcomputational unitsS(p) = T1 / TpwhereT1is time taken for one computational unit andTpis time taken forpcomputational units- Perfect linear speedup:
S(p) = p- Means that for every computational unit we add, we get an extra
T1boost to work done in a time unit
- Means that for every computational unit we add, we get an extra
- Efficiency: ratio of
S(p) : pE(p) = T1 / (Tp * p) = S(p) / p- Measures how well computational units are contributing to latency/throughput
- Perfect linear efficiency:
E(p) = 1
T1 = Tser + TperwhereT1is time spent using 1 computational unitTseris time spent doing non-parallelizable workTparis time spent doing parallelizable work
- Therefore, using
pcomputational units gives us:Tp = Tser + (Tpar / S(p))
- Therefore, overall speed up is:
S'(p) = (Tset + Tpar) / (Tser + (Tpar / S(p)))- TODO: Are these equations right?
- We can write
TserandTparusing a timeTand a fraction of it that is parallelizablefS'(p) = ((1 - f)T + fT) / ((1 - f)T + (fT / S(p)))S'(p) = 1 / (1 - f + (f / S(p)))
- If as
ptends to infinity, so doesS(p), the limit of speed isTseror(1 - f)T
- Focuses on workload instead of time
W = (1 - f)W + fWwhereWis the total workload executedfis the fraction of the workload that is parallelizable
- Therefore, with
sas some speedup factor for parallel parts:Ws = (1 - f)W + sfW
- We can measure the ratio between
WsandW, giving us speedup:S = Ws / WS = ((1 - f)W + sfW) / ((1 - f)W + fW)S = 1 - f + fs
- A grid contains a collection of blocks, which contain a collection of threads
- Threads are executed in warps
- e.g. 1024 threads in a block, 32 threads in a warp, 1024/32=32 warps needed to execute a block
- TODO: How do warps relate to streaming multiprocessors?
- Warps execute in lock step, so implicit synchronisation
- For synchronisation in a block, we use
__syncthreads()- All threads in a block will hit this point and wait
- Once all threads reach it, execution will continue
- Can not call
__syncthreads()in a conditional
- Can not synchronise in a grid
- In a 1D block
- Warp 0 has threads 0-31
- Warp 1 has threads 32-63
- In a 2D block
- Let's say block size of
(4, 4, 4)- Thread
(0, 0, 0)is given index 0 - Thread
(0, 0, 1)is given index 1 - Thread
(0, 1, 0)is given index 4 - Thread
(0, 1, 2)is given index 6 - Thread
(1, 0, 0)is given index 16 - etc.
- Thread
- Then we allocate the warps as we did with 1D block, using these new indices
- Let's say block size of
- In a warp, say threads 0-15 take branch A, and threads 16-31 take branch B. This is called divergence.
- They all have to execute the same commands, which means that the complete warp has to process branch A and branch B
- If they all take branch A, branch B does not have to be executed
- We want to minimise divergence, and will design our algorithms so that threads next to each other in terms of index will execute the same instructions
- Apply some operator to a list of data and store the cumulative output
- e.g.
- List is [1, 2, 1, 3]
- Operator is addition
- Output is [1, 3, 4, 7]
- Only works within a single block
- Work through the list, where thread
nadds together:- Items
nandn - 1 - Then items
nandn - 2 - Then items
nandn - 4 - etc.
- Items
- FLOPS:
(N - 1) + (N - 2) + (N - 4) + ...- For 1024, serial has 1023 FLOPS
- For 1024, HSH has 9217 FLOPS
- BAD!
- Two phases, reduction and distribution
- Can optimise through copying block into shared memeory, performing operations, and then copying back (this works for HSH too)
- Hopefully we don't get examined on this because it was in the assignment... imma skip.
- FLOPS:
- Reduction phase:
(N/2) + (N/4) + (N/8)... - Distribution phase:
...(N/8 - 1) + (N/4 - 1) + (N/2 - 1) - For 1024, Blelloch has 2037 FLOPS
- Reduction phase:
- Perform block scan on all blocks in the list
- Extract the last value of each block into new list
- Perform block scan on this list
- Add this list back into the original block scanned list
- Global memory read/writes in transactions of size 32, 64, or 128 bytes
- Each memory transaction takes approx the same amount of time, so writing 8 bytes is equivalent to writing 128 bytes
- If consecutive threads access consecutive bytes, then all reads will be done in one memory transaction
- If consecutive threads access non-consecutive bytes, then all reads will be done in separate memory transactions, which is a lot slower
- In CUDA programming, favour SoA as it helps coalesced memory accesses
- 2 orders of magnitude faster than global memory accesses
- There are 32 banks (since compute 2.0)
- 32 consecutive shared memory accesses are spread across the 32 banks
nreads tondifferent banks can be done in one read operationnreads to1bank has to be done innread operations - must be done serially- With exception:
nreads to the same address in the same bank only takes one operation- TODO: What do these other exceptions mean?
- Sometimes we need to perform an atomic operation, e.g.
array[i] += 1;- This line contains a read, a modification, and a write
- We have some options:
- Use
__syncthreads()to make sure all threads can perform the operation uninterupted - Restructure so that one thread collects all the updates and performs them
- Use an atomic operation:
atomicAdd(&(array[i]), 1);
- Use
- TODO: Is this fast? How does it work? Not mentioned in the slides
- Each SM has a maximum number of blocks and warps that it can run concurrently
- Maximum blocks « maximum warps
- If you have less warps running than the max number of warps, you fail to reach 100% occupancy
- If you have loads of small blocks, then you might not reach the maximum number of warps, but you will reach the maximum number of blocks
- Map
- One-to-one, each thread responsible for converting one element
- Gather
- Read many, write one
- No read/write races still
- Scatter
- Read one, write many
- Need to be careful with writes
- Stencil
- Special case of gather
- Pattern of reads is constant across threads
- Transpose
- AoS → SoA
- Reduce
- All values into one single value
- Scan/sort
- All-to-all (or at least many-to-all)
- Each thread calculates value for some element, and increments histogram bin
- Need to do atomic add
- This doesn't scale well if you have many elements writing to one histogram bin
- Improvement
- Each thread calculates histogram for a subset of the data, no atomic adds here
- Add to final result
- Atomic add final results
- Improvement 2
- Instead of adding final results, perform reduce operation
- Improvement 3
- Sort the keys, and then reduce by key
- Gather
- One-to-one mapping, but which "ones" are included is not uniform
- Exclusive scan
- The scan at index
idoes not contain the value at indexiin the original array
- The scan at index
- Segmented scan
- Only cumulate results across specific ranges
- We want to apply
f()to elements that satisfyp() - While all threads execute
p(), only few will executef(), meaning that there will be a lot of divergence - So we can compact all elements that satisfy
p()into one array, and then iterate over that - call thiscompat()- First, map
xsthroughp()xs = [1, 3, 2, 3, 2, 4], p(x) = iseven(x)bs = [f, f, t, f, t, t]
- Cast booleans from
p()into0/1sis = [0, 0, 1, 0, 1, 1]
- Perform exclusive sum scan
scan = [0, 0, 0, 1, 1, 2]
- Perform scatter with original
xswith indicesscanifbssatisfied for index2 → 0, 2 → 1, 4 → 2[2, 2, 4]
- Perform some arbitrary
f()on the final array
- First, map
- Can use this for clipping 2D triangle into a viewport
- If a triangle is outside the viewport, it maps to 0 triangles
- If a triangle is inside the viewport, it maps to 1 triangle
- If the triangle is partially inside the viewport, it maps to
ndifferent triangles - We can use
compat()to allocate indices for these triangles
Matrix representation of X⋅Y:
[0 b c][x] [bx + cx]
[d e f][y] = [dy + ey + fy]
[0 0 i][z] [iz]
We can represent this as:
V = [b, c, d, e, f, i]: The values in theX, excluding zerosC = [1, 2, 0, 1, 2, 2]: The columns of the values inVR = [0, 2, 5]: The indices of values inVthat start a row
Algorithm:
- Gather from
YusingC:
G = gather([x, y, z], [1, 2, 0, 1, 2, 2])G = [y, z, x, y, z, z]
- Map
V * G
M = map(_*_, zip(V, G))M = [by, cz, dx, ey, fz, iz]
- Run segmented inclusive sum scan on
M
[by + cz, dx + ey + fz, iz]
- In every even step, thread
icompares elements2iand2i + 1, and swaps if out of order - In odd steps, thread
icompares2i + 1and2i + 2 O(n)steps,O(n²)work
- Start with trivially sorted segments of size 1
- Merge pairs of segments at each step
- End when there is only one segment
- Small segments:
- Perform merge on one thread
- Medium segments:
- Use multiple threads to merge segments
- Large segments (bigger than block size)
- Use multiple blocks to handle each merge
- Merging
A[]andB[] - Thread
ilooks at elementA[i] - Binary search for
A[i]inB[], get indexj - We now know to insert into new array at index
i + j
- Split
A[]andB[]into segments of sizek - Take each of the splitting values, and merge into a single list using previous algorithm
- Find each of these splitting values in the other array
- e.g.
- Have splitting values
a₁, a₂, a₃andb₁, b₂, b₃ - Merged, they become
a₁, b₁, b₂, a₂, b₃, a₃ - We take range
0 - a₁from bothA[]andB[](wherea₁is binary searched for inB[]) - We merge these two ranges into the final array
- We do the same for
a₁ - b₁,b₁ - b₂,b₂ - a₂from the previously merged splitters array
- Have splitting values
- A monotonic sequence is where each value is greater than the last, or less
than the last, e.g.
[1, 2, 3, 4] - A bitonic sequence is where the values in the list change once, e.g.
[1, 2, 3, 4, 2, 1] - A bitonic sequence can be shifted circularly, e.g. for the previous example,
[3, 4, 2, 1, 1, 2], and still remain a bitonic sequence
- You can split a bitonic sequence into two bitonic sequences
- This is done by splitting it in half, and comparing swapping elements between the splits so that the first split contains all the lower elements
[1, 2, 3, 4, 2, 1]
=> [1, 2, 3]
[4, 2, 1]
=> [1, 2, 1]
[4, 2, 3]
- We can split the bitonic list repeatedly, until there is only one element in each bitonic list
- Since they only have one element, they are trivially sorted
- Due to the nature of bitonic splits, all the lists concatenated will be sorted
- But first, we need a bitonic list...
- Change vector into lists of two elements
- Each of these is trivially bitonic
- We then turn it into a monotonic list (using above method)
- We then group these lists together, e.g.
[a₁, a₂, a₃, a₄] → [a₁++a₂, a₃++a₄] - These lists are then bitonic, as we've concatenated two monotonic lists, but they are of size 4
- We do this until all elements are merged into a single bitonic list, and then turn the list into a monotonic (i.e. sorted) list
- Algorithm is easily parallelizable
O(nlog²n)steps- Fastest for small lists
- Can sort integers
- Look at each integer's binary representation, and do a stable split based on
whether the least significant bit is 0/1
- Stable means that when you sort, if two elements are equal then they do not change order
- Then do this for the 2nd least significant bit, then 3rd, etc.
- Complexity is
O(kn)wherekis the number of bits, andnis the size of the list - Where to put each element can be done with the
compactoperation previously discussed - Fastest for medium to large lists
- IEEE-754 floating point number has:
- 1 bit sign (
S) - 8 bit exponent field (
E)- Bias of 127, i.e.
exponent = E - 127 - 0 and 255 reserved for special numbers
- Bias of 127, i.e.
- 23 bit mantissa field (
M)
- 1 bit sign (
- For normal numbers, implicit initial 1 bit in the mantissa is assumed
- i.e.
010101 → 1010101
- i.e.
- Special bit patterns in
S, E, Mfor:- Positive/negative zero
- Positive/negative infinity
- NaN
- Other sub-normal numbers
- For normal numbers:
(-1)^S * (1 + (2^-23) * M) * 2^(E - 127)
- For sub-normal numbers:
- Difference between
0 - succ(0)is far greater thansucc(0) - succ(succ(0)) - Non-zero numbers that are smaller than the magnitude of the smallest IEE-754 floating point number
- If
E=0, then do not use the implicit 1 bit in the mantissa
- Difference between
- Given some number
x, what is the shortest distance between the paira, bwherea ≤ x ≤ b - IEEE-754 requires operations round to the nearest representable floating point
number
- i.e. the computed result must be with in 0.5 ULPs of the mathematically correct result
- If we try to sum the list
[1000.0, 0.1, 0.1, 0.1...]the result will always be1000.0because the operation1000.0 + 0.1results in1000.0 - Worst error is
O(n)
- Sort the list as ascending first, then sum
- However, accumulated small values will get significantly larger that later values in the vector
- Accumulate the sum, but calculate a correction term
float kahan_add(float *xs, size_t len) {
float sum = 0;
float correction = 0;
for (size_t i = 0; i < len; i++) {
float corrected_next_term = xs[i] - correction;
float new_sum = sum + corrected_next_term;
correction = (new_sum - sum) - corrected_next_term;
sum = new_sum;
}
return sum;
}- Worst error is
O(1) - Not easy to parallelise
- Sort the list, then thread
iperformsa[i*s] + a[i*s + (s / 2)]- where
sis the stride that starts at 2 and doubles on each iteration
- where
- Starts with adding similar pairs, but difference increases as you go on
- Worst error is
O(log(n)) - Poor thread blocking, memory access patterns
- Could have thread
iperforma[i] + a[i + s]- where
sis the stride that starts atlen(a) / 2and halves on each iteration
- where
- This needs careful original ordering of
a[]
- Could have thread
- Absolute error:
|v - v'|wherevis the true value, andv'is the approximation - Relative error:
|v - v'| / |v|whenv ≠ 0
- Used for image analysis, matrix multiplication, spacial simulations
- Threads are essentially broken down into 1D again, for example with a
(2, 2, 2)dimensional blockthreadIdx = (0, 0, 0) → thread 0threadIdx = (1, 0, 0) → thread 1threadIdx = (0, 1, 0) → thread 2threadIdx = (1, 1, 0) → thread 3- etc.
D[blockIdx.x][blockIdx.y]whereDand block size is2x2- Thread
(0, 0) → 0accessesD + 0 - Thread
(1, 0) → 1accessesD + 2 - Thread
(0, 1) → 2accessesD + 1 - So we're not coalesced
- Thread
D[blockIdx.y][blockIdx.x](we switchxandy)- Thread
(0, 0) → 0accessesD + 0 - Thread
(1, 0) → 1accessesD + 1 - Thread
(0, 1) → 2accessesD + 2 - So we're coalesced!
- Thread
D[x][y]whereDand block size isK x KandKis a multiple of 32- Thread
(0, 0) → 0accesses bank 0 - Thread
(1, 0) → 1Kaccesses bank 0 - Thread
(2, 0) → 2Kaccesses bank 0 - Conflicts!
- Thread
D[y][x](we switchxandy)- Thread
(0, 0) → 0accesses bank 0 - Thread
(1, 0) → 1Kaccesses bank 1 - Thread
(2, 0) → 2Kaccesses bank 2 - No conflicts!
- Thread
- When transposing, we need to access both
D[y][x]andD[x][y], meaning one of the accesses will have uncoalesced global memory accesses, and shared memory bank conflicts - Trick:
- Copy in a element of
Dinto shared memory with good indexing __syncthreads()- Copy out a different element with also good indexing
- Copy in a element of
- Naive: each thread handles an output element, and performs the dot product of
the correct row and column.
- For the "compute to global memory access" ratio, we have two global reads, one multiplication, one addition. This gives us a 1:1 ratio - not good
- Clever:
- Do dot product for tiles of the outputs
- e.g. if we have 64x64 matrices, and 32x32 tiles, we have 2x2 tiles
- To calculate
O[0, 0]:- First do partial dot products of
A[0, 0]andB[0, 0] - Then do partial dot products of
A[1, 0]andB[0, 1] - Sum the results for the final
O[0, 0]values
- First do partial dot products of
- Do dot product for tiles of the outputs

