Starting a new issue here to cover some of what I've found out about performance of spatial interaction calculations that has grown out of my SIMD work (see #586 and associated PRs #578, #587, #588).
The context is that I started work on adding SIMD optimization to InteractionType::FillSparseVectorForReceiverStrengths(), in particular the SpatialKernelType::kExponential and SpatialKernelType::kNormal kernels, and in benchmarking my optimizations I noticed that the codeblock in that function that moves to a separate code path if ((interaction_callbacks.size() == 0) && (spatiality_ == 2)) might actually be slowing things down.
Loosely, when if ((interaction_callbacks.size() == 0) && (spatiality_ == 2)) evaluates to true, spatial interactions are computed one at a time during kd-tree traversal. This means that only a single-pass through all neighbors is needed. However, if that codeblock is commented out, the interactions are computed in a second pass on the sparse vector like:
for (uint32_t col_iter = 0; col_iter < nnz; ++col_iter){
sv_value_t distance = values[col_iter];
values[col_iter] = (sv_value_t)(if_param1_ * (1.0 - distance / max_distance_));
}
Interestingly, with the default compiler optimizations, this two pass strategy seems to be faster, no doubt as a result of the compiler being able to auto-vectorize those loops.
Here are some representative results, first from a benchmark N~5000 individuals with intermediate densities (~2000 neighbors per individual):
| Kernel |
Original |
No kd-branch (Scalar) |
Scalar Δ |
| Fixed |
1.997s |
2.152s |
+7.8% |
| Linear |
2.449s |
2.288s |
-6.6% |
| Exponential |
4.107s |
2.812s |
-31.5% |
| Normal |
3.879s |
2.992s |
-22.9% |
| Cauchy |
2.693s |
2.459s |
-8.7% |
| Student's T |
9.616s |
7.965s |
-17.2% |
| TOTAL |
24.785s |
20.709s |
-16.4% |
Adding my SIMD optimizations to Exponential/Gaussian interactions on top of this lead to further time savings:
| Kernel |
Original |
No kd-branch (Scalar) |
No kd-branch (SIMD) |
Scalar Δ |
SIMD vs Orig |
| Fixed |
1.997s |
2.152s |
2.106s |
+7.8% |
+5.5% |
| Linear |
2.449s |
2.288s |
2.183s |
-6.6% |
-10.9% |
| Exponential |
4.107s |
2.812s |
2.268s |
-31.5% |
-44.8% |
| Normal |
3.879s |
2.992s |
2.285s |
-22.9% |
-41.1% |
| Cauchy |
2.693s |
2.459s |
2.272s |
-8.7% |
-15.6% |
| Student's T |
9.616s |
7.965s |
7.941s |
-17.2% |
-17.4% |
| TOTAL |
24.785s |
20.709s |
19.095s |
-16.4% |
-23.0% |
Savings are also present at lower densities. Here are results for N~5000 individuals with ~200 neighbors per individual:
| Kernel |
Original |
No kd-branch (Scalar) |
No kd-branch (SIMD) |
Scalar Δ |
SIMD vs Orig |
| Fixed |
0.511s |
0.503s |
0.485s |
-1.6% |
-5.1% |
| Linear |
0.562s |
0.517s |
0.494s |
-8.0% |
-12.1% |
| Exponential |
0.752s |
0.587s |
0.506s |
-21.9% |
-32.7% |
| Normal |
0.724s |
0.610s |
0.508s |
-15.7% |
-29.8% |
| Cauchy |
0.582s |
0.537s |
0.504s |
-7.7% |
-13.4% |
| Student's T |
1.483s |
1.267s |
1.253s |
-14.6% |
-15.5% |
| TOTAL |
4.653s |
4.063s |
3.788s |
-12.7% |
-18.6% |
@petrelharp suggested I increase the number of individuals present in the benchmarking simulations. Here are results from N~50,000 individuals with ~2000 neighbors / individual:
| Kernel |
Original |
No kd-branch (Scalar) |
No kd-branch (SIMD) |
Scalar Δ |
SIMD vs Orig |
| Fixed |
31.97s |
36.02s |
32.98s |
+12.7% |
+3.2% |
| Linear |
37.26s |
37.93s |
34.05s |
+1.8% |
-8.6% |
| Exponential |
59.58s |
44.72s |
35.17s |
-24.9% |
-41.0% |
| Normal |
56.37s |
47.19s |
35.31s |
-16.3% |
-37.4% |
| Cauchy |
40.04s |
40.20s |
35.11s |
+0.4% |
-12.3% |
| Student's T |
130.10s |
112.15s |
109.68s |
-13.8% |
-15.7% |
| TOTAL |
356.04s |
318.95s |
282.98s |
-10.4% |
-20.5% |
this reveals some performance regression of the no kd-branch in the Fixed kernel. Finally here is a last benchmark with N~50,000 individuals with ~20 neighbors/individual:
| Kernel |
Original |
No kd-branch (Scalar) |
No kd-branch (SIMD) |
Scalar Δ |
SIMD vs Orig |
| Fixed |
1.856s |
1.804s |
1.766s |
-2.8% |
-4.8% |
| Linear |
1.872s |
1.808s |
1.775s |
-3.4% |
-5.2% |
| Exponential |
2.071s |
1.884s |
1.797s |
-9.0% |
-13.2% |
| Normal |
2.035s |
1.911s |
1.816s |
-6.1% |
-10.8% |
| Cauchy |
1.854s |
1.816s |
1.792s |
-2.0% |
-3.3% |
| Student's T |
2.781s |
2.539s |
2.520s |
-8.7% |
-9.4% |
| TOTAL |
13.20s |
12.51s |
12.21s |
-5.3% |
-7.5% |
So the question is what to do here. Clearly there are benefits of the no-kd + SIMD code path, but perhaps we should consider removing the kd-traversal code path entirely when computing interactions? For added clarity-- by remove the kd-traversal code path, I only mean in calculating the interaction strengths, we still need to traverse it for the distances.
Starting a new issue here to cover some of what I've found out about performance of spatial interaction calculations that has grown out of my SIMD work (see #586 and associated PRs #578, #587, #588).
The context is that I started work on adding SIMD optimization to
InteractionType::FillSparseVectorForReceiverStrengths(), in particular theSpatialKernelType::kExponentialandSpatialKernelType::kNormalkernels, and in benchmarking my optimizations I noticed that the codeblock in that function that moves to a separate code pathif ((interaction_callbacks.size() == 0) && (spatiality_ == 2))might actually be slowing things down.Loosely, when
if ((interaction_callbacks.size() == 0) && (spatiality_ == 2))evaluates to true, spatial interactions are computed one at a time during kd-tree traversal. This means that only a single-pass through all neighbors is needed. However, if that codeblock is commented out, the interactions are computed in a second pass on the sparse vector like:Interestingly, with the default compiler optimizations, this two pass strategy seems to be faster, no doubt as a result of the compiler being able to auto-vectorize those loops.
Here are some representative results, first from a benchmark N~5000 individuals with intermediate densities (~2000 neighbors per individual):
Adding my SIMD optimizations to Exponential/Gaussian interactions on top of this lead to further time savings:
Savings are also present at lower densities. Here are results for N~5000 individuals with ~200 neighbors per individual:
@petrelharp suggested I increase the number of individuals present in the benchmarking simulations. Here are results from N~50,000 individuals with ~2000 neighbors / individual:
this reveals some performance regression of the no kd-branch in the Fixed kernel. Finally here is a last benchmark with N~50,000 individuals with ~20 neighbors/individual:
So the question is what to do here. Clearly there are benefits of the no-kd + SIMD code path, but perhaps we should consider removing the kd-traversal code path entirely when computing interactions? For added clarity-- by remove the kd-traversal code path, I only mean in calculating the interaction strengths, we still need to traverse it for the distances.