optimizing spatial interaction kernel calculation -- during kd-tree traversal or no?

Starting a new issue here to cover some of what I've found out about performance of spatial interaction calculations that has grown out of my SIMD work (see #586 and associated PRs #578, #587, #588). 

The context is that I started work on adding SIMD optimization to `InteractionType::FillSparseVectorForReceiverStrengths()`, in particular the `SpatialKernelType::kExponential` and `SpatialKernelType::kNormal` kernels, and in benchmarking my optimizations I noticed that the codeblock in that function that moves to a separate code path `if ((interaction_callbacks.size() == 0) && (spatiality_ == 2))` might actually be slowing things down. 

Loosely, when `if ((interaction_callbacks.size() == 0) && (spatiality_ == 2))` evaluates to true, spatial interactions are computed one at a time during kd-tree traversal. This means that only a single-pass through all neighbors is needed. However, if that codeblock is commented out, the interactions are computed in a second pass on the sparse vector like:
```C
for (uint32_t col_iter = 0; col_iter < nnz; ++col_iter){
	sv_value_t distance = values[col_iter];
	values[col_iter] = (sv_value_t)(if_param1_ * (1.0 - distance / max_distance_));
}
```

Interestingly, with the default compiler optimizations, this two pass strategy seems to be faster, no doubt as a result of the compiler being able to auto-vectorize those loops. 

Here are some representative results, first from a benchmark N~5000 individuals with intermediate densities (~2000 neighbors per individual):

  | Kernel      | Original | No kd-branch (Scalar) | Scalar Δ | 
  |-------------|----------|--------------|------------|
  | Fixed       | 1.997s   | 2.152s        | +7.8% | 
  | Linear      | 2.449s   | 2.288s        | -6.6%    | 
  | Exponential | 4.107s   | 2.812s        | -31.5%   | 
  | Normal      | 3.879s   | 2.992s         | -22.9%   | 
  | Cauchy      | 2.693s   | 2.459s           | -8.7%    | 
  | Student's T | 9.616s   | 7.965s          | -17.2%   | 
  | TOTAL       | 24.785s  | 20.709s         | -16.4%   | 

Adding my SIMD optimizations to Exponential/Gaussian interactions on top of this lead to further time savings:

  | Kernel      | Original | No kd-branch (Scalar) | No kd-branch (SIMD) | Scalar Δ | SIMD vs Orig |
  |-------------|----------|--------------|------------|----------|--------------|
  | Fixed       | 1.997s   | 2.152s       | 2.106s     | +7.8%   | +5.5%      |
  | Linear      | 2.449s   | 2.288s       | 2.183s     | -6.6%    | -10.9%       |
  | Exponential | 4.107s   | 2.812s       | 2.268s     | -31.5%   | -44.8%       |
  | Normal      | 3.879s   | 2.992s       | 2.285s     | -22.9%   | -41.1%       |
  | Cauchy      | 2.693s   | 2.459s       | 2.272s     | -8.7%    | -15.6%       |
  | Student's T | 9.616s   | 7.965s       | 7.941s     | -17.2%   | -17.4%       |
  | TOTAL       | 24.785s  | 20.709s      | 19.095s    | -16.4%   | -23.0%       |

Savings are also present at lower densities. Here are results for N~5000 individuals with ~200 neighbors per individual:

  | Kernel      | Original | No kd-branch (Scalar) | No kd-branch (SIMD) | Scalar Δ | SIMD vs Orig |
  |-------------|----------|--------------|------------|----------|--------------|
  | Fixed       | 0.511s   | 0.503s       | 0.485s     | -1.6%    | -5.1%        |
  | Linear      | 0.562s   | 0.517s       | 0.494s     | -8.0%    | -12.1%       |
  | Exponential | 0.752s   | 0.587s       | 0.506s     | -21.9%   | -32.7%       |
  | Normal      | 0.724s   | 0.610s       | 0.508s     | -15.7%   | -29.8%       |
  | Cauchy      | 0.582s   | 0.537s       | 0.504s     | -7.7%    | -13.4%       |
  | Student's T | 1.483s   | 1.267s       | 1.253s     | -14.6%   | -15.5%       |
  | TOTAL       | 4.653s   | 4.063s       | 3.788s     | -12.7%   | -18.6%       |

@petrelharp suggested I increase the number of individuals present in the benchmarking simulations. Here are results from N~50,000 individuals with ~2000 neighbors / individual:

  | Kernel      | Original | No kd-branch (Scalar) | No kd-branch (SIMD) | Scalar Δ  | SIMD vs Orig |
  |-------------|----------|--------------|------------|-----------|--------------|
  | Fixed       | 31.97s   | 36.02s       | 32.98s     | +12.7%  | +3.2%        |
  | Linear      | 37.26s   | 37.93s       | 34.05s     | +1.8%     | -8.6%        |
  | Exponential | 59.58s   | 44.72s       | 35.17s     | -24.9%    | -41.0%       |
  | Normal      | 56.37s   | 47.19s       | 35.31s     | -16.3%    | -37.4%       |
  | Cauchy      | 40.04s   | 40.20s       | 35.11s     | +0.4%     | -12.3%       |
  | Student's T | 130.10s  | 112.15s      | 109.68s    | -13.8%    | -15.7%       |
  | TOTAL       | 356.04s  | 318.95s      | 282.98s    | -10.4%    | -20.5%       |

this reveals some performance regression of the no kd-branch in the Fixed kernel. Finally here is a last benchmark with N~50,000 individuals with ~20 neighbors/individual:

  | Kernel      | Original | No kd-branch (Scalar) | No kd-branch (SIMD) | Scalar Δ | SIMD vs Orig |
  |-------------|----------|--------------|------------|----------|--------------|
  | Fixed       | 1.856s   | 1.804s       | 1.766s     | -2.8%    | -4.8%        |
  | Linear      | 1.872s   | 1.808s       | 1.775s     | -3.4%    | -5.2%        |
  | Exponential | 2.071s   | 1.884s       | 1.797s     | -9.0%    | -13.2%       |
  | Normal      | 2.035s   | 1.911s       | 1.816s     | -6.1%    | -10.8%       |
  | Cauchy      | 1.854s   | 1.816s       | 1.792s     | -2.0%    | -3.3%        |
  | Student's T | 2.781s   | 2.539s       | 2.520s     | -8.7%    | -9.4%        |
  | TOTAL       | 13.20s   | 12.51s       | 12.21s     | -5.3%    | -7.5%        |

So the question is what to do here. Clearly there are benefits of the no-kd + SIMD code path, but perhaps we should consider removing the kd-traversal code path entirely when computing interactions?  For added clarity-- by remove the kd-traversal code path, I only mean in calculating the interaction strengths, we still need to traverse it for the distances. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimizing spatial interaction kernel calculation -- during kd-tree traversal or no? #589

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kernel	Original	No kd-branch (Scalar)	Scalar Δ
Fixed	1.997s	2.152s	+7.8%
Linear	2.449s	2.288s	-6.6%
Exponential	4.107s	2.812s	-31.5%
Normal	3.879s	2.992s	-22.9%
Cauchy	2.693s	2.459s	-8.7%
Student's T	9.616s	7.965s	-17.2%
TOTAL	24.785s	20.709s	-16.4%

Kernel	Original	No kd-branch (Scalar)	No kd-branch (SIMD)	Scalar Δ	SIMD vs Orig
Fixed	0.511s	0.503s	0.485s	-1.6%	-5.1%
Linear	0.562s	0.517s	0.494s	-8.0%	-12.1%
Exponential	0.752s	0.587s	0.506s	-21.9%	-32.7%
Normal	0.724s	0.610s	0.508s	-15.7%	-29.8%
Cauchy	0.582s	0.537s	0.504s	-7.7%	-13.4%
Student's T	1.483s	1.267s	1.253s	-14.6%	-15.5%
TOTAL	4.653s	4.063s	3.788s	-12.7%	-18.6%

Kernel	Original	No kd-branch (Scalar)	No kd-branch (SIMD)	Scalar Δ	SIMD vs Orig
Fixed	31.97s	36.02s	32.98s	+12.7%	+3.2%
Linear	37.26s	37.93s	34.05s	+1.8%	-8.6%
Exponential	59.58s	44.72s	35.17s	-24.9%	-41.0%
Normal	56.37s	47.19s	35.31s	-16.3%	-37.4%
Cauchy	40.04s	40.20s	35.11s	+0.4%	-12.3%
Student's T	130.10s	112.15s	109.68s	-13.8%	-15.7%
TOTAL	356.04s	318.95s	282.98s	-10.4%	-20.5%

Kernel	Original	No kd-branch (Scalar)	No kd-branch (SIMD)	Scalar Δ	SIMD vs Orig
Fixed	1.856s	1.804s	1.766s	-2.8%	-4.8%
Linear	1.872s	1.808s	1.775s	-3.4%	-5.2%
Exponential	2.071s	1.884s	1.797s	-9.0%	-13.2%
Normal	2.035s	1.911s	1.816s	-6.1%	-10.8%
Cauchy	1.854s	1.816s	1.792s	-2.0%	-3.3%
Student's T	2.781s	2.539s	2.520s	-8.7%	-9.4%
TOTAL	13.20s	12.51s	12.21s	-5.3%	-7.5%

optimizing spatial interaction kernel calculation -- during kd-tree traversal or no? #589

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions