Skip to content

optimizing spatial interaction kernel calculation -- during kd-tree traversal or no? #589

Description

@andrewkern

Starting a new issue here to cover some of what I've found out about performance of spatial interaction calculations that has grown out of my SIMD work (see #586 and associated PRs #578, #587, #588).

The context is that I started work on adding SIMD optimization to InteractionType::FillSparseVectorForReceiverStrengths(), in particular the SpatialKernelType::kExponential and SpatialKernelType::kNormal kernels, and in benchmarking my optimizations I noticed that the codeblock in that function that moves to a separate code path if ((interaction_callbacks.size() == 0) && (spatiality_ == 2)) might actually be slowing things down.

Loosely, when if ((interaction_callbacks.size() == 0) && (spatiality_ == 2)) evaluates to true, spatial interactions are computed one at a time during kd-tree traversal. This means that only a single-pass through all neighbors is needed. However, if that codeblock is commented out, the interactions are computed in a second pass on the sparse vector like:

for (uint32_t col_iter = 0; col_iter < nnz; ++col_iter){
	sv_value_t distance = values[col_iter];
	values[col_iter] = (sv_value_t)(if_param1_ * (1.0 - distance / max_distance_));
}

Interestingly, with the default compiler optimizations, this two pass strategy seems to be faster, no doubt as a result of the compiler being able to auto-vectorize those loops.

Here are some representative results, first from a benchmark N~5000 individuals with intermediate densities (~2000 neighbors per individual):

Kernel Original No kd-branch (Scalar) Scalar Δ
Fixed 1.997s 2.152s +7.8%
Linear 2.449s 2.288s -6.6%
Exponential 4.107s 2.812s -31.5%
Normal 3.879s 2.992s -22.9%
Cauchy 2.693s 2.459s -8.7%
Student's T 9.616s 7.965s -17.2%
TOTAL 24.785s 20.709s -16.4%

Adding my SIMD optimizations to Exponential/Gaussian interactions on top of this lead to further time savings:

Kernel Original No kd-branch (Scalar) No kd-branch (SIMD) Scalar Δ SIMD vs Orig
Fixed 1.997s 2.152s 2.106s +7.8% +5.5%
Linear 2.449s 2.288s 2.183s -6.6% -10.9%
Exponential 4.107s 2.812s 2.268s -31.5% -44.8%
Normal 3.879s 2.992s 2.285s -22.9% -41.1%
Cauchy 2.693s 2.459s 2.272s -8.7% -15.6%
Student's T 9.616s 7.965s 7.941s -17.2% -17.4%
TOTAL 24.785s 20.709s 19.095s -16.4% -23.0%

Savings are also present at lower densities. Here are results for N~5000 individuals with ~200 neighbors per individual:

Kernel Original No kd-branch (Scalar) No kd-branch (SIMD) Scalar Δ SIMD vs Orig
Fixed 0.511s 0.503s 0.485s -1.6% -5.1%
Linear 0.562s 0.517s 0.494s -8.0% -12.1%
Exponential 0.752s 0.587s 0.506s -21.9% -32.7%
Normal 0.724s 0.610s 0.508s -15.7% -29.8%
Cauchy 0.582s 0.537s 0.504s -7.7% -13.4%
Student's T 1.483s 1.267s 1.253s -14.6% -15.5%
TOTAL 4.653s 4.063s 3.788s -12.7% -18.6%

@petrelharp suggested I increase the number of individuals present in the benchmarking simulations. Here are results from N~50,000 individuals with ~2000 neighbors / individual:

Kernel Original No kd-branch (Scalar) No kd-branch (SIMD) Scalar Δ SIMD vs Orig
Fixed 31.97s 36.02s 32.98s +12.7% +3.2%
Linear 37.26s 37.93s 34.05s +1.8% -8.6%
Exponential 59.58s 44.72s 35.17s -24.9% -41.0%
Normal 56.37s 47.19s 35.31s -16.3% -37.4%
Cauchy 40.04s 40.20s 35.11s +0.4% -12.3%
Student's T 130.10s 112.15s 109.68s -13.8% -15.7%
TOTAL 356.04s 318.95s 282.98s -10.4% -20.5%

this reveals some performance regression of the no kd-branch in the Fixed kernel. Finally here is a last benchmark with N~50,000 individuals with ~20 neighbors/individual:

Kernel Original No kd-branch (Scalar) No kd-branch (SIMD) Scalar Δ SIMD vs Orig
Fixed 1.856s 1.804s 1.766s -2.8% -4.8%
Linear 1.872s 1.808s 1.775s -3.4% -5.2%
Exponential 2.071s 1.884s 1.797s -9.0% -13.2%
Normal 2.035s 1.911s 1.816s -6.1% -10.8%
Cauchy 1.854s 1.816s 1.792s -2.0% -3.3%
Student's T 2.781s 2.539s 2.520s -8.7% -9.4%
TOTAL 13.20s 12.51s 12.21s -5.3% -7.5%

So the question is what to do here. Clearly there are benefits of the no-kd + SIMD code path, but perhaps we should consider removing the kd-traversal code path entirely when computing interactions? For added clarity-- by remove the kd-traversal code path, I only mean in calculating the interaction strengths, we still need to traverse it for the distances.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions