Backends: Execution Models & Performance

This document explains each execution backend, when to use it, and performance characteristics.

Backend Comparison
SerialBackend
ThreadedBackend
DistributedBackend
GPUBackend
AutoBackend
Performance Tuning

Backend Comparison

Backend	Best For	Parallelism	Memory Overhead	Setup Time
Serial	Development, debugging, small data (~10M points)	None	Minimal	Immediate
Threaded	Medium data (10M–500M points), shared-memory systems	Within-node threads	Low	~100ms (OhMyThreads)
Distributed	Large data (>500M points), multi-node clusters	Across nodes	Moderate	~1s (worker startup)
GPU	Very large data (>1B points), if GPU available	GPU device	Moderate	~500ms (kernel compile)
Auto	Unknown resource availability	Adaptive	Low	~100ms (detection)

SerialBackend

Definition

Single-threaded, reference implementation. All computations run on the calling thread.

When to Use

✅ Development & debugging: Deterministic, easy to profile
✅ Small datasets: <10M points (threads don't help much)
✅ Shared environments: Where spawning threads is discouraged
✅ Validation: Reference implementation for comparing other backends

When NOT to Use

❌ Large data (>10M points): Too slow
❌ Multi-CPU available: Wastes resources

Examples

1. Allocating API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

# Small test dataset
x = (randn(1000), randn(1000))
u = (randn(1000), randn(1000))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Calculate using SerialBackend explicitly
result = SFC.calculate_structure_function(
    SFT.S2SFType(),
    x, u, bins;
    backend=SFC.SerialBackend(),
    show_progress=true
)

println("Structure Function values: ", result.values)

2. Pre-allocated In-place API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

x = (randn(1000), randn(1000))
u = (randn(1000), randn(1000))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Pre-allocate output arrays
n_bins = length(bins)
sums = zeros(Float64, n_bins)
counts = zeros(Float64, n_bins)

# Compute in-place (accumulates directly into provided arrays)
SFC.calculate_structure_function!(
    sums, counts, SFT.S2SFType(),
    x, u, bins;
    backend=SFC.SerialBackend()
)

Performance Notes

O(N²) complexity; for N=1M, expect ~1 sec
Mutating calculate_structure_function! completely avoids allocating temporary arrays, making it ideal for temporal loops.

ThreadedBackend

Definition

Multi-threaded execution using OhMyThreads.jl. Distributes pairwise calculations across Threads.nthreads() worker threads.

When to Use

✅ Medium datasets: 10M–500M points
✅ Shared-memory systems: Single workstation, server
✅ Quick turnaround: Faster than serial, no cluster setup
✅ Memory-constrained: All threads can access same data

When NOT to Use

❌ Only 1 thread available: Falls back to serial (no benefit)
❌ Very large data (>500M): GPU/Distributed faster
❌ Cluster environment: Use DistributedBackend instead

Requirements

# Add to Project.toml if not present
[extras]
OhMyThreads = "67456a42-ebe4-4781-8ad1-67f7eda8d8f7"

Examples

1. Allocating API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

N = 50_000
x = (randn(N), randn(N))
u = (randn(N), randn(N))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

result = SFC.calculate_structure_function(
    SFT.L2SFType(),
    x, u, bins;
    backend=SFC.ThreadedBackend(),
    show_progress=true
)

2. Pre-allocated In-place API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

N = 50_000
x = (randn(N), randn(N))
u = (randn(N), randn(N))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Pre-allocate output arrays
n_bins = length(bins)
sums = zeros(Float64, n_bins)
counts = zeros(Float64, n_bins)

# Compute in-place (accumulates directly into provided arrays)
SFC.calculate_structure_function!(
    sums, counts, SFT.L2SFType(),
    x, u, bins;
    backend=SFC.ThreadedBackend()
)

Performance & Memory Efficiency

The modern mutating threaded backend (threaded_calculate_structure_function!) utilizes a chunked reduction strategy via OhMyThreads.chunks to divide point indexes into exactly nthreads() sub-ranges.

Chunked Workspaces: Each task/thread allocates exactly one local buffer pair for its entire chunk (rather than per-point).
Memory Scaling: This reduces the number of thread-local heap allocations to exactly $O(n_{\text{threads}})$, compared to the highly wasteful $O(N_{\text{points}})$ allocation pattern in naive map-reduce implementations.
Cache Locality: This optimization maximizes L1/L2 cache locality while maintaining complete thread safety and task-migration protection.

Thread Safety

ThreadedBackend uses thread-local reduction buffers to avoid race conditions:

Each task computes on its own local chunk workspace.
The results are folded together thread-safely using a parallel tree reduction.
No global locks or atomic conflicts are triggered, maximizing performance.

DistributedBackend

Definition

Multi-process execution using Distributed.jl. Distributes work across multiple Julia processes, potentially on different compute nodes.

When to Use

✅ Very large data: >500M points
✅ Cluster environments: HPC, cloud, multiple machines
✅ Limited per-node memory: Distribute data across nodes
✅ Need fault tolerance: Can add checkpointing

When NOT to Use

❌ Shared-memory system with <10 cores: ThreadedBackend faster
❌ Interactive/exploratory work: Higher latency
❌ Small data: Communication overhead kills performance

Setup

using StructureFunctions
using Distributed

# Start 4 worker processes (can be on different machines)
addprocs(4)

# Ensure StructureFunctions is loaded on all workers
@everywhere using StructureFunctions

# Large dataset (distributed across RAM)
N = 1_000_000_000  # 1 billion points
x = randn(N, 2)
u = randn(N, 2)

backend = DistributedBackend()
bins = 10:10:5000

@time result = calculate_structure_function(
    FullVectorStructureFunction{Float64}(order=2),
    x, u, bins;
    backend=backend,
    verbose=true
)

# Clean up
rmprocs(workers())

Cluster Job Submission

Example SLURM submission script:

#!/bin/bash
#SBATCH --nodes=4           # Request 4 nodes
#SBATCH --cpus-per-task=8   # 8 CPUs per node
#SBATCH --time=01:00:00     # 1 hour

export JULIA_NUM_THREADS=8  # Let each process use 8 threads

srun julia -p $((${SLURM_NNODES} * ${SLURM_CPUS_PER_TASK})) compute_sf.jl

Performance Notes

Communication overhead: ~50–200 ms per calculation (one-time cost)
Scales nearly linearly with process count (for large enough problems)
Best when N >> communication cost (i.e., N > 100M)

GPUBackend

Definition

GPU-accelerated computation using KernelAbstractions.jl. Fully portable kernels run on NVIDIA (CUDA), AMD (ROCm), or CPU fallback.

When to Use

✅ Massive datasets: >1 billion points
✅ GPUs available: NVIDIA A100, RTX 4090, AMD MI200, etc.
✅ Extreme speed needed: 10–100x speedup possible
✅ Research clusters: Many HPC centers provide GPUs

When NOT to Use

❌ No GPU: CPU fallback is actually slower than ThreadedBackend
❌ Data doesn't fit on GPU memory: Requires redistribution logic (future work)
❌ GPU unavailable in environment: Installation complexity

Requirements

# GPU execution requires one of:
# - CUDA.jl (NVIDIA GPUs)
# - AMDGPU.jl (AMD GPUs)
# - Metal.jl (Apple Silicon)

[extras]
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1f7f1"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"  # For NVIDIA

Example: NVIDIA GPU

using StructureFunctions
using CUDA

# Ensure GPU is available
@assert CUDA.functional() "CUDA not available"

# Large dataset
N = 1_000_000_000  # 1 billion points
x_cpu = randn(Float32, N, 2)  # Float32 for GPU memory efficiency
u_cpu = randn(Float32, N, 2)

# Move to GPU
x_gpu = cu(x_cpu)
u_gpu = cu(u_cpu)

backend = GPUBackend()
bins = 10:10:5000

@time result = calculate_structure_function(
    FullVectorStructureFunction{Float32}(order=2),
    x_gpu, u_gpu, bins;
    backend=backend,
    show_progress=true
)

# Results automatically transferred back to CPU
println(typeof(result.structure_function))  # ::Matrix{Float32}

Performance Comparison

Measured on NVIDIA A100 GPU vs. CPU (48-core Xeon):

N	CPU (s)	A100-40GB (s)	Speedup
100M	50	2.5	20x
500M	250	10	25x
1B	500	18	28x

Notes:

Overhead: Kernel compilation ~500ms (amortized if many calls)
Memory: 1B Float32 points ≈ 4 GB GPU memory
Precision: Float32 sufficient for most applications (precision ~1e-6)

Kernel Details

GPUBackend executes distance-pair calculations in massively parallel kernels:

Each GPU thread processes one distance-pair bin
Tree reduction for combining results
Zero-copy result transfer via unified memory (Ampere+)

AutoBackend

Definition

Automatically selects the best backend based on available resources:

If Distributed.nworkers() > 1 → DistributedBackend
Else if Threads.nthreads() > 1 → ThreadedBackend
Else → SerialBackend

When to Use

✅ Generic libraries: Let code adapt to environment
✅ Unknown deployment: Works on laptop, cluster, or cloud
✅ Single shared script: No backend changes needed
✅ Production pipelines: Automatic resource utilization

Example

using StructureFunctions

# Same code runs on laptop (serial), workstation (threaded),
# or cluster (distributed) without changes!

x = randn(50_000_000, 2)
u = randn(50_000_000, 2)
bins = 10:10:500

result = calculate_structure_function(
    FullVectorStructureFunction{Float64}(order=2),
    x, u, bins;
    backend=AutoBackend(),  # Default—chooses best option
    show_progress=true
)

Algorithm

function select_backend()
    if nworkers() > 1
        return DistributedBackend()
    elseif Threads.nthreads() > 1
        return ThreadedBackend()
    else
        return SerialBackend()
    end
end

Notes

Default behavior (no backend kwarg) also uses AutoBackend
Detection is fast (~1 ms)
Users can override with explicit backend if needed

Performance Tuning

Choice Decision Tree

Data size?
├─ < 10M       → SerialBackend (or ThreadedBackend if multi-core)
├─ 10M–500M    → ThreadedBackend (or AutoBackend to auto-select)
├─ 500M–1B     → DistributedBackend (or GPUBackend if GPU available)
└─ > 1B        → GPUBackend (or DistributedBackend if no GPU)

Memory Considerations

Backend	Memory per Point (approx)	Total for 1B
Serial	20 bytes (buffers only)	20 GB
Threaded	20 bytes (shared data)	20 GB
Distributed	20 bytes (per node)	20 GB / N_nodes
GPU	8 bytes (Float32)	8 GB

Thread/Process Count Tuning

ThreadedBackend: Use Threads.nthreads() = CPU_count - 1 to leave resources for OS

# Start Julia with specific thread count
JULIA_NUM_THREADS=7 julia script.jl  # On 8-core machine

DistributedBackend: Tune addprocs(N) based on cluster resources

# Rule of thumb: N ~= (total_cores / 2) for interactive work
# N ~= total_cores for batch jobs
addprocs(32)  # On 64-core system for batch

Profiling

Use @time or @profile to measure:

@time result = calculate_structure_function(...)

# For detailed timing:
using ProfilingTools
@profile calculate_structure_function(...)

# Generate flame graph
ProfileCanvas.show()

FilesExpand file tree

backends.md

Latest commit

History

backends.md

File metadata and controls

Backends: Execution Models & Performance

Table of Contents

Backend Comparison

SerialBackend

Definition

When to Use

When NOT to Use

Examples

Performance Notes

ThreadedBackend

Definition

When to Use

When NOT to Use

Requirements

Examples

Performance & Memory Efficiency

Thread Safety

DistributedBackend

Definition

When to Use

When NOT to Use

Setup

Cluster Job Submission

Performance Notes

GPUBackend

Definition

When to Use

When NOT to Use

Requirements

Example: NVIDIA GPU

Performance Comparison

Kernel Details

AutoBackend

Definition

When to Use

Example

Algorithm

Notes

Performance Tuning

Choice Decision Tree

Memory Considerations

Thread/Process Count Tuning

Profiling

Related Topics