Skip to content

Latest commit

 

History

History
447 lines (331 loc) · 12.3 KB

File metadata and controls

447 lines (331 loc) · 12.3 KB

Backends: Execution Models & Performance

This document explains each execution backend, when to use it, and performance characteristics.

Table of Contents


Backend Comparison

Backend Best For Parallelism Memory Overhead Setup Time
Serial Development, debugging, small data (~10M points) None Minimal Immediate
Threaded Medium data (10M–500M points), shared-memory systems Within-node threads Low ~100ms (OhMyThreads)
Distributed Large data (>500M points), multi-node clusters Across nodes Moderate ~1s (worker startup)
GPU Very large data (>1B points), if GPU available GPU device Moderate ~500ms (kernel compile)
Auto Unknown resource availability Adaptive Low ~100ms (detection)

SerialBackend

Definition

Single-threaded, reference implementation. All computations run on the calling thread.

When to Use

  • Development & debugging: Deterministic, easy to profile
  • Small datasets: <10M points (threads don't help much)
  • Shared environments: Where spawning threads is discouraged
  • Validation: Reference implementation for comparing other backends

When NOT to Use

  • ❌ Large data (>10M points): Too slow
  • ❌ Multi-CPU available: Wastes resources

Examples

1. Allocating API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

# Small test dataset
x = (randn(1000), randn(1000))
u = (randn(1000), randn(1000))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Calculate using SerialBackend explicitly
result = SFC.calculate_structure_function(
    SFT.S2SFType(),
    x, u, bins;
    backend=SFC.SerialBackend(),
    show_progress=true
)

println("Structure Function values: ", result.values)

2. Pre-allocated In-place API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

x = (randn(1000), randn(1000))
u = (randn(1000), randn(1000))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Pre-allocate output arrays
n_bins = length(bins)
sums = zeros(Float64, n_bins)
counts = zeros(Float64, n_bins)

# Compute in-place (accumulates directly into provided arrays)
SFC.calculate_structure_function!(
    sums, counts, SFT.S2SFType(),
    x, u, bins;
    backend=SFC.SerialBackend()
)

Performance Notes

  • O(N²) complexity; for N=1M, expect ~1 sec
  • Mutating calculate_structure_function! completely avoids allocating temporary arrays, making it ideal for temporal loops.

ThreadedBackend

Definition

Multi-threaded execution using OhMyThreads.jl. Distributes pairwise calculations across Threads.nthreads() worker threads.

When to Use

  • Medium datasets: 10M–500M points
  • Shared-memory systems: Single workstation, server
  • Quick turnaround: Faster than serial, no cluster setup
  • Memory-constrained: All threads can access same data

When NOT to Use

  • ❌ Only 1 thread available: Falls back to serial (no benefit)
  • ❌ Very large data (>500M): GPU/Distributed faster
  • ❌ Cluster environment: Use DistributedBackend instead

Requirements

# Add to Project.toml if not present
[extras]
OhMyThreads = "67456a42-ebe4-4781-8ad1-67f7eda8d8f7"

Examples

1. Allocating API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

N = 50_000
x = (randn(N), randn(N))
u = (randn(N), randn(N))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

result = SFC.calculate_structure_function(
    SFT.L2SFType(),
    x, u, bins;
    backend=SFC.ThreadedBackend(),
    show_progress=true
)

2. Pre-allocated In-place API:

using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT

N = 50_000
x = (randn(N), randn(N))
u = (randn(N), randn(N))
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]

# Pre-allocate output arrays
n_bins = length(bins)
sums = zeros(Float64, n_bins)
counts = zeros(Float64, n_bins)

# Compute in-place (accumulates directly into provided arrays)
SFC.calculate_structure_function!(
    sums, counts, SFT.L2SFType(),
    x, u, bins;
    backend=SFC.ThreadedBackend()
)

Performance & Memory Efficiency

The modern mutating threaded backend (threaded_calculate_structure_function!) utilizes a chunked reduction strategy via OhMyThreads.chunks to divide point indexes into exactly nthreads() sub-ranges.

  • Chunked Workspaces: Each task/thread allocates exactly one local buffer pair for its entire chunk (rather than per-point).
  • Memory Scaling: This reduces the number of thread-local heap allocations to exactly $O(n_{\text{threads}})$, compared to the highly wasteful $O(N_{\text{points}})$ allocation pattern in naive map-reduce implementations.
  • Cache Locality: This optimization maximizes L1/L2 cache locality while maintaining complete thread safety and task-migration protection.

Thread Safety

ThreadedBackend uses thread-local reduction buffers to avoid race conditions:

  • Each task computes on its own local chunk workspace.
  • The results are folded together thread-safely using a parallel tree reduction.
  • No global locks or atomic conflicts are triggered, maximizing performance.

DistributedBackend

Definition

Multi-process execution using Distributed.jl. Distributes work across multiple Julia processes, potentially on different compute nodes.

When to Use

  • Very large data: >500M points
  • Cluster environments: HPC, cloud, multiple machines
  • Limited per-node memory: Distribute data across nodes
  • Need fault tolerance: Can add checkpointing

When NOT to Use

  • ❌ Shared-memory system with <10 cores: ThreadedBackend faster
  • ❌ Interactive/exploratory work: Higher latency
  • ❌ Small data: Communication overhead kills performance

Setup

using StructureFunctions
using Distributed

# Start 4 worker processes (can be on different machines)
addprocs(4)

# Ensure StructureFunctions is loaded on all workers
@everywhere using StructureFunctions

# Large dataset (distributed across RAM)
N = 1_000_000_000  # 1 billion points
x = randn(N, 2)
u = randn(N, 2)

backend = DistributedBackend()
bins = 10:10:5000

@time result = calculate_structure_function(
    FullVectorStructureFunction{Float64}(order=2),
    x, u, bins;
    backend=backend,
    verbose=true
)

# Clean up
rmprocs(workers())

Cluster Job Submission

Example SLURM submission script:

#!/bin/bash
#SBATCH --nodes=4           # Request 4 nodes
#SBATCH --cpus-per-task=8   # 8 CPUs per node
#SBATCH --time=01:00:00     # 1 hour

export JULIA_NUM_THREADS=8  # Let each process use 8 threads

srun julia -p $((${SLURM_NNODES} * ${SLURM_CPUS_PER_TASK})) compute_sf.jl

Performance Notes

  • Communication overhead: ~50–200 ms per calculation (one-time cost)
  • Scales nearly linearly with process count (for large enough problems)
  • Best when N >> communication cost (i.e., N > 100M)

GPUBackend

Definition

GPU-accelerated computation using KernelAbstractions.jl. Fully portable kernels run on NVIDIA (CUDA), AMD (ROCm), or CPU fallback.

When to Use

  • Massive datasets: >1 billion points
  • GPUs available: NVIDIA A100, RTX 4090, AMD MI200, etc.
  • Extreme speed needed: 10–100x speedup possible
  • Research clusters: Many HPC centers provide GPUs

When NOT to Use

  • ❌ No GPU: CPU fallback is actually slower than ThreadedBackend
  • ❌ Data doesn't fit on GPU memory: Requires redistribution logic (future work)
  • ❌ GPU unavailable in environment: Installation complexity

Requirements

# GPU execution requires one of:
# - CUDA.jl (NVIDIA GPUs)
# - AMDGPU.jl (AMD GPUs)
# - Metal.jl (Apple Silicon)

[extras]
KernelAbstractions = "63c18a36-062a-441e-b654-da1e3ab1f7f1"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"  # For NVIDIA

Example: NVIDIA GPU

using StructureFunctions
using CUDA

# Ensure GPU is available
@assert CUDA.functional() "CUDA not available"

# Large dataset
N = 1_000_000_000  # 1 billion points
x_cpu = randn(Float32, N, 2)  # Float32 for GPU memory efficiency
u_cpu = randn(Float32, N, 2)

# Move to GPU
x_gpu = cu(x_cpu)
u_gpu = cu(u_cpu)

backend = GPUBackend()
bins = 10:10:5000

@time result = calculate_structure_function(
    FullVectorStructureFunction{Float32}(order=2),
    x_gpu, u_gpu, bins;
    backend=backend,
    show_progress=true
)

# Results automatically transferred back to CPU
println(typeof(result.structure_function))  # ::Matrix{Float32}

Performance Comparison

Measured on NVIDIA A100 GPU vs. CPU (48-core Xeon):

N CPU (s) A100-40GB (s) Speedup
100M 50 2.5 20x
500M 250 10 25x
1B 500 18 28x

Notes:

  • Overhead: Kernel compilation ~500ms (amortized if many calls)
  • Memory: 1B Float32 points ≈ 4 GB GPU memory
  • Precision: Float32 sufficient for most applications (precision ~1e-6)

Kernel Details

GPUBackend executes distance-pair calculations in massively parallel kernels:

  • Each GPU thread processes one distance-pair bin
  • Tree reduction for combining results
  • Zero-copy result transfer via unified memory (Ampere+)

AutoBackend

Definition

Automatically selects the best backend based on available resources:

  1. If Distributed.nworkers() > 1 → DistributedBackend
  2. Else if Threads.nthreads() > 1 → ThreadedBackend
  3. Else → SerialBackend

When to Use

  • Generic libraries: Let code adapt to environment
  • Unknown deployment: Works on laptop, cluster, or cloud
  • Single shared script: No backend changes needed
  • Production pipelines: Automatic resource utilization

Example

using StructureFunctions

# Same code runs on laptop (serial), workstation (threaded),
# or cluster (distributed) without changes!

x = randn(50_000_000, 2)
u = randn(50_000_000, 2)
bins = 10:10:500

result = calculate_structure_function(
    FullVectorStructureFunction{Float64}(order=2),
    x, u, bins;
    backend=AutoBackend(),  # Default—chooses best option
    show_progress=true
)

Algorithm

function select_backend()
    if nworkers() > 1
        return DistributedBackend()
    elseif Threads.nthreads() > 1
        return ThreadedBackend()
    else
        return SerialBackend()
    end
end

Notes

  • Default behavior (no backend kwarg) also uses AutoBackend
  • Detection is fast (~1 ms)
  • Users can override with explicit backend if needed

Performance Tuning

Choice Decision Tree

Data size?
├─ < 10M       → SerialBackend (or ThreadedBackend if multi-core)
├─ 10M–500M    → ThreadedBackend (or AutoBackend to auto-select)
├─ 500M–1B     → DistributedBackend (or GPUBackend if GPU available)
└─ > 1B        → GPUBackend (or DistributedBackend if no GPU)

Memory Considerations

Backend Memory per Point (approx) Total for 1B
Serial 20 bytes (buffers only) 20 GB
Threaded 20 bytes (shared data) 20 GB
Distributed 20 bytes (per node) 20 GB / N_nodes
GPU 8 bytes (Float32) 8 GB

Thread/Process Count Tuning

ThreadedBackend: Use Threads.nthreads() = CPU_count - 1 to leave resources for OS

# Start Julia with specific thread count
JULIA_NUM_THREADS=7 julia script.jl  # On 8-core machine

DistributedBackend: Tune addprocs(N) based on cluster resources

# Rule of thumb: N ~= (total_cores / 2) for interactive work
# N ~= total_cores for batch jobs
addprocs(32)  # On 64-core system for batch

Profiling

Use @time or @profile to measure:

@time result = calculate_structure_function(...)

# For detailed timing:
using ProfilingTools
@profile calculate_structure_function(...)

# Generate flame graph
ProfileCanvas.show()

Related Topics