update docs

jbphyswx · jbphyswx · commit d9e692af0a13 · 2026-05-29T03:37:43.000-07:00
diff --git a/README.md b/README.md
@@ -25,6 +25,8 @@ StructureFunctions.jl computes structure functions (SFs) from scattered data, ch
 ## Features
 
 - **Structure Functions**: 1st, 2nd, 3rd order; longitudinal & transverse projections in 1D, 2D, 3D
+- **In-place Mutating API**: Pre-allocated mutating functions (`calculate_structure_function!`) for zero-allocation loops (O(n_threads) multi-threaded chunked allocations)
+- **2D Joint-Probability Binning**: Natively accumulates both exact sums and contribution counts across distance and structure function value increment bins (`StructureFunction2D`)
 - **Typed Backend System**: Serial, Threaded, Distributed, GPU, Auto — choose your parallelization strategy
 - **Type-Stable Dispatch**: No runtime overhead from symbolic dispatch; all paths validated with JET
 - **Extensible Architecture**: Optional extensions for parallelization and GPU acceleration
@@ -60,6 +62,31 @@ if nthreads() > 1
 end
 ```
 
+### Pre-allocated In-place Calculation
+
+For high-performance loops (e.g. over timesteps), you can pre-allocate memory buffers and run mutating calculations with zero heap allocation:
+
+```julia
+using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
+
+x = ([0.0, 1.0, 2.0], [0.0, 0.0, 0.0])
+u = ([1.0, 1.1, 1.2], [0.0, 0.05, 0.1])
+bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
+sf_type = SFT.L2SFType()
+
+# Pre-allocate output arrays
+n_bins = length(bins)
+sums = zeros(Float64, n_bins)
+counts = zeros(Float64, n_bins)
+
+# Compute in-place (accumulates into provided buffers)
+SFC.calculate_structure_function!(sums, counts, sf_type, x, u, bins; backend=SFC.ThreadedBackend())
+
+# Obtain structure function values via division
+sf_values = sums ./ counts
+```
+
+
 ## Architecture
 
 ### Operator Types ✕ Result Container Pattern
@@ -192,7 +219,9 @@ result = SFC.calculate_structure_function(sf_type, x, u, bins;
 
 ## API Reference
 
-### Main Entry Point
+### Main Entry Points
+
+**1. Standard Allocating API:**
 
 ```julia
 calculate_structure_function(sf_type::AbstractStructureFunctionType,
@@ -206,35 +235,73 @@ calculate_structure_function(sf_type::AbstractStructureFunctionType,
                             show_progress=true) → StructureFunction
 ```
 
-**Arguments**:
-- `sf_type`: Operator instance (e.g., `LongitudinalSecondOrderStructureFunctionType()`)
-- `x`: Position data (Tuple of 1D vectors OR N×M matrix for N dimensions, M points)
-- `u`: Velocity/field data (same shape as `x`)
-- `distance_bins`: Vector of `(r_min, r_max)` tuples defining bins
+**2. 2D Joint-Probability Allocating API:**
+
+```julia
+calculate_structure_function(sf_type::AbstractStructureFunctionType,
+                            x::Union{Tuple, Matrix},
+                            u::Union{Tuple, Matrix},
+                            distance_bins::AbstractVector{<:Tuple},
+                            value_bins::AbstractVector;
+                            backend=SerialBackend(),
+                            distance_metric=Euclidean(),
+                            verbose=true,
+                            show_progress=true) → StructureFunction2D
+```
+
+**3. In-place Mutating API (Zero-Allocation):**
 
-**Returns**: `StructureFunction` result container
+```julia
+calculate_structure_function!(sums::AbstractVector,
+                             counts::AbstractVector,
+                             sf_type::AbstractStructureFunctionType,
+                             x::Union{Tuple, Matrix},
+                             u::Union{Tuple, Matrix},
+                             distance_bins::AbstractVector;
+                             backend=SerialBackend(),
+                             distance_metric=Euclidean(),
+                             verbose=true,
+                             show_progress=true) → Nothing
+```
+
+**4. 2D Joint-Probability Mutating API (Zero-Allocation):**
+
+```julia
+calculate_structure_function!(sums_2d::AbstractMatrix,
+                             counts_2d::AbstractMatrix,
+                             sf_type::AbstractStructureFunctionType,
+                             x::Union{Tuple, Matrix},
+                             u::Union{Tuple, Matrix},
+                             distance_bins::AbstractVector,
+                             value_bins::AbstractVector;
+                             backend=SerialBackend(),
+                             distance_metric=Euclidean(),
+                             verbose=true,
+                             show_progress=true) → Nothing
+```
 
-**See also**: `serial_calculate_structure_function`, `parallel_calculate_structure_function`, `gpu_calculate_structure_function`
+*Note: The mutating APIs accumulate (`+=` and `.+=`) directly into the provided output buffers. The caller is responsible for pre-zeroing the arrays.*
 
 ### Operator Types
 
-All inherit from `AbstractStructureFunctionType`. Instantiate with `()`:
+All inherit from `AbstractStructureFunctionType`. Instantiate with `()` or use shorthands:
 
 ```julia
 SFT.LongitudinalSecondOrderStructureFunctionType()    # 2nd order, longitudinal
 SFT.TransverseSecondOrderStructureFunctionType()      # 2nd order, transverse
 SFT.LongitudinalThirdOrderStructureFunctionType()     # 3rd order, longitudinal
-SFT.TransverseThirdOrderStructureFunctionType()       # 3rd order, transverse
-# ... and other variants (see docs/theory.md)
+# ... shorthands: L2SFType, T2SFType, L3SFType, T3SFType, S2SFType, S3SFType
 ```
 
 Each operator is **callable** (functors):
 ```julia
-sf_op = SFT.LongitudinalSecondOrderStructureFunctionType()
-sf_op(du, rhat)  # Equivalent to: calculate_structure_function(sf_op, ...)
+sf_op = SFT.L2SFType()
+sf_op(du, rhat)  # Computes L2SF increment value
 ```
 
-### Result Container
+### Result Containers
+
+**1. 1D Structure Function Container (`StructureFunction`):**
 
 ```julia
 struct StructureFunction{FT, OT, BT, VT} <: AbstractStructureFunction
@@ -245,12 +312,27 @@ struct StructureFunction{FT, OT, BT, VT} <: AbstractStructureFunction
 end
 ```
 
+**2. 2D Joint-Probability Container (`StructureFunction2D`):**
+
+```julia
+struct StructureFunction2D{FT, OT, BT, VT, MT} <: AbstractStructureFunction
+    operator::OT                   # AbstractStructureFunctionType
+    distance_bins::BT              # AbstractVector of (r_min, r_max)
+    value_bins::VT                 # AbstractVector of value bin edges
+    sums::MT                       # AbstractMatrix{FT} (distance x value)
+    counts::MT                     # AbstractMatrix{FT} (distance x value)
+end
+```
+
 **Access results**:
 ```julia
-result.values       # SF values, one per bin
+# 1D
+result.values         # SF values, one per bin
 result.distance_bins  # Original input bins
-result.operator     # The SF operator used
-result.order        # Order of the SF
+
+# 2D
+result_2d.sums        # Sum of SF values in each 2D cell
+result_2d.counts      # Count of point pairs in each 2D cell
 ```
 
 ## Theory & References
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -88,18 +88,42 @@ Each operator stores:
 - **Order** (n=2, 3, 4, ...) — which structure function order
 - **Projection** (if applicable) — which component to analyze
 
-### Result Container
+### Result Containers
 
+StructureFunctions.jl decouples raw accumulation, processed 1D structure functions, and 2D joint-probability binning into separate parametric result types inheriting from `AbstractStructureFunction`:
+
+1. **`StructureFunction`**: Stores the final processed structure function values.
+```julia
+struct StructureFunction{FT, OT, BT, VT} <: AbstractStructureFunction
+    operator::OT                   # AbstractStructureFunctionType
+    distance_bins::BT              # AbstractVector of (r_min, r_max)
+    values::VT                     # AbstractVector{FT} — computed SF
+    order::Int                     # 1, 2, 3, ...
+end
+```
+
+2. **`StructureFunctionSumsAndCounts`**: Stores exact computed sums and point counts per bin. Ideal for distributed or chunked temporal aggregation.
+```julia
+struct StructureFunctionSumsAndCounts{FT, OT, BT, VT} <: AbstractStructureFunction
+    operator::OT
+    distance_bins::BT
+    sums::VT                       # Exact computed SF value sums
+    counts::VT                     # Integer counts of contributing pairs
+end
+```
+
+3. **`StructureFunction2D`**: Stores the 2D joint-probability binning grid (separation distance $r$ vs. SF value $v$).
 ```julia
-struct StructureFunction{T}
-    distance::Vector{T}          # Bin centers
-    structure_function::Matrix{T} # S(distance, order)
-    sums::Matrix{T}              # Numerator sums
-    counts::Vector{Int64}        # Counts per bin
+struct StructureFunction2D{FT, OT, BT, VT, MT} <: AbstractStructureFunction
+    operator::OT
+    distance_bins::BT
+    value_bins::VT                 # Value increment bin edges
+    sums::MT                       # 2D matrix of exact sums (distance x value)
+    counts::MT                     # 2D matrix of contribution counts
 end
 ```
 
-Stores **both raw and processed data** so users can customize post-processing.
+All result containers support basic `Base` algebraic operations (like `+` and `+=`) to allow seamless aggregation across distributed processes or temporal timesteps.
 
 ---
 
diff --git a/docs/backends.md b/docs/backends.md
@@ -40,40 +40,61 @@ Single-threaded, reference implementation. All computations run on the calling t
 - ❌ Large data (>10M points): Too slow
 - ❌ Multi-CPU available: Wastes resources
 
-### Example
+### Examples
+
+**1. Allocating API:**
 
 ```julia
-using StructureFunctions
+using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
 
 # Small test dataset
-x = randn(1000, 2)  # 1000 points in 2D
-u = randn(1000, 2)  # velocity at each point
+x = (randn(1000), randn(1000))
+u = (randn(1000), randn(1000))
+bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
 
-# Use SerialBackend explicitly
-backend = SerialBackend()
-bins = 10:10:100  # 10 distance bins
-
-result = calculate_structure_function(
-    FullVectorStructureFunction{Float64}(order=2),
+# Calculate using SerialBackend explicitly
+result = SFC.calculate_structure_function(
+    SFT.S2SFType(),
     x, u, bins;
-    backend=backend,
+    backend=SFC.SerialBackend(),
     show_progress=true
 )
 
-println("Structure Function at bin 50: $(result.structure_function[50, 1])")
+println("Structure Function values: ", result.values)
+```
+
+**2. Pre-allocated In-place API:**
+
+```julia
+using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
+
+x = (randn(1000), randn(1000))
+u = (randn(1000), randn(1000))
+bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
+
+# Pre-allocate output arrays
+n_bins = length(bins)
+sums = zeros(Float64, n_bins)
+counts = zeros(Float64, n_bins)
+
+# Compute in-place (accumulates directly into provided arrays)
+SFC.calculate_structure_function!(
+    sums, counts, SFT.S2SFType(),
+    x, u, bins;
+    backend=SFC.SerialBackend()
+)
 ```
 
 ### Performance Notes
 - O(N²) complexity; for N=1M, expect ~1 sec
-- Light memory footprint (just result container + temporary arrays)
-- Good for validation before scaling up
+- Mutating `calculate_structure_function!` completely avoids allocating temporary arrays, making it ideal for temporal loops.
 
 ---
 
 ## ThreadedBackend
 
 ### Definition
-Multi-threaded execution using OhMyThreads.jl. Distributes pairwise calculations across Threads.nthreads() worker threads.
+Multi-threaded execution using OhMyThreads.jl. Distributes pairwise calculations across `Threads.nthreads()` worker threads.
 
 ### When to Use
 - ✅ **Medium datasets**: 10M–500M points
@@ -94,56 +115,63 @@ Multi-threaded execution using OhMyThreads.jl. Distributes pairwise calculations
 OhMyThreads = "67456a42-ebe4-4781-8ad1-67f7eda8d8f7"
 ```
 
-### Example
+### Examples
+
+**1. Allocating API:**
 
 ```julia
-using StructureFunctions
-using Base.Threads
+using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
 
-# Set number of threads before running
-# Either: JULIA_NUM_THREADS=8 julia script.jl
-# Or in REPL: Threads.nthreads() -> check current count
+N = 50_000
+x = (randn(N), randn(N))
+u = (randn(N), randn(N))
+bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
 
-# Medium dataset
-N = 50_000_000  # 50M points
-x = randn(N, 2)
-u = randn(N, 2)
+result = SFC.calculate_structure_function(
+    SFT.L2SFType(),
+    x, u, bins;
+    backend=SFC.ThreadedBackend(),
+    show_progress=true
+)
+```
 
-backend = ThreadedBackend()
-bins = 10:10:1000  # 100 distance bins
+**2. Pre-allocated In-place API:**
 
-result = calculate_structure_function(
-    FullVectorStructureFunction{Float64}(order=2),
+```julia
+using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
+
+N = 50_000
+x = (randn(N), randn(N))
+u = (randn(N), randn(N))
+bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
+
+# Pre-allocate output arrays
+n_bins = length(bins)
+sums = zeros(Float64, n_bins)
+counts = zeros(Float64, n_bins)
+
+# Compute in-place (accumulates directly into provided arrays)
+SFC.calculate_structure_function!(
+    sums, counts, SFT.L2SFType(),
     x, u, bins;
-    backend=backend,
-    show_progress=true  # Progress bar shows thread work distribution
+    backend=SFC.ThreadedBackend()
 )
-
-# For 8 threads, expect ~2-8x speedup over serial
 ```
 
-### Performance Characteristics
+### Performance & Memory Efficiency
 
-**Scaling** (measured on 4-core system):
+The modern mutating threaded backend (`threaded_calculate_structure_function!`) utilizes a **chunked reduction** strategy via `OhMyThreads.chunks` to divide point indexes into exactly `nthreads()` sub-ranges.
 
-| N | Serial (s) | Threaded (s) | Speedup |
-|---|-----------|------------|---------|
-| 1M | 0.05 | 0.08 | 0.6x (overhead) |
-| 10M | 0.6 | 0.25 | 2.4x |
-| 50M | 3.5 | 1.2 | 2.9x |
-| 100M | 8 | 2.3 | 3.5x |
-
-**Notes**:
-- Speedup is sublinear (not 4x on 4 cores) due to NUMA effects and atomic reductions
-- Optimal for scenarios where data fits in L3 cache per thread
-- Progress bar updates in real-time showing all threads' work
+* **Chunked Workspaces**: Each task/thread allocates exactly **one local buffer pair** for its entire chunk (rather than per-point).
+* **Memory Scaling**: This reduces the number of thread-local heap allocations to exactly **$O(n_{\text{threads}})$**, compared to the highly wasteful **$O(N_{\text{points}})$** allocation pattern in naive map-reduce implementations.
+* **Cache Locality**: This optimization maximizes L1/L2 cache locality while maintaining complete thread safety and task-migration protection.
 
 ### Thread Safety
 
-ThreadedBackend uses **thread-local buffers** to avoid race conditions:
-- Each thread has its own workspace
-- No atomic operations (faster than distributed)
-- Completely safe; no possibility of data races
+ThreadedBackend uses **thread-local reduction buffers** to avoid race conditions:
+- Each task computes on its own local chunk workspace.
+- The results are folded together thread-safely using a parallel tree reduction.
+- No global locks or atomic conflicts are triggered, maximizing performance.
 
 ---