You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**2D Joint-Probability Binning**: Natively accumulates both exact sums and contribution counts across distance and structure function value increment bins (`StructureFunction2D`)
28
30
-**Typed Backend System**: Serial, Threaded, Distributed, GPU, Auto — choose your parallelization strategy
29
31
-**Type-Stable Dispatch**: No runtime overhead from symbolic dispatch; all paths validated with JET
30
32
-**Extensible Architecture**: Optional extensions for parallelization and GPU acceleration
@@ -60,6 +62,31 @@ if nthreads() > 1
60
62
end
61
63
```
62
64
65
+
### Pre-allocated In-place Calculation
66
+
67
+
For high-performance loops (e.g. over timesteps), you can pre-allocate memory buffers and run mutating calculations with zero heap allocation:
68
+
69
+
```julia
70
+
using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
71
+
72
+
x = ([0.0, 1.0, 2.0], [0.0, 0.0, 0.0])
73
+
u = ([1.0, 1.1, 1.2], [0.0, 0.05, 0.1])
74
+
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
75
+
sf_type = SFT.L2SFType()
76
+
77
+
# Pre-allocate output arrays
78
+
n_bins =length(bins)
79
+
sums =zeros(Float64, n_bins)
80
+
counts =zeros(Float64, n_bins)
81
+
82
+
# Compute in-place (accumulates into provided buffers)
83
+
SFC.calculate_structure_function!(sums, counts, sf_type, x, u, bins; backend=SFC.ThreadedBackend())
84
+
85
+
# Obtain structure function values via division
86
+
sf_values = sums ./ counts
87
+
```
88
+
89
+
63
90
## Architecture
64
91
65
92
### Operator Types ✕ Result Container Pattern
@@ -192,7 +219,9 @@ result = SFC.calculate_structure_function(sf_type, x, u, bins;
Copy file name to clipboardExpand all lines: docs/architecture.md
+31-7Lines changed: 31 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -88,18 +88,42 @@ Each operator stores:
88
88
-**Order** (n=2, 3, 4, ...) — which structure function order
89
89
-**Projection** (if applicable) — which component to analyze
90
90
91
-
### Result Container
91
+
### Result Containers
92
92
93
+
StructureFunctions.jl decouples raw accumulation, processed 1D structure functions, and 2D joint-probability binning into separate parametric result types inheriting from `AbstractStructureFunction`:
94
+
95
+
1.**`StructureFunction`**: Stores the final processed structure function values.
sums::MT# 2D matrix of exact sums (distance x value)
122
+
counts::MT# 2D matrix of contribution counts
99
123
end
100
124
```
101
125
102
-
Stores **both raw and processed data** so users can customize post-processing.
126
+
All result containers support basic `Base` algebraic operations (like `+`and `+=`) to allow seamless aggregation across distributed processes or temporal timesteps.
using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
102
124
103
-
# Set number of threads before running
104
-
# Either: JULIA_NUM_THREADS=8 julia script.jl
105
-
# Or in REPL: Threads.nthreads() -> check current count
125
+
N =50_000
126
+
x = (randn(N), randn(N))
127
+
u = (randn(N), randn(N))
128
+
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
106
129
107
-
# Medium dataset
108
-
N =50_000_000# 50M points
109
-
x =randn(N, 2)
110
-
u =randn(N, 2)
130
+
result = SFC.calculate_structure_function(
131
+
SFT.L2SFType(),
132
+
x, u, bins;
133
+
backend=SFC.ThreadedBackend(),
134
+
show_progress=true
135
+
)
136
+
```
111
137
112
-
backend =ThreadedBackend()
113
-
bins =10:10:1000# 100 distance bins
138
+
**2. Pre-allocated In-place API:**
114
139
115
-
result =calculate_structure_function(
116
-
FullVectorStructureFunction{Float64}(order=2),
140
+
```julia
141
+
using StructureFunctions: Calculations as SFC, StructureFunctionTypes as SFT
142
+
143
+
N =50_000
144
+
x = (randn(N), randn(N))
145
+
u = (randn(N), randn(N))
146
+
bins = [(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)]
147
+
148
+
# Pre-allocate output arrays
149
+
n_bins =length(bins)
150
+
sums =zeros(Float64, n_bins)
151
+
counts =zeros(Float64, n_bins)
152
+
153
+
# Compute in-place (accumulates directly into provided arrays)
154
+
SFC.calculate_structure_function!(
155
+
sums, counts, SFT.L2SFType(),
117
156
x, u, bins;
118
-
backend=backend,
119
-
show_progress=true# Progress bar shows thread work distribution
157
+
backend=SFC.ThreadedBackend()
120
158
)
121
-
122
-
# For 8 threads, expect ~2-8x speedup over serial
123
159
```
124
160
125
-
### Performance Characteristics
161
+
### Performance & Memory Efficiency
126
162
127
-
**Scaling**(measured on 4-core system):
163
+
The modern mutating threaded backend (`threaded_calculate_structure_function!`) utilizes a **chunked reduction**strategy via `OhMyThreads.chunks` to divide point indexes into exactly `nthreads()` sub-ranges.
128
164
129
-
| N | Serial (s) | Threaded (s) | Speedup |
130
-
|---|-----------|------------|---------|
131
-
| 1M | 0.05 | 0.08 | 0.6x (overhead) |
132
-
| 10M | 0.6 | 0.25 | 2.4x |
133
-
| 50M | 3.5 | 1.2 | 2.9x |
134
-
| 100M | 8 | 2.3 | 3.5x |
135
-
136
-
**Notes**:
137
-
- Speedup is sublinear (not 4x on 4 cores) due to NUMA effects and atomic reductions
138
-
- Optimal for scenarios where data fits in L3 cache per thread
139
-
- Progress bar updates in real-time showing all threads' work
165
+
***Chunked Workspaces**: Each task/thread allocates exactly **one local buffer pair** for its entire chunk (rather than per-point).
166
+
***Memory Scaling**: This reduces the number of thread-local heap allocations to exactly **$O(n_{\text{threads}})$**, compared to the highly wasteful **$O(N_{\text{points}})$** allocation pattern in naive map-reduce implementations.
167
+
***Cache Locality**: This optimization maximizes L1/L2 cache locality while maintaining complete thread safety and task-migration protection.
140
168
141
169
### Thread Safety
142
170
143
-
ThreadedBackend uses **thread-local buffers** to avoid race conditions:
144
-
- Each thread has its own workspace
145
-
-No atomic operations (faster than distributed)
146
-
-Completely safe; no possibility of data races
171
+
ThreadedBackend uses **thread-local reduction buffers** to avoid race conditions:
172
+
- Each task computes on its own local chunk workspace.
173
+
-The results are folded together thread-safely using a parallel tree reduction.
174
+
-No global locks or atomic conflicts are triggered, maximizing performance.
0 commit comments