|
| 1 | +# Spatial Autocorrelation: Moran's I and LISA |
| 2 | + |
| 3 | +**Issue:** #1135 (partial -- Global Moran's I, Local Moran's I, queen/rook contiguity) |
| 4 | +**Date:** 2026-04-01 |
| 5 | + |
| 6 | +## Scope |
| 7 | + |
| 8 | +This spec covers the first increment of #1135: |
| 9 | + |
| 10 | +- Global Moran's I with analytical inference |
| 11 | +- Local Moran's I (LISA) with permutation-based pseudo p-values |
| 12 | +- Queen and rook contiguity weights (3x3 kernel) |
| 13 | + |
| 14 | +Geary's C, join count statistics, and distance-band weights are deferred to follow-up issues. |
| 15 | + |
| 16 | +## Public API |
| 17 | + |
| 18 | +Two functions in `xrspatial/autocorrelation.py`: |
| 19 | + |
| 20 | +### `morans_i` |
| 21 | + |
| 22 | +```python |
| 23 | +def morans_i( |
| 24 | + raster: xr.DataArray, |
| 25 | + contiguity: str = 'queen', |
| 26 | + boundary: str = 'nan', |
| 27 | +) -> xr.DataArray: |
| 28 | +``` |
| 29 | + |
| 30 | +Returns a scalar (0-dimensional) DataArray. The `.item()` value is the I statistic. |
| 31 | +Attrs carry analytical inference results: |
| 32 | + |
| 33 | +| Attr | Type | Description | |
| 34 | +|------|------|-------------| |
| 35 | +| `expected_I` | float | -1/(N-1) | |
| 36 | +| `variance_I` | float | Cliff & Ord analytical variance | |
| 37 | +| `z_score` | float | (I - E[I]) / sqrt(Var[I]) | |
| 38 | +| `p_value` | float | Two-sided, from normal approximation | |
| 39 | +| `N` | int | Count of non-NaN pixels | |
| 40 | +| `S0` | float | Sum of all weights | |
| 41 | +| `contiguity` | str | 'queen' or 'rook' | |
| 42 | + |
| 43 | +### `lisa` |
| 44 | + |
| 45 | +```python |
| 46 | +def lisa( |
| 47 | + raster: xr.DataArray, |
| 48 | + contiguity: str = 'queen', |
| 49 | + n_permutations: int = 999, |
| 50 | + boundary: str = 'nan', |
| 51 | +) -> xr.Dataset: |
| 52 | +``` |
| 53 | + |
| 54 | +Returns a Dataset with three DataVariables: |
| 55 | + |
| 56 | +| Variable | Dims | Dtype | Description | |
| 57 | +|----------|------|-------|-------------| |
| 58 | +| `lisa_values` | (y, x) | float32 | Local I_i per pixel | |
| 59 | +| `p_values` | (y, x) | float32 | Pseudo p-value from permutation | |
| 60 | +| `cluster` | (y, x) | int8 | 0=NS, 1=HH, 2=LL, 3=HL, 4=LH | |
| 61 | + |
| 62 | +Dataset attrs: `n_permutations`, `contiguity`, `global_morans_I`. |
| 63 | + |
| 64 | +Cluster codes use significance threshold p <= 0.05. Pixels with p > 0.05 get code 0 regardless of their quadrant. |
| 65 | + |
| 66 | +## Mathematics |
| 67 | + |
| 68 | +### Global Moran's I |
| 69 | + |
| 70 | +``` |
| 71 | +z = x - mean(x) |
| 72 | +lag = convolve(z, W) # W = queen or rook kernel |
| 73 | +I = (N / S0) * sum(z * lag) / sum(z^2) |
| 74 | +``` |
| 75 | + |
| 76 | +S0 (total weight sum) accounts for border effects and NaN gaps by convolving a non-NaN mask with the weight kernel and summing the result. |
| 77 | + |
| 78 | +Analytical expected value and variance follow Cliff & Ord (1981): |
| 79 | + |
| 80 | +``` |
| 81 | +E[I] = -1 / (N - 1) |
| 82 | +Var[I] uses S0, S1, S2, N, and the kurtosis of z |
| 83 | +``` |
| 84 | + |
| 85 | +Where S1 = (1/2) * sum_ij (w_ij + w_ji)^2 and S2 = sum_i (sum_j w_ij + sum_j w_ji)^2. For symmetric binary weights (queen/rook), S1 = 2 * S0 and S2 simplifies. |
| 86 | + |
| 87 | +### Local Moran's I (LISA) |
| 88 | + |
| 89 | +``` |
| 90 | +I_i = (z_i / var(x)) * sum_j(w_ij * z_j) |
| 91 | +``` |
| 92 | + |
| 93 | +### Permutation pseudo p-value |
| 94 | + |
| 95 | +For each pixel i: |
| 96 | +1. Extract the neighbor values (up to 8 for queen, 4 for rook). |
| 97 | +2. Shuffle the neighbor values n_permutations times (Fisher-Yates). |
| 98 | +3. Recompute I_i with each shuffled set. |
| 99 | +4. p_value = (count(|I_perm| >= |I_obs|) + 1) / (n_permutations + 1) |
| 100 | + |
| 101 | +The +1 correction includes the observed value as one permutation (Davison & Hinkley, 1997). |
| 102 | + |
| 103 | +### Cluster classification |
| 104 | + |
| 105 | +| Code | Label | Condition (when p <= 0.05) | |
| 106 | +|------|-------|---------------------------| |
| 107 | +| 0 | NS | not significant | |
| 108 | +| 1 | HH | z_i > 0 and lag_i > 0 | |
| 109 | +| 2 | LL | z_i < 0 and lag_i < 0 | |
| 110 | +| 3 | HL | z_i > 0 and lag_i < 0 | |
| 111 | +| 4 | LH | z_i < 0 and lag_i > 0 | |
| 112 | + |
| 113 | +### NaN handling |
| 114 | + |
| 115 | +- NaN input pixels produce NaN in lisa_values and p_values, 0 in cluster. |
| 116 | +- NaN neighbors are excluded from lag sums (their weight drops to zero). |
| 117 | +- Constant rasters (zero variance) produce NaN for all statistics. |
| 118 | + |
| 119 | +## Contiguity kernels |
| 120 | + |
| 121 | +Queen (8 neighbors): |
| 122 | +``` |
| 123 | +[[1, 1, 1], |
| 124 | + [1, 0, 1], |
| 125 | + [1, 1, 1]] |
| 126 | +``` |
| 127 | + |
| 128 | +Rook (4 neighbors): |
| 129 | +``` |
| 130 | +[[0, 1, 0], |
| 131 | + [1, 0, 1], |
| 132 | + [0, 1, 0]] |
| 133 | +``` |
| 134 | + |
| 135 | +## Internal Architecture |
| 136 | + |
| 137 | +### Backend dispatch |
| 138 | + |
| 139 | +Both public functions validate input via `_validate_raster`, build the contiguity kernel, then dispatch through `ArrayTypeFunctionMapping`: |
| 140 | + |
| 141 | +``` |
| 142 | +morans_i / lisa |
| 143 | + -> _validate_raster(raster, ndim=2) |
| 144 | + -> kernel = _contiguity_kernel(contiguity) |
| 145 | + -> ArrayTypeFunctionMapping(numpy, cupy, dask, dask_cupy)(raster)(...) |
| 146 | +``` |
| 147 | + |
| 148 | +### Backend implementations |
| 149 | + |
| 150 | +**numpy:** Compute mean/var eagerly. Spatial lag via `_convolve_2d_numpy` (imported from convolution module) or a local implementation. LISA permutation via `@ngjit` loop that iterates pixels, extracts 8 neighbors, runs Fisher-Yates shuffle n_permutations times. |
| 151 | + |
| 152 | +**cupy:** Same structure but GPU arrays. Spatial lag via CuPy convolution. Permutation via `@cuda.jit` kernel where each thread handles one pixel. RNG uses `numba.cuda.random.xoroshiro128p_uniform_float32` for Fisher-Yates shuffle. |
| 153 | + |
| 154 | +**dask+numpy:** Eagerly compute global mean, var, N with `da.compute()`. Pass as scalars to chunk function via `partial()`. Single `map_overlap(depth=1, boundary=...)` call with fused chunk function that computes lag + I_i + permutation + cluster. Returns `(3, H, W)` float32 array (lisa_values, p_values, cluster). Unpacked into Dataset after. |
| 155 | + |
| 156 | +**dask+cupy:** Same structure as dask+numpy but chunk function dispatches to CUDA kernel internally. |
| 157 | + |
| 158 | +### Fused LISA chunk function |
| 159 | + |
| 160 | +```python |
| 161 | +def _lisa_chunk_numpy(chunk, kernel, global_mean, global_var, n_permutations, seed): |
| 162 | + """Single-pass: lag + LISA + permutation + cluster for one chunk.""" |
| 163 | + rows, cols = chunk.shape |
| 164 | + z = chunk - global_mean |
| 165 | + out = np.empty((3, rows, cols), dtype=np.float32) |
| 166 | + # For each pixel: read neighbors, compute lag, permute, classify |
| 167 | + _lisa_fused_ngjit(z, kernel, global_var, n_permutations, seed, out) |
| 168 | + return out |
| 169 | +``` |
| 170 | + |
| 171 | +The `@ngjit` inner function handles the pixel loop with neighbor extraction, lag computation, Fisher-Yates permutation, and cluster assignment in a single pass. |
| 172 | + |
| 173 | +### Global Moran's I backend flow |
| 174 | + |
| 175 | +``` |
| 176 | +numpy: |
| 177 | + z = data - mean |
| 178 | + lag = convolve(z, kernel) # reuse focal convolution |
| 179 | + mask = ~isnan(data) |
| 180 | + S0 = sum(convolve(mask, kernel)) # total weight count |
| 181 | + N = sum(mask) |
| 182 | + I = (N / S0) * nansum(z * lag) / nansum(z^2) |
| 183 | + # analytical variance via S0, S1, S2, N |
| 184 | +
|
| 185 | +dask: |
| 186 | + mean, var, N = da.compute(da.nanmean(data), da.nanvar(data), da.sum(~da.isnan(data))) |
| 187 | + z = data - mean |
| 188 | + lag = map_overlap(_convolve_chunk, depth=1, ...) |
| 189 | + S0 = da.sum(map_overlap(_convolve_mask_chunk, depth=1, ...)) |
| 190 | + I = (N / S0) * da.nansum(z * lag) / da.nansum(z**2) |
| 191 | + I = I.compute() |
| 192 | +``` |
| 193 | + |
| 194 | +## map_overlap specifics |
| 195 | + |
| 196 | +- **depth:** 1 in both dimensions (3x3 kernel, half-size = 1) |
| 197 | +- **boundary:** passed through `_boundary_to_dask(boundary)`, default `np.nan` |
| 198 | +- **meta:** `np.array((), dtype=np.float32)` for numpy, `cupy.array((), dtype=cupy.float32)` for cupy |
| 199 | +- **Fused output:** LISA chunk function receives a 2D chunk and returns a `(3, H, W)` array. To make `map_overlap` accept this shape change, pass `new_axis=0` so dask knows the output gains a leading dimension. After the `map_overlap` call, slice `result[0]`, `result[1]`, `result[2]` to get the three output bands. If `new_axis` with `map_overlap` proves brittle at runtime, fall back to three separate `map_overlap` calls (one per output band) at the cost of 3x neighbor reads. |
| 200 | + |
| 201 | +## Seed handling |
| 202 | + |
| 203 | +Permutation tests need reproducible results for cross-backend testing. |
| 204 | + |
| 205 | +- numpy/dask+numpy: `np.random.SeedSequence(seed)` spawns per-pixel child sequences. In practice, the `@ngjit` function uses a simple LCG seeded with `seed + pixel_linear_index` for Fisher-Yates shuffles over 8 elements. Good enough for 8-element shuffles. |
| 206 | +- cupy/dask+cupy: `xoroshiro128p` states initialized with `create_xoroshiro128p_states(n_threads, seed)`. |
| 207 | +- Public API does not expose seed. Internal backends accept it for testing. Default seed is 0 for determinism within a single call (users wanting different random draws can call twice -- permutation p-values are not sensitive to seed choice for n >= 999). |
| 208 | + |
| 209 | +## Testing |
| 210 | + |
| 211 | +File: `xrspatial/tests/test_autocorrelation.py` |
| 212 | + |
| 213 | +### Known-value tests |
| 214 | + |
| 215 | +| Input | Expected I | Rationale | |
| 216 | +|-------|-----------|-----------| |
| 217 | +| 4x4 checkerboard | I < -0.8 | Strong negative autocorrelation | |
| 218 | +| 4x4 row gradient | I > 0.5 | Positive autocorrelation | |
| 219 | +| 8x8 random (seeded) | -0.3 < I < 0.3 | Near zero | |
| 220 | +| Constant value | NaN | Zero variance | |
| 221 | + |
| 222 | +### LISA tests |
| 223 | + |
| 224 | +- Checkerboard: all I_i negative, all clusters HL or LH, all p-values < 0.05 |
| 225 | +- Gradient: center pixels positive (HH or LL), p-values < 0.05 |
| 226 | +- NaN corners: NaN in output at those positions, valid elsewhere |
| 227 | + |
| 228 | +### Edge cases |
| 229 | + |
| 230 | +- Single-cell raster: returns NaN |
| 231 | +- All-NaN raster: returns NaN |
| 232 | +- Raster with one non-NaN pixel: returns NaN |
| 233 | + |
| 234 | +### Cross-backend parity |
| 235 | + |
| 236 | +- `assert_numpy_equals_dask_numpy` for both functions |
| 237 | +- `assert_numpy_equals_cupy` (skip if no GPU) |
| 238 | +- Fixed seed ensures identical permutation sequences |
| 239 | + |
| 240 | +### Contiguity |
| 241 | + |
| 242 | +- Queen vs rook produce different I values on same input |
| 243 | +- Rook on 4x4 checkerboard: I = -1.0 (perfect negative, all 4 neighbors opposite) |
| 244 | + |
| 245 | +## Documentation |
| 246 | + |
| 247 | +- Add API entry in `docs/source/reference/` (new section or extend focal tools) |
| 248 | +- Add row to README feature matrix under a new "Spatial Statistics" category |
| 249 | + |
| 250 | +## Files changed |
| 251 | + |
| 252 | +| File | Change | |
| 253 | +|------|--------| |
| 254 | +| `xrspatial/autocorrelation.py` | New module | |
| 255 | +| `xrspatial/__init__.py` | Export `morans_i`, `lisa` | |
| 256 | +| `xrspatial/tests/test_autocorrelation.py` | New test file | |
| 257 | +| `docs/source/reference/autocorrelation.rst` | New API docs | |
| 258 | +| `README.md` | Feature matrix row | |
| 259 | +| `examples/user_guide/` | New notebook | |
0 commit comments