Skip to content

Commit 04d313e

Browse files
ImbrucedjiayuasuCopilot
authored
[SEDONA-738] Add moran i autocorrelation. (#1975)
* SEDONA-738 Add moran i autocorrelation. * SEDONA-738 Fix unit tests. * SEDONA-738 Fix unit tests. * SEDONA-738 Fix unit tests. * SEDONA-738 Fix unit tests. * SEDONA-738 Fix unit tests. * SEDONA-738 Fix scala 2.13 issue * Update spark/common/src/test/scala/org/apache/sedona/stats/autocorellation/AutoCorrelationFixtures.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix typos * Update doc --------- Co-authored-by: Jia Yu <jiayu@apache.org> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent 9c07af8 commit 04d313e

13 files changed

Lines changed: 739 additions & 34 deletions

File tree

docs/api/stats/sql.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,3 +135,117 @@ names in parentheses are python variable names
135135
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
136136

137137
In both cases the output is the input DataFrame with the weights column added to each row.
138+
139+
## Moran I
140+
141+
Moran I is the spatial autocorrelation algorithm, which is using spatial
142+
location and non-spatial attribute. When the value is close to the 1 it
143+
means that there is spatial correlation, when it is close to 0 then the
144+
correlation does not exist and data is randomly distributed. When the
145+
MoranI autocorrelation value is close to -1 it means that there is negative
146+
correlation. Negative correlation means that close values has dissimilar values.
147+
148+
You can see spatial correlation values on the figure below
149+
150+
- on the left there is negative correlation (-1)
151+
- in the middle correlation is positive (1)
152+
- on the right the correlation is close to zero and data is random.
153+
154+
![moranI.png](../../image/moranI.png)
155+
156+
Moran statistics can be used as the Scala/Java and Python functions.
157+
As the input function requires weight DataFrame. You can create the
158+
weight DataFrame using Apache Sedona weighting functions. You need
159+
to keep in mind that your input has to have id column that uniquely identifies
160+
the feature and value field. The required minimal schema for the MoranI Apache Sedona
161+
function is:
162+
163+
```
164+
|-- id: integer (nullable = true)
165+
|-- value: double (nullable = true)
166+
|-- weights: array (nullable = false)
167+
| |-- element: struct (containsNull = false)
168+
| | |-- neighbor: struct (nullable = false)
169+
| | | |-- id: integer (nullable = true)
170+
| | | |-- value: double (nullable = true)
171+
| | |-- value: double (nullable = true)
172+
```
173+
174+
You can manipulate the value column name and id using function parameters.
175+
176+
To use the [Apache Sedona weight functions](#adddistancebandcolumn) you need to pass the id column and value column to kept parameters.
177+
178+
=== "Scala"
179+
180+
```scala
181+
val weights = Weighting.addDistanceBandColumn(
182+
positiveCorrelationFrame,
183+
1.0,
184+
savedAttributes = Seq("id", "value")
185+
)
186+
187+
val moranResult = Moran.getGlobal(weights, idColumn = "id")
188+
189+
// result fields
190+
moranResult.getPNorm
191+
moranResult.getI
192+
moranResult.getZNorm
193+
```
194+
195+
=== "Python"
196+
197+
```python
198+
from sedona.spark.stats.autocorrelation.moran import Moran
199+
from sedona.spark.stats.weighting import add_binary_distance_band_column
200+
201+
result = add_binary_distance_band_column(
202+
df,
203+
1.0,
204+
saved_attributes=["id", "value"]
205+
)
206+
207+
moran_i_result = Moran.get_global(result)
208+
209+
## result fields
210+
moran_i_result.p_norm
211+
moran_i_result.i
212+
moran_i_result.z_norm
213+
```
214+
215+
In the result you get the Z norm, P norm and Moran I value.
216+
217+
The full signatures of the functions
218+
219+
=== "Scala"
220+
221+
```scala
222+
def getGlobal(
223+
dataframe: DataFrame,
224+
twoTailed: Boolean = true,
225+
idColumn: String = ID_COLUMN,
226+
valueColumnName: String = VALUE_COLUMN): MoranResult
227+
228+
// java interface
229+
public interface MoranResult {
230+
public double getI();
231+
public double getPNorm();
232+
public double getZNorm();
233+
}
234+
```
235+
236+
=== "Python"
237+
238+
```python
239+
def get_global(
240+
df: DataFrame,
241+
two_tailed: bool = True,
242+
id_column: str = "id",
243+
value_column: str = "value",
244+
) -> MoranResult
245+
246+
@dataclass
247+
class MoranResult:
248+
i: float
249+
p_norm: float
250+
z_norm: float
251+
```

docs/image/moranI.png

5.29 KB
Loading

python/sedona/spark/register/java_libs.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ class SedonaJvmLib(Enum):
6565
st_predicates = "org.apache.spark.sql.sedona_sql.expressions.st_predicates"
6666
st_aggregates = "org.apache.spark.sql.sedona_sql.expressions.st_aggregates"
6767
SedonaContext = "org.apache.sedona.spark.SedonaContext"
68+
Moran = "org.apache.sedona.stats.autocorrelation.Moran"
6869

6970
@classmethod
7071
def from_str(cls, geo_lib: str) -> "SedonaJvmLib":
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
from dataclasses import dataclass
18+
19+
from pyspark.sql import DataFrame
20+
from pyspark.sql import SparkSession
21+
22+
23+
@dataclass
24+
class MoranResult:
25+
i: float
26+
p_norm: float
27+
z_norm: float
28+
29+
30+
class Moran:
31+
32+
@staticmethod
33+
def get_global(
34+
df: DataFrame,
35+
two_tailed: bool = True,
36+
id_column: str = "id",
37+
value_column: str = "value",
38+
) -> MoranResult:
39+
sedona = SparkSession.getActiveSession()
40+
41+
_jvm = sedona._jvm
42+
moran_result = (
43+
sedona._jvm.org.apache.sedona.stats.autocorrelation.Moran.getGlobal(
44+
df._jdf, two_tailed, id_column, value_column
45+
)
46+
)
47+
48+
return MoranResult(
49+
i=moran_result.getI(),
50+
p_norm=moran_result.getPNorm(),
51+
z_norm=moran_result.getZNorm(),
52+
)

python/sedona/spark/stats/hotspot_detection/getis_ord.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
Geographical Analysis, 24(3), 189-206. https://doi.org/10.1111/j.1538-4632.1992.tb00261.x
2222
"""
2323

24-
from pyspark.sql import Column, DataFrame, SparkSession
24+
from pyspark.sql import DataFrame, SparkSession
2525

2626
# todo change weights and x type to string
2727

@@ -59,7 +59,7 @@ def g_local(
5959
sedona = SparkSession.getActiveSession()
6060

6161
result_df = sedona._jvm.org.apache.sedona.stats.hotspotDetection.GetisOrd.gLocal(
62-
dataframe, x, weights, permutations, star, island_weight
62+
dataframe._jdf, x, weights, permutations, star, island_weight
6363
)
6464

6565
return DataFrame(result_df, sedona)

python/sedona/spark/stats/weighting.py

Lines changed: 45 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -60,18 +60,21 @@ def add_distance_band_column(
6060
6161
"""
6262
sedona = SparkSession.getActiveSession()
63-
return sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumn(
64-
dataframe._jdf,
65-
float(threshold),
66-
binary,
67-
float(alpha),
68-
include_zero_distance_neighbors,
69-
include_self,
70-
float(self_weight),
71-
geometry,
72-
use_spheroid,
73-
saved_attributes,
74-
result_name,
63+
return DataFrame(
64+
sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumnPython(
65+
dataframe._jdf,
66+
float(threshold),
67+
binary,
68+
float(alpha),
69+
include_zero_distance_neighbors,
70+
include_self,
71+
float(self_weight),
72+
geometry,
73+
use_spheroid,
74+
saved_attributes,
75+
result_name,
76+
),
77+
sedona,
7578
)
7679

7780

@@ -110,15 +113,21 @@ def add_binary_distance_band_column(
110113
"""
111114
sedona = SparkSession.getActiveSession()
112115

113-
return sedona._jvm.org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn(
114-
dataframe._jdf,
115-
float(threshold),
116-
include_zero_distance_neighbors,
117-
include_self,
118-
geometry,
119-
use_spheroid,
120-
saved_attributes,
121-
result_name,
116+
return DataFrame(
117+
sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumnPython(
118+
dataframe._jdf,
119+
float(threshold),
120+
True,
121+
float(-1.0),
122+
include_zero_distance_neighbors,
123+
include_self,
124+
float(1.0),
125+
geometry,
126+
use_spheroid,
127+
saved_attributes,
128+
result_name,
129+
),
130+
sedona,
122131
)
123132

124133

@@ -161,15 +170,19 @@ def add_weighted_distance_band_column(
161170
"""
162171
sedona = SparkSession.getActiveSession()
163172

164-
return sedona._jvm.org.apache.sedona.stats.Weighting.addBinaryDistanceBandColumn(
165-
dataframe._jdf,
166-
float(threshold),
167-
float(alpha),
168-
include_zero_distance_neighbors,
169-
include_self,
170-
float(self_weight),
171-
geometry,
172-
use_spheroid,
173-
saved_attributes,
174-
result_name,
173+
return DataFrame(
174+
sedona._jvm.org.apache.sedona.stats.Weighting.addDistanceBandColumnPython(
175+
dataframe._jdf,
176+
float(threshold),
177+
False,
178+
alpha,
179+
include_zero_distance_neighbors,
180+
include_self,
181+
self_weight,
182+
geometry,
183+
use_spheroid,
184+
saved_attributes,
185+
result_name,
186+
),
187+
sedona,
175188
)

0 commit comments

Comments
 (0)