Skip to content

Commit ca9ca30

Browse files
authored
[SEDONA-725] restructure spark python package (#1930)
* SEDONA-725 Add pyflink to Sedona. * SEDONA-725 rearrange the spark module * SEDONA-725 rearrange the spark module * SEDONA-725 rearrange the spark module * SEDONA-725 rearrange the spark module * SEDONA-725 rearrange the spark module
1 parent d8abc45 commit ca9ca30

210 files changed

Lines changed: 1144 additions & 862 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.pre-commit-config.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,15 @@ repos:
125125
- --license-filepath
126126
- .github/workflows/license-templates/LICENSE.txt
127127
- --fuzzy-match-generates-todo
128+
- id: insert-license
129+
name: add license for all Python files
130+
files: \.py$
131+
args:
132+
- --comment-style
133+
- '|# |'
134+
- --license-filepath
135+
- .github/workflows/license-templates/LICENSE.txt
136+
- --fuzzy-match-generates-todo
128137
- repo: https://github.com/asottile/pyupgrade
129138
rev: v3.19.1
130139
hooks:

docs/api/sql/Raster-visualizer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ Output:
7777
Example:
7878

7979
```python
80-
from sedona.raster_utils.SedonaUtils import SedonaUtils
80+
from sedona.spark import SedonaUtils
8181

8282
# Or from sedona.spark import *
8383

docs/api/sql/Visualization_SedonaKepler.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ from sedona.spark import *
2828
Alternatively it can also be imported using:
2929

3030
```python
31-
from sedona.maps.SedonaKepler import SedonaKepler
31+
from sedona.spark import SedonaKepler
3232
```
3333

3434
Following are details on all the APIs exposed via SedonaKepler:

docs/api/sql/Visualization_SedonaPyDeck.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ from sedona.spark import *
2828
Alternatively it can also be imported using:
2929

3030
```python
31-
from sedona.maps.SedonaPyDeck import SedonaPyDeck
31+
from sedona.spark import SedonaPyDeck
3232
```
3333

3434
!!!Note

docs/setup/install-python.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,8 @@ SedonaRegistrator is deprecated in Sedona 1.4.1 and later versions. Please use t
8585

8686
```python
8787
from pyspark.sql import SparkSession
88-
from sedona.register import SedonaRegistrator
89-
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
88+
from sedona.spark import SedonaRegistrator
89+
from sedona.spark import SedonaKryoRegistrator, KryoSerializer
9090

9191
spark = (
9292
SparkSession.builder.appName("appName")

docs/tutorial/concepts/clustering-algorithms.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ Here are the contents of the DataFrame:
9696
Here’s how to run the DBSCAN algorithm:
9797

9898
```python
99-
from sedona.stats.clustering.dbscan import dbscan
99+
from sedona.spark.stats import dbscan
100100

101101
dbscan(df, 1.0, 3).orderBy("id").show()
102102
```

docs/tutorial/files/stac-sedona-spark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ The Python API allows you to interact with a SpatioTemporal Asset Catalog (STAC)
204204
#### Initialize the Client
205205

206206
```python
207-
from sedona.stac.client import Client
207+
from sedona.spark.stac import Client
208208

209209
# Initialize the client
210210
client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

docs/tutorial/geopandas-shapely.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ def create_spatial_dataframe(spark: SparkSession, gdf: gpd.GeoDataFrame) -> Data
8181
Example:
8282

8383
```python
84-
from sedona.utils.geoarrow import create_spatial_dataframe
84+
from sedona.spark.geoarrow import create_spatial_dataframe
8585

8686
create_spatial_dataframe(spark, gdf)
8787
```

docs/tutorial/rdd.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ Please refer to [Create a Geometry type column](sql.md#create-a-geometry-type-co
5151
=== "Python"
5252

5353
```python
54-
from sedona.utils.structured_adapter import StructuredAdapter
54+
from sedona.spark import StructuredAdapter
5555

5656
spatialRDD = StructuredAdapter.toSpatialRdd(spatialDf, "usacounty")
5757
```
@@ -165,8 +165,8 @@ Assume you now have a SpatialRDD (typed or generic). You can use the following c
165165
=== "Python"
166166

167167
```python
168-
from sedona.core.geom.envelope import Envelope
169-
from sedona.core.spatialOperator import RangeQuery
168+
from sedona.spark import Envelope
169+
from sedona.spark import RangeQuery
170170

171171
range_query_window = Envelope(-90.01, -80.01, 30.01, 40.01)
172172
consider_boundary_intersection = False ## Only return gemeotries fully covered by the window
@@ -179,9 +179,9 @@ Assume you now have a SpatialRDD (typed or generic). You can use the following c
179179

180180
Example:
181181
```python
182-
from sedona.core.geom.envelope import Envelope
183-
from sedona.core.spatialOperator import RangeQueryRaw
184-
from sedona.utils.adapter import Adapter
182+
from sedona.spark import Envelope
183+
from sedona.spark import RangeQueryRaw
184+
from sedona.spark import Adapter
185185

186186
range_query_window = Envelope(-90.01, -80.01, 30.01, 40.01)
187187
consider_boundary_intersection = False ## Only return gemeotries fully covered by the window
@@ -283,9 +283,9 @@ To utilize a spatial index in a spatial range query, use the following code:
283283
=== "Python"
284284

285285
```python
286-
from sedona.core.geom.envelope import Envelope
287-
from sedona.core.enums import IndexType
288-
from sedona.core.spatialOperator import RangeQuery
286+
from sedona.spark import Envelope
287+
from sedona.spark import IndexType
288+
from sedona.spark import RangeQuery
289289

290290
range_query_window = Envelope(-90.01, -80.01, 30.01, 40.01)
291291
consider_boundary_intersection = False ## Only return gemeotries fully covered by the window
@@ -379,7 +379,7 @@ Assume you now have a SpatialRDD (typed or generic). You can use the following c
379379
=== "Python"
380380

381381
```python
382-
from sedona.core.spatialOperator import KNNQuery
382+
from sedona.spark import KNNQuery
383383
from shapely.geometry import Point
384384

385385
point = Point(-84.01, 34.01)
@@ -446,8 +446,8 @@ To utilize a spatial index in a spatial KNN query, use the following code:
446446
=== "Python"
447447

448448
```python
449-
from sedona.core.spatialOperator import KNNQuery
450-
from sedona.core.enums import IndexType
449+
from sedona.spark import KNNQuery
450+
from sedona.spark import IndexType
451451
from shapely.geometry import Point
452452

453453
point = Point(-84.01, 34.01)
@@ -518,8 +518,8 @@ Assume you now have two SpatialRDDs (typed or generic). You can use the followin
518518
=== "Python"
519519

520520
```python
521-
from sedona.core.enums import GridType
522-
from sedona.core.spatialOperator import JoinQuery
521+
from sedona.spark import GridType
522+
from sedona.spark import JoinQuery
523523

524524
consider_boundary_intersection = False ## Only return geometries fully covered by each query window in queryWindowRDD
525525
using_index = False
@@ -610,9 +610,9 @@ To utilize a spatial index in a spatial join query, use the following code:
610610
=== "Python"
611611

612612
```python
613-
from sedona.core.enums import GridType
614-
from sedona.core.enums import IndexType
615-
from sedona.core.spatialOperator import JoinQuery
613+
from sedona.spark import GridType
614+
from sedona.spark import IndexType
615+
from sedona.spark import JoinQuery
616616

617617
object_rdd.spatialPartitioning(GridType.KDBTREE)
618618
query_window_rdd.spatialPartitioning(object_rdd.getPartitioner())
@@ -676,10 +676,10 @@ The index should be built on either one of two SpatialRDDs. In general, you shou
676676

677677
Example:
678678
```python
679-
from sedona.core.SpatialRDD import CircleRDD
680-
from sedona.core.enums import GridType
681-
from sedona.core.spatialOperator import JoinQueryRaw
682-
from sedona.utils.structured_adapter import StructuredAdapter
679+
from sedona.spark import CircleRDD
680+
from sedona.spark import GridType
681+
from sedona.spark import JoinQueryRaw
682+
from sedona.spark import StructuredAdapter
683683

684684
object_rdd.analyze()
685685

@@ -743,9 +743,9 @@ Assume you now have two SpatialRDDs (typed or generic). You can use the followin
743743
=== "Python"
744744

745745
```python
746-
from sedona.core.SpatialRDD import CircleRDD
747-
from sedona.core.enums import GridType
748-
from sedona.core.spatialOperator import JoinQuery
746+
from sedona.spark import CircleRDD
747+
from sedona.spark import GridType
748+
from sedona.spark import JoinQuery
749749

750750
object_rdd.analyze()
751751

docs/tutorial/sql.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -614,7 +614,7 @@ The first parameter is the dataframe, the next two are the epsilon and min_point
614614
=== "Python"
615615

616616
```python
617-
from sedona.stats.clustering.dbscan import dbscan
617+
from sedona.spark.stats import dbscan
618618

619619
dbscan(df, 0.1, 5).show()
620620
```
@@ -670,7 +670,7 @@ The first parameter is the dataframe, the next is the number of nearest neighbor
670670
=== "Python"
671671

672672
```python
673-
from sedona.stats.outlier_detection.local_outlier_factor import local_outlier_factor
673+
from sedona.spark.stats import local_outlier_factor
674674

675675
local_outlier_factor(df, 20).show()
676676
```
@@ -737,8 +737,8 @@ Using Gi involves first generating the neighbors list for each record, then call
737737
=== "Python"
738738

739739
```python
740-
from sedona.stats.weighting import add_binary_distance_band_column
741-
from sedona.stats.hotspot_detection.getis_ord import g_local
740+
from sedona.spark.stats import add_binary_distance_band_column
741+
from sedona.spark.stats import g_local
742742

743743
distance_radius = 1.0
744744
weighted_df = addBinaryDistanceBandColumn(df, distance_radius)
@@ -966,7 +966,7 @@ This UDF example takes a geometry type input and returns a primitive type output
966966
=== "Python"
967967

968968
```python
969-
from sedona.sql.types import GeometryType
969+
from sedona.spark.sql.types import GeometryType
970970
from pyspark.sql.types import DoubleType
971971

972972
def lengthPoly(geom: GeometryType()):
@@ -1025,7 +1025,7 @@ This UDF example takes a geometry type input and returns a geometry type output:
10251025
=== "Python"
10261026

10271027
```python
1028-
from sedona.sql.types import GeometryType
1028+
from sedona.spark import GeometryType
10291029
from pyspark.sql.types import DoubleType
10301030

10311031
def bufferFixed(geom: GeometryType()):
@@ -1083,7 +1083,7 @@ This UDF example takes a geometry type input and a primitive type input and retu
10831083
=== "Python"
10841084

10851085
```python
1086-
from sedona.sql.types import GeometryType
1086+
from sedona.spark import GeometryType
10871087
from pyspark.sql.types import DoubleType
10881088

10891089
def bufferIt(geom: GeometryType(), distance: DoubleType()):
@@ -1165,7 +1165,7 @@ This UDF example takes a geometry type input and a primitive type input and retu
11651165
=== "Python"
11661166

11671167
```python
1168-
from sedona.sql.types import GeometryType
1168+
from sedona.spark import GeometryType
11691169
from pyspark.sql.types import *
11701170

11711171
schemaUDF = StructType([
@@ -1230,7 +1230,7 @@ a given geometry.
12301230

12311231
```python
12321232
import shapely.geometry.base as b
1233-
from sedona.sql.functions import sedona_vectorized_udf
1233+
from sedona.spark import sedona_vectorized_udf
12341234

12351235
@sedona_vectorized_udf(return_type=GeometryType())
12361236
def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
@@ -1241,8 +1241,8 @@ def vectorized_buffer(geom: b.BaseGeometry) -> b.BaseGeometry:
12411241

12421242
```python
12431243
import geopandas as gpd
1244-
from sedona.sql.functions import sedona_vectorized_udf, SedonaUDFType
1245-
from sedona.sql.types import GeometryType
1244+
from sedona.spark import sedona_vectorized_udf, SedonaUDFType
1245+
from sedona.spark import GeometryType
12461246

12471247

12481248
@sedona_vectorized_udf(udf_type=SedonaUDFType.GEO_SERIES, return_type=GeometryType())
@@ -1339,7 +1339,7 @@ Use SedonaSQL DataFrame-RDD Adapter to convert a DataFrame to an SpatialRDD.
13391339
=== "Python"
13401340

13411341
```python
1342-
from sedona.utils.structured_adapter import StructuredAdapter
1342+
from sedona.spark import StructuredAdapter
13431343

13441344
spatialRDD = StructuredAdapter.toSpatialRdd(spatialDf, "usacounty")
13451345
```
@@ -1365,7 +1365,7 @@ Use SedonaSQL DataFrame-RDD Adapter to convert a DataFrame to an SpatialRDD. Ple
13651365
=== "Python"
13661366

13671367
```python
1368-
from sedona.utils.adapter import StructuredAdapter
1368+
from sedona.spark import StructuredAdapter
13691369

13701370
spatialDf = StructuredAdapter.toDf(spatialRDD, sedona)
13711371
```
@@ -1401,7 +1401,7 @@ You can use `StructuredAdapter` and the `spatialRDD.spatialPartitioningWithoutDu
14011401
=== "Python"
14021402

14031403
```python
1404-
from sedona.utils.structured_adapter import StructuredAdapter
1404+
from sedona.spark import StructuredAdapter
14051405

14061406
spatialRDD.spatialPartitioningWithoutDuplicates(GridType.KDBTREE)
14071407
# Specify the desired number of partitions as 10, though the actual number may vary
@@ -1427,7 +1427,7 @@ PairRDD is the result of a spatial join query or distance join query. SedonaSQL
14271427
=== "Python"
14281428

14291429
```python
1430-
from sedona.utils.adapter import StructuredAdapter
1430+
from sedona.spark import StructuredAdapter
14311431

14321432
joinResultDf = StructuredAdapter.pairRddToDf(result_pair_rdd, leftDf.schema, rightDf.schema, spark)
14331433
```

0 commit comments

Comments
 (0)