Skip to content

Commit 1afd1b1

Browse files
authored
[GH-2700] Add 02-vegetation-change notebook: end-to-end raster workflow (#2896)
1 parent cfc80d5 commit 1afd1b1

1 file changed

Lines changed: 257 additions & 0 deletions

File tree

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"<!--\n",
8+
" Licensed to the Apache Software Foundation (ASF) under one\n",
9+
" or more contributor license agreements. See the NOTICE file\n",
10+
" distributed with this work for additional information\n",
11+
" regarding copyright ownership. The ASF licenses this file\n",
12+
" to you under the Apache License, Version 2.0 (the\n",
13+
" \"License\"); you may not use this file except in compliance\n",
14+
" with the License. You may obtain a copy of the License at\n",
15+
"\n",
16+
" http://www.apache.org/licenses/LICENSE-2.0\n",
17+
"\n",
18+
" Unless required by applicable law or agreed to in writing,\n",
19+
" software distributed under the License is distributed on an\n",
20+
" \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
21+
" KIND, either express or implied. See the License for the\n",
22+
" specific language governing permissions and limitations\n",
23+
" under the License.\n",
24+
"-->\n",
25+
"\n",
26+
"# Did this farmland green up?\n",
27+
"\n",
28+
"A complete raster pipeline. We answer:\n",
29+
"\n",
30+
"> **Between two satellite scenes a season apart, which farm parcels in this AOI greened up the most?**\n",
31+
"\n",
32+
"End-to-end on Sedona's raster surface: load two GeoTIFFs through the new tiling raster data source, compute NDVI per scene with `RS_MapAlgebra`, take their difference for a \"greening\" delta raster, run `RS_ZonalStats` per parcel, rank, clip the delta raster to the top parcel with `RS_Clip`, round-trip through `RS_AsCOG` to prove the result is a valid Cloud-Optimized GeoTIFF, and visualize the four stages as a single matplotlib figure.\n",
33+
"\n",
34+
"The two source scenes are **synthesized** at the top of the notebook (256 \u00d7 256 px, 2 bands each: red + NIR, EPSG:4326) so the notebook is fully reproducible and ships no new bytes. The same code path runs unchanged against real Sentinel-2 chips \u2014 only the raster filenames change.\n",
35+
"\n",
36+
"**Requires Sedona \u2265 1.9.0.** The `raster` data source (auto-tiling GeoTIFF reader, GH-2672) and `RS_AsCOG` (GH-2652) land in 1.9.0; the notebook will fail on older versions of the docker image."
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"## 1. Connect to Spark through SedonaContext"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {},
50+
"outputs": [],
51+
"source": [
52+
"from sedona.spark import SedonaContext\n",
53+
"\n",
54+
"config = SedonaContext.builder().master(\"spark://localhost:7077\").getOrCreate()\n",
55+
"sedona = SedonaContext.create(config)"
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"## 2. Synthesize two scenes a season apart\n",
63+
"\n",
64+
"We build two 256\u00d7256, 2-band GeoTIFFs in `/tmp` representing red and near-infrared reflectance over the same AOI. The \"before\" scene is mostly bare; the \"after\" scene has a circular field of vegetation in the south-west corner with elevated NIR. Pixel values are scaled to the same 0-10000 reflectance range as Sentinel-2 surface-reflectance products so the NDVI math is identical to the real-world case."
65+
]
66+
},
67+
{
68+
"cell_type": "code",
69+
"execution_count": null,
70+
"metadata": {},
71+
"outputs": [],
72+
"source": "import os\n\nimport numpy as np\nimport rasterio\nfrom rasterio.transform import from_bounds\n\nWORK = \"/tmp/veg-change\"\nos.makedirs(WORK, exist_ok=True)\n\nAOI = (-91.10, 41.50, -91.00, 41.60) # (xmin, ymin, xmax, ymax) in EPSG:4326\nW = H = 256\ntransform = from_bounds(*AOI, W, H)\nrng = np.random.default_rng(42)\n\nys, xs = np.mgrid[0:H, 0:W]\nfield_mask = ((xs - 64) ** 2 + (ys - 192) ** 2) < 50**2 # circular field\n\n\ndef synth_scene(green_strength):\n red = (1500 + 200 * rng.standard_normal((H, W))).clip(0, 10000)\n nir = (1800 + 200 * rng.standard_normal((H, W))).clip(0, 10000)\n nir = np.where(field_mask, nir + green_strength, nir)\n return red.astype(\"uint16\"), nir.astype(\"uint16\")\n\n\nfor name, strength in [(\"before\", 200), (\"after\", 4000)]:\n red, nir = synth_scene(strength)\n with rasterio.open(\n f\"{WORK}/scene_{name}.tif\",\n \"w\",\n driver=\"GTiff\",\n tiled=True,\n blockxsize=256,\n blockysize=256,\n height=H,\n width=W,\n count=2,\n dtype=\"uint16\",\n crs=\"EPSG:4326\",\n transform=transform,\n ) as dst:\n dst.write(red, 1)\n dst.write(nir, 2)\n dst.set_band_description(1, \"red\")\n dst.set_band_description(2, \"nir\")\n\nfor f in (f\"{WORK}/scene_before.tif\", f\"{WORK}/scene_after.tif\"):\n print(f\"{f}: {os.path.getsize(f)} bytes\")"
73+
},
74+
{
75+
"cell_type": "markdown",
76+
"metadata": {},
77+
"source": "## 3. Load both scenes with the `raster` data source\n\n`sedona.read.format(\"raster\")` (new in 1.9) auto-tiles GeoTIFFs and yields one `Raster`-typed row per tile. Each row carries `(x, y)` tile-index columns \u2014 keep them; we'll join on those when computing \u0394NDVI so the same query works for single-tile inputs (as here) and for multi-gigabyte rasters that yield thousands of tiles per scene."
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"metadata": {},
83+
"outputs": [],
84+
"source": "scenes = (\n sedona.read.format(\"raster\")\n .load(f\"{WORK}/scene_*.tif\")\n .selectExpr(\"split(name, '\\\\\\\\.')[0] as scene\", \"x\", \"y\", \"rast\")\n)\nscenes.cache()\nscenes.show(truncate=80)"
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"## 4. Compute NDVI per scene with `RS_MapAlgebra`\n",
91+
"\n",
92+
"`RS_MapAlgebra(raster, pixelType, script)` runs a single-raster pixel script. The script syntax (`out[0] = ...; rast[i]` indexing) is documented [here](../api/sql/Raster-map-algebra.md). We coerce the result to `D` (double) so the negative side of NDVI survives."
93+
]
94+
},
95+
{
96+
"cell_type": "code",
97+
"execution_count": null,
98+
"metadata": {},
99+
"outputs": [],
100+
"source": "scenes.createOrReplaceTempView(\"scenes\")\n\nndvi = sedona.sql(\"\"\"\n SELECT scene, x, y,\n RS_MapAlgebra(\n rast, 'D',\n 'out[0] = (rast[1] - rast[0]) / (rast[1] + rast[0] + 0.000001);'\n ) AS rast\n FROM scenes\n\"\"\")\nndvi.cache()\nndvi.createOrReplaceTempView(\"ndvi\")\nndvi.selectExpr(\"scene\", \"x\", \"y\", \"RS_BandPixelType(rast, 1) as dtype\").show()"
101+
},
102+
{
103+
"cell_type": "markdown",
104+
"metadata": {},
105+
"source": "## 5. Compute the greening delta raster\n\nTwo-raster `RS_MapAlgebra(rast0, rast1, pixelType, script, noDataValue)` subtracts the before-NDVI from the after-NDVI in a single pass. We join the two NDVI dataframes on the tile coordinates `(x, y)` so the same query works whether each scene produced a single tile (as here) or thousands."
106+
},
107+
{
108+
"cell_type": "code",
109+
"execution_count": null,
110+
"metadata": {},
111+
"outputs": [],
112+
"source": "delta = sedona.sql(\"\"\"\n SELECT a.x, a.y,\n RS_MapAlgebra(\n a.rast, b.rast, 'D',\n 'out[0] = rast0[0] - rast1[0];',\n -9999.0\n ) AS rast\n FROM (SELECT x, y, rast FROM ndvi WHERE scene = 'scene_after') a\n JOIN (SELECT x, y, rast FROM ndvi WHERE scene = 'scene_before') b\n ON a.x = b.x AND a.y = b.y\n\"\"\")\ndelta.cache()\ndelta.createOrReplaceTempView(\"delta\")\ndelta.selectExpr(\n \"x\",\n \"y\",\n \"RS_BandPixelType(rast, 1) as dtype\",\n \"RS_Width(rast) as w\",\n \"RS_Height(rast) as h\",\n).show()"
113+
},
114+
{
115+
"cell_type": "markdown",
116+
"metadata": {},
117+
"source": "## 6. Synthesize parcel polygons and run `RS_ZonalStats`\n\nBuild a 4\u00d74 grid of square parcels covering the AOI and compute the mean \u0394NDVI per parcel. `RS_ZonalStats(raster, zone, statType)` is the canonical raster\u2192vector aggregation: every pixel inside `zone` contributes to the statistic.\n\nFor tiled inputs, the cross-join below produces one row per `(parcel, tile)`; with our single-tile delta it collapses to one row per parcel. To handle truly tiled inputs robustly you would compute `RS_ZonalStats(rast, geom, 'sum')` and `'count'` per tile and aggregate `SUM(sum) / SUM(count)` per parcel \u2014 same idiom, one extra `GROUP BY`."
118+
},
119+
{
120+
"cell_type": "code",
121+
"execution_count": null,
122+
"metadata": {},
123+
"outputs": [],
124+
"source": [
125+
"from pyspark.sql import Row\n",
126+
"\n",
127+
"xmin, ymin, xmax, ymax = AOI\n",
128+
"step_x = (xmax - xmin) / 4\n",
129+
"step_y = (ymax - ymin) / 4\n",
130+
"parcel_rows = []\n",
131+
"for ix in range(4):\n",
132+
" for iy in range(4):\n",
133+
" x0, x1 = xmin + ix * step_x, xmin + (ix + 1) * step_x\n",
134+
" y0, y1 = ymin + iy * step_y, ymin + (iy + 1) * step_y\n",
135+
" wkt = f\"POLYGON(({x0} {y0}, {x1} {y0}, {x1} {y1}, {x0} {y1}, {x0} {y0}))\"\n",
136+
" parcel_rows.append(Row(parcel_id=f\"P{ix}{iy}\", wkt=wkt))\n",
137+
"\n",
138+
"parcels = sedona.createDataFrame(parcel_rows).selectExpr(\n",
139+
" \"parcel_id\", \"ST_GeomFromText(wkt) as geom\"\n",
140+
")\n",
141+
"parcels.createOrReplaceTempView(\"parcels\")\n",
142+
"\n",
143+
"ranked = sedona.sql(\"\"\"\n",
144+
" SELECT p.parcel_id,\n",
145+
" ROUND(RS_ZonalStats(d.rast, p.geom, 'mean'), 4) AS mean_delta_ndvi\n",
146+
" FROM parcels p, delta d\n",
147+
" ORDER BY mean_delta_ndvi DESC\n",
148+
"\"\"\")\n",
149+
"ranked.show()"
150+
]
151+
},
152+
{
153+
"cell_type": "markdown",
154+
"metadata": {},
155+
"source": [
156+
"## 7. Clip the delta raster to the top-greening parcel\n",
157+
"\n",
158+
"`RS_Clip(raster, band, geom)` crops the raster to the polygon's extent. We pick the parcel with the highest mean \u0394NDVI and zoom in on its delta values."
159+
]
160+
},
161+
{
162+
"cell_type": "code",
163+
"execution_count": null,
164+
"metadata": {},
165+
"outputs": [],
166+
"source": [
167+
"top_id = ranked.first()[\"parcel_id\"]\n",
168+
"print(f\"top-greening parcel: {top_id}\")\n",
169+
"\n",
170+
"clipped = sedona.sql(f\"\"\"\n",
171+
" SELECT RS_Clip(d.rast, 1, p.geom) AS rast\n",
172+
" FROM delta d, parcels p\n",
173+
" WHERE p.parcel_id = '{top_id}'\n",
174+
"\"\"\")\n",
175+
"clipped.cache()\n",
176+
"clipped.createOrReplaceTempView(\"clipped\")\n",
177+
"clipped.selectExpr(\n",
178+
" \"RS_Width(rast) as w\",\n",
179+
" \"RS_Height(rast) as h\",\n",
180+
").show()"
181+
]
182+
},
183+
{
184+
"cell_type": "markdown",
185+
"metadata": {},
186+
"source": [
187+
"## 8. Round-trip through `RS_AsCOG`\n",
188+
"\n",
189+
"`RS_AsCOG` returns the raster as a binary Cloud-Optimized GeoTIFF byte array. We persist it to disk, then read it back with the same `raster` data source we used to load the input \u2014 proving the output is a valid GeoTIFF that downstream tools can stream from object storage."
190+
]
191+
},
192+
{
193+
"cell_type": "code",
194+
"execution_count": null,
195+
"metadata": {},
196+
"outputs": [],
197+
"source": "cog_bytes = clipped.selectExpr(\"RS_AsCOG(rast) as cog\").first()[\"cog\"]\nwith open(f\"{WORK}/delta_topparcel_cog.tif\", \"wb\") as fh:\n fh.write(cog_bytes)\nprint(f\"COG size: {len(cog_bytes):,} bytes\")\n\nroundtrip = sedona.read.format(\"raster\").load(f\"{WORK}/delta_topparcel_cog.tif\")\nroundtrip.selectExpr(\n \"RS_Width(rast) as w\",\n \"RS_Height(rast) as h\",\n \"RS_BandPixelType(rast, 1) as dtype\",\n).show()"
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"metadata": {},
202+
"source": [
203+
"## 9. Visualize the pipeline as a 4-panel figure\n",
204+
"\n",
205+
"Read the rasters back via `rasterio` and lay them out side by side: NDVI before, NDVI after, \u0394NDVI across the full AOI with parcel boundaries overlaid, and the top-parcel close-up."
206+
]
207+
},
208+
{
209+
"cell_type": "code",
210+
"execution_count": null,
211+
"metadata": {},
212+
"outputs": [],
213+
"source": "import matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\n\n\ndef ndvi_arr(path):\n with rasterio.open(path) as ds:\n red = ds.read(1).astype(\"float32\")\n nir = ds.read(2).astype(\"float32\")\n return (nir - red) / (nir + red + 1e-6)\n\n\nndvi_before = ndvi_arr(f\"{WORK}/scene_before.tif\")\nndvi_after = ndvi_arr(f\"{WORK}/scene_after.tif\")\ndelta_arr = ndvi_after - ndvi_before\nwith rasterio.open(f\"{WORK}/delta_topparcel_cog.tif\") as ds:\n top_arr = ds.read(1)\n top_extent = (ds.bounds.left, ds.bounds.right, ds.bounds.bottom, ds.bounds.top)\n\nfig, axes = plt.subplots(1, 4, figsize=(16, 4))\nextent = (AOI[0], AOI[2], AOI[1], AOI[3])\naxes[0].imshow(ndvi_before, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[0].set_title(\"NDVI before\")\naxes[1].imshow(ndvi_after, vmin=-0.2, vmax=0.8, cmap=\"RdYlGn\", extent=extent)\naxes[1].set_title(\"NDVI after\")\naxes[2].imshow(delta_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=extent)\naxes[2].set_title(\"\u0394NDVI (after \u2212 before) with parcel grid\")\nxmin, ymin, xmax, ymax = AOI\nstep_x = (xmax - xmin) / 4\nstep_y = (ymax - ymin) / 4\nfor ix in range(4):\n for iy in range(4):\n x0 = xmin + ix * step_x\n y0 = ymin + iy * step_y\n axes[2].add_patch(\n mpatches.Rectangle(\n (x0, y0), step_x, step_y, fill=False, edgecolor=\"black\", linewidth=0.6\n )\n )\n axes[2].text(\n x0 + step_x / 2,\n y0 + step_y / 2,\n f\"P{ix}{iy}\",\n ha=\"center\",\n va=\"center\",\n fontsize=7,\n color=\"black\",\n )\naxes[3].imshow(top_arr, vmin=-0.5, vmax=0.5, cmap=\"PiYG\", extent=top_extent)\naxes[3].set_title(f\"Top parcel ({top_id}) \u0394NDVI\")\nfor ax in axes:\n ax.set_xticks([])\n ax.set_yticks([])\nfig.tight_layout()\nfig"
214+
},
215+
{
216+
"cell_type": "markdown",
217+
"metadata": {},
218+
"source": [
219+
"## What just happened?\n",
220+
"\n",
221+
"We turned two synthetic Sentinel-2-style scenes into a ranked parcel-level greening report in nine steps:\n",
222+
"\n",
223+
"1. Synthesized two 256\u00d7256 red+NIR GeoTIFFs in `/tmp`.\n",
224+
"2. Loaded both with `sedona.read.format(\"raster\")` \u2014 the new auto-tiling reader handles GeoTIFFs of any size with the same call shape.\n",
225+
"3. Computed per-pixel NDVI per scene with single-raster `RS_MapAlgebra`.\n",
226+
"4. Subtracted the two NDVI rasters with two-raster `RS_MapAlgebra` to get a \u0394NDVI \"greening\" layer.\n",
227+
"5. Synthesized a 4\u00d74 parcel grid and ranked parcels by mean \u0394NDVI via `RS_ZonalStats`.\n",
228+
"6. Clipped the delta raster to the top-greening parcel with `RS_Clip`.\n",
229+
"7. Encoded the result as a Cloud-Optimized GeoTIFF with `RS_AsCOG` and read it back with the `raster` data source \u2014 proving the output is valid for cloud-hosted streaming.\n",
230+
"8. Plotted the four pipeline stages side by side.\n",
231+
"\n",
232+
"Swap `/tmp/scene_*.tif` for a glob over real Sentinel-2 chips and the same SQL runs on a cluster against terabytes of imagery."
233+
]
234+
}
235+
],
236+
"metadata": {
237+
"kernelspec": {
238+
"display_name": "Python 3",
239+
"language": "python",
240+
"name": "python3"
241+
},
242+
"language_info": {
243+
"codemirror_mode": {
244+
"name": "ipython",
245+
"version": 3
246+
},
247+
"file_extension": ".py",
248+
"mimetype": "text/x-python",
249+
"name": "python",
250+
"nbconvert_exporter": "python",
251+
"pygments_lexer": "ipython3",
252+
"version": "3.11"
253+
}
254+
},
255+
"nbformat": 4,
256+
"nbformat_minor": 4
257+
}

0 commit comments

Comments
 (0)