Skip to content

Commit 0bab161

Browse files
Add comprehensive data documentation to README
Include study area details (Gezira, Sudan), Sentinel-2 raster metadata (dimensions, CRS, resolution, valid pixel count), spectral band wavelengths, index formulas, training dataset statistics (class distribution, splits, feature ranges), and preprocessing pipeline diagram. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent f7641f6 commit 0bab161

1 file changed

Lines changed: 165 additions & 6 deletions

File tree

README.md

Lines changed: 165 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,10 @@ Pixel-level crop classification from **Sentinel-2** satellite imagery using a **
1717
- [Installation](#installation)
1818
- [Usage](#usage)
1919
- [Data](#data)
20+
- [Study Area](#study-area)
21+
- [Sentinel-2 Raster Composite](#sentinel-2-raster-composite)
22+
- [Training Dataset](#training-dataset)
23+
- [Preprocessing Pipeline](#preprocessing-pipeline)
2024
- [License](#license)
2125

2226
---
@@ -280,14 +284,169 @@ Applies the trained GCN to the Sentinel-2 composite and produces:
280284

281285
## Data
282286

283-
Training data is derived from Sentinel-2 imagery (2020 Q1) over an agricultural region (EPSG:32636, 10 m resolution). The 24-band composite includes:
287+
### Study Area
284288

285-
| Category | Features |
286-
|:---------|:---------|
287-
| **Spectral bands** | B2, B3, B4, B5, B6, B7, B8, B8A, B11, B12 |
288-
| **Vegetation indices** | NDVI, EVI, SAVI, GNDVI, NDRE, NDRE2, NDWI, MNDWI, BSI, NDTI, CIgreen, CIrededge, MSAVI, GCVI |
289+
The study area is located in the **Gezira agricultural region, Sudan** -- one of the largest irrigated schemes in Africa, situated between the Blue Nile and White Nile rivers.
289290

290-
> GCVI is dropped during training (duplicate of CIgreen), leaving **23 features**.
291+
| Property | Value |
292+
|:---------|:------|
293+
| **Location** | Gezira State, Sudan |
294+
| **Center coordinates** | 14.053° N, 32.623° E |
295+
| **Coordinate system** | WGS 84 / UTM Zone 36N (EPSG:32636) |
296+
| **Spatial extent** | 22.62 km x 14.24 km (322 km²) |
297+
| **Temporal period** | Q1 2020 (January -- March) |
298+
| **Satellite** | Sentinel-2 (ESA Copernicus) |
299+
300+
---
301+
302+
### Sentinel-2 Raster Composite
303+
304+
The input raster is a multi-temporal composite derived from Sentinel-2 Level-2A (surface reflectance) imagery.
305+
306+
`S2_composite_24bands_2020_Q1.tif`
307+
308+
| Property | Value |
309+
|:---------|:------|
310+
| **Dimensions** | 2,262 x 1,424 pixels |
311+
| **Total pixels** | 3,221,088 |
312+
| **Valid pixels** | 1,043,855 (32.4%) |
313+
| **Spatial resolution** | 10 m |
314+
| **Bands** | 24 (float32) |
315+
| **Compression** | LZW |
316+
| **File size** | ~107 MB |
317+
318+
#### Spectral Bands (10)
319+
320+
Sentinel-2 surface reflectance bands covering visible, red-edge, near-infrared, and short-wave infrared wavelengths:
321+
322+
| Band | Name | Wavelength (nm) | Description |
323+
|:-----|:-----|:---------------:|:------------|
324+
| 1 | **B2** | 490 | Blue |
325+
| 2 | **B3** | 560 | Green |
326+
| 3 | **B4** | 665 | Red |
327+
| 4 | **B5** | 705 | Red Edge 1 |
328+
| 5 | **B6** | 740 | Red Edge 2 |
329+
| 6 | **B7** | 783 | Red Edge 3 |
330+
| 7 | **B8** | 842 | Near Infrared (NIR) |
331+
| 8 | **B8A** | 865 | Narrow NIR |
332+
| 9 | **B11** | 1610 | Short-Wave Infrared 1 (SWIR-1) |
333+
| 10 | **B12** | 2190 | Short-Wave Infrared 2 (SWIR-2) |
334+
335+
#### Spectral Indices (14)
336+
337+
Derived vegetation, water, and soil indices computed from the spectral bands:
338+
339+
| Index | Formula | Purpose |
340+
|:------|:--------|:--------|
341+
| **NDVI** | (NIR - Red) / (NIR + Red) | Vegetation greenness |
342+
| **EVI** | 2.5 * (NIR - Red) / (NIR + 6*Red - 7.5*Blue + 1) | Enhanced vegetation (corrects atmospheric effects) |
343+
| **SAVI** | 1.5 * (NIR - Red) / (NIR + Red + 0.5) | Soil-adjusted vegetation |
344+
| **GNDVI** | (NIR - Green) / (NIR + Green) | Green-band vegetation |
345+
| **NDRE** | (NIR - RedEdge1) / (NIR + RedEdge1) | Red-edge vegetation |
346+
| **NDRE2** | (RedEdge3 - RedEdge1) / (RedEdge3 + RedEdge1) | Narrow red-edge vegetation |
347+
| **NDWI** | (Green - NIR) / (Green + NIR) | Water content in vegetation |
348+
| **MNDWI** | (Green - SWIR1) / (Green + SWIR1) | Modified water index (surface water) |
349+
| **BSI** | ((SWIR1 + Red) - (NIR + Blue)) / ((SWIR1 + Red) + (NIR + Blue)) | Bare soil |
350+
| **NDTI** | (SWIR1 - SWIR2) / (SWIR1 + SWIR2) | Non-photosynthetic vegetation / tillage |
351+
| **CIgreen** | (NIR / Green) - 1 | Chlorophyll index (green) |
352+
| **CIrededge** | (NIR / RedEdge1) - 1 | Chlorophyll index (red edge) |
353+
| **MSAVI** | (2*NIR + 1 - sqrt((2*NIR+1)² - 8*(NIR-Red))) / 2 | Modified soil-adjusted vegetation |
354+
| **GCVI** | (NIR / Green) - 1 | Green chlorophyll vegetation index |
355+
356+
> **Note:** GCVI is identical to CIgreen and is dropped during training, leaving **23 features**.
357+
358+
---
359+
360+
### Training Dataset
361+
362+
Labeled ground-truth samples extracted from the raster at known crop field locations.
363+
364+
`crop_training_data_5classes_2020.csv`
365+
366+
| Property | Value |
367+
|:---------|:------|
368+
| **Total samples** | 24,556 |
369+
| **After deduplication** | 24,556 (no duplicates) |
370+
| **Features** | 23 (after dropping GCVI) |
371+
| **Missing values** | 0 |
372+
| **File size** | ~8.6 MB |
373+
374+
#### Class Distribution
375+
376+
| Class ID | Class Name | Samples | Percentage | Category |
377+
|:--------:|:-----------|--------:|:----------:|:---------|
378+
| 0 | **Cotton** | 337 | 1.4% | Minority |
379+
| 1 | **Wheat** | 7,901 | 32.2% | Majority |
380+
| 2 | **Fallow** | 11,150 | 45.4% | Majority |
381+
| 3 | **Grass** | 5,024 | 20.5% | Moderate |
382+
| 4 | **Water** | 144 | 0.6% | Minority |
383+
384+
#### Data Split
385+
386+
The dataset is split using stratified random sampling (seed=42) to preserve class proportions:
387+
388+
| Split | Percentage | Samples | Purpose |
389+
|:------|:----------:|--------:|:--------|
390+
| **Train** | 70% | 17,189 | Model training + scaler fitting |
391+
| **Validation** | 15% | 3,684 | Early stopping & hyperparameter selection |
392+
| **Test** | 15% | 3,683 | Final unbiased evaluation |
393+
394+
#### Feature Value Ranges
395+
396+
| Feature | Min | Max | Mean | Std |
397+
|:--------|----:|----:|-----:|----:|
398+
| B2 | 0.0124 | 0.1406 | 0.0567 | 0.0321 |
399+
| B3 | 0.0336 | 0.1806 | 0.0912 | 0.0378 |
400+
| B4 | 0.0179 | 0.2885 | 0.1053 | 0.0682 |
401+
| B5 | 0.0379 | 0.3036 | 0.1490 | 0.0649 |
402+
| B6 | 0.0152 | 0.3286 | 0.1973 | 0.0792 |
403+
| B7 | 0.0187 | 0.3856 | 0.2172 | 0.0906 |
404+
| B8 | 0.0211 | 0.4196 | 0.2320 | 0.0989 |
405+
| B8A | 0.0173 | 0.3943 | 0.2366 | 0.0945 |
406+
| B11 | 0.0276 | 0.3987 | 0.2073 | 0.0641 |
407+
| B12 | 0.0223 | 0.3001 | 0.1526 | 0.0660 |
408+
| NDVI | -0.3794 | 0.9099 | 0.4226 | 0.2892 |
409+
| EVI | -0.1346 | 0.7345 | 0.1951 | 0.1741 |
410+
| SAVI | -0.0748 | 0.6408 | 0.2633 | 0.1842 |
411+
| GNDVI | -0.1824 | 0.8117 | 0.4384 | 0.2215 |
412+
| NDRE | -0.1618 | 0.4439 | 0.1583 | 0.1397 |
413+
| NDRE2 | -0.1485 | 0.2174 | 0.0792 | 0.0733 |
414+
| NDWI | -0.4701 | 0.5739 | 0.1183 | 0.2075 |
415+
| MNDWI | -0.6554 | 0.3752 | -0.3611 | 0.1686 |
416+
| BSI | -0.4143 | 0.2164 | -0.0017 | 0.1064 |
417+
| NDTI | 0.0034 | 0.2614 | 0.1193 | 0.0414 |
418+
| CIgreen | -0.2674 | 11.6385 | 2.0076 | 2.0345 |
419+
| CIrededge | -0.2386 | 3.8363 | 0.5632 | 0.5991 |
420+
| MSAVI | -0.1033 | 0.5920 | 0.1676 | 0.1568 |
421+
422+
---
423+
424+
### Preprocessing Pipeline
425+
426+
```
427+
Raw CSV (24,556 samples, 28 columns)
428+
|
429+
v
430+
Drop metadata columns (system:index, .geo)
431+
|
432+
v
433+
Drop GCVI (duplicate of CIgreen)
434+
|
435+
v
436+
Remove duplicates (0 found)
437+
|
438+
v
439+
Stratified train/val/test split (70/15/15)
440+
|
441+
v
442+
StandardScaler (fit on train set only)
443+
|
444+
v
445+
KNN Graph Construction (k=8 neighbors)
446+
|
447+
v
448+
PyTorch Geometric Data Object
449+
```
291450

292451
## License
293452

0 commit comments

Comments
 (0)