Skip to content

Commit a1a3726

Browse files
committed
[BW] switch uuid fallback ids to address hashes
1 parent 7a4a2e5 commit a1a3726

2 files changed

Lines changed: 17 additions & 7 deletions

File tree

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,11 @@ In details, the IDs are sourced as follows:
2222

2323
|State| ID-Source | example-id |stable|
2424
|-----|--------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|------|
25-
|BW| DISCH (Dienststellenschlüssel) extracted from email, fallback to WFS UUID when not available | `BW-04154817` or `BW-UUID-00000a15-a965-4999-b9ad-05895eb0fad2` |✅ likely (~80% with DISCH, ~20% UUID fallback)|
25+
|BW| DISCH (Dienststellenschlüssel) extracted from email, fallback to address hash when not available (see below) | `BW-04154817` or `BW-FB-e5c29cbf7215726b4f3515cfad6bee63e2a0bb8ded432a34e9e51c4324ec52ea` |✅ likely (~80% with DISCH, ~20% fallback)|
2626
|BY| id from the WFS service | `BY-SCHUL_SCHULSTANDORTEGRUNDSCHULEN_2acb7d31-915d-40a9-adcf-27b38251fa48` |❓ unlikely (although we reached out to ask for canonical IDs to be published)|
2727
|BE| Field `bsn` (Berliner Schulnummer) from the WFS Service | `BE-02K10` |✅ likely|
2828
|BB| Field `schul_nr` (Schulnummer) from thw WFS Service | `BB-111430` |✅ likely|
29-
|HB| Field `snr_txt` (Schulnummer) from the INSPIRE shapefile - official 3-digit ID used in Bremen materials | `HB-002` |✅ likely|
29+
|HB| Field `snr_txt` (Schulnummer) from the INSPIRE shapefile - official 3-digit ID used in Bremen materials | `HB-002` |✅ likely|
3030
|HH| Field `schul_id` From the WFS Service | `HH-7910-0` |✅ likely|
3131
|HE| `school_no` URL query param of the schools's details page (identical to the Dienststellennummer) | `HE-4024` |✅ likely|
3232
|MV| Field `dstnr` from the WFS | `MV-75130302` |✅ likely|
@@ -38,6 +38,9 @@ In details, the IDs are sourced as follows:
3838
|ST| `OBJECTID` from the ArcGIS FeatureServer API (prefixed with `ARC`) | `ST-ARC00001` |❓ unlikely (OBJECTID may change on data reimport)|
3939
|TH| `Schulnummer` from the WFS service | `TH-10601` |✅ likely|
4040

41+
For Baden-Württemberg, not all schools have a Dienststellenschlüssel that we can extract. For those who don't,
42+
we join name, address, zip and city with a " " between each part and generate the SHA256 hash.
43+
4144
## Geolocations
4245
When available, we try to use the geolocations provided by the data publishers.
4346

jedeschule/spiders/baden_wuerttemberg.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
import hashlib
2+
13
import re
24
import scrapy
35
from scrapy import Item
@@ -32,6 +34,8 @@ def extract_disch(email: str | None) -> str | None:
3234
match = DISCH_RE.search(email.strip())
3335
return match.group(1) if match else None
3436

37+
def create_address_based_fallback(address, city, zip):
38+
return hashlib.sha256(f"{address} {zip} {city}").hexdigest()
3539

3640
class BadenWuerttembergSpider(SchoolSpider):
3741
name = "baden-wuerttemberg"
@@ -136,13 +140,16 @@ def parse(self, response):
136140

137141
@staticmethod
138142
def normalize(item: Item) -> School:
139-
# Prefer DISCH (stable government ID) over UUID when available
140-
disch = item.get("disch")
141-
uuid = item.get("uuid")
142-
school_id = f"BW-{disch}" if disch else f"BW-UUID-{uuid}"
143+
def id():
144+
# Prefer DISCH (stable government ID) when available
145+
if disch := item.get('disch'):
146+
return f'{disch}'
147+
key = " ".join([item.get(key) or "" for key in ['name', 'address', 'zip', 'city']])
148+
key_hash = hashlib.sha256(key.encode('utf-8')).hexdigest()
149+
return f'FB-{key_hash}'
143150

144151
return School(
145-
id=school_id,
152+
id=f"BW-{id()}",
146153
name=item.get("name"),
147154
address=item.get("address"),
148155
zip=item.get("zip"),

0 commit comments

Comments
 (0)