[HE] Add geolocation support from OpenStreetMap iframes by tifa365 · Pull Request #220 · Datenschule/jedeschule-scraper

tifa365 · 2025-10-25T12:09:18Z

Adds geolocation support to the Hessen school scraper by extracting coordinates from OpenStreetMap iframes embedded on school detail pages.

Summary

Extracts coordinates from OpenStreetMap iframes and links on school detail pages
Achieves 90.7% coverage (1,863 out of 2,054 schools with coordinates)
Uses only standard library parsing (no new dependencies)
Converts empty strings to None for consistent data handling
Filters out placeholder coordinates (-1.0, -1.0) used by Hessen DB

Implementation details

Added _extract_coords_from_osm_url() method with three fallback strategies:
1. marker parameter (most precise)
2. mlat/mlon parameters
3. bbox center (least precise)
Updated normalize() to include latitude/longitude fields and convert empty strings to None
Added comprehensive code comments for better maintainability

Test plan

Tested with small sample (3 schools)
Verified full scraper run (2,054 schools in ~7.5 minutes)
Confirmed 90.7% geolocation coverage through analysis
Verified empty string to None conversion works correctly

Fixes #203

k-nut

Thank you for your work on this.

I left a couple of comments where I think the code can be simplified further.

I would also prefer it if you removed the comments again. I think that comments should (mostly) document the why not the what and the added comments/docstrings are mostly just an explanation of the lines below.

k-nut · 2025-10-27T12:29:08Z

+        # Try mlat/mlon parameters
+        if "mlat" in qs and "mlon" in qs:
+            try:
+                return float(qs["mlat"][0]), float(qs["mlon"][0])
+            except Exception:
+                pass
+
+        # Fallback: bbox center
+        if "bbox" in qs and qs["bbox"]:
+            try:
+                west, south, east, north = map(float, qs["bbox"][0].split(",", 3))
+                return (south + north) / 2.0, (west + east) / 2.0
+            except Exception:
+                pass
+


I don't think this ever happens. There either is a marker or there is no map at all. Or do you have an example where one of these cases would trigger?

k-nut · 2025-10-27T12:30:29Z

+        if not url or "openstreetmap.org" not in url:
+            return None, None
+


There is no need to check this since you have the url in the selector already and only call this function if a match is found (you could try adding a type annotation to make it explicit that we don't want this to be an optional parameter but always a string).

k-nut · 2025-10-27T12:30:58Z

+        # Fallback: try "Größere Karte" link
+        if latitude is None:
+            osm_link = response.xpath('//a[contains(@href, "openstreetmap.org")]/@href').get()
+            if osm_link:
+                latitude, longitude = self._extract_coords_from_osm_url(osm_link)
+


Do we ever need this? If so, an example link would be great.

k-nut · 2025-10-27T12:31:09Z

+        if latitude == -1.0 and longitude == -1.0:
+            latitude = None
+            longitude = None


It would also be nice to have an example for this case.

Will add an example to the code in the comments. This seems to be a placeholder but might not be the best idea to use instead of NA.

https://schul-db.bildung.hessen.de/schul_db.html/details/?school_no=9642

k-nut

I still think it would be good to remove most of the added comments again (a notable exception being the one with the example for the (-1.0, -1.0) school). Other than that, looks good to me!

k-nut · 2025-11-05T06:49:45Z

        for school in schools:
            yield scrapy.Request(school, callback=self.parse_details)

+    def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]:


nit: We actually know that the signature is like this:

Suggested change

def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]:

def _extract_coords_from_osm_url(self, url: str) -> tuple[float, float] | tuple [None, None]:

k-nut · 2025-11-05T06:51:18Z

 | HB    | ❌ No                 | -                                            |
 | HH    | ✅ Yes                | WFS                                          |
-| HE    | ❌ No                 | -                                            |
+| HE    | ⚠️  Partial (90.7%)    | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. |


Suggested change

| HE | ⚠️ Partial (90.7%) | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. |

| HE | ⚠️ Partial (~90%) | Extracted from OSM on detail pages. The schools without coordinates are schools with placeholder coordinates that are filtered out and schools with no map data at all. |

Let's not use concrete numbers since the values might change over time.

k-nut · 2025-11-05T06:53:39Z

-            zip=item.get("plz"),
-            school_type=item.get("schultyp"),
-            id="HE-{}".format(item.get("id")),
+            name=item.get("name") or None,


Why did you add or None for all of these? That should be the default, right? (unless you want to explicitly coalesce "" to None?)

Addressed all 3 issues.

Extract coordinates from OSM iframes and links on school detail pages using standard library parsing (no new dependencies). Achieves 90.7% coverage (1,863/2,054 schools). Fixes Datenschule#215

- Remove redundant URL validation (already in XPath selector) - Add type annotation to _extract_coords_from_osm_url() - Remove mlat/mlon and bbox fallbacks (never used) - Remove link fallback (iframe always present with coordinates) - Add example URL for placeholder coordinates (-1.0, -1.0) - Replace broad Exception with specific ValueError/IndexError - Update README with geolocation statistics (1,863/2,054 schools) Analysis confirmed Hessen iframes always use 'marker' parameter format. The mlat/mlon format only appears in links, which are not needed.

- Update type annotation to tuple[float, float] | tuple[None, None] - Simplify README geolocation description (remove concrete numbers) - Remove superfluous 'or None' from normalize() method

) * [HE] Add geolocation support from OpenStreetMap iframes Extract coordinates from OSM iframes and links on school detail pages using standard library parsing (no new dependencies). Currently achieves 90.7% coverage (1,863/2,054 schools). Co-authored-by: tim <tfangmeyer@gmail.com>

tifa365 mentioned this pull request Oct 26, 2025

Geolocation Data Implementation Status #214

Open

k-nut approved these changes Oct 27, 2025

View reviewed changes

k-nut approved these changes Nov 5, 2025

View reviewed changes

tim added 3 commits November 5, 2025 09:08

[HE] Add geolocation support from OpenStreetMap iframes

1145564

Extract coordinates from OSM iframes and links on school detail pages using standard library parsing (no new dependencies). Achieves 90.7% coverage (1,863/2,054 schools). Fixes Datenschule#215

Address PR feedback: simplify Hessen coordinate extraction

c6ad7cd

- Update type annotation to tuple[float, float] | tuple[None, None] - Simplify README geolocation description (remove concrete numbers) - Remove superfluous 'or None' from normalize() method

tifa365 force-pushed the feature/hessen-improvements branch from f0b3e20 to c6ad7cd Compare November 5, 2025 08:11

k-nut approved these changes Nov 26, 2025

View reviewed changes

k-nut merged commit 0b019ff into Datenschule:main Nov 26, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HE] Add geolocation support from OpenStreetMap iframes#220

[HE] Add geolocation support from OpenStreetMap iframes#220
k-nut merged 3 commits intoDatenschule:mainfrom
tifa365:feature/hessen-improvements

tifa365 commented Oct 25, 2025 •

edited by k-nut

Loading

Uh oh!

k-nut left a comment

Uh oh!

k-nut Oct 27, 2025

Uh oh!

k-nut Oct 27, 2025

Uh oh!

k-nut Oct 27, 2025

Uh oh!

k-nut Oct 27, 2025

Uh oh!

tifa365 Nov 4, 2025

Uh oh!

k-nut left a comment

Uh oh!

k-nut Nov 5, 2025

Uh oh!

k-nut Nov 5, 2025

Uh oh!

k-nut Nov 5, 2025

Uh oh!

tifa365 Nov 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if not url or "openstreetmap.org" not in url:
		return None, None

	def _extract_coords_from_osm_url(self, url: str) -> tuple[float \| None, float \| None]:
	def _extract_coords_from_osm_url(self, url: str) -> tuple[float, float] \| tuple [None, None]:

	\| HE \| ⚠️ Partial (90.7%) \| Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. \|
	\| HE \| ⚠️ Partial (~90%) \| Extracted from OSM on detail pages. The schools without coordinates are schools with placeholder coordinates that are filtered out and schools with no map data at all. \|

Conversation

tifa365 commented Oct 25, 2025 • edited by k-nut Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation details

Test plan

Uh oh!

k-nut left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k-nut left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tifa365 Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tifa365 commented Oct 25, 2025 •

edited by k-nut

Loading

tifa365 Nov 5, 2025 •

edited

Loading