Skip to content

[HE] Add geolocation support from OpenStreetMap iframes#220

Merged
k-nut merged 3 commits intoDatenschule:mainfrom
tifa365:feature/hessen-improvements
Nov 26, 2025
Merged

[HE] Add geolocation support from OpenStreetMap iframes#220
k-nut merged 3 commits intoDatenschule:mainfrom
tifa365:feature/hessen-improvements

Conversation

@tifa365
Copy link
Copy Markdown
Contributor

@tifa365 tifa365 commented Oct 25, 2025

Adds geolocation support to the Hessen school scraper by extracting coordinates from OpenStreetMap iframes embedded on school detail pages.

Summary

  • Extracts coordinates from OpenStreetMap iframes and links on school detail pages
  • Achieves 90.7% coverage (1,863 out of 2,054 schools with coordinates)
  • Uses only standard library parsing (no new dependencies)
  • Converts empty strings to None for consistent data handling
  • Filters out placeholder coordinates (-1.0, -1.0) used by Hessen DB

Implementation details

  • Added _extract_coords_from_osm_url() method with three fallback strategies:
    1. marker parameter (most precise)
    2. mlat/mlon parameters
    3. bbox center (least precise)
  • Updated normalize() to include latitude/longitude fields and convert empty strings to None
  • Added comprehensive code comments for better maintainability

Test plan

  • Tested with small sample (3 schools)
  • Verified full scraper run (2,054 schools in ~7.5 minutes)
  • Confirmed 90.7% geolocation coverage through analysis
  • Verified empty string to None conversion works correctly

Fixes #203

Copy link
Copy Markdown
Member

@k-nut k-nut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this.

I left a couple of comments where I think the code can be simplified further.

I would also prefer it if you removed the comments again. I think that comments should (mostly) document the why not the what and the added comments/docstrings are mostly just an explanation of the lines below.

Comment thread jedeschule/spiders/hessen.py Outdated
Comment on lines +71 to +85
# Try mlat/mlon parameters
if "mlat" in qs and "mlon" in qs:
try:
return float(qs["mlat"][0]), float(qs["mlon"][0])
except Exception:
pass

# Fallback: bbox center
if "bbox" in qs and qs["bbox"]:
try:
west, south, east, north = map(float, qs["bbox"][0].split(",", 3))
return (south + north) / 2.0, (west + east) / 2.0
except Exception:
pass

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this ever happens. There either is a marker or there is no map at all. Or do you have an example where one of these cases would trigger?

Comment thread jedeschule/spiders/hessen.py Outdated
Comment on lines +58 to +60
if not url or "openstreetmap.org" not in url:
return None, None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to check this since you have the url in the selector already and only call this function if a match is found (you could try adding a type annotation to make it explicit that we don't want this to be an optional parameter but always a string).

Comment thread jedeschule/spiders/hessen.py Outdated
Comment on lines +136 to +141
# Fallback: try "Größere Karte" link
if latitude is None:
osm_link = response.xpath('//a[contains(@href, "openstreetmap.org")]/@href').get()
if osm_link:
latitude, longitude = self._extract_coords_from_osm_url(osm_link)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever need this? If so, an example link would be great.

Comment on lines +143 to +145
if latitude == -1.0 and longitude == -1.0:
latitude = None
longitude = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be nice to have an example for this case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add an example to the code in the comments. This seems to be a placeholder but might not be the best idea to use instead of NA.

https://schul-db.bildung.hessen.de/schul_db.html/details/?school_no=9642

Copy link
Copy Markdown
Member

@k-nut k-nut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it would be good to remove most of the added comments again (a notable exception being the one with the example for the (-1.0, -1.0) school). Other than that, looks good to me!

Comment thread jedeschule/spiders/hessen.py Outdated
for school in schools:
yield scrapy.Request(school, callback=self.parse_details)

def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We actually know that the signature is like this:

Suggested change
def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]:
def _extract_coords_from_osm_url(self, url: str) -> tuple[float, float] | tuple [None, None]:

Comment thread README.md Outdated
| HB | ❌ No | - |
| HH | ✅ Yes | WFS |
| HE | ❌ No | - |
| HE | ⚠️ Partial (90.7%) | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| HE | ⚠️ Partial (90.7%) | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. |
| HE | ⚠️ Partial (~90%) | Extracted from OSM on detail pages. The schools without coordinates are schools with placeholder coordinates that are filtered out and schools with no map data at all. |

Let's not use concrete numbers since the values might change over time.

Comment thread jedeschule/spiders/hessen.py Outdated
zip=item.get("plz"),
school_type=item.get("schultyp"),
id="HE-{}".format(item.get("id")),
name=item.get("name") or None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add or None for all of these? That should be the default, right? (unless you want to explicitly coalesce "" to None?)

Copy link
Copy Markdown
Contributor Author

@tifa365 tifa365 Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all 3 issues.

tim added 3 commits November 5, 2025 09:08
Extract coordinates from OSM iframes and links on school detail pages
using standard library parsing (no new dependencies). Achieves 90.7%
coverage (1,863/2,054 schools).

Fixes Datenschule#215
- Remove redundant URL validation (already in XPath selector)
- Add type annotation to _extract_coords_from_osm_url()
- Remove mlat/mlon and bbox fallbacks (never used)
- Remove link fallback (iframe always present with coordinates)
- Add example URL for placeholder coordinates (-1.0, -1.0)
- Replace broad Exception with specific ValueError/IndexError
- Update README with geolocation statistics (1,863/2,054 schools)

Analysis confirmed Hessen iframes always use 'marker' parameter format.
The mlat/mlon format only appears in links, which are not needed.
- Update type annotation to tuple[float, float] | tuple[None, None]
- Simplify README geolocation description (remove concrete numbers)
- Remove superfluous 'or None' from normalize() method
@tifa365 tifa365 force-pushed the feature/hessen-improvements branch from f0b3e20 to c6ad7cd Compare November 5, 2025 08:11
@k-nut k-nut merged commit 0b019ff into Datenschule:main Nov 26, 2025
2 checks passed
k-nut pushed a commit to tifa365/jedeschule-scraper that referenced this pull request Apr 20, 2026
)

* [HE] Add geolocation support from OpenStreetMap iframes

Extract coordinates from OSM iframes and links on school detail pages
using standard library parsing (no new dependencies). Currently achieves 90.7%
coverage (1,863/2,054 schools).

Co-authored-by: tim <tfangmeyer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[HE]: Scrape geolocation from details page.

2 participants