[HE] Add geolocation support from OpenStreetMap iframes#220
[HE] Add geolocation support from OpenStreetMap iframes#220k-nut merged 3 commits intoDatenschule:mainfrom
Conversation
k-nut
left a comment
There was a problem hiding this comment.
Thank you for your work on this.
I left a couple of comments where I think the code can be simplified further.
I would also prefer it if you removed the comments again. I think that comments should (mostly) document the why not the what and the added comments/docstrings are mostly just an explanation of the lines below.
| # Try mlat/mlon parameters | ||
| if "mlat" in qs and "mlon" in qs: | ||
| try: | ||
| return float(qs["mlat"][0]), float(qs["mlon"][0]) | ||
| except Exception: | ||
| pass | ||
|
|
||
| # Fallback: bbox center | ||
| if "bbox" in qs and qs["bbox"]: | ||
| try: | ||
| west, south, east, north = map(float, qs["bbox"][0].split(",", 3)) | ||
| return (south + north) / 2.0, (west + east) / 2.0 | ||
| except Exception: | ||
| pass | ||
|
|
There was a problem hiding this comment.
I don't think this ever happens. There either is a marker or there is no map at all. Or do you have an example where one of these cases would trigger?
| if not url or "openstreetmap.org" not in url: | ||
| return None, None | ||
|
|
There was a problem hiding this comment.
There is no need to check this since you have the url in the selector already and only call this function if a match is found (you could try adding a type annotation to make it explicit that we don't want this to be an optional parameter but always a string).
| # Fallback: try "Größere Karte" link | ||
| if latitude is None: | ||
| osm_link = response.xpath('//a[contains(@href, "openstreetmap.org")]/@href').get() | ||
| if osm_link: | ||
| latitude, longitude = self._extract_coords_from_osm_url(osm_link) | ||
|
|
There was a problem hiding this comment.
Do we ever need this? If so, an example link would be great.
| if latitude == -1.0 and longitude == -1.0: | ||
| latitude = None | ||
| longitude = None |
There was a problem hiding this comment.
It would also be nice to have an example for this case.
There was a problem hiding this comment.
Will add an example to the code in the comments. This seems to be a placeholder but might not be the best idea to use instead of NA.
https://schul-db.bildung.hessen.de/schul_db.html/details/?school_no=9642
k-nut
left a comment
There was a problem hiding this comment.
I still think it would be good to remove most of the added comments again (a notable exception being the one with the example for the (-1.0, -1.0) school). Other than that, looks good to me!
| for school in schools: | ||
| yield scrapy.Request(school, callback=self.parse_details) | ||
|
|
||
| def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]: |
There was a problem hiding this comment.
nit: We actually know that the signature is like this:
| def _extract_coords_from_osm_url(self, url: str) -> tuple[float | None, float | None]: | |
| def _extract_coords_from_osm_url(self, url: str) -> tuple[float, float] | tuple [None, None]: |
| | HB | ❌ No | - | | ||
| | HH | ✅ Yes | WFS | | ||
| | HE | ❌ No | - | | ||
| | HE | ⚠️ Partial (90.7%) | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. | |
There was a problem hiding this comment.
| | HE | ⚠️ Partial (90.7%) | Extracted from OSM on detail pages (1,863/2,054 schools). The 191 schools without coordinates include both schools with placeholder coordinates (-1.0, -1.0) that are filtered to null and schools with no map data at all. | | |
| | HE | ⚠️ Partial (~90%) | Extracted from OSM on detail pages. The schools without coordinates are schools with placeholder coordinates that are filtered out and schools with no map data at all. | |
Let's not use concrete numbers since the values might change over time.
| zip=item.get("plz"), | ||
| school_type=item.get("schultyp"), | ||
| id="HE-{}".format(item.get("id")), | ||
| name=item.get("name") or None, |
There was a problem hiding this comment.
Why did you add or None for all of these? That should be the default, right? (unless you want to explicitly coalesce "" to None?)
There was a problem hiding this comment.
Addressed all 3 issues.
Extract coordinates from OSM iframes and links on school detail pages using standard library parsing (no new dependencies). Achieves 90.7% coverage (1,863/2,054 schools). Fixes Datenschule#215
- Remove redundant URL validation (already in XPath selector) - Add type annotation to _extract_coords_from_osm_url() - Remove mlat/mlon and bbox fallbacks (never used) - Remove link fallback (iframe always present with coordinates) - Add example URL for placeholder coordinates (-1.0, -1.0) - Replace broad Exception with specific ValueError/IndexError - Update README with geolocation statistics (1,863/2,054 schools) Analysis confirmed Hessen iframes always use 'marker' parameter format. The mlat/mlon format only appears in links, which are not needed.
- Update type annotation to tuple[float, float] | tuple[None, None] - Simplify README geolocation description (remove concrete numbers) - Remove superfluous 'or None' from normalize() method
f0b3e20 to
c6ad7cd
Compare
Adds geolocation support to the Hessen school scraper by extracting coordinates from OpenStreetMap iframes embedded on school detail pages.
Summary
Implementation details
_extract_coords_from_osm_url()method with three fallback strategies:markerparameter (most precise)mlat/mlonparametersbboxcenter (least precise)normalize()to include latitude/longitude fields and convert empty strings to NoneTest plan
Fixes #203