11Workflows
22=========
33
4- This page describes the four end-to-end pipelines that make up the openpois
5- data processing system , in the order they should be executed. Each pipeline
6- is implemented as a series of scripts in the ``scripts/ `` directory; the
7- scripts call library functions documented in the :doc: `api `.
4+ This page describes the four end-to-end pipelines that produce the openpois
5+ dataset , in the order they are executed. Each pipeline is implemented as a
6+ series of scripts in the ``scripts/ `` directory; the scripts call library
7+ functions documented in the :doc: `api `.
88
99All scripts read their configuration from ``config.yaml `` via
1010``config_versioned.Config ``. See the individual script docstrings for the
1111exact config keys each script uses.
1212
13- ---
13+
14+ Prerequisites
15+ -------------
16+
17+ **Python environment. ** Install the conda env from ``environment.yml `` and
18+ the package itself in editable mode:
19+
20+ .. code-block :: bash
21+
22+ make build_env # conda env create -f environment.yml (env name: openpois)
23+ conda activate openpois
24+ make install_package # pip install -e .
25+
26+ **Geofabrik OAuth (Pipeline 2 only). ** Pipeline 2 downloads full-history
27+ PBFs from Geofabrik's OAuth-protected internal server. Any OSM account grants
28+ access. Generate a Netscape-format cookie jar by logging in at
29+ ``https://osm-internal.download.geofabrik.de/ `` and exporting cookies, or by
30+ running Geofabrik's ``oauth_cookie_client.py ``. Save the cookie jar at the
31+ path configured in ``config.yaml `` under ``download.osm.history_cookie_file ``.
32+
33+ **Source Cooperative credentials (publishing only). ** Publishing the data
34+ back to Source Cooperative requires short-lived AWS-style credentials in a
35+ JSON file at the path configured under ``publish.credentials_file `` (default:
36+ ``.env.json `` at the repo root). The format is documented in the
37+ ``scripts/publish/upload_to_source_coop.py `` docstring. Replicators who do
38+ not intend to publish can stop after Pipeline 4 Step 3.
39+
40+ ----
1441
1542Pipeline 1: POI Snapshot Downloads
16- ------------------------------------
43+ ----------------------------------
1744
1845These two scripts are independent and can be run in any order (or in
1946parallel). Each downloads a current US-wide POI snapshot from one data
@@ -51,27 +78,29 @@ See :mod:`openpois.io.overture`.
5178 Reads the first 100 rows of each snapshot without loading the full files,
5279saving snippet CSVs to the ``testing/ `` directory for column inspection.
5380
54- ---
81+ ----
5582
5683Pipeline 2: OSM Historical Change-Rate Model
5784--------------------------------------------
5885
59- This pipeline downloads OpenStreetMap element histories for a Seattle-area
60- bounding box and fits a Poisson change-rate model to estimate how quickly
61- different POI categories become outdated.
86+ This pipeline downloads OpenStreetMap full-history PBFs (US + Puerto Rico)
87+ and fits a Poisson change-rate model to estimate how quickly different POI
88+ categories become outdated.
6289
63- **Step 1 — Download OSM element histories **
90+ **Step 1 — Download full-history PBFs **
6491
6592.. code-block :: bash
6693
67- python scripts/osm_data/download .py
94+ python scripts/osm_data/download_history .py
6895
69- Queries the Overpass API across a configured date range to collect element IDs
70- for each tag key, then fetches the full version history of each element via the
71- OSM API. Outputs ``osm_elements.csv ``, ``osm_versions.csv ``,
72- ``osm_changes.csv ``, and ``osm_failed_elements.csv ``.
96+ Requires the Geofabrik OAuth cookie jar described in *Prerequisites * above.
97+ Downloads the US-mainland and Puerto Rico full-history extracts, filters
98+ each with ``osmium tags-filter `` (POI tag keys only) and ``osmium
99+ time-filter `` (the ``download.osm.start_date `` / ``end_date `` window), then
100+ parses with pyosmium into per-version and per-change Parquet tables.
101+ Outputs: ``osm_versions.parquet `` and ``osm_changes.parquet ``.
73102
74- See :mod: `openpois.io.osm_history `.
103+ See :mod: `openpois.io.osm_history_pbf `.
75104
76105**Step 2 — Reformat into observations **
77106
@@ -82,7 +111,7 @@ See :mod:`openpois.io.osm_history`.
82111 Converts raw version histories into one-row-per-observation records, each
83112flagged for whether the configured ``osm_data.tag_key `` changed, then
84113assigns a shared taxonomy label and explodes rows for POIs mapping to
85- multiple labels. Output: ``osm_observations.csv ``.
114+ multiple labels. Output: ``osm_observations.parquet ``.
86115
87116See :mod: `openpois.osm.format_observations `.
88117
@@ -111,10 +140,10 @@ Produces Kaplan-Meier-style survival curve plots saved to
111140
112141See :mod: `openpois.osm.change_plots `.
113142
114- ---
143+ ----
115144
116145Pipeline 3: Rate the OSM Snapshot
117- ------------------------------------
146+ ---------------------------------
118147
119148This pipeline applies the fitted change-rate model (Pipeline 2) to the OSM
120149snapshot (Pipeline 1) to assign a confidence score to every POI.
@@ -141,31 +170,30 @@ See :mod:`openpois.models.apply`.
141170
142171 python scripts/osm_snapshot/format_for_upload.py
143172
144- Adds geohash columns and writes a Hive-style partitioned dataset so that the
145- web map can fetch only the tiles it needs. Output:
146- ``osm_snapshot_partitioned/ ``.
173+ Adds geohash columns and writes a Hive-style partitioned dataset so the web
174+ map can fetch only the tiles it needs. Output: ``osm_snapshot_partitioned/ ``.
147175
148176See :mod: `openpois.io.geohash_partition `.
149177
150- **Step 3 — Upload to S3 **
178+ **Step 3 — Build OSM PMTiles **
151179
152180.. code-block :: bash
153181
154- python scripts/osm_snapshot/upload_to_s3 .py
182+ python scripts/osm_snapshot/prepare_pmtiles .py
155183
156- Uploads the partitioned dataset to the configured public S3 bucket with
157- public-read ACL. Requires AWS credentials (``AWS_ACCESS_KEY_ID `` /
158- ``AWS_SECRET_ACCESS_KEY `` env vars or ``~/.aws/credentials ``).
184+ Generates a single-zoom (z14) PMTiles archive from the partitioned dataset
185+ for use by the web map. Output: ``osm_snapshot.pmtiles ``.
159186
160- See :mod: `openpois.io.s3 `.
187+ See :mod: `openpois.io.pmtiles `.
161188
162- ---
189+ ----
163190
164- Pipeline 4: Conflation and Upload
165- ------------------------------------
191+ Pipeline 4: Conflation and Publishing
192+ -------------------------------------
166193
167- This pipeline conflates the rated OSM snapshot with the Overture Maps snapshot
168- into a single unified POI dataset for the web map.
194+ This pipeline conflates the rated OSM snapshot with the Overture Maps
195+ snapshot into a single unified POI dataset and publishes it to Source
196+ Cooperative.
169197
170198**Prerequisites: ** Pipeline 3 rated OSM snapshot and Pipeline 1 Overture
171199snapshot.
@@ -193,23 +221,29 @@ See :mod:`openpois.conflation.match`, :mod:`openpois.conflation.merge`, and
193221 Produces a summary CSV with match counts and average match scores per
194222shared taxonomy label. Output: ``summary_by_label.csv ``.
195223
196- **Step 3 — Partition for upload **
224+ **Step 3 — Partition and build conflated PMTiles **
197225
198226.. code-block :: bash
199227
200228 python scripts/conflation/format_for_upload.py
229+ python scripts/conflation/prepare_pmtiles.py
201230
202- Adds geohash columns and writes a Hive-style partitioned dataset.
203- Output: ``conflated_partitioned/ ``.
231+ Adds geohash columns and writes a Hive-style partitioned dataset, then
232+ builds a single-zoom (z14) PMTiles archive of the conflated points.
233+ Outputs: ``conflated_partitioned/ `` and ``conflated.pmtiles ``.
204234
205- See :mod: `openpois.io.geohash_partition `.
235+ See :mod: `openpois.io.geohash_partition ` and :mod: ` openpois.io.pmtiles ` .
206236
207- **Step 4 — Upload to S3 * *
237+ **Step 4 — Publish to Source Cooperative ** * (optional) *
208238
209239.. code-block :: bash
210240
211- python scripts/conflation/upload_to_s3 .py
241+ python scripts/publish/upload_to_source_coop .py
212242
213- Uploads the partitioned conflated dataset to S3 with public-read ACL.
243+ Uploads the partitioned conflated dataset, the partitioned OSM dataset,
244+ both PMTiles archives, and a per-version README to Source Cooperative
245+ under the ``versions.source_coop `` folder. Requires the credentials file
246+ described in *Prerequisites *. Skip this step if you only want the data
247+ locally.
214248
215- See :mod: `openpois.io.s3 `.
249+ See :mod: `openpois.io.source_coop ` and :mod: ` openpois.publish.build_readme `.
0 commit comments