Skip to content

Commit cd4b6e7

Browse files
authored
Merge pull request #60 from jpolchlo/refactor/process-osm
Import implementation from OSMesa project
2 parents 1d5596c + 12e2da2 commit cd4b6e7

122 files changed

Lines changed: 3625 additions & 3754 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,6 @@ target
1818

1919
derby.log
2020
metastore_db/*
21+
bench/target/
22+
idea.sbt
23+
mainRunner/

README.md

Lines changed: 165 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,183 @@
1-
VectorPipe
2-
==========
1+
# VectorPipe #
32

43
[![Build Status](https://travis-ci.org/geotrellis/vectorpipe.svg?branch=master)](https://travis-ci.org/geotrellis/vectorpipe)
54
[![Bintray](https://img.shields.io/bintray/v/azavea/maven/vectorpipe.svg)](https://bintray.com/azavea/maven/vectorpipe)
65
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/447170921bc94b3fb494bb2b965c2235)](https://www.codacy.com/app/fosskers/vectorpipe?utm_source=github.com&utm_medium=referral&utm_content=geotrellis/vectorpipe&utm_campaign=Badge_Grade)
76

8-
A pipeline for mass conversion of Vector data (OpenStreetMap, etc.) into
9-
Mapbox VectorTiles. Powered by [GeoTrellis](http://geotrellis.io) and
10-
[Apache Spark](http://spark.apache.org/).
7+
VectorPipe (VP) is a library for working with OpenStreetMap (OSM) vector
8+
data. Powered by [Geotrellis](http://geotrellis.io) and [Apache
9+
Spark](http://spark.apache.org/).
1110

12-
Overview
13-
--------
11+
OSM provides a wealth of data which has broad coverage and a deep history.
12+
This comes at the price of very large size which can make accessing the power
13+
of OSM difficult. VectorPipe can help by making OSM processing in Apache
14+
Spark possible, leveraging large computing clusters to churn through the large
15+
volume of, say, an OSM full history file.
1416

15-
![](docs/pipeline.png)
17+
For those cases where an application needs to process incoming changes, VP
18+
also provides streaming Spark `DataSource`s for changesets, OsmChange files,
19+
and Augmented diffs generated by Overpass.
1620

17-
There are four main stages here which represent the four main shapes of
18-
data in the pipeline:
21+
For ease of use, the output of VP imports is a Spark DataFrame containing
22+
columns of JTS `Geometry` objects, enabled by the user-defined types provided
23+
by [GeoMesa](https://github.com/locationtech/geomesa). That package also
24+
provides functions for manipulating those geometries via Spark SQL directives.
1925

20-
1. Unprocessed Vector (geometric) Data
21-
2. Clipped GeoTrellis `Feature`s organized into a Grid on the earth
22-
3. VectorTiles organized into a Grid on the earth
23-
4. Fully processed VectorTiles, output to some target
26+
The final important contribution is a set of functions for exporting
27+
geometries to vector tiles. This leans on the `geotrellis-vectortile`
28+
package.
2429

25-
Of these, Stage 4 is left to the user to leverage GeoTrellis directly in
26-
their own application. Luckily the `RDD[(SpatialKey, VectorTile)] => Unit`
27-
operation only requires about 5 lines of code. Stages 1 to 3 then are the
28-
primary concern of Vectorpipe.
30+
## Getting Started ##
2931

30-
### Processing Raw Data
32+
The fastest way to get started with VectorPipe is to invoke `spark-shell` and
33+
load the package jars from the Bintray repository:
34+
```bash
35+
spark-shell --packages com.azavea:vectorpipe_2.11:1.0.0 --repositories http://dl.bintray.com/azavea/maven
36+
```
3137

32-
For each data source that has first-class support, we expose a
33-
`vectorpipe.*` module with a matching name. Example: `vectorpipe.osm`. These
34-
modules expose all the types and functions necessary for transforming the
35-
raw data into the "Middle Ground" types.
38+
This will download the required components and set up a REPL with VectorPipe
39+
available. At which point, you may issue
40+
```scala
41+
// Make JTS types available to Spark
42+
import org.locationtech.geomesa.spark.jts._
43+
spark.withJTS
3644

37-
No first-class support for your favourite data source? Want to write it
38-
yourself, and maybe even keep it private? That's okay, just provide the
39-
function `YourData => RDD[Feature[G, D]]` and VectorPipe can handle the
40-
rest.
45+
import vectorpipe._
46+
```
47+
and begin using the package.
4148

42-
### Clipping Features into a Grid
49+
#### A Note on Cluster Computing ####
4350

44-
GeoTrellis has a consistent `RDD[(K, V)]` pattern for handling grids of
45-
tiled data, where `K` is the grid index and `V` is the actual value type.
46-
Before `RDD[(SpatialKey, VectorTile)]` can be achieved, we need to convert
47-
our gridless `RDD[Feature[G, D]]` into such a grid, such that each Feature's
48-
`Geometry` is reasonably clipped to the size of an individual tile. Depending
49-
on which clipping function you choose (from the `vectorpipe.Clip` object, or
50-
even your own custom one) the shape of the clipped Geometry will vary. See
51-
our Scaladocs for more detail on the available options.
51+
Your local machine is probably insufficient for dealing with very large OSM
52+
files. We recommend the use of Amazon's Elastic Map Reduce (EMR) service to
53+
provision substantial clusters of computing resources. You'll want to supply
54+
Spark, Hive, and Hadoop to your cluster, with Spark version 2.3. Creating a
55+
cluster with EMR version between 5.13 and 5.19 should suffice. From there,
56+
`ssh` into the master node and run `spark-shell` as above for an interactive
57+
environment, or use `spark-submit` for batch jobs. (You may submit Steps to
58+
the EMR cluster using `spark-submit` as well.)
5259

53-
### Collating Feature Groups into a VectorTile
60+
### Importing Data ###
5461

55-
Once clipped and gridded by `VectorPipe.toGrid`, we have a `RDD[(SpatialKey,
56-
Iterable[Feature[G, D]])]` that represents all the Geometry fragments
57-
present at each tiled location on the earth. This is the perfect shape to
58-
turn into a `VectorTile`. To do so, we need to choose a *Collator* function,
59-
which determines what VectorTile Layer each `Feature` should be placed into,
60-
and how (if at all) its corresponding metadata (the `D`) should be
61-
processed.
62+
Batch analysis can be performed in a few different ways. Perhaps the fastest
63+
way is to procure an OSM PBF file from a source such as
64+
[GeoFabrik](https://download.geofabrik.de/index.html), which supplies various
65+
extracts of OSM, including the full planet worth of data.
6266

63-
Want to write your own Collator? The `Collate.generically` function will be
64-
of interest to you.
67+
VectorPipe does not provide the means to directly read these OSM PBF files,
68+
however, and a conversion to a useful file format will thus be needed. We
69+
suggest using [`osm2orc`](https://github.com/mojodna/osm2orc) to convert your
70+
source file to the ORC format which can be read natively via Spark:
71+
```scala
72+
val df = spark.read.orc(path)
73+
```
74+
The resulting `DataFrame` can be processed with VectorPipe.
75+
76+
It is also possible to read from a cache of
77+
[OsmChange](https://wiki.openstreetmap.org/wiki/OsmChange) files directly
78+
rather than convert the PBF file:
79+
```scala
80+
import vectorpipe.sources.Source
81+
val df = spark.read
82+
.format(Source.Changes)
83+
.options(Map[String, String](
84+
Source.BaseURI -> "https://download.geofabrik.de/europe/isle-of-man-updates/",
85+
Source.StartSequence -> "2080",
86+
Source.EndSequence -> "2174",
87+
Source.BatchSize -> "1"))
88+
.load
89+
.persist // recommended to avoid rereading
90+
```
91+
(Note that the start and end sequence will shift over time for Geofabrik.
92+
Please navigate to the base URI to determine these values, otherwise timeouts
93+
may occur.) This may issue errors, but should complete. This is much slower
94+
than using ORC files and is much touchier, but it stands as an option.
95+
96+
[It is also possible to build a dataframe from a stream of changesets in a
97+
similar manner as above. Changesets carry additional metadata regarding the
98+
author of the changes, but none of the geometric information. These tables
99+
can be joined on `changeset`.]
100+
101+
In either case, a useful place to start is to convert the incoming dataframe
102+
into a more usable format. We recommend calling
103+
```scala
104+
val geoms = OSM.toGeometry(df)
105+
```
106+
which will produce a frame consisting of "top-level" entities, which is to say
107+
nodes that don't participate in a way, ways that don't participate in
108+
relations, and a subset of the relations from the OSM data. The resulting
109+
dataframe will represent these entities with JTS geometries in the `geom`
110+
column.
111+
112+
The `toGeometry` function keeps elements that fit one of the following
113+
descriptions:
114+
- points from tagged nodes (including tags that really ought to be dropped—e.g. `source=*`);
115+
- polygons derived from ways with tags that cause them to be considered as areas;
116+
- lines from ways lacking area tags;
117+
- multipolygons from multipolygon or boundary relations; and
118+
- multilinestrings from route relations.
119+
120+
It is also possible to filter the results based on information in the tags.
121+
For instance, all buildings can be found as
122+
```scala
123+
import vectorpipe.functions.osm._
124+
val buildings = geoms.filter(isBuilding('tags))
125+
```
126+
127+
Again, the JTS user defined types allow for easier manipulation of and
128+
calculation from geometric types. See
129+
[here](https://www.geomesa.org/documentation/user/spark/sparksql_functions.html)
130+
for a list of functions that operate on geometries.
131+
132+
## The `internal` package ##
133+
134+
While most users will rely solely on the features exposed by the `OSM` object,
135+
finer-grained control of the output of the process—say, if one does not need
136+
relations, for example—is available through the `vectorpipe.internal`
137+
package.
138+
139+
There is a significant caveat here: there are two schemas that are
140+
found in the system when working with imported OSM dataframes. The difference
141+
is in the type of a sub-field of the `members` list. This can cause errors of
142+
the form
143+
```
144+
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Byte
145+
```
146+
when using the `internal` package methods.
147+
148+
These type problems can be fixed by calling
149+
`vectorpipe.functions.osm.ensureCompressedMembers` on the input OSM data frame
150+
before passing to any relation-generating functions, such as
151+
`reconstructRelationGeometries`. Top-level functions in the `OSM` object
152+
handle this conversion for you. Note that this only affects the data frames
153+
carrying the initially imported OSM data.
154+
155+
## Local Development ##
156+
157+
If you are intending to contribute to VectorPipe, you may need to work with a
158+
development version. If that is the case, instead of loading from Bintray,
159+
you will need to build a fat jar using
160+
```bash
161+
./sbt assembly
162+
```
163+
and following that,
164+
```bash
165+
spark-shell --jars target/scala_2.11/vectorpipe.jar
166+
```
167+
168+
### IntelliJ IDEA
169+
170+
When developing with IntelliJ IDEA, the sbt plugin will see Spark dependencies
171+
as provided, which will prevent them from being indexed properly, resulting in
172+
errors / warnings within the IDE. To fix this, create `idea.sbt` at the root of
173+
the project:
174+
175+
```scala
176+
import Dependencies._
177+
178+
lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings(
179+
libraryDependencies ++= Seq(
180+
sparkSql % Compile
181+
)
182+
)
183+
```

bench/src/main/scala/vectorpipe/Bench.scala

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ package vectorpipe
22

33
import java.util.concurrent.TimeUnit
44

5-
import geotrellis.vector.{ Extent, MultiLine, Point, Line }
5+
import geotrellis.vector.{Extent, Line, Point}
66
import org.openjdk.jmh.annotations._
77

88
// --- //
@@ -21,11 +21,4 @@ class LineBench {
2121
List.range(4, -100, -2).map(n => Point(n, 1)) ++ List(Point(-3,4), Point(-1,4), Point(2,4), Point(4,4))
2222
)
2323
}
24-
25-
// @Benchmark
26-
// def java: MultiLine = Clip.toNearestPointJava(extent, line)
27-
28-
@Benchmark
29-
def tailrec: MultiLine = Clip.toNearestPoint(extent, line)
30-
3124
}

0 commit comments

Comments
 (0)