|
1 | | -VectorPipe |
2 | | -========== |
| 1 | +# VectorPipe # |
3 | 2 |
|
4 | 3 | [](https://travis-ci.org/geotrellis/vectorpipe) |
5 | 4 | [](https://bintray.com/azavea/maven/vectorpipe) |
6 | 5 | [](https://www.codacy.com/app/fosskers/vectorpipe?utm_source=github.com&utm_medium=referral&utm_content=geotrellis/vectorpipe&utm_campaign=Badge_Grade) |
7 | 6 |
|
8 | | -A pipeline for mass conversion of Vector data (OpenStreetMap, etc.) into |
9 | | -Mapbox VectorTiles. Powered by [GeoTrellis](http://geotrellis.io) and |
10 | | -[Apache Spark](http://spark.apache.org/). |
| 7 | +VectorPipe (VP) is a library for working with OpenStreetMap (OSM) vector |
| 8 | +data. Powered by [Geotrellis](http://geotrellis.io) and [Apache |
| 9 | +Spark](http://spark.apache.org/). |
11 | 10 |
|
12 | | -Overview |
13 | | --------- |
| 11 | +OSM provides a wealth of data which has broad coverage and a deep history. |
| 12 | +This comes at the price of very large size which can make accessing the power |
| 13 | +of OSM difficult. VectorPipe can help by making OSM processing in Apache |
| 14 | +Spark possible, leveraging large computing clusters to churn through the large |
| 15 | +volume of, say, an OSM full history file. |
14 | 16 |
|
15 | | - |
| 17 | +For those cases where an application needs to process incoming changes, VP |
| 18 | +also provides streaming Spark `DataSource`s for changesets, OsmChange files, |
| 19 | +and Augmented diffs generated by Overpass. |
16 | 20 |
|
17 | | -There are four main stages here which represent the four main shapes of |
18 | | -data in the pipeline: |
| 21 | +For ease of use, the output of VP imports is a Spark DataFrame containing |
| 22 | +columns of JTS `Geometry` objects, enabled by the user-defined types provided |
| 23 | +by [GeoMesa](https://github.com/locationtech/geomesa). That package also |
| 24 | +provides functions for manipulating those geometries via Spark SQL directives. |
19 | 25 |
|
20 | | -1. Unprocessed Vector (geometric) Data |
21 | | -2. Clipped GeoTrellis `Feature`s organized into a Grid on the earth |
22 | | -3. VectorTiles organized into a Grid on the earth |
23 | | -4. Fully processed VectorTiles, output to some target |
| 26 | +The final important contribution is a set of functions for exporting |
| 27 | +geometries to vector tiles. This leans on the `geotrellis-vectortile` |
| 28 | +package. |
24 | 29 |
|
25 | | -Of these, Stage 4 is left to the user to leverage GeoTrellis directly in |
26 | | -their own application. Luckily the `RDD[(SpatialKey, VectorTile)] => Unit` |
27 | | -operation only requires about 5 lines of code. Stages 1 to 3 then are the |
28 | | -primary concern of Vectorpipe. |
| 30 | +## Getting Started ## |
29 | 31 |
|
30 | | -### Processing Raw Data |
| 32 | +The fastest way to get started with VectorPipe is to invoke `spark-shell` and |
| 33 | +load the package jars from the Bintray repository: |
| 34 | +```bash |
| 35 | +spark-shell --packages com.azavea:vectorpipe_2.11:1.0.0 --repositories http://dl.bintray.com/azavea/maven |
| 36 | +``` |
31 | 37 |
|
32 | | -For each data source that has first-class support, we expose a |
33 | | -`vectorpipe.*` module with a matching name. Example: `vectorpipe.osm`. These |
34 | | -modules expose all the types and functions necessary for transforming the |
35 | | -raw data into the "Middle Ground" types. |
| 38 | +This will download the required components and set up a REPL with VectorPipe |
| 39 | +available. At which point, you may issue |
| 40 | +```scala |
| 41 | +// Make JTS types available to Spark |
| 42 | +import org.locationtech.geomesa.spark.jts._ |
| 43 | +spark.withJTS |
36 | 44 |
|
37 | | -No first-class support for your favourite data source? Want to write it |
38 | | -yourself, and maybe even keep it private? That's okay, just provide the |
39 | | -function `YourData => RDD[Feature[G, D]]` and VectorPipe can handle the |
40 | | -rest. |
| 45 | +import vectorpipe._ |
| 46 | +``` |
| 47 | +and begin using the package. |
41 | 48 |
|
42 | | -### Clipping Features into a Grid |
| 49 | +#### A Note on Cluster Computing #### |
43 | 50 |
|
44 | | -GeoTrellis has a consistent `RDD[(K, V)]` pattern for handling grids of |
45 | | -tiled data, where `K` is the grid index and `V` is the actual value type. |
46 | | -Before `RDD[(SpatialKey, VectorTile)]` can be achieved, we need to convert |
47 | | -our gridless `RDD[Feature[G, D]]` into such a grid, such that each Feature's |
48 | | -`Geometry` is reasonably clipped to the size of an individual tile. Depending |
49 | | -on which clipping function you choose (from the `vectorpipe.Clip` object, or |
50 | | -even your own custom one) the shape of the clipped Geometry will vary. See |
51 | | -our Scaladocs for more detail on the available options. |
| 51 | +Your local machine is probably insufficient for dealing with very large OSM |
| 52 | +files. We recommend the use of Amazon's Elastic Map Reduce (EMR) service to |
| 53 | +provision substantial clusters of computing resources. You'll want to supply |
| 54 | +Spark, Hive, and Hadoop to your cluster, with Spark version 2.3. Creating a |
| 55 | +cluster with EMR version between 5.13 and 5.19 should suffice. From there, |
| 56 | +`ssh` into the master node and run `spark-shell` as above for an interactive |
| 57 | +environment, or use `spark-submit` for batch jobs. (You may submit Steps to |
| 58 | +the EMR cluster using `spark-submit` as well.) |
52 | 59 |
|
53 | | -### Collating Feature Groups into a VectorTile |
| 60 | +### Importing Data ### |
54 | 61 |
|
55 | | -Once clipped and gridded by `VectorPipe.toGrid`, we have a `RDD[(SpatialKey, |
56 | | -Iterable[Feature[G, D]])]` that represents all the Geometry fragments |
57 | | -present at each tiled location on the earth. This is the perfect shape to |
58 | | -turn into a `VectorTile`. To do so, we need to choose a *Collator* function, |
59 | | -which determines what VectorTile Layer each `Feature` should be placed into, |
60 | | -and how (if at all) its corresponding metadata (the `D`) should be |
61 | | -processed. |
| 62 | +Batch analysis can be performed in a few different ways. Perhaps the fastest |
| 63 | +way is to procure an OSM PBF file from a source such as |
| 64 | +[GeoFabrik](https://download.geofabrik.de/index.html), which supplies various |
| 65 | +extracts of OSM, including the full planet worth of data. |
62 | 66 |
|
63 | | -Want to write your own Collator? The `Collate.generically` function will be |
64 | | -of interest to you. |
| 67 | +VectorPipe does not provide the means to directly read these OSM PBF files, |
| 68 | +however, and a conversion to a useful file format will thus be needed. We |
| 69 | +suggest using [`osm2orc`](https://github.com/mojodna/osm2orc) to convert your |
| 70 | +source file to the ORC format which can be read natively via Spark: |
| 71 | +```scala |
| 72 | +val df = spark.read.orc(path) |
| 73 | +``` |
| 74 | +The resulting `DataFrame` can be processed with VectorPipe. |
| 75 | + |
| 76 | +It is also possible to read from a cache of |
| 77 | +[OsmChange](https://wiki.openstreetmap.org/wiki/OsmChange) files directly |
| 78 | +rather than convert the PBF file: |
| 79 | +```scala |
| 80 | +import vectorpipe.sources.Source |
| 81 | +val df = spark.read |
| 82 | + .format(Source.Changes) |
| 83 | + .options(Map[String, String]( |
| 84 | + Source.BaseURI -> "https://download.geofabrik.de/europe/isle-of-man-updates/", |
| 85 | + Source.StartSequence -> "2080", |
| 86 | + Source.EndSequence -> "2174", |
| 87 | + Source.BatchSize -> "1")) |
| 88 | + .load |
| 89 | + .persist // recommended to avoid rereading |
| 90 | +``` |
| 91 | +(Note that the start and end sequence will shift over time for Geofabrik. |
| 92 | +Please navigate to the base URI to determine these values, otherwise timeouts |
| 93 | +may occur.) This may issue errors, but should complete. This is much slower |
| 94 | +than using ORC files and is much touchier, but it stands as an option. |
| 95 | + |
| 96 | +[It is also possible to build a dataframe from a stream of changesets in a |
| 97 | +similar manner as above. Changesets carry additional metadata regarding the |
| 98 | +author of the changes, but none of the geometric information. These tables |
| 99 | +can be joined on `changeset`.] |
| 100 | + |
| 101 | +In either case, a useful place to start is to convert the incoming dataframe |
| 102 | +into a more usable format. We recommend calling |
| 103 | +```scala |
| 104 | +val geoms = OSM.toGeometry(df) |
| 105 | +``` |
| 106 | +which will produce a frame consisting of "top-level" entities, which is to say |
| 107 | +nodes that don't participate in a way, ways that don't participate in |
| 108 | +relations, and a subset of the relations from the OSM data. The resulting |
| 109 | +dataframe will represent these entities with JTS geometries in the `geom` |
| 110 | +column. |
| 111 | + |
| 112 | +The `toGeometry` function keeps elements that fit one of the following |
| 113 | +descriptions: |
| 114 | +- points from tagged nodes (including tags that really ought to be dropped—e.g. `source=*`); |
| 115 | +- polygons derived from ways with tags that cause them to be considered as areas; |
| 116 | +- lines from ways lacking area tags; |
| 117 | +- multipolygons from multipolygon or boundary relations; and |
| 118 | +- multilinestrings from route relations. |
| 119 | + |
| 120 | +It is also possible to filter the results based on information in the tags. |
| 121 | +For instance, all buildings can be found as |
| 122 | +```scala |
| 123 | +import vectorpipe.functions.osm._ |
| 124 | +val buildings = geoms.filter(isBuilding('tags)) |
| 125 | +``` |
| 126 | + |
| 127 | +Again, the JTS user defined types allow for easier manipulation of and |
| 128 | +calculation from geometric types. See |
| 129 | +[here](https://www.geomesa.org/documentation/user/spark/sparksql_functions.html) |
| 130 | +for a list of functions that operate on geometries. |
| 131 | + |
| 132 | +## The `internal` package ## |
| 133 | + |
| 134 | +While most users will rely solely on the features exposed by the `OSM` object, |
| 135 | +finer-grained control of the output of the process—say, if one does not need |
| 136 | +relations, for example—is available through the `vectorpipe.internal` |
| 137 | +package. |
| 138 | + |
| 139 | +There is a significant caveat here: there are two schemas that are |
| 140 | +found in the system when working with imported OSM dataframes. The difference |
| 141 | +is in the type of a sub-field of the `members` list. This can cause errors of |
| 142 | +the form |
| 143 | +``` |
| 144 | +java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Byte |
| 145 | +``` |
| 146 | +when using the `internal` package methods. |
| 147 | + |
| 148 | +These type problems can be fixed by calling |
| 149 | +`vectorpipe.functions.osm.ensureCompressedMembers` on the input OSM data frame |
| 150 | +before passing to any relation-generating functions, such as |
| 151 | +`reconstructRelationGeometries`. Top-level functions in the `OSM` object |
| 152 | +handle this conversion for you. Note that this only affects the data frames |
| 153 | +carrying the initially imported OSM data. |
| 154 | + |
| 155 | +## Local Development ## |
| 156 | + |
| 157 | +If you are intending to contribute to VectorPipe, you may need to work with a |
| 158 | +development version. If that is the case, instead of loading from Bintray, |
| 159 | +you will need to build a fat jar using |
| 160 | +```bash |
| 161 | +./sbt assembly |
| 162 | +``` |
| 163 | +and following that, |
| 164 | +```bash |
| 165 | +spark-shell --jars target/scala_2.11/vectorpipe.jar |
| 166 | +``` |
| 167 | + |
| 168 | +### IntelliJ IDEA |
| 169 | + |
| 170 | +When developing with IntelliJ IDEA, the sbt plugin will see Spark dependencies |
| 171 | +as provided, which will prevent them from being indexed properly, resulting in |
| 172 | +errors / warnings within the IDE. To fix this, create `idea.sbt` at the root of |
| 173 | +the project: |
| 174 | + |
| 175 | +```scala |
| 176 | +import Dependencies._ |
| 177 | + |
| 178 | +lazy val mainRunner = project.in(file("mainRunner")).dependsOn(RootProject(file("."))).settings( |
| 179 | + libraryDependencies ++= Seq( |
| 180 | + sparkSql % Compile |
| 181 | + ) |
| 182 | +) |
| 183 | +``` |
0 commit comments