Skip to content
This repository was archived by the owner on Oct 8, 2020. It is now read-only.

Commit 812c079

Browse files
committed
Merge branch 'release/0.7.1'
2 parents b1fcbff + f07aecf commit 812c079

23 files changed

Lines changed: 913 additions & 214 deletions

File tree

.travis.yml

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,18 @@
1+
env:
2+
global:
3+
- secure: "J3kCgKGX29zpQw0Eu6iVRrSFQkyoqovCaKCZmFtXMRVku9cp5lRSj72YBTvEDDpEljS9i6IJiVFEy9ux45ZnensBiPePt+Uq7frza7KdgFtC4kcpEA6Jn/I8BKOmvg1X3ZMGqi/g7Df/R/Lb6gvCDnYd+1Oqdhy+j9vEiX1yZDbRA52iIO9nmaPDsmrAoHniD3x8vWHbLh3MtBR25oAbFwvGp6vPXS8irWwaJrgRNvh+MVfBKREWdR8UkC4I7PUq+ux/EOtyP5ErU7rJUnHA1IncLudDaTPY4OHlyc4FUSs6oZ076FauG5pas4c/zrZmF/yH30c2Fo9AWEKcoWYIkAl6AOYc2BMqKxq+VQO1mKT6mxD/v3w446AYqB/LQLuwiaRrxTFVSzppimPK6C0tHxIRKN1q73ptsXJZjEjOirMulMuw6t5FQwNSoOhyZpKF5Fbfw1dfWlslOhVw5qceE0fEe726AGmzAQoO+sqMwiNxcgoy7MYotApv+arwsA2ZFngN5EgyMQ5CmDOhibyMIAWSO9t8hdn+0qf003serv9WbnDfuf+DG/ch6EGn5ovPHxMxxOBv2Mo6umbk6dz5CJceMeR/4w1aE2FXLy8OA23CNXRSoZ0dVpEva6RPBX9L+rt71R1QtKOuUINYodgQ57OV4fGD+daxjgoHXiOIjlg="
4+
15
language: scala
26
sudo: false
37
cache:
48
directories:
5-
- $HOME/.m2
9+
- "$HOME/.m2"
610
scala:
7-
- 2.11.11
11+
- 2.11.11
12+
jdk:
13+
- openjdk8
814
script:
9-
- mvn scalastyle:check
10-
- mvn test
15+
- mvn scalastyle:check
16+
- mvn -U test
17+
after_success:
18+
- mvn -U clean test jacoco:report coveralls:report

README.md

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,14 @@
55
[![Twitter](https://img.shields.io/twitter/follow/SANSA_Stack.svg?style=social)](https://twitter.com/SANSA_Stack)
66

77
## Description
8-
SANSA Query is a library to perform queries directly into [Spark](https://spark.apache.org) or [Flink](https://flink.apache.org). It allows files to reside in HDFS as well as in a local file system and distributes executions across Spark RDDs/DataFrames or Flink DataSets.
8+
SANSA Query is a library to perform SPARQL queries over RDF data using big data engines [Spark](https://spark.apache.org) and [Flink](https://flink.apache.org). It allows to query RDF data that resides both in HDFS and in a local file system. Queries are executed distributed and in parallel across Spark RDDs/DataFrames or Flink DataSets. Further, SANSA-Query can query non-RDF data stored in databases e.g., MongoDB, Cassandra, MySQL or file format Parquet, using Spark.
99

10-
SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag. Its uses [Sparqlify](https://github.com/AKSW/Sparqlify) as a scalable SPARQL-SQL rewriter.
10+
For RDF data, SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single triple table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag. Its uses [Sparqlify](https://github.com/AKSW/Sparqlify) as a scalable SPARQL-SQL rewriter.
1111

12-
### SANSA Query Spark
13-
On SANSA Query Spark the method for partitioning an `RDD[Triple]` is located in [RdfPartitionUtilsSpark](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-spark/src/main/scala/net/sansa_stack/rdf/spark/partition/core/RdfPartitionUtilsSpark.scala). It uses an [RdfPartitioner](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartitioner.scala) which maps a Triple to a single [RdfPartition](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartition.scala) instance.
12+
For heterogeneous data sources (data lake), SANSA uses virtual property tables (PT) partitioning, whereby data relevant to a query is loaded _on the fly_ into Spark DataFrames composed of attributes corresponding to the properties of the query.
13+
14+
### SANSA Query SPARK - RDF
15+
On SANSA Query Spark for RDF the method for partitioning an `RDD[Triple]` is located in [RdfPartitionUtilsSpark](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-spark/src/main/scala/net/sansa_stack/rdf/spark/partition/core/RdfPartitionUtilsSpark.scala). It uses an [RdfPartitioner](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartitioner.scala) which maps a Triple to a single [RdfPartition](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartition.scala) instance.
1416

1517
* [RdfPartition](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartition.scala) - as the name suggests, represents a partition of the RDF data and defines two methods:
1618
* `matches(Triple): Boolean`: This method is used to test whether a triple fits into a partition.
@@ -22,6 +24,14 @@ On SANSA Query Spark the method for partitioning an `RDD[Triple]` is located in
2224

2325
See the [available layouts](https://github.com/SANSA-Stack/SANSA-RDF/tree/develop/sansa-rdf-common/src/main/scala/net/sansa_stack/rdf/common/partition/layout) for details.
2426

27+
### SANSA Query SPARK - Heterogeneous Data Sources
28+
SANSA Query Spark for heterogeneous data sources (data data) is composed of three main components:
29+
30+
* [Anlyser](https://github.com/SANSA-Stack/SANSA-DataLake/tree/develop/sansa-datalake-spark/src/main/scala/net/sansa_stack/datalake/spark): it extracts SPARQL triple patters and groups them by subject, it also extracts any operation on subjects like filters, group by, order by, distinct, limit.
31+
* ِ[Planner](https://github.com/SANSA-Stack/SANSA-DataLake/blob/develop/sansa-datalake-spark/src/main/scala/net/sansa_stack/datalake/spark/Planner.scala): it extracts joins between subject-based triple patter groups and generates join plan accordingly. The join order followed is left-deep.
32+
* [Mapper](https://github.com/SANSA-Stack/SANSA-DataLake/blob/develop/sansa-datalake-spark/src/main/scala/net/sansa_stack/datalake/spark/Mapper.scala): it access (RML) mappings and matches properties of a subject-based triples patter group against the attributes of individual data sources. If a match exists of every property of the triple pattern, the respective data source is declared _relavant_ and loaded into Spark DataFrame. The loading into DataFrame is performed using [Spark Connectors](https://spark-packages.org/).
33+
* [Executor](https://github.com/SANSA-Stack/SANSA-DataLake/blob/develop/sansa-datalake-spark/src/main/scala/net/sansa_stack/datalake/spark/SparkExecutor.scala): it analyses SPARQL query and generates equivalent Spark SQL functions over DataFrames, for SELECT, WHERE, GROUP-BY, ORDER-BY, LIMIT. Connection between subject-based triple pattern groups are translated into JOINs between relevant Spark DataFrames.
34+
2535
## Usage
2636

2737
The following Scala code shows how to query an RDF file SPARQL syntax (be it a local file or a file residing in HDFS):
@@ -45,5 +55,7 @@ server.join()
4555
```
4656
An overview is given in the [FAQ section of the SANSA project page](http://sansa-stack.net/faq/#sparql-queries). Further documentation about the builder objects can also be found on the [ScalaDoc page](http://sansa-stack.net/scaladocs/).
4757

58+
For querying heterogeneous data sources, refer to the documentation of the dedicated [SANSA-DatLake](https://github.com/SANSA-Stack/SANSA-DataLake) component.
59+
4860
## How to Contribute
4961
We always welcome new contributors to the project! Please see [our contribution guide](http://sansa-stack.net/contributing-to-sansa/) for more details on how to get started contributing to SANSA.

pom.xml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
<parent>
1010
<groupId>net.sansa-stack</groupId>
1111
<artifactId>sansa-parent</artifactId>
12-
<version>0.6.0</version>
12+
<version>0.7.1</version>
1313
</parent>
1414

1515
<packaging>pom</packaging>
@@ -74,6 +74,17 @@
7474
</roles>
7575
<timezone>0</timezone>
7676
</developer>
77+
<developer>
78+
<id>GezimSejdiu</id>
79+
<name>Gezim Sejdiu</name>
80+
<url>https://gezimsejdiu.github.io/</url>
81+
<organization>SDA</organization>
82+
<organizationUrl>http://sda.tech</organizationUrl>
83+
<roles>
84+
<role>developer</role>
85+
</roles>
86+
<timezone>0</timezone>
87+
</developer>
7788
</developers>
7889

7990
<modules>

sansa-query-common/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
<parent>
88
<groupId>net.sansa-stack</groupId>
99
<artifactId>sansa-query-parent_2.11</artifactId>
10-
<version>0.6.0</version>
10+
<version>0.7.1</version>
1111
</parent>
1212

1313
<dependencies>

sansa-query-flink/pom.xml

Lines changed: 98 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,101 @@
11
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
2-
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3-
<modelVersion>4.0.0</modelVersion>
4-
5-
<artifactId>sansa-query-flink_2.11</artifactId>
6-
7-
<parent>
8-
<groupId>net.sansa-stack</groupId>
9-
<artifactId>sansa-query-parent_2.11</artifactId>
10-
<version>0.6.0</version>
11-
</parent>
12-
13-
14-
<dependencies>
15-
16-
<dependency>
17-
<groupId>${project.groupId}</groupId>
18-
<artifactId>sansa-rdf-flink${scala.version.suffix}</artifactId>
19-
</dependency>
20-
21-
<dependency>
22-
<groupId>${project.groupId}</groupId>
23-
<artifactId>sansa-rdf-spark${scala.version.suffix}</artifactId>
24-
</dependency>
25-
26-
<dependency>
27-
<groupId>${project.groupId}</groupId>
28-
<artifactId>sansa-rdf-common${scala.version.suffix}</artifactId>
29-
</dependency>
30-
31-
<dependency>
32-
<groupId>org.scalatest</groupId>
33-
<artifactId>scalatest_${scala.binary.version}</artifactId>
34-
<scope>test</scope>
35-
</dependency>
36-
37-
38-
<dependency>
39-
<groupId>org.aksw.bsbm</groupId>
40-
<artifactId>bsbm-core</artifactId>
41-
<scope>test</scope>
42-
</dependency>
43-
44-
45-
<dependency>
46-
<groupId>com.typesafe.scala-logging</groupId>
47-
<artifactId>scala-logging_${scala.binary.version}</artifactId>
48-
<version>3.5.0</version>
49-
</dependency>
50-
51-
<dependency>
52-
<groupId>org.scalatest</groupId>
53-
<artifactId>scalatest_${scala.binary.version}</artifactId>
54-
<scope>test</scope>
55-
</dependency>
56-
57-
<dependency>
58-
<groupId>junit</groupId>
59-
<artifactId>junit</artifactId>
60-
<scope>test</scope>
61-
</dependency>
62-
63-
<dependency>
64-
<groupId>org.apache.calcite</groupId>
65-
<artifactId>calcite-core</artifactId>
66-
<version>1.13.0</version>
67-
<scope>test</scope>
68-
</dependency>
69-
70-
<dependency>
71-
<groupId>com.google.protobuf</groupId>
72-
<artifactId>protobuf-java</artifactId>
73-
<version>3.3.1</version>
74-
<scope>runtime</scope>
75-
</dependency>
76-
77-
</dependencies>
78-
79-
<build>
80-
<!-- <sourceDirectory>src/main/java</sourceDirectory> -->
81-
<!-- <sourceDirectory>src/main/scala</sourceDirectory> -->
82-
<plugins>
83-
<plugin>
84-
<groupId>org.apache.maven.plugins</groupId>
85-
<artifactId>maven-compiler-plugin</artifactId>
86-
</plugin>
87-
88-
<plugin>
89-
<groupId>net.alchim31.maven</groupId>
90-
<artifactId>scala-maven-plugin</artifactId>
91-
</plugin>
92-
93-
<plugin>
94-
<groupId>org.scalastyle</groupId>
95-
<artifactId>scalastyle-maven-plugin</artifactId>
96-
</plugin>
97-
</plugins>
98-
</build>
2+
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3+
<modelVersion>4.0.0</modelVersion>
4+
5+
<artifactId>sansa-query-flink_2.11</artifactId>
6+
7+
<parent>
8+
<groupId>net.sansa-stack</groupId>
9+
<artifactId>sansa-query-parent_2.11</artifactId>
10+
<version>0.7.1</version>
11+
</parent>
12+
13+
14+
<dependencies>
15+
16+
<dependency>
17+
<groupId>${project.groupId}</groupId>
18+
<artifactId>sansa-rdf-flink${scala.version.suffix}</artifactId>
19+
</dependency>
20+
21+
<dependency>
22+
<groupId>${project.groupId}</groupId>
23+
<artifactId>sansa-rdf-spark${scala.version.suffix}</artifactId>
24+
</dependency>
25+
26+
<dependency>
27+
<groupId>${project.groupId}</groupId>
28+
<artifactId>sansa-rdf-common${scala.version.suffix}</artifactId>
29+
</dependency>
30+
31+
<dependency>
32+
<groupId>org.scalatest</groupId>
33+
<artifactId>scalatest_${scala.binary.version}</artifactId>
34+
<scope>test</scope>
35+
</dependency>
36+
37+
38+
<dependency>
39+
<groupId>org.aksw.bsbm</groupId>
40+
<artifactId>bsbm-core</artifactId>
41+
<scope>test</scope>
42+
</dependency>
43+
44+
45+
<dependency>
46+
<groupId>com.typesafe.scala-logging</groupId>
47+
<artifactId>scala-logging_${scala.binary.version}</artifactId>
48+
<version>3.5.0</version>
49+
</dependency>
50+
51+
<dependency>
52+
<groupId>org.scalatest</groupId>
53+
<artifactId>scalatest_${scala.binary.version}</artifactId>
54+
<scope>test</scope>
55+
</dependency>
56+
57+
<dependency>
58+
<groupId>junit</groupId>
59+
<artifactId>junit</artifactId>
60+
<scope>test</scope>
61+
</dependency>
62+
63+
<dependency>
64+
<groupId>org.apache.calcite</groupId>
65+
<artifactId>calcite-core</artifactId>
66+
<version>1.13.0</version>
67+
<scope>test</scope>
68+
</dependency>
69+
70+
<dependency>
71+
<groupId>com.google.protobuf</groupId>
72+
<artifactId>protobuf-java</artifactId>
73+
<version>3.3.1</version>
74+
<scope>runtime</scope>
75+
</dependency>
76+
77+
78+
</dependencies>
79+
80+
<build>
81+
<!-- <sourceDirectory>src/main/java</sourceDirectory> -->
82+
<!-- <sourceDirectory>src/main/scala</sourceDirectory> -->
83+
<plugins>
84+
<plugin>
85+
<groupId>org.apache.maven.plugins</groupId>
86+
<artifactId>maven-compiler-plugin</artifactId>
87+
</plugin>
88+
89+
<plugin>
90+
<groupId>net.alchim31.maven</groupId>
91+
<artifactId>scala-maven-plugin</artifactId>
92+
</plugin>
93+
94+
<plugin>
95+
<groupId>org.scalastyle</groupId>
96+
<artifactId>scalastyle-maven-plugin</artifactId>
97+
</plugin>
98+
</plugins>
99+
</build>
99100

100101
</project>

sansa-query-flink/src/main/scala/net/sansa_stack/query/flink/sparqlify/BasicTableInfoProviderFlink.scala

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,10 @@ package net.sansa_stack.query.flink.sparqlify
22

33
import java.util.Collections
44

5+
import org.aksw.commons.util.strings.StringUtils
6+
57
import collection.JavaConverters._
6-
import org.aksw.sparqlify.config.v0_2.bridge.{ BasicTableInfo, BasicTableInfoProvider }
8+
import org.aksw.sparqlify.config.v0_2.bridge.{BasicTableInfo, BasicTableInfoProvider}
79
import org.apache.flink.table.api.scala.BatchTableEnvironment
810

911
/**
@@ -14,10 +16,11 @@ class BasicTableInfoProviderFlink(flinkTable: BatchTableEnvironment)
1416
override def getBasicTableInfo(queryString: String): BasicTableInfo = {
1517
val table = flinkTable.sqlQuery(queryString)
1618
val schema = table.getSchema
17-
val types = schema.getTypes
18-
val names = schema.getColumnNames
19-
val map = (0 until types.length).map { i =>
20-
(names(i), types(i).toString)
19+
val types = schema.getFieldDataTypes
20+
val names = schema.getFieldNames
21+
val map = (0 until types.length).map { i => {
22+
(names(i), types(i).toString.toLowerCase.capitalize)
23+
}
2124
} toMap
2225

2326
println(map)

0 commit comments

Comments
 (0)