Skip to content
This repository was archived by the owner on Oct 8, 2020. It is now read-only.

Commit c962776

Browse files
committed
Merge branch 'release/0.1.0'
2 parents 7f0f417 + 6d03565 commit c962776

97 files changed

Lines changed: 1192 additions & 2671 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ buildNumber.properties
1313
.preferences
1414
.project
1515

16+
deptree.txt
17+
1618
### Java template
1719
*.class
1820

README.md

Lines changed: 26 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,36 @@
1-
Quick start:
2-
* Import the project into your IDE of choice
3-
* Run the class [MainS2RdfExample.scala](s2rdf-example/src/main/scala/org/aksw/s2rdf/example/MainS2RdfExample.scala)
1+
# SANSA Query
42

3+
## Description
4+
SANSA Query is a library to perform queries directly into [Spark](https://spark.apache.org) or [Flink](https://flink.apache.org). It allows files to reside in HDFS as well as in a local file system and distributes executions across Spark RDDs/DataFrames or Flink DataSets.
55

6+
SANSA uses vertical partitioning (VP) approach and is designed to support extensible partitioning of RDF data. Instead of dealing with a single three-column table (s, p, o), data is partitioned into multiple tables based on the used RDF predicates, RDF term types and literal datatypes. The first column of these tables is always a string representing the subject. The second column always represents the literal value as a Scala/Java datatype. Tables for storing literals with language tags have an additional third string column for the language tag. Its uses [Sparqlify](https://github.com/AKSW/Sparqlify) as a scalable SPARQL-SQL rewriter.
67

7-
== dataset-creator
8-
vp = vertical partitioning -> each predicate to its own table
8+
### SANSA Query Spark
9+
On SANSA Query Spark the method for partitioning a RDD[Triple] is located in [RdfPartitionUtilsSpark](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/src/main/scala/net/sansa_stack/rdf/spark/partition/core/RdfPartitionUtilsSpark.scala). It uses an [RdfPartitioner](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartitioner.scala) which maps a Triple to a single [RdfPartition](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/src/main/scala/net/sansa_stack/rdf/common/partition/core/RdfPartition.scala) instance.
910

10-
runDriver with 3 args:
11-
1: 'working directory: where the indexes are placed; ./VP ./ExtVP'
12-
2: 'inFile - file within the working directory'
13-
3: 'joinType: VP, ss, so, os' (VP must be used first; ss, so and os are then generated under expVP)
14-
4: 'some threshold of when to not create the extVP index - 1 which always creates all extVP; lorenz thinks it was 0.7 in the authors scripts'
11+
* RdfPartition, as the name suggests, represents a partition of the RDF data and defines two methods:
12+
matches(Triple): Boolean: This method is used to test whether a triple fits into a partition.
13+
* Layout => TripleLayout: This method returns the [TripleLayout](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/src/main/scala/net/sansa_stack/rdf/common/partition/layout/TripleLayout.scala) associated with the partition, as explained below.
14+
* urthermore,RdfPartitions are expected to be serializable, and to define equals and hash code.
15+
* TripleLayout instances are used to obtain framework-agnostic compact tabular representations of triples according to a partition. For this purpose it defines the two methods:
16+
* fromTriple(triple:Triple): Product: This method must, for a given triple, return its representation as a Product(this is the super class of all scalaTuples)
17+
* schema:Type: This method must return the exact scala type of the objects returned by fromTriple, such as typeOf[Tuple2[String,Double]]. Hence, layouts are expected to only yield instances of one specific type.
1518

19+
See the [available layouts](https://github.com/SANSA-Stack/SANSA-RDF/blob/develop/src/main/scala/net/sansa_stack/rdf/common/partition/layout) for details.
1620

17-
Settings
18-
currently several settings hard coded
21+
## Usage
1922

23+
The following Scala code shows how to query an RDF file SPQRQL syntax (be it a local file or a file residing in HDFS):
24+
```scala
2025

21-
spark reads default properties from
22-
src/main/resources/conf/spark-default.properties
23-
24-
25-
== query translation
26-
query-translator.run.Main
27-
28-
Generates sparql sql - run the main method to see what SQL gets generated
29-
30-
31-
32-
== query execution
33-
runDriver
34-
1: dbDir: base directory
35-
2: qrFile: Query plan file name - gets generated by the query translator (don't write it manually)
36-
Uses some strange format to create a list queries
37-
QueryTranslator/QYagoQUerySet/sql_090/
26+
val graphRdd = NTripleReader.load(sparkSession, new File("path/to/rdf.nt"))
27+
28+
val partitions = RdfPartitionUtilsSpark.partitionGraph(graphRdd)
29+
val rewriter = SparqlifyUtils3.createSparqlSqlRewriter(sparkSession, partitions)
30+
31+
val qef = new QueryExecutionFactorySparqlifySpark(sparkSession, rewriter)
32+
```
33+
An overview is given in the [FAQ section of the SANSA project page](http://sansa-stack.net/faq/#sparql-queries). Further documentation about the builder objects can also be found on the [ScalaDoc page](http://sansa-stack.net/scaladocs/).
3834

35+
3936

0 commit comments

Comments
 (0)