Skip to content

Commit 883241e

Browse files
Merge pull request #7 from Skarface-/master
Scala Implementation of node2vec with Spark
2 parents 5fe9af6 + 74f264e commit 883241e

10 files changed

Lines changed: 879 additions & 0 deletions

File tree

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,14 @@
11
*.pyc
2+
.DS_Store
3+
target
4+
bin
5+
build
6+
.gradle
7+
*.iml
8+
*.ipr
9+
*.iws
10+
*.log
11+
.classpath
12+
.project
13+
.settings
14+
.idea

node2vec_spark/README.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# node2vec on spark
2+
3+
This library is a implementation using scala for running on spark of *node2vec* as described in the paper:
4+
> node2vec: Scalable Feature Learning for Networks.
5+
> Aditya Grover and Jure Leskovec.
6+
> Knowledge Discovery and Data Mining, 2016.
7+
> <Insert paper link>
8+
9+
The *node2vec* algorithm learns continuous representations for nodes in any (un)directed, (un)weighted graph. Please check the [project page](https://snap.stanford.edu/node2vec/) for more details.
10+
11+
12+
### Building node2vec_spark
13+
**In order to build node2vec_spark, use the following:**
14+
15+
```
16+
$ git clone https://github.com/Skarface-/node2vec.git
17+
$ mvn clean package
18+
```
19+
20+
**and requires:**<br/>
21+
Maven 3.0.5 or newer<br/>
22+
Java 7+<br/>
23+
Scala 2.10 or newer.
24+
25+
This will produce jar file in "node2vec_spark/target/"
26+
27+
### Examples
28+
This library has two functions: *randomwalk* and *embedding*. <br/>
29+
These were described in these papers [node2vec: Scalable Feature Learning for Networks](http://arxiv.org/abs/1607.00653) and [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781).
30+
31+
### Random walk
32+
Example:
33+
34+
./spark-submit --class com.navercorp.Main \
35+
./node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar \
36+
--cmd randomwalk --p 100.0 --q 100.0 --walkLength 40 \
37+
--input <input> --output <output>
38+
39+
#### Options
40+
Invoke a command without arguments to list available arguments and their default values:
41+
42+
```
43+
--cmd COMMAND
44+
Functions: randomwalk or embedding. If you want to execute all functions "randomwalk" and "embedding" sequentially input "node2vec". Default "node2vec"
45+
--input [INPUT]
46+
Input edgelist path. The supported input format is an edgelist: "node1_id_int node2_id_int <weight_float, optional>"
47+
--output [OUTPUT]
48+
Random paths path.
49+
--walkLength WALK_LENGTH
50+
Length of walk per source. Default is 80.
51+
--numWalks NUM_WALKS
52+
Number of walks per source. Default is 10.
53+
--p P
54+
Return hyperparaemter. Default is 1.0.
55+
--q Q
56+
Inout hyperparameter. Default is 1.0.
57+
--weighted Boolean
58+
Specifying (un)weighted. Default is true.
59+
--directed Boolean
60+
Specifying (un)directed. Default is false.
61+
--degree UPPER_BOUND_OF_NUMBER_OF_NEIGHBORS
62+
Specifying upper bound of number of neighbors. Default is 30.
63+
--indexed Boolean
64+
Specifying whether nodes in edgelist are indexed or not. Default is true.
65+
```
66+
67+
* If "indexed" is set to false, *node2vec_spark* index nodes in input edgelist, example: <br/>
68+
**unindexed edgelist:**<br/>
69+
node1 node2 1.0<br/>
70+
node2 node7 1.0<br/>
71+
72+
**indexed:**<br/>
73+
1 2 1.0<br/>
74+
2 3 1.0<br/>
75+
76+
1 node1<br/>
77+
2 node2<br/>
78+
3 node7
79+
80+
#### Input
81+
The supported input format is an edgelist:
82+
83+
node1_id_int node2_id_int <weight_float, optional>
84+
or
85+
node1_str node2_str <weight_float, optional>, Please set the option "indexed" to false
86+
87+
88+
#### Output
89+
The output file (number of nodes)*numWalks random paths as follows:
90+
91+
src_node_id_int node1_id_int node2_id_int ... noden_id_int
92+
93+
94+
### Embedding random paths
95+
Example:
96+
97+
./spark-submit --class com.navercorp.Main \
98+
./node2vec_spark/target/node2vec-0.0.1-SNAPSHOT.jar \
99+
--cmd embedding --dim 50 --iter 20 \
100+
--input <input> --nodePath <node2id_path> --output <output>
101+
102+
#### Options
103+
Invoke a command without arguments to list available arguments and their default values:
104+
105+
```
106+
--cmd COMMAND
107+
embedding. If you want to execute sequentially all functions: "randomwalk" and "embedding", input "node2vec". default "node2vec"
108+
--input [INPUT]
109+
Input random paths. The supported input format is an random paths: "src_node_id_int node1_id_int ... noden_id_int"
110+
--output [OUTPUT]
111+
word2vec model(.bin) and embeddings(.emb).
112+
--nodePath [NODE\_PATH]
113+
Input node2index path. The supported input format: "node1_str node1_id_int"
114+
--iter ITERATION
115+
Number of epochs in SGD. Default 10.
116+
--dim DIMENSION
117+
Number of dimensions. Default is 128.
118+
--window WINDOW_SIZE
119+
Context size for optimization. Default is 10.
120+
121+
```
122+
123+
#### Input
124+
The supported input format is an random paths:
125+
126+
src_node_id_int node1_id_int ... noden_id_int
127+
128+
#### Output
129+
The output files are **embeddings and word2vec model.** The embeddings file has the following format:
130+
131+
node1_str dim1 dim2 ... dimd
132+
133+
where dim1, ... , dimd is the d-dimensional representation learned by word2vec.
134+
135+
the output file *word2vec model* has the spark word2vec model format. please reference to https://spark.apache.org/docs/1.5.2/mllib-feature-extraction.html#word2vec
136+
137+
## References
138+
1. [node2vec: Scalable Feature Learning for Networks](http://arxiv.org/abs/1607.00653)
139+
2. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781)

node2vec_spark/pom.xml

Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
3+
4+
<modelVersion>4.0.0</modelVersion>
5+
6+
<groupId>com.navercorp</groupId>
7+
<artifactId>node2vec</artifactId>
8+
<packaging>jar</packaging>
9+
<version>0.0.1-SNAPSHOT</version>
10+
11+
<name>node2vec_spark</name>
12+
<url>http://snap.stanford.edu/node2vec/</url>
13+
14+
<properties>
15+
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
16+
<shadedClassifier>bin</shadedClassifier>
17+
<maven-shade-plugin.version>2.4.3</maven-shade-plugin.version>
18+
<exec-maven-plugin.version>1.4.0</exec-maven-plugin.version>
19+
<java.version>1.7</java.version>
20+
<scala.binary.version>2.10</scala.binary.version>
21+
</properties>
22+
23+
<build>
24+
<plugins>
25+
<plugin>
26+
<groupId>org.scala-tools</groupId>
27+
<artifactId>maven-scala-plugin</artifactId>
28+
<version>2.15.2</version>
29+
<executions>
30+
<execution>
31+
<goals>
32+
<goal>compile</goal>
33+
</goals>
34+
</execution>
35+
</executions>
36+
</plugin>
37+
<plugin>
38+
<groupId>org.apache.maven.plugins</groupId>
39+
<artifactId>maven-dependency-plugin</artifactId>
40+
<version>2.4</version>
41+
<executions>
42+
<execution>
43+
<id>copy-dependencies</id>
44+
<phase>package</phase>
45+
<goals>
46+
<goal>copy-dependencies</goal>
47+
</goals>
48+
<configuration>
49+
<outputDirectory>${project.build.directory}/lib</outputDirectory>
50+
</configuration>
51+
</execution>
52+
</executions>
53+
</plugin>
54+
<plugin>
55+
<groupId>org.apache.maven.plugins</groupId>
56+
<artifactId>maven-shade-plugin</artifactId>
57+
<version>1.6</version>
58+
<executions>
59+
<execution>
60+
<phase>package</phase>
61+
<goals>
62+
<goal>shade</goal>
63+
</goals>
64+
</execution>
65+
</executions>
66+
</plugin>
67+
<plugin>
68+
<groupId>org.apache.maven.plugins</groupId>
69+
<artifactId>maven-compiler-plugin</artifactId>
70+
<version>2.3.2</version>
71+
<configuration>
72+
<source>1.7</source>
73+
<target>1.7</target>
74+
<encoding>UTF-8</encoding>
75+
</configuration>
76+
</plugin>
77+
<plugin>
78+
<groupId>org.apache.maven.plugins</groupId>
79+
<artifactId>maven-surefire-plugin</artifactId>
80+
<configuration>
81+
<skip>false</skip>
82+
</configuration>
83+
</plugin>
84+
</plugins>
85+
</build>
86+
87+
<dependencies>
88+
<dependency>
89+
<groupId>org.apache.hadoop</groupId>
90+
<artifactId>hadoop-hdfs</artifactId>
91+
<version>2.7.1</version>
92+
</dependency>
93+
<dependency>
94+
<groupId>org.scala-lang</groupId>
95+
<artifactId>scala-library</artifactId>
96+
<version>${scala.binary.version}.5</version>
97+
<scope>provided</scope>
98+
</dependency>
99+
<dependency>
100+
<groupId>org.apache.spark</groupId>
101+
<artifactId>spark-core_${scala.binary.version}</artifactId>
102+
<version>1.6.1</version>
103+
<scope>provided</scope>
104+
</dependency>
105+
<dependency>
106+
<groupId>org.apache.spark</groupId>
107+
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
108+
<version>1.6.1</version>
109+
<scope>provided</scope>
110+
</dependency>
111+
<dependency>
112+
<groupId>com.github.scopt</groupId>
113+
<artifactId>scopt_${scala.binary.version}</artifactId>
114+
<version>3.3.0</version>
115+
<exclusions>
116+
<exclusion>
117+
<groupId>org.scala-lang</groupId>
118+
<artifactId>scala-library</artifactId>
119+
</exclusion>
120+
</exclusions>
121+
</dependency>
122+
<dependency>
123+
<groupId>com.google.guava</groupId>
124+
<artifactId>guava</artifactId>
125+
<version>19.0</version>
126+
</dependency>
127+
</dependencies>
128+
129+
</project>
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
2+
appender.out.type = Console
3+
appender.out.name = out
4+
appender.out.layout.type = PatternLayout
5+
appender.out.layout.pattern = [%30.30t] %-30.30c{1} %-5p %m%n
6+
logger.springframework.name = org.springframework
7+
logger.springframework.level = WARN
8+
rootLogger.level = INFO
9+
rootLogger.appenderRef.out.ref = out

0 commit comments

Comments
 (0)