Skip to content

Commit 7c56ae4

Browse files
committed
All LSH classes implement Serializable + added an example of serialization...
1 parent f5f576e commit 7c56ae4

4 files changed

Lines changed: 189 additions & 3 deletions

File tree

README.md

Lines changed: 88 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ A Java implementation of Locality Sensitive Hashing (LSH).
55

66
Locality Sensitive Hashing (LSH) is a family of hashing methods that tent to produce the same hash (or signature) for similar items. There exist different LSH functions, that each correspond to a similarity metric. For example, the MinHash algorithm is designed for Jaccard similarity (the relative number of elements that two sets have in common). For cosine similarity, the traditional LSH algorithm used is Random Projection, but others exist, like Super-Bit, that deliver better resutls.
77

8+
LSH functions have two main use cases:
9+
* Compute the signature of large input vectors. These signatures can be used to quickly estimate the similarity between vectors.
10+
* With a given number of buekcts, bin similar vectors together.
11+
812
This library implements Locality Sensitive Hashing (LSH), as described in Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets", Cambridge University Press.
913

1014
Are currently implemented:
@@ -24,7 +28,6 @@ Using maven:
2428

2529
Or see the [releases](https://github.com/tdebatty/java-LSH/releases) page.
2630

27-
2831
##MinHash
2932

3033
MinHash is a hashing scheme that tents to produce similar signatures for sets that have a high Jaccard similarity.
@@ -385,3 +388,87 @@ public class MyApp {
385388
```
386389

387390
[Read Javadoc...](http://api123.web-d.be/api/java-LSH/head/index.html)
391+
392+
## Serialization
393+
394+
As the parameters of the hashing function are randomly initialized when the LSH object is instantiated:
395+
* two LSH objects will produce different hashes and signatures for the same input vector;
396+
* two executions of your program will produce different hashes and signatures for the same input vector;
397+
* the signatures produced by two different LSH objects can not be used to estimate the similarity between vectors.
398+
399+
The solution is to serialize you LSH object so you an reuse it:
400+
401+
```java
402+
import info.debatty.java.lsh.LSHMinHash;
403+
import java.io.File;
404+
import java.io.FileInputStream;
405+
import java.io.FileOutputStream;
406+
import java.io.IOException;
407+
import java.io.ObjectInputStream;
408+
import java.io.ObjectOutputStream;
409+
import java.util.Random;
410+
411+
public class SerializeExample {
412+
413+
public static void main(String[] args)
414+
throws IOException, ClassNotFoundException {
415+
416+
// Create a single random boolean vector
417+
int n = 100;
418+
double sparsity = 0.75;
419+
boolean[] vector = new boolean[n];
420+
Random rand = new Random();
421+
for (int j = 0; j < n; j++) {
422+
vector[j] = rand.nextDouble() > sparsity;
423+
}
424+
425+
// Create and configure LSH
426+
int stages = 2;
427+
int buckets = 10;
428+
LSHMinHash lsh = new LSHMinHash(stages, buckets, n);
429+
println(lsh.hash(vector));
430+
431+
// Create another LSH object
432+
// as the parameters of the hashing function are randomly initialized
433+
// these two LSH objects will produce different hashes for the same
434+
// input vector!
435+
LSHMinHash other_lsh = new LSHMinHash(stages, buckets, n);
436+
println(other_lsh.hash(vector));
437+
438+
// Moreover, signatures produced by different LSH objects cannot
439+
// be used to compute estimated similarity!
440+
// The solution is to serialize and save the object, so it can be
441+
// reused later...
442+
File tempfile = File.createTempFile("lshobject", ".ser");
443+
FileOutputStream fout = new FileOutputStream(tempfile);
444+
ObjectOutputStream oos = new ObjectOutputStream(fout);
445+
oos.writeObject(lsh);
446+
oos.close();
447+
System.out.println(
448+
"LSH object serialized to " + tempfile.getAbsolutePath());
449+
450+
FileInputStream fin = new FileInputStream(tempfile);
451+
ObjectInputStream ois = new ObjectInputStream(fin);
452+
LSHMinHash saved_lsh = (LSHMinHash) ois.readObject();
453+
println(saved_lsh.hash(vector));
454+
}
455+
456+
static void println(int[] array) {
457+
System.out.print("[");
458+
for (int v : array) {
459+
System.out.print("" + v + " ");
460+
}
461+
System.out.println("]");
462+
}
463+
}
464+
```
465+
466+
Will produce something like:
467+
```
468+
[5 5 ]
469+
[3 1 ]
470+
LSH object serialized to /tmp/lshobject5903174677942358274.ser
471+
[5 5 ]
472+
```
473+
474+
[Check the examples](https://github.com/tdebatty/java-LSH/tree/master/src/main/java/info/debatty/java/lsh/examples) or [read Javadoc](http://api123.io/api/java-LSH/head/index.html)

src/main/java/info/debatty/java/lsh/LSH.java

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
11
package info.debatty.java.lsh;
22

3+
import java.io.Serializable;
4+
35
/**
46
* Implementation of Locality Sensitive Hashing (LSH) principle, as described in
57
* Leskovec, Rajaraman & Ullman (2014), "Mining of Massive Datasets",
68
* Cambridge University Press.
79
*
810
* @author Thibault Debatty http://www.debatty.info
911
*/
10-
public abstract class LSH {
12+
public abstract class LSH implements Serializable {
1113

1214
protected static final long LARGE_PRIME = 433494437;
1315

src/main/java/info/debatty/java/lsh/MinHash.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
package info.debatty.java.lsh;
22

3+
import java.io.Serializable;
34
import java.security.InvalidParameterException;
45
import java.util.ArrayList;
56
import java.util.Collections;
@@ -25,7 +26,7 @@
2526
*
2627
* @author Thibault Debatty http://www.debatty.info
2728
*/
28-
public class MinHash {
29+
public class MinHash implements Serializable {
2930

3031
public static double JaccardIndex(Set<Integer> s1, Set<Integer> s2) {
3132
Set<Integer> intersection = new HashSet<Integer>(s1);
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
/*
2+
* The MIT License
3+
*
4+
* Copyright 2015 Thibault Debatty.
5+
*
6+
* Permission is hereby granted, free of charge, to any person obtaining a copy
7+
* of this software and associated documentation files (the "Software"), to deal
8+
* in the Software without restriction, including without limitation the rights
9+
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10+
* copies of the Software, and to permit persons to whom the Software is
11+
* furnished to do so, subject to the following conditions:
12+
*
13+
* The above copyright notice and this permission notice shall be included in
14+
* all copies or substantial portions of the Software.
15+
*
16+
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17+
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18+
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
19+
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20+
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21+
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
22+
* THE SOFTWARE.
23+
*/
24+
package info.debatty.java.lsh.examples;
25+
26+
import info.debatty.java.lsh.LSHMinHash;
27+
import java.io.File;
28+
import java.io.FileInputStream;
29+
import java.io.FileOutputStream;
30+
import java.io.IOException;
31+
import java.io.ObjectInputStream;
32+
import java.io.ObjectOutputStream;
33+
import java.util.Random;
34+
35+
/**
36+
*
37+
* @author Thibault Debatty
38+
*/
39+
public class SerializeExample {
40+
41+
/**
42+
* @param args the command line arguments
43+
* @throws java.io.IOException
44+
* @throws java.lang.ClassNotFoundException
45+
*/
46+
public static void main(String[] args)
47+
throws IOException, ClassNotFoundException {
48+
49+
// Create a single random boolean vector
50+
int n = 100;
51+
double sparsity = 0.75;
52+
boolean[] vector = new boolean[n];
53+
Random rand = new Random();
54+
for (int j = 0; j < n; j++) {
55+
vector[j] = rand.nextDouble() > sparsity;
56+
}
57+
58+
// Create and configure LSH
59+
int stages = 2;
60+
int buckets = 10;
61+
LSHMinHash lsh = new LSHMinHash(stages, buckets, n);
62+
println(lsh.hash(vector));
63+
64+
// Create another LSH object
65+
// as the parameters of the hashing function are randomly initialized
66+
// these two LSH objects will produce different hashes for the same
67+
// input vector!
68+
LSHMinHash other_lsh = new LSHMinHash(stages, buckets, n);
69+
println(other_lsh.hash(vector));
70+
71+
// Moreover, signatures produced by different LSH objects cannot
72+
// be used to compute estimated similarity!
73+
// The solution is to serialize and save the object, so it can be
74+
// reused later...
75+
File tempfile = File.createTempFile("lshobject", ".ser");
76+
FileOutputStream fout = new FileOutputStream(tempfile);
77+
ObjectOutputStream oos = new ObjectOutputStream(fout);
78+
oos.writeObject(lsh);
79+
oos.close();
80+
System.out.println(
81+
"LSH object serialized to " + tempfile.getAbsolutePath());
82+
83+
FileInputStream fin = new FileInputStream(tempfile);
84+
ObjectInputStream ois = new ObjectInputStream(fin);
85+
LSHMinHash saved_lsh = (LSHMinHash) ois.readObject();
86+
println(saved_lsh.hash(vector));
87+
}
88+
89+
static void println(int[] array) {
90+
System.out.print("[");
91+
for (int v : array) {
92+
System.out.print("" + v + " ");
93+
}
94+
System.out.println("]");
95+
}
96+
}

0 commit comments

Comments
 (0)