11# WordVecSpace
22A high performance pure python module that helps in loading and performing operations on word vector spaces created using Google's Word2vec tool.
33
4- This module has ability to load data into memory using ` WordVecSpaceMem ` or it can also supports performing operations on the data which is on the disk using ` WordVecSpaceAnnoy ` .
4+ This module has ability to load data into memory using ` WordVecSpaceMem ` and it can also support performing operations on the data which is on the disk using ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk ` .
55
66## Installation
77> Prerequisites: Python3.5
@@ -63,19 +63,19 @@ by the `wordvecspace` module. You'll first have to convert them
6363to the ` WordVecSpace ` format.
6464
6565``` bash
66- $ wordvecspace convert < input_dir> < output_file >
66+ $ wordvecspace convert < input_dir> < output_dir >
6767
6868# <input_dir> is the directory which has vocab.txt and vectors.bin
69- # <output_file > is the file where you want to put your output file
69+ # <output_dir > is the directory where you want to store your output files.
7070```
7171
7272Example:
7373
7474``` bash
75- $ wordvecspace convert /home/user/bindata /home/user/dc.wvspace
75+ $ wordvecspace convert /home/user/bindata /home/user/output_dir
7676
7777# /home/user/bindata is the directory containing vocab.txt and vectors.bin
78- # dc.wvspace is the output file
78+ # /home/user/output_dir is the output directory which contains wordvecspace data files.
7979```
8080
8181### Importing
@@ -89,59 +89,67 @@ $ wordvecspace convert /home/user/bindata /home/user/dc.wvspace
8989
9090##### Load data
9191``` python
92- >> > wv = WordVecSpaceMem(' /home/user/dc.wvspace ' )
92+ >> > wv = WordVecSpaceMem(' /home/user/output_dir ' )
9393```
9494
9595##### Make get_nearest call
9696``` python
9797>> > wv.get_nearest(' india' , k = 20 )
98- [509 , 486 , 523 , 4343 , 14208 , 13942 , 42424 , 25578 , 6212 , 2475 , 3560 , 13508 , 20919 , 3389 , 4484 , 19995 , 8776 , 7012 , 12191 , 16619 ]
99-
98+ [509 , 3389 , 486 , 523 , 7125 , 16619 , 4491 , 12191 , 6866 , 8776 , 15232 , 14208 , 5998 , 21916 , 5226 , 6322 , 4343 , 6212 , 10172 , 6186 ]
10099# k is for getting top k nearest values
101100```
102101
103102#### Types
104- ` wordvecspace ` module can perform operations by loading data into RAM using ` WordVecSpaceMem ` or directly on the data which is on the disk using ` WordVecSpaceAnnoy `
103+ ` wordvecspace ` module can perform operations by loading data into RAM using ` WordVecSpaceMem ` or directly on the data which is on the disk using ` WordVecSpaceDisk `
105104
106- ` WordVecSpaceMem ` is a bruteforce algorithm which compares given word with all the words in the vector space
105+ ` WordVecSpaceMem ` and ` WordVecSpaceDisk ` is a bruteforce algorithm which compares given word with all the words in the vector space
107106
108- ` WordVecSpaceAnnoy ` takes wvspace file as input and creates annoy indexes in another file (index file). Using this file ` annoy ` gives approximate results quickly. For better understanding of ` Annoy ` please go through this [ link] ( https://github.com/spotify/annoy )
107+ ` WordVecSpaceAnnoy ` takes wordvecspace output_dir as input and creates annoy indexes in another file (index file). Using this file ` annoy ` gives approximate results quickly. For better understanding of ` Annoy ` please go through this [ link] ( https://github.com/spotify/annoy )
109108
110- As we have seen how to import ` WordVecSpaceMem ` above, let us look at ` WordVecSpaceAnnoy `
109+ As we have seen how to import ` WordVecSpaceMem ` above, let us look at ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk `
111110
112111##### Import
113112``` python
114113>> > from wordvecspace import WordVecSpaceAnnoy
114+ >> > from wordvecspace import WordVecSpaceDisk
115115```
116116
117117##### Load data
118118``` python
119- wv = WordVecSpaceAnnoy(' /home/user/dc.wvspace ' , n_trees, index_fpath)
119+ >> > wv = WordVecSpaceAnnoy(' /home/user/output_dir ' , n_trees, index_fpath)
120120
121121# n_trees = number of trees(More trees gives a higher precision when querying for get_nearest)
122122# index_fpath = path for annoy index file
123123
124- # n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/` (dc.wvspace file directory) directory for index_fpath.
124+ # n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/output_dir` (wordvecspace data directory) directory for index_fpath.
125+
126+ >> > wv = WordVecSpaceDisk(' /home/user/output_dir' )
125127```
126128
127129##### Make get_nearest call
128130``` python
129- >> > wv.get_nearest(' india' , k = 20 )
130- [509 , 486 , 523 , 4343 , 14208 , 13942 , 42424 , 25578 , 6212 , 2475 , 3560 , 13508 , 20919 , 3389 , 4484 , 19995 , 8776 , 7012 , 12191 , 16619 ]
131+ >> > wv.get_nearest(' india' , k = 20 ) (ANNOY )
132+ [509 , 3389 , 16619 , 4491 , 6866 , 8776 , 14208 , 5998 , 21916 , 20919 , 2325 , 4622 , 3546 , 24149 , 5064 , 35704 , 25578 , 15842 , 4137 , 6499 ]
133+
134+ >> > wv.get_nearest(' india' , k = 20 ) (DISK )
135+ [509 , 3389 , 486 , 523 , 7125 , 16619 , 4491 , 12191 , 6866 , 8776 , 15232 , 14208 , 5998 , 21916 , 5226 , 6322 , 4343 , 6212 , 10172 , 6186 ]
131136```
132137
133138#### Distance calculations
134139` WordVecSpaceAnnoy ` supports different types of distance calculations such as ` "angular" ` , ` "euclidean" ` , ` "manhattan" ` and ` "hamming" ` .
135140
136141` WordVecSpaceMem ` supports ` "angular" ` and ` "euclidean" ` for distance calculations.
137142
138- Both uses ` "angular" ` by default. If you want to change it then you can change at the time of creating object.
143+ ` WordVecSpaceDisk ` supports ` "angular" ` and ` "euclidean" ` for distance calculations.
144+
145+ All of the above uses ` "angular" ` by default. If you want to change it then you can change at the time of creating object.
139146
140147Example:
141148
142149``` bash
143- wv = WordVecSpaceAnnoy(' /path/to/wvspacefile' , n_trees, metric=" euclidean" )
144- wv = WordVecSpaceMem(' /path/to/wvspacefile' , metric=" euclidean" )
150+ wv = WordVecSpaceAnnoy(' /path/to/output_dir' , n_trees, metric=" euclidean" )
151+ wv = WordVecSpaceMem(' /path/to/output_dir' , metric=" euclidean" )
152+ wv = WordVecSpaceDisk(' /path/to/output_dir' , metric=" euclidean" )
145153
146154# metric = type of distance calculation
147155```
@@ -150,14 +158,14 @@ WordVecSpaceMem can also supports specifying metric at the time of calculating d
150158
151159Example:
152160``` bash
153- wv = WordVecSpaceMem(' /path/to/wvspacefile ' , metric=" euclidean" )
161+ wv = WordVecSpaceMem(' /path/to/output_dir ' , metric=" euclidean" )
154162
155163wv.get_distance(' ap' , ' india' , metric=' angular' )
156164```
157165
158166#### Examples of using wordvecspace methods
159167
160- > WordVecSpaceMem and WordVecSpaceAnnoy have the same common methods.
168+ > ` WordVecSpaceMem ` , ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk ` support the same methods.
161169
162170##### Check if a word exists or not in the word vector space
163171``` python
@@ -238,25 +246,25 @@ None
238246``` python
239247# Get magnitude for the word "hi"
240248>> > print (wv.get_vector_magnitude(" hi" ))
241- 8.7948
249+ 1.0
242250```
243251
244252##### Get vector magnitude of the words
245253``` python
246254# Get magnitude for the words "hi" and "india"
247255>> > print (wv.get_vector_magnitudes([" hi" , " india" ]))
248- [ 8.7948 10.303 ]
256+ [1.0 , 1.0 ]
249257```
250258
251259##### Get vector for given word
252260``` python
253261# Get the word vector for a word india
254262>> > print (wv.get_word_vector(" india" ))
255- [- 6.4482 - 2.1636 5.7277 - 3.7746 3.583 ]
263+ [- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]
256264
257265# Get the unit word vector for a word india
258266>> > print (wv.get_word_vector(" india" , normalized = True ))
259- [- 0.6259 - 0.21 0.5559 - 0.3664 0.3478 ]
267+ [- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]
260268
261269# Get the word vector for a word inidia.
262270>> > print (wv.get_word_vector(' inidia' , raise_exc = True ))
@@ -278,80 +286,80 @@ wordvecspace.exception.UnknownWord: "inidia"
278286##### Get vector for given words
279287``` python
280288>> > print (wv.get_word_vectors([" hi" , " india" ]))
281- [[ 0.4008 0.3623 - 0.013 0.8395 0.0562 ]
282- [- 0.4975 - 0.134 0.7874 - 0.3274 0.0857 ]]
289+ [[ 0.6342 0.2268 - 0.3904 0.0368 0.6266 ]
290+ [- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]]
283291>> > print (wv.get_word_vectors([" hi" , " inidia" ]))
284- [[ 0.4008 0.3623 - 0.013 0.8395 0.0562 ]
292+ [[ 0.6342 0.2268 - 0.3904 0.0368 0.6266 ]
285293 [ 0 . 0. 0. 0. 0. ]]
286294```
287295
288296##### Get distance between two words
289297``` python
290298# Get distance between "india", "usa"
291299>> > print (wv.get_distance(" india" , " usa" ))
292- 0.48379534483
300+ 0.37698328495
293301
294302# Get the distance between 250, "india"
295303>> > print (wv.get_distance(250 , " india" ))
296- 1.16397565603
304+ 1.1418992728
297305
298306# Get the euclidean distance between 250, "india" for WordvecSpaceMem
299307>> > print (wv.get_distance(250 , " india" , metric = ' euclidean' ))
300- 12.04961109161377
308+ 1.5112241506576538
301309```
302310
303311##### Get distance between list of words
304312
305313``` python
306314>> > print (wv.get_distances(" for" , [" to" , " for" , " india" ]))
307- [[ 0.381 0 . 0.9561 ]]
315+ [[ 2.7428e-01 5.9605e-08 1.1567e+00 ]]
308316
309317>> > print (wv.get_distances(" for" , [" to" , " for" , " inidia" ]))
310- [[ 0.381 0 . 1. ]]
318+ [[ 2.7428e-01 5.9605e-08 1.0000e+00 ]]
311319
312320>> > print (wv.get_distances([" india" , " for" ], [" to" , " for" , " usa" ]))
313- [[ 1.0685 0.9561 0.3251 ]
314- [ 0.381 0 . 1.4781 ]]
321+ [[ 1.1445e+00 1.1567e+00 3.7698e-01 ]
322+ [ 2.7428e-01 5.9605e-08 1.6128e+00 ]]
315323
316324>> > print (wv.get_distances([" india" , " usa" ]))
317- [[ 1.3853 0.4129 0.3149 ... , 1.1231 1.4595 0.7912 ]
318- [ 1.3742 0.9549 1.0354 ... , 0.5556 1.0847 1.0832 ]]
325+ [[ 1.5464 0.4876 0.3017 ... , 1.2492 1.2451 0.8925 ]
326+ [ 1.0436 0.9995 1.0913 ... , 0.6996 0.8014 1.1608 ]]
319327
320328>> > print (wv.get_distances([" andhra" ]))
321- [[ 1.2817 0.6138 0.2995 ... , 0.9945 1.224 0.6137 ]]
329+ [[ 1.5418 0.7153 0.277 ... , 1.1657 1.0774 0.7036 ]]
322330
323331# For WordVecSpaceMem
324332>> > print (wv.get_distances([" andhra" ], metric = ' euclidean' ))
325- [[ 9.0035 8.3985 7.1658 ... , 9.2236 9.6078 8.6349 ]]
333+ [[ 1.756 1.1961 0.7443 ... , 1.5269 1.4679 1.1862 ]]
326334```
327335
328336##### Get nearest
329337``` python
330338# Get nearest for given word or index
331339>> > print (wv.get_nearest(" india" , 20 ))
332- [509 , 486 , 523 , 4343 , 14208 , 13942 , 42424 , 25578 , 6212 , 2475 , 3560 , 13508 , 20919 , 3389 , 4484 , 19995 , 8776 , 7012 , 12191 , 16619 ]
340+ [509 , 3389 , 486 , 523 , 7125 , 16619 , 4491 , 12191 , 6866 , 8776 , 15232 , 14208 , 5998 , 21916 , 5226 , 6322 , 4343 , 6212 , 10172 , 6186 ]
333341
334342# Get nearest for given words or indices
335343>> > print (wv.get_nearest([" ram" , " india" ], 5 ))
336- [[3844 , 38851 , 25381 , 10830 , 17049 ], [509 , 486 , 523 , 4343 , 14208 ]]
344+ [[3844 , 16727 , 15811 , 42731 , 41516 ], [509 , 3389 , 486 , 523 , 7125 ]]
337345
338346# Get nearest using euclidean distance for WordVecSpaceMem
339347>> > print (wv.get_nearest([" ram" , " india" ], 5 , metric = ' euclidean' ))
340- [[3844 , 25381 , 27802 , 17049 , 38851 ], [509 , 486 , 14208 , 523 , 13942 ]]
348+ [[3844 , 16727 , 15811 , 42731 , 41516 ], [509 , 3389 , 486 , 523 , 7125 ]]
341349
342350# Get common nearest neighbors among given words
343351>> > print (wv.get_nearest([' india' , ' bosnia' ], 10 , combination = True ))
344- [14208 , 486 , 523 , 4343 , 42424 , 509 ]
352+ [523 , 509 , 486 ]
345353```
346354
347- ### Service
355+ ## Service
348356
349357``` bash
350358# Run wordvecspace as a service (which continuously listens on some port for API requests)
351- $ wordvecspace runserver < type> < input_file > --metric < metric> --port < port> --eargs < eargs>
359+ $ wordvecspace runserver < type> < input_dir > --metric < metric> --port < port> --eargs < eargs>
352360
353- # <type> is for specifying wordvecspace functionality (eg: mem or annoy ).
354- # <input_file > is for wordvecspace file
361+ # <type> is for specifying wordvecspace functionality (eg: mem, annoy or disk ).
362+ # <input_dir > is for wordvecspace data dir
355363# <metric> is to specify type for distance calculation
356364# <port> is to run wordvecspace in that port
357365# <eargs> is for specifying extra arguments for annoy
@@ -361,10 +369,13 @@ Example:
361369
362370``` bash
363371# For mem
364- $ wordvecspace runserver mem /home/user/dc.wvspace --metric angular --port 8000
372+ $ wordvecspace runserver mem /home/user/output_dir --metric angular --port 8000
373+
374+ # For disk
375+ $ wordvecspace runserver disk /home/user/output_dir --metric angular --port 8000
365376
366377# For annoy
367- $ wordvecspace runserver annoy /home/user/dc.wvspace --metric euclidean --port 8000 --eargs n_trees=1:index_fpath=/tmp
378+ $ wordvecspace runserver annoy /home/user/output_dir --metric euclidean --port 8000 --eargs n_trees=1:index_fpath=/tmp
368379
369380# Extra arguments for annoy are n_trees and index_fpath
370381# - n_trees is the number of trees for annoy
@@ -415,20 +426,24 @@ $ http://localhost:8000/api/v1/get_nearest?words_or_indices=india&k=100&metric=e
415426``` bash
416427# wordvecspace provides command to directly interact with it
417428
418- $ wordvecspace interact < type> < input_file > --metric < metric> --eargs < eargs>
429+ $ wordvecspace interact < type> < input_dir > --metric < metric> --eargs < eargs>
419430
420- # <type> is for specifying wordvecspace functionality (eg: mem or annoy).
421- # <input_file > is for wordvecspace file
431+ # <type> is for specifying wordvecspace functionality (eg: mem, disk or annoy).
432+ # <input_dir > is for wordvecspace data dir
422433# <metric> is to specify type for distance calculation
423434# <eargs> is for specifying extra arguments for annoy
424435```
425436
426437Example:
427438``` bash
428439# For mem
429- $ wordvecspace interact mem /home/user/dc.wvspace --metric euclidean
440+ $ wordvecspace interact mem /home/user/output_dir --metric euclidean
441+
442+ # For Disk
443+ $ wordvecspace interact disk /home/user/output_dir --metric euclidean
430444
431- $ wordvecspace interact annoy /home/user/dc.wvspace --metric angular --eargs n_trees=1:index_fpath=/tmp
445+ # For Annoy
446+ $ wordvecspace interact annoy /home/user/output_dir --metric angular --eargs n_trees=1:index_fpath=/tmp
432447WordVecSpaceAnnoy console (vectors=71291 dims=5)
433448>>> wv.get_nearest(' india' , 20)
434449[509, 486, 523, 4343, 13942, 42424, 25578, 3389, 12191, 16619, 12088, 6049, 5226, 4137, 41883, 18617, 10172, 35704, 25552, 29059]
@@ -447,7 +462,7 @@ $ wget 'https://s3.amazonaws.com/deepcompute-public-data/wordvecspace/small_test
447462$ tar xvzf small_test_data.tgz
448463
449464# Export the path of data file to the environment variables
450- $ export WORDVECSPACE_DATAFILE =" /home/user/dc.wvspace "
465+ $ export WORDVECSPACE_DATADIR =" /home/user/output_dir "
451466
452467# Run tests
453468$ python3 setup.py test
0 commit comments