11# WordVecSpace
22A high performance pure python module that helps in loading and performing operations on word vector spaces created using Google's Word2vec tool.
33
4- This module has ability to load data into memory using ` WordVecSpaceMem ` and it can also support performing operations on the data which is on the disk using ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk ` .
4+ This module has ability to the load data into memory using ` WordVecSpaceMem ` and it can also support performing operations on the data which is on the disk using ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk ` .
55
66## Installation
7- > Prerequisites: Python3.5
7+ > Prerequisites: >= Python3.5.2
88
99``` bash
10- $ sudo apt install libopenblas-base
11- $ sudo apt-get install libffi-dev
10+ $ sudo apt install libopenblas-base # Optional
1211$ sudo pip3 install wordvecspace
1312```
14- > Note: wordvecspace is using ` /usr/lib/libopenblas.so.0 ` as a default path for openblas. If the path of openblas is different in your machine then you have to set the environment variable for that path.
15- > Ex: For ubuntu-17.04, open blas path is ` /usr/lib/x86_64-linux-gnu/libopenblas.so.0 ` .
16-
17- Setting the environment variable for openblas
18- ``` bash
19- $ export WORDVECSPACE_BLAS_FPATH=/usr/lib/x86_64-linux-gnu/libopenblas.so.0
20- ```
2113
2214## Usage
2315
@@ -84,12 +76,12 @@ $ wordvecspace convert /home/user/bindata /home/user/output_dir
8476
8577##### Import
8678``` python
87- >> > from wordvecspace import WordVecSpaceMem
79+ >> > from wordvecspace import WordVecSpaceDisk
8880```
8981
9082##### Load data
9183``` python
92- >> > wv = WordVecSpaceMem (' /home/user/output_dir' )
84+ >> > wv = WordVecSpaceDisk (' /home/user/output_dir' )
9385```
9486
9587##### Make get_nearest call
@@ -106,33 +98,35 @@ $ wordvecspace convert /home/user/bindata /home/user/output_dir
10698
10799` WordVecSpaceAnnoy ` takes wordvecspace output_dir as input and creates annoy indexes in another file (index file). Using this file ` annoy ` gives approximate results quickly. For better understanding of ` Annoy ` please go through this [ link] ( https://github.com/spotify/annoy )
108100
109- As we have seen how to import ` WordVecSpaceMem ` above, let us look at ` WordVecSpaceAnnoy ` and ` WordVecSpaceDisk `
101+ As we have seen how to import ` WordVecSpaceDisk ` above, let us look at ` WordVecSpaceAnnoy ` and ` WordVecSpaceMem `
110102
111103##### Import
112104``` python
113105>> > from wordvecspace import WordVecSpaceAnnoy
114- >> > from wordvecspace import WordVecSpaceDisk
106+ >> > from wordvecspace import WordVecSpaceMem
115107```
116108
117109##### Load data
118110``` python
119- >> > wv = WordVecSpaceAnnoy(' /home/user/output_dir' , n_trees, index_fpath)
111+ # WordVecSpaceMem
112+ >> > wv = WordVecSpaceMem(' /home/user/output_dir' )
113+
114+ # WordVecSpaceAnnoy
115+ >> > wv = WordVecSpaceAnnoy(' /home/user/output_dir' , n_trees = 2 , index_fpath = ' /tmp' )
120116
121117# n_trees = number of trees(More trees gives a higher precision when querying for get_nearest)
122118# index_fpath = path for annoy index file
123119
124120# n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/output_dir` (wordvecspace data directory) directory for index_fpath.
125-
126- >> > wv = WordVecSpaceDisk(' /home/user/output_dir' )
127121```
128122
129123##### Make get_nearest call
130124``` python
125+ >> > wv.get_nearest(' india' , k = 20 ) (MEM )
126+ [509 , 3389 , 486 , 523 , 7125 , 16619 , 4491 , 12191 , 6866 , 8776 , 15232 , 14208 , 5998 , 21916 , 5226 , 6322 , 4343 , 6212 , 10172 , 6186 ]
127+
131128>> > wv.get_nearest(' india' , k = 20 ) (ANNOY )
132129[509 , 3389 , 16619 , 4491 , 6866 , 8776 , 14208 , 5998 , 21916 , 20919 , 2325 , 4622 , 3546 , 24149 , 5064 , 35704 , 25578 , 15842 , 4137 , 6499 ]
133-
134- >> > wv.get_nearest(' india' , k = 20 ) (DISK )
135- [509 , 3389 , 486 , 523 , 7125 , 16619 , 4491 , 12191 , 6866 , 8776 , 15232 , 14208 , 5998 , 21916 , 5226 , 6322 , 4343 , 6212 , 10172 , 6186 ]
136130```
137131
138132#### Distance calculations
@@ -181,10 +175,10 @@ False
181175>> > print (wv.get_word_index(" india" ))
182176509
183177
184- >> > print (wv.get_word_index (" inidia" ))
178+ >> > print (wv.get_index (" inidia" ))
185179None
186180
187- >> > print (wv.get_word_index (" inidia" , raise_exc = True ))
181+ >> > print (wv.get_index (" inidia" , raise_exc = True ))
188182Traceback (most recent call last):
189183 File " /usr/lib/python3.6/code.py" , line 91 , in runcode
190184 exec (code, self .locals)
@@ -196,10 +190,10 @@ wordvecspace.exception.UnknownWord: "inidia"
196190
197191##### Get the indices of words
198192``` python
199- >> > print (wv.get_word_indices ([' the' , ' deepcompute' , ' india' ]))
193+ >> > print (wv.get_indices ([' the' , ' deepcompute' , ' india' ]))
200194[1 , None , 509 ]
201195
202- >> > print (wv.get_word_indices ([' the' , ' deepcompute' , ' india' ], raise_exc = True ))
196+ >> > print (wv.get_indices ([' the' , ' deepcompute' , ' india' ], raise_exc = True ))
203197Traceback (most recent call last):
204198 File " /usr/lib/python3.6/code.py" , line 91 , in runcode
205199 exec (code, self .locals)
@@ -214,60 +208,60 @@ wordvecspace.exception.UnknownWord: "deepcompute"
214208##### Get Word at Index
215209``` python
216210# Get word at Index 509
217- >> > print (wv.get_word_at_index (509 ))
211+ >> > print (wv.get_word (509 ))
218212india
219213```
220214
221215##### Get Words at Indices
222216``` python
223- >> > print (wv.get_word_at_indices ([1 , 509 , 71190 , 72000 ]))
217+ >> > print (wv.get_words ([1 , 509 , 71190 , 72000 ]))
224218[' the' , ' india' , ' reka' , None ]
225219```
226220
227221##### Get occurrence of the word
228222``` python
229223# Get occurrences of the word "india"
230- >> > print (wv.get_word_occurrence (" india" ))
224+ >> > print (wv.get_occurrence (" india" ))
2312253242
232226
233227# Get occurrences of the word "inidia"
234- >> > print (wv.get_word_occurrence (" inidia" ))
228+ >> > print (wv.get_occurrence (" inidia" ))
235229None
236230```
237231
238232##### Get occurrence of the words
239233``` python
240234# Get occurrence of the words 'the', 'india' and 'Deepcompute'
241- >> > print (wv.get_word_occurrences ([" the" , " india" , " Deepcompute" ]))
235+ >> > print (wv.get_occurrences ([" the" , " india" , " Deepcompute" ]))
242236[1061396 , 3242 , None ]
243237```
244238
245239##### Get vector magnitude of the word
246240``` python
247241# Get magnitude for the word "hi"
248- >> > print (wv.get_vector_magnitude (" hi" ))
242+ >> > print (wv.get_magnitude (" hi" ))
2492431.0
250244```
251245
252246##### Get vector magnitude of the words
253247``` python
254248# Get magnitude for the words "hi" and "india"
255- >> > print (wv.get_vector_magnitudes ([" hi" , " india" ]))
249+ >> > print (wv.get_magnitudes ([" hi" , " india" ]))
256250[1.0 , 1.0 ]
257251```
258252
259253##### Get vector for given word
260254``` python
261255# Get the word vector for a word india
262- >> > print (wv.get_word_vector (" india" ))
256+ >> > print (wv.get_vector (" india" ))
263257[- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]
264258
265259# Get the unit word vector for a word india
266- >> > print (wv.get_word_vector (" india" , normalized = True ))
260+ >> > print (wv.get_vector (" india" , normalized = True ))
267261[- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]
268262
269263# Get the word vector for a word inidia.
270- >> > print (wv.get_word_vector (' inidia' , raise_exc = True ))
264+ >> > print (wv.get_vector (' inidia' , raise_exc = True ))
271265Traceback (most recent call last):
272266 File " /usr/lib/python3.6/code.py" , line 91 , in runcode
273267 exec (code, self .locals)
@@ -279,16 +273,16 @@ Traceback (most recent call last):
279273wordvecspace.exception.UnknownWord: " inidia"
280274
281275# If you don't want to get exception when word is not there, then you can simply discard raise_exc=True
282- >> > print (wv.get_word_vector (' inidia' ))
276+ >> > print (wv.get_vector (' inidia' ))
283277[ 0 . 0. 0. 0. 0.]
284278```
285279
286280##### Get vector for given words
287281``` python
288- >> > print (wv.get_word_vectors ([" hi" , " india" ]))
282+ >> > print (wv.get_vectors ([" hi" , " india" ]))
289283[[ 0.6342 0.2268 - 0.3904 0.0368 0.6266 ]
290284 [- 0.7871 - 0.2993 0.3233 - 0.2864 0.323 ]]
291- >> > print (wv.get_word_vectors ([" hi" , " inidia" ]))
285+ >> > print (wv.get_vectors ([" hi" , " inidia" ]))
292286[[ 0.6342 0.2268 - 0.3904 0.0368 0.6266 ]
293287 [ 0 . 0. 0. 0. 0. ]]
294288```
@@ -391,33 +385,33 @@ $ curl "http://localhost:8000/api/v1/does_word_exist?word=india"
391385``` bash
392386$ http://localhost:8000/api/v1/does_word_exist? word=india
393387
394- $ http://localhost:8000/api/v1/get_word_index ? word=india
388+ $ http://localhost:8000/api/v1/get_index ? word=india
395389
396- $ http://localhost:8000/api/v1/get_word_indices ? words=[" india" , 22, " hello" ]
390+ $ http://localhost:8000/api/v1/get_indices ? words=[" india" , 22, " hello" ]
397391
398- $ http://localhost:8000/api/v1/get_word_at_index ? index=509
392+ $ http://localhost:8000/api/v1/get_index ? index=509
399393
400- $ http://localhost:8000/api/v1/get_word_at_indices ? indices=[22, 509]
394+ $ http://localhost:8000/api/v1/get_indices ? indices=[22, 509]
401395
402- $ http://localhost:8000/api/v1/get_word_vector ? word_or_index=509
396+ $ http://localhost:8000/api/v1/get_vector ? word_or_index=509
403397
404- $ http://localhost:8000/api/v1/get_vector_magnitude ? word_or_index=88
398+ $ http://localhost:8000/api/v1/get_magnitude ? word_or_index=88
405399
406- $ http://localhost:8000/api/v1/get_vector_magnitudes ? words_or_indices=[88, " india" ]
400+ $ http://localhost:8000/api/v1/get_magnitudes ? words_or_indices=[88, " india" ]
407401
408- $ http://localhost:8000/api/v1/get_word_occurrence ? word_or_index=india
402+ $ http://localhost:8000/api/v1/get_occurrence ? word_or_index=india
409403
410- $ http://localhost:8000/api/v1/get_word_occurrences ? words_or_indices=[" india" , 22]
404+ $ http://localhost:8000/api/v1/get_occurrences ? words_or_indices=[" india" , 22]
411405
412- $ http://localhost:8000/api/v1/get_word_vectors ? words_or_indices=[1, " india" ]
406+ $ http://localhost:8000/api/v1/get_vectors ? words_or_indices=[1, " india" ]
413407
414408$ http://localhost:8000/api/v1/get_distance? word_or_index1=ap& word_or_index2=india
415409
416410$ http://localhost:8000/api/v1/get_distances? row_words_or_indices=[" india" , 33]
417411
418- $ http://localhost:8000/api/v1/get_nearest? words_or_indices =india& k=100
412+ $ http://localhost:8000/api/v1/get_nearest? v_w_i =india& k=100
419413
420- $ http://localhost:8000/api/v1/get_nearest? words_or_indices =india& k=100& metric=euclidean
414+ $ http://localhost:8000/api/v1/get_nearest? v_w_i =india& k=100& metric=euclidean
421415```
422416
423417> To see all API methods of wordvecspace please run http://localhost:8000/api/v1/apidoc
0 commit comments