Skip to content

Commit 141afde

Browse files
author
RamanjaneyuluIdavalapati
committed
Made it as usable component. Need to work on test cases, sharding and more
1 parent 36eb3ee commit 141afde

13 files changed

Lines changed: 611 additions & 1156 deletions

File tree

README.md

Lines changed: 44 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,15 @@
11
# WordVecSpace
22
A high performance pure python module that helps in loading and performing operations on word vector spaces created using Google's Word2vec tool.
33

4-
This module has ability to load data into memory using `WordVecSpaceMem` and it can also support performing operations on the data which is on the disk using `WordVecSpaceAnnoy` and `WordVecSpaceDisk`.
4+
This module has ability to the load data into memory using `WordVecSpaceMem` and it can also support performing operations on the data which is on the disk using `WordVecSpaceAnnoy` and `WordVecSpaceDisk`.
55

66
## Installation
7-
> Prerequisites: Python3.5
7+
> Prerequisites: >=Python3.5.2
88
99
```bash
10-
$ sudo apt install libopenblas-base
11-
$ sudo apt-get install libffi-dev
10+
$ sudo apt install libopenblas-base # Optional
1211
$ sudo pip3 install wordvecspace
1312
```
14-
> Note: wordvecspace is using `/usr/lib/libopenblas.so.0` as a default path for openblas. If the path of openblas is different in your machine then you have to set the environment variable for that path.
15-
> Ex: For ubuntu-17.04, open blas path is `/usr/lib/x86_64-linux-gnu/libopenblas.so.0`.
16-
17-
Setting the environment variable for openblas
18-
```bash
19-
$ export WORDVECSPACE_BLAS_FPATH=/usr/lib/x86_64-linux-gnu/libopenblas.so.0
20-
```
2113

2214
## Usage
2315

@@ -84,12 +76,12 @@ $ wordvecspace convert /home/user/bindata /home/user/output_dir
8476

8577
##### Import
8678
```python
87-
>>> from wordvecspace import WordVecSpaceMem
79+
>>> from wordvecspace import WordVecSpaceDisk
8880
```
8981

9082
##### Load data
9183
```python
92-
>>> wv = WordVecSpaceMem('/home/user/output_dir')
84+
>>> wv = WordVecSpaceDisk('/home/user/output_dir')
9385
```
9486

9587
##### Make get_nearest call
@@ -106,33 +98,35 @@ $ wordvecspace convert /home/user/bindata /home/user/output_dir
10698

10799
`WordVecSpaceAnnoy` takes wordvecspace output_dir as input and creates annoy indexes in another file (index file). Using this file `annoy` gives approximate results quickly. For better understanding of `Annoy` please go through this [link](https://github.com/spotify/annoy)
108100

109-
As we have seen how to import `WordVecSpaceMem` above, let us look at `WordVecSpaceAnnoy` and `WordVecSpaceDisk`
101+
As we have seen how to import `WordVecSpaceDisk` above, let us look at `WordVecSpaceAnnoy` and `WordVecSpaceMem`
110102

111103
##### Import
112104
```python
113105
>>> from wordvecspace import WordVecSpaceAnnoy
114-
>>> from wordvecspace import WordVecSpaceDisk
106+
>>> from wordvecspace import WordVecSpaceMem
115107
```
116108

117109
##### Load data
118110
```python
119-
>>> wv = WordVecSpaceAnnoy('/home/user/output_dir', n_trees, index_fpath)
111+
# WordVecSpaceMem
112+
>>> wv = WordVecSpaceMem('/home/user/output_dir')
113+
114+
# WordVecSpaceAnnoy
115+
>>> wv = WordVecSpaceAnnoy('/home/user/output_dir', n_trees=2, index_fpath='/tmp')
120116

121117
# n_trees = number of trees(More trees gives a higher precision when querying for get_nearest)
122118
# index_fpath = path for annoy index file
123119

124120
# n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/output_dir` (wordvecspace data directory) directory for index_fpath.
125-
126-
>>> wv = WordVecSpaceDisk('/home/user/output_dir')
127121
```
128122

129123
##### Make get_nearest call
130124
```python
125+
>>> wv.get_nearest('india', k=20) (MEM)
126+
[509, 3389, 486, 523, 7125, 16619, 4491, 12191, 6866, 8776, 15232, 14208, 5998, 21916, 5226, 6322, 4343, 6212, 10172, 6186]
127+
131128
>>> wv.get_nearest('india', k=20) (ANNOY)
132129
[509, 3389, 16619, 4491, 6866, 8776, 14208, 5998, 21916, 20919, 2325, 4622, 3546, 24149, 5064, 35704, 25578, 15842, 4137, 6499]
133-
134-
>>> wv.get_nearest('india', k=20) (DISK)
135-
[509, 3389, 486, 523, 7125, 16619, 4491, 12191, 6866, 8776, 15232, 14208, 5998, 21916, 5226, 6322, 4343, 6212, 10172, 6186]
136130
```
137131

138132
#### Distance calculations
@@ -181,10 +175,10 @@ False
181175
>>> print(wv.get_word_index("india"))
182176
509
183177

184-
>>> print(wv.get_word_index("inidia"))
178+
>>> print(wv.get_index("inidia"))
185179
None
186180

187-
>>> print(wv.get_word_index("inidia", raise_exc=True))
181+
>>> print(wv.get_index("inidia", raise_exc=True))
188182
Traceback (most recent call last):
189183
File "/usr/lib/python3.6/code.py", line 91, in runcode
190184
exec(code, self.locals)
@@ -196,10 +190,10 @@ wordvecspace.exception.UnknownWord: "inidia"
196190

197191
##### Get the indices of words
198192
```python
199-
>>> print(wv.get_word_indices(['the', 'deepcompute', 'india']))
193+
>>> print(wv.get_indices(['the', 'deepcompute', 'india']))
200194
[1, None, 509]
201195

202-
>>> print(wv.get_word_indices(['the', 'deepcompute', 'india'], raise_exc=True))
196+
>>> print(wv.get_indices(['the', 'deepcompute', 'india'], raise_exc=True))
203197
Traceback (most recent call last):
204198
File "/usr/lib/python3.6/code.py", line 91, in runcode
205199
exec(code, self.locals)
@@ -214,60 +208,60 @@ wordvecspace.exception.UnknownWord: "deepcompute"
214208
##### Get Word at Index
215209
```python
216210
# Get word at Index 509
217-
>>> print(wv.get_word_at_index(509))
211+
>>> print(wv.get_word(509))
218212
india
219213
```
220214

221215
##### Get Words at Indices
222216
```python
223-
>>> print(wv.get_word_at_indices([1, 509, 71190, 72000]))
217+
>>> print(wv.get_words([1, 509, 71190, 72000]))
224218
['the', 'india', 'reka', None]
225219
```
226220

227221
##### Get occurrence of the word
228222
```python
229223
# Get occurrences of the word "india"
230-
>>> print(wv.get_word_occurrence("india"))
224+
>>> print(wv.get_occurrence("india"))
231225
3242
232226

233227
# Get occurrences of the word "inidia"
234-
>>> print(wv.get_word_occurrence("inidia"))
228+
>>> print(wv.get_occurrence("inidia"))
235229
None
236230
```
237231

238232
##### Get occurrence of the words
239233
```python
240234
# Get occurrence of the words 'the', 'india' and 'Deepcompute'
241-
>>> print(wv.get_word_occurrences(["the", "india", "Deepcompute"]))
235+
>>> print(wv.get_occurrences(["the", "india", "Deepcompute"]))
242236
[1061396, 3242, None]
243237
```
244238

245239
##### Get vector magnitude of the word
246240
```python
247241
# Get magnitude for the word "hi"
248-
>>> print(wv.get_vector_magnitude("hi"))
242+
>>> print(wv.get_magnitude("hi"))
249243
1.0
250244
```
251245

252246
##### Get vector magnitude of the words
253247
```python
254248
# Get magnitude for the words "hi" and "india"
255-
>>> print(wv.get_vector_magnitudes(["hi", "india"]))
249+
>>> print(wv.get_magnitudes(["hi", "india"]))
256250
[1.0, 1.0]
257251
```
258252

259253
##### Get vector for given word
260254
```python
261255
# Get the word vector for a word india
262-
>>> print(wv.get_word_vector("india"))
256+
>>> print(wv.get_vector("india"))
263257
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]
264258

265259
# Get the unit word vector for a word india
266-
>>> print(wv.get_word_vector("india", normalized=True))
260+
>>> print(wv.get_vector("india", normalized=True))
267261
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]
268262

269263
# Get the word vector for a word inidia.
270-
>>> print(wv.get_word_vector('inidia', raise_exc=True))
264+
>>> print(wv.get_vector('inidia', raise_exc=True))
271265
Traceback (most recent call last):
272266
File "/usr/lib/python3.6/code.py", line 91, in runcode
273267
exec(code, self.locals)
@@ -279,16 +273,16 @@ Traceback (most recent call last):
279273
wordvecspace.exception.UnknownWord: "inidia"
280274

281275
# If you don't want to get exception when word is not there, then you can simply discard raise_exc=True
282-
>>> print(wv.get_word_vector('inidia'))
276+
>>> print(wv.get_vector('inidia'))
283277
[ 0. 0. 0. 0. 0.]
284278
```
285279

286280
##### Get vector for given words
287281
```python
288-
>>> print(wv.get_word_vectors(["hi", "india"]))
282+
>>> print(wv.get_vectors(["hi", "india"]))
289283
[[ 0.6342 0.2268 -0.3904 0.0368 0.6266]
290284
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]]
291-
>>> print(wv.get_word_vectors(["hi", "inidia"]))
285+
>>> print(wv.get_vectors(["hi", "inidia"]))
292286
[[ 0.6342 0.2268 -0.3904 0.0368 0.6266]
293287
[ 0. 0. 0. 0. 0. ]]
294288
```
@@ -391,33 +385,33 @@ $ curl "http://localhost:8000/api/v1/does_word_exist?word=india"
391385
```bash
392386
$ http://localhost:8000/api/v1/does_word_exist?word=india
393387

394-
$ http://localhost:8000/api/v1/get_word_index?word=india
388+
$ http://localhost:8000/api/v1/get_index?word=india
395389

396-
$ http://localhost:8000/api/v1/get_word_indices?words=["india", 22, "hello"]
390+
$ http://localhost:8000/api/v1/get_indices?words=["india", 22, "hello"]
397391

398-
$ http://localhost:8000/api/v1/get_word_at_index?index=509
392+
$ http://localhost:8000/api/v1/get_index?index=509
399393

400-
$ http://localhost:8000/api/v1/get_word_at_indices?indices=[22, 509]
394+
$ http://localhost:8000/api/v1/get_indices?indices=[22, 509]
401395

402-
$ http://localhost:8000/api/v1/get_word_vector?word_or_index=509
396+
$ http://localhost:8000/api/v1/get_vector?word_or_index=509
403397

404-
$ http://localhost:8000/api/v1/get_vector_magnitude?word_or_index=88
398+
$ http://localhost:8000/api/v1/get_magnitude?word_or_index=88
405399

406-
$ http://localhost:8000/api/v1/get_vector_magnitudes?words_or_indices=[88, "india"]
400+
$ http://localhost:8000/api/v1/get_magnitudes?words_or_indices=[88, "india"]
407401

408-
$ http://localhost:8000/api/v1/get_word_occurrence?word_or_index=india
402+
$ http://localhost:8000/api/v1/get_occurrence?word_or_index=india
409403

410-
$ http://localhost:8000/api/v1/get_word_occurrences?words_or_indices=["india", 22]
404+
$ http://localhost:8000/api/v1/get_occurrences?words_or_indices=["india", 22]
411405

412-
$ http://localhost:8000/api/v1/get_word_vectors?words_or_indices=[1, "india"]
406+
$ http://localhost:8000/api/v1/get_vectors?words_or_indices=[1, "india"]
413407

414408
$ http://localhost:8000/api/v1/get_distance?word_or_index1=ap&word_or_index2=india
415409

416410
$ http://localhost:8000/api/v1/get_distances?row_words_or_indices=["india", 33]
417411

418-
$ http://localhost:8000/api/v1/get_nearest?words_or_indices=india&k=100
412+
$ http://localhost:8000/api/v1/get_nearest?v_w_i=india&k=100
419413

420-
$ http://localhost:8000/api/v1/get_nearest?words_or_indices=india&k=100&metric=euclidean
414+
$ http://localhost:8000/api/v1/get_nearest?v_w_i=india&k=100&metric=euclidean
421415
```
422416

423417
> To see all API methods of wordvecspace please run http://localhost:8000/api/v1/apidoc

setup.py

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
from setuptools import setup, find_packages
22

3-
version = '0.5.2'
3+
version = '0.5.3'
44
setup(
55
name="wordvecspace",
6+
python_requires='>3.5.1',
67
version=version,
78
description="A high performance pure python module that helps in"
89
" loading and performing operations on word vector spaces"
@@ -16,19 +17,16 @@
1617
download_url="https://github.com/deep-compute/wordvecspace/tarball/%s" % version,
1718
license='MIT License',
1819
install_requires=[
19-
'numpy==1.13.1',
20-
'pandas==0.20.3',
21-
'numba==0.36.2',
22-
'basescript==0.2.1',
2320
'annoy==1.11.4',
2421
'scipy==1.0.0',
25-
'diskarray==0.1.4',
26-
'diskdict==0.1',
27-
'deeputil==0.2.5'
22+
'diskarray==0.1.8',
23+
'diskdict==0.2.2',
24+
'deeputil==0.2.5',
25+
'bottleneck==1.2.1',
2826
],
2927
extras_require={
30-
'cuda': ['pycuda==2017.1.1', 'scikit-cuda==0.5.1'],
31-
'service': ['kwikapi[tornado]==0.2']
28+
'cuda': ['pycuda==2018.1.1', 'scikit-cuda==0.5.1'],
29+
'service': ['kwikapi[tornado]==0.4.5']
3230
},
3331
package_dir={'wordvecspace': 'wordvecspace'},
3432
packages=find_packages('.'),

wordvecspace/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
from .command import main
2-
from .base import WordVecSpace
2+
from .base import WordVecSpaceBase
33
from .mem import WordVecSpaceMem
44
from .disk import WordVecSpaceDisk
55
from .annoy import WordVecSpaceAnnoy

0 commit comments

Comments
 (0)