Skip to content

Commit 1493956

Browse files
Merge pull request #33 from prasannababuAddagiri/master
After supporting WordVecSpaceDisk
2 parents 5d64201 + 53aab4e commit 1493956

13 files changed

Lines changed: 612 additions & 516 deletions

File tree

.travis.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ python:
44
before_install:
55
- wget 'https://s3.amazonaws.com/deepcompute-public-data/wordvecspace/small_test_data.tgz'
66
&& tar xvzf small_test_data.tgz
7-
- export WORDVECSPACE_DATAFILE='dc.wvspace'
8-
install:
7+
- export WORDVECSPACE_DATADIR='small_test_data'
8+
install:
99
- pip install .[service]
1010
- sudo apt install libopenblas-base
1111
- sudo apt-get install libffi-dev
@@ -16,8 +16,8 @@ deploy:
1616
skip_cleanup: true
1717
api-key:
1818
secure: LmVvlW+FdYNIDlinjJ4sieONrcx1jaw18J7/mpHBD9ppIWZ+TB6H/iNqkqkh4WvULZttJrTHRYE6rQHXww7KK2UMrjVNE/TVUPaLFDeRRFvLDinAbqJkn+QJia0TuRa/26Bg9cDcvNYTghy7s37xpK2bJTEMF/eCM9b9RHYXilESYy8Z4l8IkFn5vnaDDfT5iV8xjuuOE4lsf4KC3L0xXIkYnKC/LbDVDj3B9h52TpsteL6cZtn/ExAThor5SrVymW7oMR1qrPQv8btNAdxymqJvEbjaP5RUuX7ehihev0Yge47A2X9gvxDRv+a6wM0HOvT4aGsMwCWo++fb0taWH7HUXFxSvkzKhsl74kDMmnE0WarcI/8L/3Q/zRhW1a2vAtj3O0FDHtzS/OK/k3TDk6Fh/LOvk2mTuGD3L34YxJrXxDxnt4tK2ubde8cGeA7pI5jRLNTNQXUip6Dxhr/5ZnMmG2nHI6ujjmDnucE+CHBtUmS1wjBn6ootE4pdoyti0aaA9OrVoGrf39pK7FAG38KJghqn8I3YCLoeapWjI4/DI0WIfq2Vl+v6yQar3Dn9lBLpWFLrjUmZnAx2F1e0P2y0VUg9hl0bINzIIrm2mHw4Zsl2GlMVSR033cwvcbdyeNxKMAfSV3EZBDpNuI6nlkkUZG1O72N/WV+kFRtSdQA=
19-
name: wordvecspace-0.5.1
20-
tag_name: 0.5.1
19+
name: wordvecspace-0.5.2
20+
tag_name: 0.5.2
2121
on:
2222
repo: deep-compute/wordvecspace
2323
# pypitest

README.md

Lines changed: 70 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# WordVecSpace
22
A high performance pure python module that helps in loading and performing operations on word vector spaces created using Google's Word2vec tool.
33

4-
This module has ability to load data into memory using `WordVecSpaceMem` or it can also supports performing operations on the data which is on the disk using `WordVecSpaceAnnoy`.
4+
This module has ability to load data into memory using `WordVecSpaceMem` and it can also support performing operations on the data which is on the disk using `WordVecSpaceAnnoy` and `WordVecSpaceDisk`.
55

66
## Installation
77
> Prerequisites: Python3.5
@@ -63,19 +63,19 @@ by the `wordvecspace` module. You'll first have to convert them
6363
to the `WordVecSpace` format.
6464

6565
```bash
66-
$ wordvecspace convert <input_dir> <output_file>
66+
$ wordvecspace convert <input_dir> <output_dir>
6767

6868
# <input_dir> is the directory which has vocab.txt and vectors.bin
69-
# <output_file> is the file where you want to put your output file
69+
# <output_dir> is the directory where you want to store your output files.
7070
```
7171

7272
Example:
7373

7474
```bash
75-
$ wordvecspace convert /home/user/bindata /home/user/dc.wvspace
75+
$ wordvecspace convert /home/user/bindata /home/user/output_dir
7676

7777
# /home/user/bindata is the directory containing vocab.txt and vectors.bin
78-
# dc.wvspace is the output file
78+
# /home/user/output_dir is the output directory which contains wordvecspace data files.
7979
```
8080

8181
### Importing
@@ -89,59 +89,67 @@ $ wordvecspace convert /home/user/bindata /home/user/dc.wvspace
8989

9090
##### Load data
9191
```python
92-
>>> wv = WordVecSpaceMem('/home/user/dc.wvspace')
92+
>>> wv = WordVecSpaceMem('/home/user/output_dir')
9393
```
9494

9595
##### Make get_nearest call
9696
```python
9797
>>> wv.get_nearest('india', k=20)
98-
[509, 486, 523, 4343, 14208, 13942, 42424, 25578, 6212, 2475, 3560, 13508, 20919, 3389, 4484, 19995, 8776, 7012, 12191, 16619]
99-
98+
[509, 3389, 486, 523, 7125, 16619, 4491, 12191, 6866, 8776, 15232, 14208, 5998, 21916, 5226, 6322, 4343, 6212, 10172, 6186]
10099
# k is for getting top k nearest values
101100
```
102101

103102
#### Types
104-
`wordvecspace` module can perform operations by loading data into RAM using `WordVecSpaceMem` or directly on the data which is on the disk using `WordVecSpaceAnnoy`
103+
`wordvecspace` module can perform operations by loading data into RAM using `WordVecSpaceMem` or directly on the data which is on the disk using `WordVecSpaceDisk`
105104

106-
`WordVecSpaceMem` is a bruteforce algorithm which compares given word with all the words in the vector space
105+
`WordVecSpaceMem` and `WordVecSpaceDisk` is a bruteforce algorithm which compares given word with all the words in the vector space
107106

108-
`WordVecSpaceAnnoy` takes wvspace file as input and creates annoy indexes in another file (index file). Using this file `annoy` gives approximate results quickly. For better understanding of `Annoy` please go through this [link](https://github.com/spotify/annoy)
107+
`WordVecSpaceAnnoy` takes wordvecspace output_dir as input and creates annoy indexes in another file (index file). Using this file `annoy` gives approximate results quickly. For better understanding of `Annoy` please go through this [link](https://github.com/spotify/annoy)
109108

110-
As we have seen how to import `WordVecSpaceMem` above, let us look at `WordVecSpaceAnnoy`
109+
As we have seen how to import `WordVecSpaceMem` above, let us look at `WordVecSpaceAnnoy` and `WordVecSpaceDisk`
111110

112111
##### Import
113112
```python
114113
>>> from wordvecspace import WordVecSpaceAnnoy
114+
>>> from wordvecspace import WordVecSpaceDisk
115115
```
116116

117117
##### Load data
118118
```python
119-
wv = WordVecSpaceAnnoy('/home/user/dc.wvspace', n_trees, index_fpath)
119+
>>> wv = WordVecSpaceAnnoy('/home/user/output_dir', n_trees, index_fpath)
120120

121121
# n_trees = number of trees(More trees gives a higher precision when querying for get_nearest)
122122
# index_fpath = path for annoy index file
123123

124-
# n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/` (dc.wvspace file directory) directory for index_fpath.
124+
# n_trees and index_fpath are optional. If those are not given then WordVecSpaceAnnoy uses `1` for n_trees and `/home/user/output_dir` (wordvecspace data directory) directory for index_fpath.
125+
126+
>>> wv = WordVecSpaceDisk('/home/user/output_dir')
125127
```
126128

127129
##### Make get_nearest call
128130
```python
129-
>>> wv.get_nearest('india', k=20)
130-
[509, 486, 523, 4343, 14208, 13942, 42424, 25578, 6212, 2475, 3560, 13508, 20919, 3389, 4484, 19995, 8776, 7012, 12191, 16619]
131+
>>> wv.get_nearest('india', k=20) (ANNOY)
132+
[509, 3389, 16619, 4491, 6866, 8776, 14208, 5998, 21916, 20919, 2325, 4622, 3546, 24149, 5064, 35704, 25578, 15842, 4137, 6499]
133+
134+
>>> wv.get_nearest('india', k=20) (DISK)
135+
[509, 3389, 486, 523, 7125, 16619, 4491, 12191, 6866, 8776, 15232, 14208, 5998, 21916, 5226, 6322, 4343, 6212, 10172, 6186]
131136
```
132137

133138
#### Distance calculations
134139
`WordVecSpaceAnnoy` supports different types of distance calculations such as `"angular"`, `"euclidean"`, `"manhattan"` and `"hamming"`.
135140

136141
`WordVecSpaceMem` supports `"angular"` and `"euclidean"` for distance calculations.
137142

138-
Both uses `"angular"` by default. If you want to change it then you can change at the time of creating object.
143+
`WordVecSpaceDisk` supports `"angular"` and `"euclidean"` for distance calculations.
144+
145+
All of the above uses `"angular"` by default. If you want to change it then you can change at the time of creating object.
139146

140147
Example:
141148

142149
```bash
143-
wv = WordVecSpaceAnnoy('/path/to/wvspacefile', n_trees, metric="euclidean")
144-
wv = WordVecSpaceMem('/path/to/wvspacefile', metric="euclidean")
150+
wv = WordVecSpaceAnnoy('/path/to/output_dir', n_trees, metric="euclidean")
151+
wv = WordVecSpaceMem('/path/to/output_dir', metric="euclidean")
152+
wv = WordVecSpaceDisk('/path/to/output_dir', metric="euclidean")
145153

146154
# metric = type of distance calculation
147155
```
@@ -150,14 +158,14 @@ WordVecSpaceMem can also supports specifying metric at the time of calculating d
150158

151159
Example:
152160
```bash
153-
wv = WordVecSpaceMem('/path/to/wvspacefile', metric="euclidean")
161+
wv = WordVecSpaceMem('/path/to/output_dir', metric="euclidean")
154162

155163
wv.get_distance('ap', 'india', metric='angular')
156164
```
157165

158166
#### Examples of using wordvecspace methods
159167

160-
> WordVecSpaceMem and WordVecSpaceAnnoy have the same common methods.
168+
> `WordVecSpaceMem`, `WordVecSpaceAnnoy` and `WordVecSpaceDisk` support the same methods.
161169
162170
##### Check if a word exists or not in the word vector space
163171
```python
@@ -238,25 +246,25 @@ None
238246
```python
239247
# Get magnitude for the word "hi"
240248
>>> print(wv.get_vector_magnitude("hi"))
241-
8.7948
249+
1.0
242250
```
243251

244252
##### Get vector magnitude of the words
245253
```python
246254
# Get magnitude for the words "hi" and "india"
247255
>>> print(wv.get_vector_magnitudes(["hi", "india"]))
248-
[ 8.7948 10.303 ]
256+
[1.0, 1.0]
249257
```
250258

251259
##### Get vector for given word
252260
```python
253261
# Get the word vector for a word india
254262
>>> print(wv.get_word_vector("india"))
255-
[-6.4482 -2.1636 5.7277 -3.7746 3.583 ]
263+
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]
256264

257265
# Get the unit word vector for a word india
258266
>>> print(wv.get_word_vector("india", normalized=True))
259-
[-0.6259 -0.21 0.5559 -0.3664 0.3478]
267+
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]
260268

261269
# Get the word vector for a word inidia.
262270
>>> print(wv.get_word_vector('inidia', raise_exc=True))
@@ -278,80 +286,80 @@ wordvecspace.exception.UnknownWord: "inidia"
278286
##### Get vector for given words
279287
```python
280288
>>> print(wv.get_word_vectors(["hi", "india"]))
281-
[[ 0.4008 0.3623 -0.013 0.8395 0.0562]
282-
[-0.4975 -0.134 0.7874 -0.3274 0.0857]]
289+
[[ 0.6342 0.2268 -0.3904 0.0368 0.6266]
290+
[-0.7871 -0.2993 0.3233 -0.2864 0.323 ]]
283291
>>> print(wv.get_word_vectors(["hi", "inidia"]))
284-
[[ 0.4008 0.3623 -0.013 0.8395 0.0562]
292+
[[ 0.6342 0.2268 -0.3904 0.0368 0.6266]
285293
[ 0. 0. 0. 0. 0. ]]
286294
```
287295

288296
##### Get distance between two words
289297
```python
290298
# Get distance between "india", "usa"
291299
>>> print(wv.get_distance("india", "usa"))
292-
0.48379534483
300+
0.37698328495
293301

294302
# Get the distance between 250, "india"
295303
>>> print(wv.get_distance(250, "india"))
296-
1.16397565603
304+
1.1418992728
297305

298306
# Get the euclidean distance between 250, "india" for WordvecSpaceMem
299307
>>> print(wv.get_distance(250, "india", metric='euclidean'))
300-
12.04961109161377
308+
1.5112241506576538
301309
```
302310

303311
##### Get distance between list of words
304312

305313
```python
306314
>>> print(wv.get_distances("for", ["to", "for", "india"]))
307-
[[ 0.381 0. 0.9561]]
315+
[[ 2.7428e-01 5.9605e-08 1.1567e+00]]
308316

309317
>>> print(wv.get_distances("for", ["to", "for", "inidia"]))
310-
[[ 0.381 0. 1. ]]
318+
[[ 2.7428e-01 5.9605e-08 1.0000e+00]]
311319

312320
>>> print(wv.get_distances(["india", "for"], ["to", "for", "usa"]))
313-
[[ 1.0685 0.9561 0.3251]
314-
[ 0.381 0. 1.4781]]
321+
[[ 1.1445e+00 1.1567e+00 3.7698e-01]
322+
[ 2.7428e-01 5.9605e-08 1.6128e+00]]
315323

316324
>>> print(wv.get_distances(["india", "usa"]))
317-
[[ 1.3853 0.4129 0.3149 ..., 1.1231 1.4595 0.7912]
318-
[ 1.3742 0.9549 1.0354 ..., 0.5556 1.0847 1.0832]]
325+
[[ 1.5464 0.4876 0.3017 ..., 1.2492 1.2451 0.8925]
326+
[ 1.0436 0.9995 1.0913 ..., 0.6996 0.8014 1.1608]]
319327

320328
>>> print(wv.get_distances(["andhra"]))
321-
[[ 1.2817 0.6138 0.2995 ..., 0.9945 1.224 0.6137]]
329+
[[ 1.5418 0.7153 0.277 ..., 1.1657 1.0774 0.7036]]
322330

323331
# For WordVecSpaceMem
324332
>>> print(wv.get_distances(["andhra"], metric='euclidean'))
325-
[[ 9.0035 8.3985 7.1658 ..., 9.2236 9.6078 8.6349]]
333+
[[ 1.756 1.1961 0.7443 ..., 1.5269 1.4679 1.1862]]
326334
```
327335

328336
##### Get nearest
329337
```python
330338
# Get nearest for given word or index
331339
>>> print(wv.get_nearest("india", 20))
332-
[509, 486, 523, 4343, 14208, 13942, 42424, 25578, 6212, 2475, 3560, 13508, 20919, 3389, 4484, 19995, 8776, 7012, 12191, 16619]
340+
[509, 3389, 486, 523, 7125, 16619, 4491, 12191, 6866, 8776, 15232, 14208, 5998, 21916, 5226, 6322, 4343, 6212, 10172, 6186]
333341

334342
# Get nearest for given words or indices
335343
>>> print(wv.get_nearest(["ram", "india"], 5))
336-
[[3844, 38851, 25381, 10830, 17049], [509, 486, 523, 4343, 14208]]
344+
[[3844, 16727, 15811, 42731, 41516], [509, 3389, 486, 523, 7125]]
337345

338346
# Get nearest using euclidean distance for WordVecSpaceMem
339347
>>> print(wv.get_nearest(["ram", "india"], 5, metric='euclidean'))
340-
[[3844, 25381, 27802, 17049, 38851], [509, 486, 14208, 523, 13942]]
348+
[[3844, 16727, 15811, 42731, 41516], [509, 3389, 486, 523, 7125]]
341349

342350
# Get common nearest neighbors among given words
343351
>>> print(wv.get_nearest(['india', 'bosnia'], 10, combination=True))
344-
[14208, 486, 523, 4343, 42424, 509]
352+
[523, 509, 486]
345353
```
346354

347-
### Service
355+
## Service
348356

349357
```bash
350358
# Run wordvecspace as a service (which continuously listens on some port for API requests)
351-
$ wordvecspace runserver <type> <input_file> --metric <metric> --port <port> --eargs <eargs>
359+
$ wordvecspace runserver <type> <input_dir> --metric <metric> --port <port> --eargs <eargs>
352360

353-
# <type> is for specifying wordvecspace functionality (eg: mem or annoy).
354-
# <input_file> is for wordvecspace file
361+
# <type> is for specifying wordvecspace functionality (eg: mem, annoy or disk).
362+
# <input_dir> is for wordvecspace data dir
355363
# <metric> is to specify type for distance calculation
356364
# <port> is to run wordvecspace in that port
357365
# <eargs> is for specifying extra arguments for annoy
@@ -361,10 +369,13 @@ Example:
361369

362370
```bash
363371
# For mem
364-
$ wordvecspace runserver mem /home/user/dc.wvspace --metric angular --port 8000
372+
$ wordvecspace runserver mem /home/user/output_dir --metric angular --port 8000
373+
374+
# For disk
375+
$ wordvecspace runserver disk /home/user/output_dir --metric angular --port 8000
365376

366377
# For annoy
367-
$ wordvecspace runserver annoy /home/user/dc.wvspace --metric euclidean --port 8000 --eargs n_trees=1:index_fpath=/tmp
378+
$ wordvecspace runserver annoy /home/user/output_dir --metric euclidean --port 8000 --eargs n_trees=1:index_fpath=/tmp
368379

369380
# Extra arguments for annoy are n_trees and index_fpath
370381
# - n_trees is the number of trees for annoy
@@ -415,20 +426,24 @@ $ http://localhost:8000/api/v1/get_nearest?words_or_indices=india&k=100&metric=e
415426
```bash
416427
# wordvecspace provides command to directly interact with it
417428

418-
$ wordvecspace interact <type> <input_file> --metric <metric> --eargs <eargs>
429+
$ wordvecspace interact <type> <input_dir> --metric <metric> --eargs <eargs>
419430

420-
# <type> is for specifying wordvecspace functionality (eg: mem or annoy).
421-
# <input_file> is for wordvecspace file
431+
# <type> is for specifying wordvecspace functionality (eg: mem, disk or annoy).
432+
# <input_dir> is for wordvecspace data dir
422433
# <metric> is to specify type for distance calculation
423434
# <eargs> is for specifying extra arguments for annoy
424435
```
425436

426437
Example:
427438
```bash
428439
# For mem
429-
$ wordvecspace interact mem /home/user/dc.wvspace --metric euclidean
440+
$ wordvecspace interact mem /home/user/output_dir --metric euclidean
441+
442+
# For Disk
443+
$ wordvecspace interact disk /home/user/output_dir --metric euclidean
430444

431-
$ wordvecspace interact annoy /home/user/dc.wvspace --metric angular --eargs n_trees=1:index_fpath=/tmp
445+
# For Annoy
446+
$ wordvecspace interact annoy /home/user/output_dir --metric angular --eargs n_trees=1:index_fpath=/tmp
432447
WordVecSpaceAnnoy console (vectors=71291 dims=5)
433448
>>> wv.get_nearest('india', 20)
434449
[509, 486, 523, 4343, 13942, 42424, 25578, 3389, 12191, 16619, 12088, 6049, 5226, 4137, 41883, 18617, 10172, 35704, 25552, 29059]
@@ -447,7 +462,7 @@ $ wget 'https://s3.amazonaws.com/deepcompute-public-data/wordvecspace/small_test
447462
$ tar xvzf small_test_data.tgz
448463

449464
# Export the path of data file to the environment variables
450-
$ export WORDVECSPACE_DATAFILE="/home/user/dc.wvspace"
465+
$ export WORDVECSPACE_DATADIR="/home/user/output_dir"
451466

452467
# Run tests
453468
$ python3 setup.py test

setup.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from setuptools import setup, find_packages
22

3-
version = '0.5.1'
3+
version = '0.5.2'
44
setup(
55
name="wordvecspace",
66
version=version,
@@ -19,11 +19,12 @@
1919
'numpy==1.13.1',
2020
'pandas==0.20.3',
2121
'numba==0.36.2',
22-
'basescript==0.2.0',
23-
'tables==3.4.2',
22+
'basescript==0.2.1',
2423
'annoy==1.11.4',
25-
'cmph-cffi==0.3.0',
26-
'scipy==1.0.0'
24+
'scipy==1.0.0',
25+
'diskarray==0.1.4',
26+
'diskdict==0.1',
27+
'deeputil==0.2.5'
2728
],
2829
extras_require={
2930
'cuda': ['pycuda==2017.1.1', 'scikit-cuda==0.5.1'],
@@ -38,7 +39,7 @@
3839
"Intended Audience :: Developers",
3940
"License :: OSI Approved :: MIT License",
4041
],
41-
test_suite = 'test.suite_test',
42+
test_suite='test.suite_test',
4243
entry_points={
4344
"console_scripts": [
4445
"wordvecspace = wordvecspace:main",

wordvecspace/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from .command import main
22
from .base import WordVecSpace
33
from .mem import WordVecSpaceMem
4+
from .disk import WordVecSpaceDisk
45
from .annoy import WordVecSpaceAnnoy
56
from .fileformat import WordVecSpaceFile

0 commit comments

Comments
 (0)