Skip to content

Commit 8a7731b

Browse files
Pringledstephantul
andauthored
Add docs (#15)
* Added setup for main docs * Added results * Added usage section * Updated readme * Updated docs * Resolved conflict * Added finalized mteb results * Updated docs * Updated docs * Updated docs * Updated table of contents * Updated pyproject for pypi release * Updated docs and pyproject * Updated readme * Updated readme * Added quickstart * Small update * Small update * Updated readme * Updated readme * Update README.md * Added link to HF * Updated description --------- Co-authored-by: Stephan Tulkens <stephantul@gmail.com>
1 parent 16193aa commit 8a7731b

3 files changed

Lines changed: 268 additions & 3 deletions

File tree

README.md

Lines changed: 244 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,244 @@
1-
# Model2vec
1+
# Model2Vec: Distill a Small Fast Model from any Sentence Transformer
2+
3+
**Model2Vec** is a method to distill a small, fast model from any Sentence Transformer model.
4+
5+
## Table of Contents
6+
- [Main Features](#main-features)
7+
- [Quickstart](#quickstart)
8+
- [What is Model2Vec?](#what-is-model2vec)
9+
- [Who is this for?](#who-is-this-for)
10+
- [Usage](#usage)
11+
- [Distilling a Model2Vec model](#distilling-a-model2vec-model)
12+
- [Inferencing a Model2Vec model](#inferencing-a-model2vec-model)
13+
- [Evaluating a Model2Vec model](#evaluating-a-model2vec-model)
14+
- [Results](#results)
15+
- [Citing](#citing)
16+
17+
## Main Features
18+
- **Small**: Model2Vec can reduce the size of a Sentence Transformer model by a factor of 15 *.
19+
- **Fast distillation**: Model2Vec can distill a Sentence Transformer model in ~5 minutes on CPU *.
20+
- **Fast inference**: Model2Vec creates static embeddings that are up to 500 times * faster than the original model.
21+
- **State-of-the-art static embedding performance**: Model2Vec outperforms traditional static embeddings by a large margin on a number of benchmarks.
22+
- **No data needed**: Distillation happens directly on a token leven, so no dataset is needed.
23+
- **Simple to use**: Model2Vec provides an easy to use interface for distilling and inferencing Model2Vec models.
24+
- **Bring your own model**: Model2Vec can be applied to any Sentence Transformer model.
25+
- **Bring your own vocabulary**: Model2Vec can be applied to any vocabulary, allowing you to use your own domain-specific vocabulary.
26+
- **Multi-lingual**: Model2Vec can easily be applied to any language.
27+
- **Tightly integrated with HuggingFace hub**: Model2Vec models can be easily shared and loaded from the HuggingFace hub. Our models can be found [here](https://huggingface.co/minishlab).
28+
- **Easy Evaluation**: Model2Vec comes with a set of evaluation tasks to measure the performance of the distilled model.
29+
30+
\* Based on the [bge-base-en-v1.5 model](https://huggingface.co/BAAI/bge-base-en-v1.5).
31+
32+
33+
## Quickstart
34+
35+
Install the package with:
36+
```bash
37+
pip install model2vec
38+
```
39+
40+
The easiest way to get started with Model2Vec is to download one of our [flagship models from the HuggingFace hub](https://huggingface.co/minishlab). These models are pre-trained and ready to use. The following code snippet shows how to load a model and make embeddings:
41+
```python
42+
from model2vec import StaticModel
43+
44+
# Load a model from the HuggingFace hub (in this case the M2V_base_output model)
45+
model_name = "minishlab/M2V_base_output"
46+
model = StaticModel.from_pretrained(model_name)
47+
48+
# Make embeddings
49+
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
50+
```
51+
52+
Alternatively, you can distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model:
53+
```python
54+
from model2vec.distill import distill
55+
56+
# Choose a Sentence Transformer model
57+
model_name = "BAAI/bge-base-en-v1.5"
58+
59+
# Distill the model
60+
m2v_model = distill(model_name=model_name, pca_dims=256)
61+
62+
# Save the model
63+
m2v_model.save_pretrained("m2v_model")
64+
```
65+
66+
## What is Model2Vec?
67+
Model2Vec is a simple and effective method to distill any sentence transformer into static embeddings. It works by inferencing a vocabulary with the specified Sentence Transformer model, reducing the dimensionality of the embeddings using PCA, weighting the embeddings using zipf weighting, and storing the embeddings in a static format.
68+
69+
This technique creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on a a number of relevent tasks, while being much faster to create than traditional static embedding models such as GloVe, without need for a dataset.
70+
71+
72+
## Who is this for?
73+
Model2Vec allows anyone to create their own static embeddings from any Sentence Transformer model in minutes. It can easily be applied to other languages by using a language-specific Sentence Transformer model and vocab. Similarly, it can be applied to specific domains by using a domain specific model, vocab, or both. This makes it an ideal tool for fast prototyping, research, and production use cases where speed and size are more important than performance.
74+
75+
76+
77+
78+
## Usage
79+
80+
### Distilling a Model2Vec model
81+
82+
Distilling a model from the output embeddings of a Sentence Transformer model:
83+
```python
84+
from model2vec.distill import distill
85+
86+
# Choose a Sentence Transformer model
87+
model_name = "BAAI/bge-base-en-v1.5"
88+
89+
# Distill the model
90+
m2v_model = distill(model_name=model_name, pca_dims=256)
91+
92+
# Save the model
93+
m2v_model.save_pretrained("m2v_model")
94+
95+
```
96+
97+
Distilling with a custom vocabulary:
98+
```python
99+
from model2vec.distill import distill
100+
101+
# Load a vocabulary as a list of strings
102+
vocabulary = ["word1", "word2", "word3"]
103+
# Choose a Sentence Transformer model
104+
model_name = "BAAI/bge-base-en-v1.5"
105+
106+
# Distill the model with the custom vocabulary
107+
m2v_model = distill(model_name=model_name, vocabulary=vocabulary, pca_dims=256)
108+
109+
# Save the model
110+
m2v_model.save_pretrained("m2v_model")
111+
```
112+
113+
Alternatively, the command line interface can be used to distill a model:
114+
```bash
115+
python3 -m model2vec.distill --model-name BAAI/bge-base-en-v1.5 --vocabulary-path vocab.txt --device mps --save-path model2vec_model
116+
```
117+
118+
### Inferencing a Model2Vec model
119+
Inferencing with one of our flagship Model2Vec models:
120+
```python
121+
from model2vec import StaticModel
122+
123+
# Load a model from the HuggingFace hub
124+
model_name = "minishlab/M2V_base_output"
125+
model = StaticModel.from_pretrained(model_name)
126+
127+
# Make embeddings
128+
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
129+
```
130+
131+
132+
Inferencing with a saved Model2Vec model:
133+
```python
134+
from model2vec import StaticModel
135+
136+
# Load a saved model
137+
model_name = "m2v_model"
138+
model = StaticModel.from_pretrained(model_name)
139+
140+
# Make embeddings
141+
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
142+
```
143+
144+
### Evaluating a Model2Vec model
145+
146+
Model2Vec models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). To run this, first install the optionall evaluation package:
147+
```bash
148+
pip install model2vec[evaluation]
149+
```
150+
151+
Then, the following code snippet shows how to evaluate a Model2Vec model:
152+
```python
153+
from model2vec import StaticModel
154+
155+
from evaluation import CustomMTEB, get_tasks, parse_mteb_results, make_leaderboard, summarize_results
156+
from mteb import ModelMeta
157+
158+
# Get all available tasks
159+
tasks = get_tasks()
160+
# Define the CustomMTEB object with the specified tasks
161+
evaluation = CustomMTEB(tasks=tasks)
162+
163+
# Load the model
164+
model_name = "m2v_model"
165+
model = StaticModel.from_pretrained(model_name)
166+
167+
# Optionally, add model metadata in MTEB format
168+
model.mteb_model_meta = ModelMeta(
169+
name=model_name, revision="no_revision_available", release_date=None, languages=None
170+
)
171+
172+
# Run the evaluation
173+
results = evaluation.run(model, eval_splits=["test"], output_folder=f"results/{model_name}")
174+
175+
# Parse the results and summarize them
176+
parsed_results = parse_mteb_results(mteb_results=results, model_name=model_name)
177+
task_scores = summarize_results(parsed_results)
178+
# Print the results in a leaderboard format
179+
print(make_leaderboard(task_scores))
180+
```
181+
182+
## Results
183+
184+
### Main Results
185+
186+
Model2Vec is evaluated on MTEB, as well as two additional tasks: PEARL (a phrase representation task) and WordSim (a word similarity task). The results are shown in the table below.
187+
188+
189+
190+
191+
| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | PEARL | WordSim |
192+
|------------------|-------------|------------|-------|-------|-----------|-------|-------|-------|-------|-------|---------|
193+
| all-MiniLM-L6-v2 | 56.08 | 56.09 | 62.62 | 41.94 | 82.37 | 58.04 | 41.95 | 78.90 | 30.81 | 60.83 | 49.91 |
194+
| M2V_base_glove | 48.58 | 47.60 | 61.35 | 30.52 | 75.34 | 48.50 | 29.26 | 70.31 | 31.50 | 50.28 | 54.29 |
195+
| M2V_base_output | 46.79 | 45.34 | 61.25 | 25.58 | 74.90 | 47.63 | 26.14 | 68.58 | 29.20 | 54.02 | 49.18 |
196+
| GloVe_300d | 42.84 | 42.36 | 57.31 | 27.66 | 72.48 | 43.30 | 22.78 | 61.90 | 28.81 | 45.65 | 43.05 |
197+
| WL256* | 48.88 | 49.36 | 58.98 | 33.34 | 74.00 | 52.03 | 33.12 | 73.34 | 29.05 | 48.81 | 45.16 |
198+
199+
<details>
200+
<summary> Task Abbreviations </summary>
201+
202+
For readability, the MTEB task names are abbreviated as follows:
203+
- Class: Classification
204+
- Clust: Clustering
205+
- PairClass: PairClassification
206+
- Rank: Reranking
207+
- Ret: Retrieval
208+
- STS: Semantic Textual Similarity
209+
- Sum: Summarization
210+
</details>
211+
212+
\
213+
\* WL256, introduced in the [WordLlama](https://github.com/dleemiller/WordLlama/tree/main) package is included for comparison due to its similarities to Model2Vec. However, we believe it is heavily overfit to the MTEB dataset since it is trained on datasets used in MTEB itself. This can be seen by the fact that the WL256 model performs much worse on the non-MTEB tasks (PEARL and WordSim) than our models. The results shown in the [Classification and Speed Benchmarks](#classification-and-speed-benchmarks) further support this.
214+
215+
### Classification and Speed Benchmarks
216+
217+
In addition to the MTEB evaluation, Model2Vec is evaluated on a number of classification datasets. These are used as additional analysis to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.
218+
219+
| model | Average | sst2 | imdb | trec | ag_news |
220+
|:-----------------|----------:|---------:|-------:|---------:|----------:|
221+
| bge-base-en-v1.5 | 90.00 | 91.54 | 91.88 | 85.16 | 91.45 |
222+
| all-MiniLM-L6-v2 | 84.10 | 83.95 | 81.36 | 81.31 | 89.77 |
223+
| M2V_base_output | 82.23 | 80.92 | 84.56 | 75.27 | 88.17 |
224+
| M2V_base_glove | 80.76 | 83.07 | 85.24 | 66.12 | 88.61 |
225+
| WL256 | 78.48 | 76.88 | 80.12 | 69.23 | 87.68 |
226+
| GloVe_300d | 77.77 | 81.68 | 84.00 | 55.67 | 89.71 |
227+
228+
As can be seen, the Model2Vec models outperforms the GloVe and WL256 models on all classification tasks, and is competitive with the all-MiniLM-L6-v2 model while being much faster.
229+
230+
The scatterplot below shows the relationship between the number of sentences per second and the average classification score. The bubble sizes correspond to the number of parameters in the models (larger = more parameters), and the colors correspond to the sentences per second (greener = more sentences per second). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.
231+
232+
![Description](assets/images/sentences_per_second_vs_average_score.png)
233+
234+
## Citing
235+
236+
If you use Model2Vec in your research, please cite the following:
237+
```bibtex
238+
@software{minishlab2024word2vec,
239+
authors = {Stephan Tulkens, Thomas van Dongen},
240+
title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
241+
year = {2024},
242+
url = {https://github.com/MinishLab/model2vec},
243+
}
244+
```
66.9 KB
Loading

pyproject.toml

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
[project]
22
name = "model2vec"
3-
description = "Distill your sentencetransformers to small fast models."
4-
readme = "README.md"
3+
description = "Distill a Small Fast Model from any Sentence Transformer"
4+
readme = { file = "README.md", content-type = "text/markdown" }
5+
license = { file = "LICENSE" }
56
version = "0.1.0"
67
requires-python = ">=3.10"
8+
authors = [{ name = "Stéphan Tulkens", email = "stephantul@gmail.com"}, {name = "Thomas van Dongen", email = "thomas123@live.nl"}]
79

810
dependencies = [
911
"click",
@@ -37,6 +39,26 @@ dev = [
3739
]
3840
evaluation = ["evaluation@git+https://github.com/MinishLab/evaluation@main"]
3941

42+
classifiers = [
43+
"Development Status :: 4 - Beta",
44+
"Intended Audience :: Developers",
45+
"Intended Audience :: Science/Research",
46+
"Topic :: Scientific/Engineering :: Artificial Intelligence",
47+
"Topic :: Software Development :: Libraries",
48+
"License :: OSI Approved :: MIT License",
49+
"Programming Language :: Python :: 3 :: Only",
50+
"Programming Language :: Python :: 3.10",
51+
"Programming Language :: Python :: 3.11",
52+
"Programming Language :: Python :: 3.12",
53+
"Natural Language :: English",
54+
]
55+
56+
[project.urls]
57+
"Homepage" = "https://github.com/MinishLab"
58+
"Bug Reports" = "https://github.com/MinishLab/model2vec/issues"
59+
"Source" = "https://github.com/MinishLab/model2vec"
60+
61+
4062
[tool.ruff]
4163
exclude = [".venv/"]
4264
line-length = 120

0 commit comments

Comments
 (0)