Skip to content

Commit 0b9f065

Browse files
authored
Readme reread me (#20)
* Add to readme * Fix section, add push_to_hub example * Add related work to ToC * WIP * Add finetuning section, license * Remove finetuning
1 parent ecde199 commit 0b9f065

1 file changed

Lines changed: 55 additions & 50 deletions

File tree

README.md

Lines changed: 55 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@
1111
- [Quickstart](#quickstart)
1212
- [What is Model2Vec?](#what-is-model2vec)
1313
- [Main Features](#main-features)
14-
- [Who is this for?](#who-is-this-for)
1514
- [Usage](#usage)
1615
- [Distilling a Model2Vec model](#distilling-a-model2vec-model)
17-
- [Inferencing a Model2Vec model](#inferencing-a-model2vec-model)
16+
- [Inferencing a Model2Vec model](#inference-with-a-model2vec-model)
1817
- [Evaluating a Model2Vec model](#evaluating-a-model2vec-model)
1918
- [Model List](#model-list)
2019
- [Results](#results)
20+
- [Related Work](#related-work)
2121
- [Citing](#citing)
2222

2323
## Quickstart
@@ -39,7 +39,9 @@ model = StaticModel.from_pretrained(model_name)
3939
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
4040
```
4141

42-
Alternatively, you can distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model:
42+
And that's it. You can use the model to classify texts, to cluster, or to build a RAG system.
43+
44+
Instead of using on of our models, you can distill your own Model2Vec model from a Sentence Transformer model. The following code snippet shows how to distill a model:
4345
```python
4446
from model2vec.distill import distill
4547

@@ -53,37 +55,39 @@ m2v_model = distill(model_name=model_name, pca_dims=256)
5355
m2v_model.save_pretrained("m2v_model")
5456
```
5557

56-
## What is Model2Vec?
57-
Model2Vec is a simple and effective method to distill any sentence transformer into static embeddings. It works by inferencing a vocabulary with the specified Sentence Transformer model, reducing the dimensionality of the embeddings using PCA, weighting the embeddings using zipf weighting, and storing the embeddings in a static format. When a vocabulary is passed, a word-level tokenizer is created on the fly based on the vocabulary. When output embeddings are used, the subword tokenizer from the Sentence Transformer is used.
58+
Distillation is really fast, and only takes about 30 seconds on a 2024 macbook using the MPS backend. Best of all, distillation requires no training data.
5859

59-
This technique creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on a a number of relevent tasks, while being much faster to create than traditional static embedding models such as GloVe, without need for a dataset.
60+
## What is Model2Vec?
6061

61-
## Main Features
62-
- **Small**: Model2Vec can reduce the size of a Sentence Transformer model by a factor of 15 *.
63-
- **Fast distillation**: Model2Vec can distill a Sentence Transformer model in ~5 minutes on CPU *.
64-
- **Fast inference**: Model2Vec creates static embeddings that are up to 500 times * faster than the original model.
65-
- **State-of-the-art static embedding performance**: Model2Vec outperforms traditional static embeddings by a large margin on a number of benchmarks.
66-
- **No data needed**: Distillation happens directly on a token leven, so no dataset is needed.
67-
- **Simple to use**: Model2Vec provides an easy to use interface for distilling and inferencing Model2Vec models.
68-
- **Bring your own model**: Model2Vec can be applied to any Sentence Transformer model.
69-
- **Bring your own vocabulary**: Model2Vec can be applied to any vocabulary, allowing you to use your own domain-specific vocabulary.
70-
- **Multi-lingual**: Model2Vec can easily be applied to any language.
71-
- **Tightly integrated with HuggingFace hub**: Model2Vec models can be easily shared and loaded from the HuggingFace hub. Our models can be found [here](https://huggingface.co/minishlab).
72-
- **Easy Evaluation**: Model2Vec comes with a set of evaluation tasks to measure the performance of the distilled model.
62+
Model2vec creates a small, fast, and powerful model that outperforms other static embedding models by a large margin on all tasks we could find, while being much faster to create than traditional static embedding models such as GloVe. Best of all, you don't need _any_ data to distill a model using Model2Vec.
7363

74-
\* Based on the [bge-base-en-v1.5 model](https://huggingface.co/BAAI/bge-base-en-v1.5).
64+
It works by passing a vocabulary through a sentence transformer model, then reducing the dimensionality of the resulting embeddings using PCA, and finally weighting the embeddings using zipf weighting. During inference, we simply take the mean of all token embeddings occurring in a sentence.
7565

76-
## Who is this for?
77-
Model2Vec allows anyone to create their own static embeddings from any Sentence Transformer model in minutes. It can easily be applied to other languages by using a language-specific Sentence Transformer model and vocab. Similarly, it can be applied to specific domains by using a domain specific model, vocab, or both. This makes it an ideal tool for fast prototyping, research, and production use cases where speed and size are more important than performance.
66+
Model2vec has 2 modes:
67+
- **Output**: behaves much a like a real sentence transformer, i.e., it uses a subword tokenizer and encodes all wordpieces. This is really quick to create, very small (30 MB), but might be less performant on some tasks.
68+
- **Vocab**: behaves much like GloVe or regular word2vec vectors, albeit with much better performance. These models are a bit bigger, depending on your vocabulary size, but still very fast, and are useful in situations in which you have a bit more RAM, but still need to go fast.
7869

70+
## Main Features
7971

72+
Model2Vec is:
8073

74+
- **Small**: reduces the size of a Sentence Transformer model by a factor of 15, from 120M params, down to 7.5M (30 MB on disk!).
75+
- **Static, but better**: smaller than GLoVe, but much more performant, even with the same vocabulary.
76+
- **Fast distillation**: make your own model in 30 seconds.
77+
- **Fast inference**: up to 500 times faster on CPU than the original model. Go green or go home.
78+
- **No data needed**: Distillation happens directly on the token level, so no dataset is needed.
79+
- **Simple to use**: An easy to use interface for distilling and inferencing.
80+
- **Bring your own model**: Can be applied to any Sentence Transformer model.
81+
- **Bring your own vocabulary**: Can be applied to any vocabulary, allowing you to use your own domain-specific vocabulary. Need biomedical? Just get a medical dictionary, a biomedical model, and inference it.
82+
- **Multi-lingual**: Use any language. Need a French model? [Pick one](https://huggingface.co/models?library=sentence-transformers&language=fr&sort=trending). Need multilingual? [Here you go](https://huggingface.co/sentence-transformers/LaBSE).
83+
- **Tightly integrated with HuggingFace hub**: easily share and load models from the HuggingFace hub, using the familiar `from_pretrained` and `push_to_hub`. Our own models can be found [here](https://huggingface.co/minishlab). Feel free to share your own.
84+
- **Easy Evaluation**: evaluate your models on MTEB and some of our own tasks to measure the performance of the distilled model. Model2Vec models work out of the box on [MTEB](https://huggingface.co/spaces/mteb/leaderboard).
8185

8286
## Usage
8387

8488
### Distilling a Model2Vec model
8589

86-
Distilling a model from the output embeddings of a Sentence Transformer model:
90+
Distilling a model from the output embeddings of a Sentence Transformer model. As mentioned above, this leads to really small model that might be less performant.
8791
```python
8892
from model2vec.distill import distill
8993

@@ -98,7 +102,7 @@ m2v_model.save_pretrained("m2v_model")
98102

99103
```
100104

101-
Distilling with a custom vocabulary:
105+
If you pass a vocabulary, you get a set of static word embeddings, together with a custom tokenizer for exactly that vocabulary. This is comparable to how you would use GLoVe or traditional word2vec, but doesn't actually require a corpus or data.
102106
```python
103107
from model2vec.distill import distill
104108

@@ -112,42 +116,32 @@ m2v_model = distill(model_name=model_name, vocabulary=vocabulary, pca_dims=None)
112116

113117
# Save the model
114118
m2v_model.save_pretrained("m2v_model")
119+
120+
# Or push it to the hub
121+
m2v_model.push_to_hub("my_organization/my_model", token="<it's a secret to everybody>")
115122
```
116123

117-
Alternatively, the command line interface can be used to distill a model:
124+
We also provide a command line interface for distillation. Note that `vocab.txt` should be a file with one word per line.
118125
```bash
119126
python3 -m model2vec.distill --model-name BAAI/bge-base-en-v1.5 --vocabulary-path vocab.txt --device mps --save-path model2vec_model
120127
```
121128

122-
### Inferencing a Model2Vec model
123-
Inferencing with one of our flagship Model2Vec models:
129+
### Inference with a Model2Vec model
130+
Inference works as follows. The example shows one of our own models, but you can also just load a local one, or another one from the hub.
124131
```python
125132
from model2vec import StaticModel
126133

127-
# Load a model from the HuggingFace hub
134+
# Load a model from the HuggingFace hub, or a local one.
128135
model_name = "minishlab/M2V_base_output"
129136
model = StaticModel.from_pretrained(model_name)
130137

131138
# Make embeddings
132-
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
133-
```
134-
135-
136-
Inferencing with a saved Model2Vec model:
137-
```python
138-
from model2vec import StaticModel
139-
140-
# Load a saved model
141-
model_name = "m2v_model"
142-
model = StaticModel.from_pretrained(model_name)
143-
144-
# Make embeddings
145-
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everyone."])
139+
embeddings = model.encode(["It's dangerous to go alone!", "It's a secret to everybody."])
146140
```
147141

148142
### Evaluating a Model2Vec model
149143

150-
Model2Vec models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). To run this, first install the optional evaluation package:
144+
Our models can be evaluated using our [evaluation package](https://github.com/MinishLab/evaluation). To run this, first install the optional evaluation package:
151145
```bash
152146
pip install evaluation@git+https://github.com/MinishLab/evaluation@main
153147
```
@@ -195,10 +189,7 @@ print(make_leaderboard(task_scores))
195189

196190
### Main Results
197191

198-
Model2Vec is evaluated on MTEB, as well as two additional tasks: PEARL (a phrase representation task) and WordSim (a word similarity task). The results are shown in the table below.
199-
200-
201-
192+
Model2Vec is evaluated on MTEB, as well as two additional tasks: [PEARL](https://github.com/tigerchen52/PEARL) (a phrase representation task) and WordSim (a collection of _word_ similarity tasks). The results are shown in the table below.
202193

203194
| Model | Avg (All) | Avg (MTEB) | Class | Clust | PairClass | Rank | Ret | STS | Sum | PEARL | WordSim |
204195
|------------------|-------------|------------|-------|-------|-----------|-------|-------|-------|-------|-------|---------|
@@ -222,11 +213,11 @@ For readability, the MTEB task names are abbreviated as follows:
222213
</details>
223214

224215
\
225-
\* WL256, introduced in the [WordLlama](https://github.com/dleemiller/WordLlama/tree/main) package is included for comparison due to its similarities to Model2Vec. However, we believe it is heavily overfit to the MTEB dataset since it is trained on datasets used in MTEB itself. This can be seen by the fact that the WL256 model performs much worse on the non-MTEB tasks (PEARL and WordSim) than our models. The results shown in the [Classification and Speed Benchmarks](#classification-and-speed-benchmarks) further support this.
216+
\* WL256, introduced in the [WordLlama](https://github.com/dleemiller/WordLlama/tree/main) package is included for comparison due to its similarities to Model2Vec. However, we believe it is heavily overfit to the MTEB dataset since it is trained on datasets used in MTEB itself. This can be seen by the fact that the WL256 model performs much worse on the non-MTEB tasks (PEARL and WordSim) than our models and GLoVe. The results shown in the [Classification and Speed Benchmarks](#classification-and-speed-benchmarks) further support this.
226217

227218
### Classification and Speed Benchmarks
228219

229-
In addition to the MTEB evaluation, Model2Vec is evaluated on a number of classification datasets. These are used as additional analysis to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.
220+
In addition to the MTEB evaluation, we evaluate Model2Vec on a number of classification datasets. These are used as additional evidence to avoid overfitting to the MTEB dataset and to benchmark the speed of the model. The results are shown in the table below.
230221

231222
| model | Average | sst2 | imdb | trec | ag_news |
232223
|:-----------------|----------:|---------:|-------:|---------:|----------:|
@@ -237,14 +228,28 @@ In addition to the MTEB evaluation, Model2Vec is evaluated on a number of classi
237228
| WL256 | 78.48 | 76.88 | 80.12 | 69.23 | 87.68 |
238229
| GloVe_300d | 77.77 | 81.68 | 84.00 | 55.67 | 89.71 |
239230

240-
As can be seen, the Model2Vec models outperforms the GloVe and WL256 models on all classification tasks, and is competitive with the all-MiniLM-L6-v2 model while being much faster.
231+
As can be seen, Model2Vec models outperform the GloVe and WL256 models on all classification tasks, and are competitive with the all-MiniLM-L6-v2 model, while being much faster.
241232

242-
The scatterplot below shows the relationship between the number of sentences per second and the average classification score. The bubble sizes correspond to the number of parameters in the models (larger = more parameters), and the colors correspond to the sentences per second (greener = more sentences per second). This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the all-MiniLM-L6-v2 model.
233+
The figure below shows the relationship between the number of sentences per second and the average classification score. The circle sizes correspond to the number of parameters in the models (larger = more parameters).
234+
This plot shows that the Model2Vec models are much faster than the other models, while still being competitive in terms of classification performance with the [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model.
243235

244236
| ![Description](assets/images/speed_vs_accuracy.png) |
245237
|:--:|
246238
|*Figure: The average accuracy over all classification datasets plotted against sentence per second. The circle size indicates model size.*|
247239

240+
## Related work
241+
242+
If you are interested in fast small models, also consider looking at these techniques:
243+
* [BPEmb](https://bpemb.h-its.org/): GLoVE embeddings trained on BPE-encoded Wikipedias. Huge inspiration to this project, multilingual, very fast. If you don't find a sentence transformer in the language you need, check this out.
244+
* [fast-sentence-transformers](https://github.com/davidberenstein1957/fast-sentence-transformers): distillation using Model2Vec comes at a cost. If that cost is too steep for you, and you have access to a GPU, this package is for you. It automates the quantization and optimization of sentence transformers without loss of performance.
245+
* [wordllama](https://github.com/dleemiller/WordLlama): Uses the _input_ embeddings of a LLama2 model and then performs contrastive learning on these embeddings. As we show above, we think this is a bit overfit on MTEB, as the model is trained on MTEB datasets, and only evaluated on MTEB. It provides an interesting point of comparison to Model2Vec, and, fun fact, was invented at the same time.
246+
247+
If you find other related work, please let us know.
248+
249+
## License
250+
251+
MIT
252+
248253
## Citing
249254

250255
If you use Model2Vec in your research, please cite the following:

0 commit comments

Comments
 (0)