Skip to content

Commit 55f7bff

Browse files
authored
feature: updated alternative embedding options after testing (#196)
1 parent d91a12d commit 55f7bff

3 files changed

Lines changed: 76 additions & 12 deletions

File tree

EMBEDDINGS.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -68,14 +68,15 @@ This option runs embedding models directly on your machine using the library.
6868

6969
### Recommended Models
7070

71-
These are based on MTEB [datasets](https://huggingface.co/datasets/mteb/results) as of 13-Jun-2026.
71+
These are based on MTEB [datasets](https://huggingface.co/datasets/mteb/results) as of 15-Jun-2026. All listed models have been verified to work with the `sentence-transformers` provider in `cocoindex-code`.
7272

7373
| Tier | Model | Params | Code Score | Best For |
7474
| :--- | :--- | :--- | :--- | :--- |
75-
| **Micro** | [`Snowflake/arctic-embed-xs`](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | 22M | 0.67 | Old CPUs, minimal RAM usage. |
76-
| **Small** | [`ibm-granite/granite-embedding-97m-multilingual-r2`](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) | 97M | 0.80 | Modern laptops, multilingual code. |
77-
| **Medium** | [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) | 239M | **0.90** | **Performance sweet spot.** BERT-based (Fast). |
78-
| **High** | [`geevec-ai/geevec-embeddings-1.0-lite`](https://huggingface.co/geevec-ai/geevec-embeddings-1.0-lite) | 366M | **0.92** | Maximum local accuracy (needs GPU for speed). |
75+
| **Default** | [`Snowflake/arctic-embed-xs`](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | 22M | 0.67 | Default |
76+
| **Micro** | [`lightonai/LateOn-Code-edge`](https://huggingface.co/lightonai/LateOn-Code-edge) | 17M | 0.82 | **Efficiency King.** Incredible code performance for its size. |
77+
| **Small** | [`lightonai/LateOn-Code`](https://huggingface.co/lightonai/LateOn-Code) | 149M | 0.85 | Great balance of speed and accuracy on modern laptops. |
78+
| **Medium** | [`microsoft/harrier-oss-v1-270m`](https://huggingface.co/microsoft/harrier-oss-v1-270m) | 270M | **0.90** | **Performance sweet spot.** High accuracy, runs well on CPUs. |
79+
| **Multi** | [`ibm-granite/granite-embedding-97m-multilingual-r2`](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) | 97M | 0.80 | Multilingual codebases (e.g. Code + Docs in different languages). |
7980

8081
#### Other Model Options
8182

@@ -190,8 +191,8 @@ envs:
190191

191192
## Choosing Based on Your Content
192193

193-
- **Heavy Source Code**: Use **Jina v5 Nano** (Local) or **Voyage 4 Large** (Cloud). Both score >0.90 on code search benchmarks.
194-
- **Large Documentation / Files**: Models with large context windows (8k+ tokens) like **Jina v5** (32k) or **OpenAI v3 Large** (8k).
194+
- **Heavy Source Code**: Use **LateOn-Code** (Micro/Small) or **Harrier 270m** (Medium). Both score >0.85 on code search benchmarks.
195+
- **Large Documentation / Files**: Models with large context windows like **Voyage 4 Large** (Cloud) or **OpenAI v3 Large** (8k).
195196
- **Multilingual Projects**: **Granite 97m** (Small Local) or **Cohere Multilingual v3** (Cloud).
196197

197198
### Fine-Tuning with `indexing_params` and `query_params`
@@ -210,16 +211,18 @@ embedding:
210211
input_type: query
211212
```
212213

213-
**Example for Sentence-Transformers (Jina):**
214+
**Example for Sentence-Transformers (Harrier):**
214215

215216
```yaml
216217
embedding:
217218
provider: sentence-transformers
218-
model: jinaai/jina-embeddings-v5-text-nano
219+
model: microsoft/harrier-oss-v1-270m
220+
# Most encoder-only models don't require explicit prompts,
221+
# but some (like Nomic or BGE) do:
219222
indexing_params:
220-
prompt_name: retrieval.passage
223+
prompt_name: null
221224
query_params:
222-
prompt_name: retrieval.query
225+
prompt_name: null
223226
```
224227

225228
---

scripts/MTEB-RANKINGS.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# MTEB Model Discovery Report
2+
3+
> **Data Freshness**: MTEB results dataset last updated on `2026-06-15`.
4+
5+
## Top Embedding Models for Code Search
6+
7+
### Tier: Micro (< 50M)
8+
9+
| Model | Code Search Score | General Retrieval Score | Params (M) |
10+
|-------------------------------------------|---------------------|---------------------------|--------------|
11+
| lightonai/LateOn-Code-edge | 0.816549 | nan | 17 |
12+
| lightonai/LateOn-Code-edge-pretrain | 0.791693 | nan | 16.798 |
13+
| thenlper/gte-small | 0.781565 | 0.479423 | 33 |
14+
| avsolatorio/GIST-small-Embedding-v0 | 0.772521 | 0.480646 | 33.36 |
15+
| avsolatorio/NoInstruct-small-Embedding-v0 | 0.770071 | 0.488884 | 33.36 |
16+
17+
### Tier: Small (< 150M)
18+
19+
| Model | Code Search Score | General Retrieval Score | Params (M) |
20+
|---------------------------------------------------|---------------------|---------------------------|--------------|
21+
| lightonai/LateOn-Code | 0.851318 | nan | 149 |
22+
| lightonai/LateOn-Code-pretrain | 0.832574 | nan | 149.016 |
23+
| ibm-granite/granite-embedding-97m-multilingual-r2 | 0.799971 | 0.446515 | 97 |
24+
| avsolatorio/GIST-Embedding-v0 | 0.78981 | 0.503411 | 109.482 |
25+
| thenlper/gte-base | 0.789403 | 0.496155 | 109 |
26+
27+
### Tier: Medium (< 500M)
28+
29+
| Model | Code Search Score | General Retrieval Score | Params (M) |
30+
|-------------------------------------------|---------------------|---------------------------|--------------|
31+
| geevec-ai/geevec-embeddings-1.0-lite | 0.92365 | 0.53474 | 366 |
32+
| jinaai/jina-embeddings-v5-text-nano | 0.90384 | 0.535934 | 239 |
33+
| microsoft/harrier-oss-v1-270m | 0.89605 | 0.425505 | 270 |
34+
| Shuu12121/CodeSearch-ModernBERT-Crow-Plus | 0.892957 | nan | 151.668 |
35+
| codefuse-ai/F2LLM-v2-330M | 0.842182 | 0.475202 | 334 |
36+
37+
### Tier: Large (> 500M)
38+
39+
| Model | Code Search Score | General Retrieval Score | Params (M) |
40+
|------------------------------------------|---------------------|---------------------------|--------------|
41+
| voyageai/voyage-4-large | 0.97726 | nan | nan |
42+
| voyageai/voyage-4-large (embed_dim=2048) | 0.97719 | nan | nan |
43+
| google/gemini-embedding-2-preview | 0.972905 | nan | nan |
44+
| microsoft/harrier-oss-v1-27b | 0.96994 | 0.483455 | 27009.3 |
45+
| Octen/Octen-Embedding-8B-INT8 | 0.967965 | nan | 7567.3 |
46+
47+
---
48+
49+
## How to Regenerate this Report
50+
51+
This report was generated using the `find_best_models.py` script. To update it with the latest live data from MTEB, run:
52+
53+
```bash
54+
uv run scripts/find_best_models.py --clear-cache --output MTEB-RANKINGS.md
55+
```

scripts/find_best_models.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,12 @@ def main():
163163
"jinaai/jina-embeddings-v5-text-nano": 239,
164164
"ibm-granite/granite-embedding-97m-multilingual-r2": 97,
165165
"geevec-ai/geevec-embeddings-1.0-lite": 366,
166+
"lightonai/LateOn-Code-edge": 17,
167+
"lightonai/LateOn-Code": 149,
168+
"microsoft/harrier-oss-v1-270m": 270,
169+
"thenlper/gte-small": 33,
170+
"thenlper/gte-base": 109,
171+
"codefuse-ai/F2LLM-v2-330M": 334,
166172
}
167173

168174
def categorize(size):
@@ -178,7 +184,7 @@ def categorize(size):
178184

179185
print("Analyzing top candidates to determine hardware tiers...", file=sys.stderr)
180186
results["max_score"] = results[["score_general", "score_code"]].max(axis=1)
181-
results = results.sort_values(by="max_score", ascending=False).head(500)
187+
results = results.sort_values(by="max_score", ascending=False).head(1000)
182188

183189
results["size_mb"] = results["model_name"].map(known_sizes)
184190

0 commit comments

Comments
 (0)