Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 14 additions & 11 deletions EMBEDDINGS.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,15 @@ This option runs embedding models directly on your machine using the library.

### Recommended Models

These are based on MTEB [datasets](https://huggingface.co/datasets/mteb/results) as of 13-Jun-2026.
These are based on MTEB [datasets](https://huggingface.co/datasets/mteb/results) as of 15-Jun-2026. All listed models have been verified to work with the `sentence-transformers` provider in `cocoindex-code`.

| Tier | Model | Params | Code Score | Best For |
| :--- | :--- | :--- | :--- | :--- |
| **Micro** | [`Snowflake/arctic-embed-xs`](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | 22M | 0.67 | Old CPUs, minimal RAM usage. |
| **Small** | [`ibm-granite/granite-embedding-97m-multilingual-r2`](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) | 97M | 0.80 | Modern laptops, multilingual code. |
| **Medium** | [`jinaai/jina-embeddings-v5-text-nano`](https://huggingface.co/jinaai/jina-embeddings-v5-text-nano) | 239M | **0.90** | **Performance sweet spot.** BERT-based (Fast). |
| **High** | [`geevec-ai/geevec-embeddings-1.0-lite`](https://huggingface.co/geevec-ai/geevec-embeddings-1.0-lite) | 366M | **0.92** | Maximum local accuracy (needs GPU for speed). |
| **Default** | [`Snowflake/arctic-embed-xs`](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs) | 22M | 0.67 | Default |
| **Micro** | [`lightonai/LateOn-Code-edge`](https://huggingface.co/lightonai/LateOn-Code-edge) | 17M | 0.82 | **Efficiency King.** Incredible code performance for its size. |
| **Small** | [`lightonai/LateOn-Code`](https://huggingface.co/lightonai/LateOn-Code) | 149M | 0.85 | Great balance of speed and accuracy on modern laptops. |
| **Medium** | [`microsoft/harrier-oss-v1-270m`](https://huggingface.co/microsoft/harrier-oss-v1-270m) | 270M | **0.90** | **Performance sweet spot.** High accuracy, runs well on CPUs. |
| **Multi** | [`ibm-granite/granite-embedding-97m-multilingual-r2`](https://huggingface.co/ibm-granite/granite-embedding-97m-multilingual-r2) | 97M | 0.80 | Multilingual codebases (e.g. Code + Docs in different languages). |

#### Other Model Options

Expand Down Expand Up @@ -190,8 +191,8 @@ envs:

## Choosing Based on Your Content

- **Heavy Source Code**: Use **Jina v5 Nano** (Local) or **Voyage 4 Large** (Cloud). Both score >0.90 on code search benchmarks.
- **Large Documentation / Files**: Models with large context windows (8k+ tokens) like **Jina v5** (32k) or **OpenAI v3 Large** (8k).
- **Heavy Source Code**: Use **LateOn-Code** (Micro/Small) or **Harrier 270m** (Medium). Both score >0.85 on code search benchmarks.
- **Large Documentation / Files**: Models with large context windows like **Voyage 4 Large** (Cloud) or **OpenAI v3 Large** (8k).
- **Multilingual Projects**: **Granite 97m** (Small Local) or **Cohere Multilingual v3** (Cloud).

### Fine-Tuning with `indexing_params` and `query_params`
Expand All @@ -210,16 +211,18 @@ embedding:
input_type: query
```

**Example for Sentence-Transformers (Jina):**
**Example for Sentence-Transformers (Harrier):**

```yaml
embedding:
provider: sentence-transformers
model: jinaai/jina-embeddings-v5-text-nano
model: microsoft/harrier-oss-v1-270m
# Most encoder-only models don't require explicit prompts,
# but some (like Nomic or BGE) do:
indexing_params:
prompt_name: retrieval.passage
prompt_name: null
query_params:
prompt_name: retrieval.query
prompt_name: null
```

---
Expand Down
55 changes: 55 additions & 0 deletions scripts/MTEB-RANKINGS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# MTEB Model Discovery Report

> **Data Freshness**: MTEB results dataset last updated on `2026-06-15`.

## Top Embedding Models for Code Search

### Tier: Micro (< 50M)

| Model | Code Search Score | General Retrieval Score | Params (M) |
|-------------------------------------------|---------------------|---------------------------|--------------|
| lightonai/LateOn-Code-edge | 0.816549 | nan | 17 |
| lightonai/LateOn-Code-edge-pretrain | 0.791693 | nan | 16.798 |
| thenlper/gte-small | 0.781565 | 0.479423 | 33 |
| avsolatorio/GIST-small-Embedding-v0 | 0.772521 | 0.480646 | 33.36 |
| avsolatorio/NoInstruct-small-Embedding-v0 | 0.770071 | 0.488884 | 33.36 |

### Tier: Small (< 150M)

| Model | Code Search Score | General Retrieval Score | Params (M) |
|---------------------------------------------------|---------------------|---------------------------|--------------|
| lightonai/LateOn-Code | 0.851318 | nan | 149 |
| lightonai/LateOn-Code-pretrain | 0.832574 | nan | 149.016 |
| ibm-granite/granite-embedding-97m-multilingual-r2 | 0.799971 | 0.446515 | 97 |
| avsolatorio/GIST-Embedding-v0 | 0.78981 | 0.503411 | 109.482 |
| thenlper/gte-base | 0.789403 | 0.496155 | 109 |

### Tier: Medium (< 500M)

| Model | Code Search Score | General Retrieval Score | Params (M) |
|-------------------------------------------|---------------------|---------------------------|--------------|
| geevec-ai/geevec-embeddings-1.0-lite | 0.92365 | 0.53474 | 366 |
| jinaai/jina-embeddings-v5-text-nano | 0.90384 | 0.535934 | 239 |
| microsoft/harrier-oss-v1-270m | 0.89605 | 0.425505 | 270 |
| Shuu12121/CodeSearch-ModernBERT-Crow-Plus | 0.892957 | nan | 151.668 |
| codefuse-ai/F2LLM-v2-330M | 0.842182 | 0.475202 | 334 |

### Tier: Large (> 500M)

| Model | Code Search Score | General Retrieval Score | Params (M) |
|------------------------------------------|---------------------|---------------------------|--------------|
| voyageai/voyage-4-large | 0.97726 | nan | nan |
| voyageai/voyage-4-large (embed_dim=2048) | 0.97719 | nan | nan |
| google/gemini-embedding-2-preview | 0.972905 | nan | nan |
| microsoft/harrier-oss-v1-27b | 0.96994 | 0.483455 | 27009.3 |
| Octen/Octen-Embedding-8B-INT8 | 0.967965 | nan | 7567.3 |

---

## How to Regenerate this Report

This report was generated using the `find_best_models.py` script. To update it with the latest live data from MTEB, run:

```bash
uv run scripts/find_best_models.py --clear-cache --output MTEB-RANKINGS.md
```
8 changes: 7 additions & 1 deletion scripts/find_best_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,12 @@ def main():
"jinaai/jina-embeddings-v5-text-nano": 239,
"ibm-granite/granite-embedding-97m-multilingual-r2": 97,
"geevec-ai/geevec-embeddings-1.0-lite": 366,
"lightonai/LateOn-Code-edge": 17,
"lightonai/LateOn-Code": 149,
"microsoft/harrier-oss-v1-270m": 270,
"thenlper/gte-small": 33,
"thenlper/gte-base": 109,
"codefuse-ai/F2LLM-v2-330M": 334,
}

def categorize(size):
Expand All @@ -178,7 +184,7 @@ def categorize(size):

print("Analyzing top candidates to determine hardware tiers...", file=sys.stderr)
results["max_score"] = results[["score_general", "score_code"]].max(axis=1)
results = results.sort_values(by="max_score", ascending=False).head(500)
results = results.sort_values(by="max_score", ascending=False).head(1000)

results["size_mb"] = results["model_name"].map(known_sizes)

Expand Down
Loading