Skip to content

add: ogma model results#525

Open
sam-at-axiotic wants to merge 10 commits into
embeddings-benchmark:mainfrom
sam-at-axiotic:adds-ogma-results
Open

add: ogma model results#525
sam-at-axiotic wants to merge 10 commits into
embeddings-benchmark:mainfrom
sam-at-axiotic:adds-ogma-results

Conversation

@sam-at-axiotic
Copy link
Copy Markdown

@sam-at-axiotic sam-at-axiotic commented May 7, 2026

Adds MTEB result JSONs for the Axiotic Ogma models.

AI assistance disclosure: initial file organization was prepared with AI assistance; I reviewed the submitted files and take responsibility for this PR.

I confirm the result-submission integrity declaration is made by me as the human submitter.

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR 4620
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

@sam-at-axiotic sam-at-axiotic changed the title Adds ogma results add: ogma model results May 7, 2026
@sam-at-axiotic sam-at-axiotic marked this pull request as ready for review May 7, 2026 17:50
@sam-at-axiotic sam-at-axiotic marked this pull request as draft May 8, 2026 11:57
Renames each axiotic__ogma-* sha subdir to the HF revision now pinned by
mteb PR #4620 (commit 1887fc06), which switched the wrapper to use
OgmaTokenizerFast for canonical special-token handling:

  ogma-micro: d9d323709a... -> c9a793dacd...
  ogma-mini:  300b6184ef... -> 580266301b...
  ogma-small: 9c3f997130... -> 761deba3f4...
  ogma-base:  7524c6e1b2... -> 6c9cd11d41...

Updates revision field in each model_meta.json to match, and harmonizes
n_parameters with the mteb wrapper (off-by-N counting was the only delta:
ogma-small 8596544 -> 8596352, etc).

JSON task scores are unchanged: the encoding fix in mteb produces
bit-identical output to the canonical model.embed() API on the new
revisions, so existing scores remain the canonical numbers.
@sam-at-axiotic sam-at-axiotic marked this pull request as ready for review May 13, 2026 10:44
@github-actions
Copy link
Copy Markdown

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: axiotic/ogma-base, axiotic/ogma-micro, axiotic/ogma-mini, axiotic/ogma-small
Tasks: AmazonCounterfactualClassification, AmazonPolarityClassification, AmazonReviewsClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, AskUbuntuDupQuestions, BIOSSES, Banking77Classification, BiorxivClusteringP2P, BiorxivClusteringS2S, CQADupstackAndroidRetrieval, CQADupstackEnglishRetrieval, CQADupstackGamingRetrieval, CQADupstackGisRetrieval, CQADupstackMathematicaRetrieval, CQADupstackPhysicsRetrieval, CQADupstackProgrammersRetrieval, CQADupstackRetrieval, CQADupstackStatsRetrieval, CQADupstackTexRetrieval, CQADupstackUnixRetrieval, CQADupstackWebmastersRetrieval, CQADupstackWordpressRetrieval, ClimateFEVER, DBPedia, EmotionClassification, FEVER, FiQA2018, HotpotQA, ImdbClassification, MSMARCO, MTOPDomainClassification, MTOPIntentClassification, MassiveIntentClassification, MassiveScenarioClassification, MedrxivClusteringP2P, MedrxivClusteringS2S, MindSmallReranking, NFCorpus, NQ, QuoraRetrieval, RedditClustering, RedditClusteringP2P, SCIDOCS, SICK-R, STS12, STS13, STS14, STS15, STS16, STSBenchmark, SciDocsRR, SciFact, SprintDuplicateQuestions, StackExchangeClustering, StackExchangeClusteringP2P, StackOverflowDupQuestions, SummEval, TRECCOVID, Touche2020, ToxicConversationsClassification, TweetSentimentExtractionClassification, TwentyNewsgroupsClustering, TwitterSemEval2015, TwitterURLCorpus

Results for axiotic/ogma-base

task_name axiotic/ogma-base google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AmazonCounterfactualClassification 0.7025 0.9200 nan 0.9903 Bytedance/Seed1.6-embedding-1215 False
AmazonPolarityClassification 0.7985 nan 0.9326 0.9774 nvidia/NV-Embed-v2 False
AmazonReviewsClassification 0.3943 nan nan 0.6880 TencentBAC/Conan-embedding-v2 False
ArXivHierarchicalClusteringP2P 0.5583 0.6492 0.5569 0.6869 NovaSearch/jasper_en_vision_language_v1 False
ArXivHierarchicalClusteringS2S 0.5273 0.6384 0.5367 0.6548 Qwen/Qwen3-Embedding-8B False
ArguAna 0.45 0.8644 0.5436 0.8979 voyageai/voyage-3-m-exp False
AskUbuntuDupQuestions 0.5676 0.6424 0.5924 0.7528 IEITYuan/Yuan-embedding-2.0-en False
BIOSSES 0.8415 0.8897 0.8457 0.9692 Gameselo/STS-multilingual-mpnet-base-v2 False
Banking77Classification 0.7856 0.9427 0.7492 0.9427 google/gemini-embedding-001 False
BiorxivClusteringP2P 0.3411 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 False
BiorxivClusteringS2S 0.2634 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 False
CQADupstackAndroidRetrieval 0.3728 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.3491 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGamingRetrieval 0.4491 0.7068 0.587 0.8161 IEITYuan/Yuan-embedding-2.0-en False
CQADupstackGisRetrieval 0.2965 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.249 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.3423 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.3353 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.3105 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.2666 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.2177 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackUnixRetrieval 0.2957 0.5369 0.3988 0.7198 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.3133 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.2386 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ClimateFEVER 0.2851 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.3632 nan 0.413 0.5350 nvidia/NV-Embed-v2 False
EmotionClassification 0.4769 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 False
FEVER 0.6027 nan 0.8279 0.9628 voyageai/voyage-3-m-exp False
FiQA2018 0.3259 0.6178 0.4381 0.8206 ai-sage/Giga-Embeddings-instruct False
HotpotQA 0.5243 nan 0.7122 0.8696 voyageai/voyage-3-m-exp False
ImdbClassification 0.7359 0.9498 0.8867 0.9737 Qwen/Qwen3-Embedding-8B False
MSMARCO 0.3586 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 False
MTOPDomainClassification 0.9047 0.9926 0.9097 0.9995 voyageai/voyage-3-m-exp False
MTOPIntentClassification 0.6314 nan nan 0.9551 BAAI/bge-multilingual-gemma2 False
MassiveIntentClassification 0.6838 0.8871 0.6843 0.9194 voyageai/voyage-3-m-exp False
MassiveScenarioClassification 0.7315 0.9191 0.7146 0.9930 voyageai/voyage-3-m-exp False
MedrxivClusteringP2P 0.3202 nan 0.317 0.5153 voyageai/voyage-3-m-exp False
MedrxivClusteringS2S 0.2922 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 False
MindSmallReranking 0.3062 0.3295 0.3024 0.3437 Kingsoft-LLM/QZhou-Embedding False
NFCorpus 0.3035 nan 0.3398 0.5575 TencentBAC/Conan-embedding-v2 False
NQ 0.5071 nan 0.6403 0.8248 voyageai/voyage-3-m-exp False
QuoraRetrieval 0.6088 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RedditClustering 0.4467 nan 0.4691 0.7716 voyageai/voyage-3-m-exp False
RedditClusteringP2P 0.5367 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 False
SCIDOCS 0.1637 0.2515 0.1745 0.5986 IEITYuan/Yuan-embedding-2.0-en False
SICK-R 0.7981 0.8275 0.8023 0.9465 Gameselo/STS-multilingual-mpnet-base-v2 False
STS12 0.7603 0.8155 0.8002 0.9546 Gameselo/STS-multilingual-mpnet-base-v2 False
STS13 0.8505 0.8989 0.8155 0.9776 Gameselo/STS-multilingual-mpnet-base-v2 False
STS14 0.8097 0.8541 0.7772 0.9753 Gameselo/STS-multilingual-mpnet-base-v2 False
STS15 0.8688 0.9044 0.8931 0.9811 Gameselo/STS-multilingual-mpnet-base-v2 False
STS16 0.833 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STSBenchmark 0.8649 0.8908 0.8729 0.9504 Kingsoft-LLM/QZhou-Embedding False
SciDocsRR 0.741 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
SciFact 0.5942 nan 0.702 0.8660 openbmb/MiniCPM-Embedding False
SprintDuplicateQuestions 0.9491 0.9690 0.9314 0.9838 Kingsoft-LLM/QZhou-Embedding False
StackExchangeClustering 0.5204 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 False
StackExchangeClusteringP2P 0.3414 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 False
StackOverflowDupQuestions 0.4353 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B False
SummEval 0.2973 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TRECCOVID 0.6701 0.8631 0.7115 0.9833 IEITYuan/Yuan-embedding-2.0-en False
Touche2020 0.2858 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
ToxicConversationsClassification 0.6623 0.8875 0.6601 0.9759 voyageai/voyage-3-m-exp False
TweetSentimentExtractionClassification 0.6204 0.6988 0.628 0.8823 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.4163 nan 0.394 0.8349 voyageai/voyage-3-m-exp False
TwitterSemEval2015 0.7079 0.7917 0.7528 0.8946 voyageai/voyage-large-2-instruct False
TwitterURLCorpus 0.855 0.8705 0.8583 0.9571 TencentBAC/Conan-embedding-v2 False
Average 0.5191 0.7861 0.5661 0.7736 nan -

Results for axiotic/ogma-micro

task_name axiotic/ogma-micro google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AmazonCounterfactualClassification 0.6485 0.9289 nan 0.9893 Bytedance/Seed1.6-embedding-1215 False
AmazonPolarityClassification 0.6763 nan 0.9326 0.9774 nvidia/NV-Embed-v2 False
AmazonReviewsClassification 0.3523 nan nan 0.6880 TencentBAC/Conan-embedding-v2 False
ArXivHierarchicalClusteringP2P 0.5505 0.6492 0.5569 0.6869 NovaSearch/jasper_en_vision_language_v1 False
ArXivHierarchicalClusteringS2S 0.5036 0.6384 0.5367 0.6548 Qwen/Qwen3-Embedding-8B False
ArguAna 0.4194 0.8644 0.5436 0.8979 voyageai/voyage-3-m-exp False
AskUbuntuDupQuestions 0.5594 0.6424 0.5924 0.7528 IEITYuan/Yuan-embedding-2.0-en False
BIOSSES 0.7885 0.8897 0.8457 0.9692 Gameselo/STS-multilingual-mpnet-base-v2 False
Banking77Classification 0.7003 0.9427 0.7492 0.9427 google/gemini-embedding-001 False
BiorxivClusteringP2P 0.3105 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 False
BiorxivClusteringS2S 0.202 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 False
CQADupstackAndroidRetrieval 0.2614 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.1982 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGamingRetrieval 0.3592 0.7068 0.587 0.8161 IEITYuan/Yuan-embedding-2.0-en False
CQADupstackGisRetrieval 0.213 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.1454 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.2806 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.2433 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.2234 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.2158 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.1504 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackUnixRetrieval 0.2012 0.5369 0.3988 0.7198 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.2343 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.1779 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ClimateFEVER 0.206 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.2727 nan 0.413 0.5350 nvidia/NV-Embed-v2 False
EmotionClassification 0.3598 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 False
FEVER 0.6289 nan 0.8279 0.9628 voyageai/voyage-3-m-exp False
FiQA2018 0.1779 0.6178 0.4381 0.8206 ai-sage/Giga-Embeddings-instruct False
HotpotQA 0.3875 nan 0.7122 0.8696 voyageai/voyage-3-m-exp False
ImdbClassification 0.6525 0.9498 0.8867 0.9737 Qwen/Qwen3-Embedding-8B False
MSMARCO 0.2178 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 False
MTOPDomainClassification 0.8345 0.9927 0.9097 0.9995 voyageai/voyage-3-m-exp False
MTOPIntentClassification 0.5172 nan nan 0.9551 BAAI/bge-multilingual-gemma2 False
MassiveIntentClassification 0.5875 0.8846 0.6804 0.9194 voyageai/voyage-3-m-exp False
MassiveScenarioClassification 0.6684 0.9208 0.7178 0.9930 voyageai/voyage-3-m-exp False
MedrxivClusteringP2P 0.3043 nan 0.317 0.5153 voyageai/voyage-3-m-exp False
MedrxivClusteringS2S 0.2515 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 False
MindSmallReranking 0.301 0.3295 0.3024 0.3437 Kingsoft-LLM/QZhou-Embedding False
NFCorpus 0.2383 nan 0.3398 0.5575 TencentBAC/Conan-embedding-v2 False
NQ 0.2935 nan 0.6403 0.8248 voyageai/voyage-3-m-exp False
QuoraRetrieval 0.4713 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RedditClustering 0.3783 nan 0.4691 0.7716 voyageai/voyage-3-m-exp False
RedditClusteringP2P 0.4691 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 False
SCIDOCS 0.1197 0.2515 0.1745 0.5986 IEITYuan/Yuan-embedding-2.0-en False
SICK-R 0.6997 0.8275 0.8023 0.9465 Gameselo/STS-multilingual-mpnet-base-v2 False
STS12 0.6762 0.8155 0.8002 0.9546 Gameselo/STS-multilingual-mpnet-base-v2 False
STS13 0.7693 0.8989 0.8155 0.9776 Gameselo/STS-multilingual-mpnet-base-v2 False
STS14 0.7441 0.8541 0.7772 0.9753 Gameselo/STS-multilingual-mpnet-base-v2 False
STS15 0.8184 0.9044 0.8931 0.9811 Gameselo/STS-multilingual-mpnet-base-v2 False
STS16 0.7759 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STSBenchmark 0.7782 0.8908 0.8729 0.9504 Kingsoft-LLM/QZhou-Embedding False
SciDocsRR 0.7162 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
SciFact 0.4796 nan 0.702 0.8660 openbmb/MiniCPM-Embedding False
SprintDuplicateQuestions 0.9348 0.9690 0.9314 0.9838 Kingsoft-LLM/QZhou-Embedding False
StackExchangeClustering 0.4363 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 False
StackExchangeClusteringP2P 0.3344 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 False
StackOverflowDupQuestions 0.4129 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B False
SummEval 0.3177 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TRECCOVID 0.5952 0.8631 0.7115 0.9833 IEITYuan/Yuan-embedding-2.0-en False
Touche2020 0.2328 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
ToxicConversationsClassification 0.6013 0.8875 0.6601 0.9759 voyageai/voyage-3-m-exp False
TweetSentimentExtractionClassification 0.5453 0.6988 0.628 0.8823 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.3159 nan 0.394 0.8349 voyageai/voyage-3-m-exp False
TwitterSemEval2015 0.6003 0.7917 0.7528 0.8946 voyageai/voyage-large-2-instruct False
TwitterURLCorpus 0.8236 0.8705 0.8583 0.9571 TencentBAC/Conan-embedding-v2 False
Average 0.4479 0.7864 0.5661 0.7736 nan -

Results for axiotic/ogma-mini

task_name axiotic/ogma-mini google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AmazonCounterfactualClassification 0.6501 0.9200 nan 0.9903 Bytedance/Seed1.6-embedding-1215 False
AmazonPolarityClassification 0.7044 nan 0.9326 0.9774 nvidia/NV-Embed-v2 False
AmazonReviewsClassification 0.3722 nan nan 0.6880 TencentBAC/Conan-embedding-v2 False
ArXivHierarchicalClusteringP2P 0.5434 0.6492 0.5569 0.6869 NovaSearch/jasper_en_vision_language_v1 False
ArXivHierarchicalClusteringS2S 0.4988 0.6384 0.5367 0.6548 Qwen/Qwen3-Embedding-8B False
ArguAna 0.4072 0.8644 0.5436 0.8979 voyageai/voyage-3-m-exp False
AskUbuntuDupQuestions 0.5213 0.6424 0.5924 0.7528 IEITYuan/Yuan-embedding-2.0-en False
BIOSSES 0.8 0.8897 0.8457 0.9692 Gameselo/STS-multilingual-mpnet-base-v2 False
Banking77Classification 0.7293 0.9427 0.7492 0.9427 google/gemini-embedding-001 False
BiorxivClusteringP2P 0.3011 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 False
BiorxivClusteringS2S 0.2036 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 False
CQADupstackAndroidRetrieval 0.3175 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.211 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGamingRetrieval 0.4044 0.7068 0.587 0.8161 IEITYuan/Yuan-embedding-2.0-en False
CQADupstackGisRetrieval 0.2396 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.1698 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.2932 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.2728 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.2482 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.2259 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.1705 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackUnixRetrieval 0.2314 0.5369 0.3988 0.7198 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.2497 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.193 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ClimateFEVER 0.2461 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.2958 nan 0.413 0.5350 nvidia/NV-Embed-v2 False
EmotionClassification 0.3907 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 False
FEVER 0.6983 nan 0.8279 0.9628 voyageai/voyage-3-m-exp False
FiQA2018 0.2072 0.6178 0.4381 0.8206 ai-sage/Giga-Embeddings-instruct False
HotpotQA 0.4357 nan 0.7122 0.8696 voyageai/voyage-3-m-exp False
ImdbClassification 0.6725 0.9498 0.8867 0.9737 Qwen/Qwen3-Embedding-8B False
MSMARCO 0.2573 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 False
MTOPDomainClassification 0.8533 0.9926 0.9097 0.9995 voyageai/voyage-3-m-exp False
MTOPIntentClassification 0.5477 nan nan 0.9551 BAAI/bge-multilingual-gemma2 False
MassiveIntentClassification 0.6104 0.8871 0.6843 0.9194 voyageai/voyage-3-m-exp False
MassiveScenarioClassification 0.6948 0.9191 0.7146 0.9930 voyageai/voyage-3-m-exp False
MedrxivClusteringP2P 0.3088 nan 0.317 0.5153 voyageai/voyage-3-m-exp False
MedrxivClusteringS2S 0.2535 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 False
MindSmallReranking 0.2968 0.3295 0.3024 0.3437 Kingsoft-LLM/QZhou-Embedding False
NFCorpus 0.2554 nan 0.3398 0.5575 TencentBAC/Conan-embedding-v2 False
NQ 0.3393 nan 0.6403 0.8248 voyageai/voyage-3-m-exp False
QuoraRetrieval 0.5177 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RedditClustering 0.395 nan 0.4691 0.7716 voyageai/voyage-3-m-exp False
RedditClusteringP2P 0.4908 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 False
SCIDOCS 0.138 0.2515 0.1745 0.5986 IEITYuan/Yuan-embedding-2.0-en False
SICK-R 0.7183 0.8275 0.8023 0.9465 Gameselo/STS-multilingual-mpnet-base-v2 False
STS12 0.7193 0.8155 0.8002 0.9546 Gameselo/STS-multilingual-mpnet-base-v2 False
STS13 0.7927 0.8989 0.8155 0.9776 Gameselo/STS-multilingual-mpnet-base-v2 False
STS14 0.7596 0.8541 0.7772 0.9753 Gameselo/STS-multilingual-mpnet-base-v2 False
STS15 0.8306 0.9044 0.8931 0.9811 Gameselo/STS-multilingual-mpnet-base-v2 False
STS16 0.7904 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STSBenchmark 0.8057 0.8908 0.8729 0.9504 Kingsoft-LLM/QZhou-Embedding False
SciDocsRR 0.6965 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
SciFact 0.5301 nan 0.702 0.8660 openbmb/MiniCPM-Embedding False
SprintDuplicateQuestions 0.9496 0.9690 0.9314 0.9838 Kingsoft-LLM/QZhou-Embedding False
StackExchangeClustering 0.4464 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 False
StackExchangeClusteringP2P 0.3341 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 False
StackOverflowDupQuestions 0.3813 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B False
SummEval 0.3133 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TRECCOVID 0.6142 0.8631 0.7115 0.9833 IEITYuan/Yuan-embedding-2.0-en False
Touche2020 0.2409 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
ToxicConversationsClassification 0.6123 0.8875 0.6601 0.9759 voyageai/voyage-3-m-exp False
TweetSentimentExtractionClassification 0.5713 0.6988 0.628 0.8823 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.3364 nan 0.394 0.8349 voyageai/voyage-3-m-exp False
TwitterSemEval2015 0.6068 0.7917 0.7528 0.8946 voyageai/voyage-large-2-instruct False
TwitterURLCorpus 0.8333 0.8705 0.8583 0.9571 TencentBAC/Conan-embedding-v2 False
Average 0.4659 0.7861 0.5661 0.7736 nan -

Results for axiotic/ogma-small

task_name axiotic/ogma-small google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AmazonCounterfactualClassification 0.6964 0.9289 nan 0.9893 Bytedance/Seed1.6-embedding-1215 False
AmazonPolarityClassification 0.7672 nan 0.9326 0.9774 nvidia/NV-Embed-v2 False
AmazonReviewsClassification 0.39 nan nan 0.6880 TencentBAC/Conan-embedding-v2 False
ArXivHierarchicalClusteringP2P 0.5554 0.6492 0.5569 0.6869 NovaSearch/jasper_en_vision_language_v1 False
ArXivHierarchicalClusteringS2S 0.5212 0.6384 0.5367 0.6548 Qwen/Qwen3-Embedding-8B False
ArguAna 0.4232 0.8644 0.5436 0.8979 voyageai/voyage-3-m-exp False
AskUbuntuDupQuestions 0.5508 0.6424 0.5924 0.7528 IEITYuan/Yuan-embedding-2.0-en False
BIOSSES 0.8381 0.8897 0.8457 0.9692 Gameselo/STS-multilingual-mpnet-base-v2 False
Banking77Classification 0.7738 0.9427 0.7492 0.9427 google/gemini-embedding-001 False
BiorxivClusteringP2P 0.3328 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 False
BiorxivClusteringS2S 0.2541 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 False
CQADupstackAndroidRetrieval 0.368 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.3301 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGamingRetrieval 0.4501 0.7068 0.587 0.8161 IEITYuan/Yuan-embedding-2.0-en False
CQADupstackGisRetrieval 0.2819 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.2165 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.3383 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.3307 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.3004 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.263 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.2056 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackUnixRetrieval 0.2873 0.5369 0.3988 0.7198 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.3025 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.2309 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ClimateFEVER 0.2861 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.3594 nan 0.413 0.5350 nvidia/NV-Embed-v2 False
EmotionClassification 0.4522 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 False
FEVER 0.688 nan 0.8279 0.9628 voyageai/voyage-3-m-exp False
FiQA2018 0.3005 0.6178 0.4381 0.8206 ai-sage/Giga-Embeddings-instruct False
HotpotQA 0.5157 nan 0.7122 0.8696 voyageai/voyage-3-m-exp False
ImdbClassification 0.7249 0.9498 0.8867 0.9737 Qwen/Qwen3-Embedding-8B False
MSMARCO 0.3431 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 False
MTOPDomainClassification 0.9065 0.9927 0.9097 0.9995 voyageai/voyage-3-m-exp False
MTOPIntentClassification 0.6081 nan nan 0.9551 BAAI/bge-multilingual-gemma2 False
MassiveIntentClassification 0.6636 0.8846 0.6804 0.9194 voyageai/voyage-3-m-exp False
MassiveScenarioClassification 0.7278 0.9208 0.7178 0.9930 voyageai/voyage-3-m-exp False
MedrxivClusteringP2P 0.3196 nan 0.317 0.5153 voyageai/voyage-3-m-exp False
MedrxivClusteringS2S 0.2859 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 False
MindSmallReranking 0.3055 0.3295 0.3024 0.3437 Kingsoft-LLM/QZhou-Embedding False
NFCorpus 0.3012 nan 0.3398 0.5575 TencentBAC/Conan-embedding-v2 False
NQ 0.4672 nan 0.6403 0.8248 voyageai/voyage-3-m-exp False
QuoraRetrieval 0.6055 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RedditClustering 0.4394 nan 0.4691 0.7716 voyageai/voyage-3-m-exp False
RedditClusteringP2P 0.526 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 False
SCIDOCS 0.1587 0.2515 0.1745 0.5986 IEITYuan/Yuan-embedding-2.0-en False
SICK-R 0.7875 0.8275 0.8023 0.9465 Gameselo/STS-multilingual-mpnet-base-v2 False
STS12 0.7562 0.8155 0.8002 0.9546 Gameselo/STS-multilingual-mpnet-base-v2 False
STS13 0.8404 0.8989 0.8155 0.9776 Gameselo/STS-multilingual-mpnet-base-v2 False
STS14 0.7994 0.8541 0.7772 0.9753 Gameselo/STS-multilingual-mpnet-base-v2 False
STS15 0.8577 0.9044 0.8931 0.9811 Gameselo/STS-multilingual-mpnet-base-v2 False
STS16 0.8252 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STSBenchmark 0.8554 0.8908 0.8729 0.9504 Kingsoft-LLM/QZhou-Embedding False
SciDocsRR 0.7355 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
SciFact 0.6004 nan 0.702 0.8660 openbmb/MiniCPM-Embedding False
SprintDuplicateQuestions 0.953 0.9690 0.9314 0.9838 Kingsoft-LLM/QZhou-Embedding False
StackExchangeClustering 0.5022 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 False
StackExchangeClusteringP2P 0.3408 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 False
StackOverflowDupQuestions 0.4285 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B False
SummEval 0.2959 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TRECCOVID 0.6905 0.8631 0.7115 0.9833 IEITYuan/Yuan-embedding-2.0-en False
Touche2020 0.2676 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
ToxicConversationsClassification 0.6558 0.8875 0.6601 0.9759 voyageai/voyage-3-m-exp False
TweetSentimentExtractionClassification 0.6119 0.6988 0.628 0.8823 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.3986 nan 0.394 0.8349 voyageai/voyage-3-m-exp False
TwitterSemEval2015 0.6849 0.7917 0.7528 0.8946 voyageai/voyage-large-2-instruct False
TwitterURLCorpus 0.8494 0.8705 0.8583 0.9571 TencentBAC/Conan-embedding-v2 False
Average 0.5111 0.7864 0.5661 0.7736 nan -


Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 13, 2026

@sam-at-axiotic I tried to run your model, but got an error

mteb run -m axiotic/ogma-base -t ArguAna

    raise TypeError(
        f"RetrievalEvaluator expects a SearchInterface, Encoder, or CrossEncoder, got {type(model)}"
    )
TypeError: RetrievalEvaluator expects a SearchInterface, Encoder, or CrossEncoder, got <class 'mteb.models.model_implementations.axiotic_models.OgmaWrapper'>                                                                                                                                                     

Seems your model don't have similarity and similarity_pairwise and because of this it didn't pass this check

@sam-at-axiotic
Copy link
Copy Markdown
Author

sam-at-axiotic commented May 14, 2026

Thanks again @Samoed. The mteb fix is in embeddings-benchmark/mteb#4670.

OgmaWrapper now inherits from AbsEncoder, which provides the default similarity / similarity_pairwise driven by ModelMeta.similarity_fn_name (cosine for all Ogma models).

Verified locally with mteb run -m axiotic/ogma-micro -t ArguAna against the released wrapper. Pure protocol-conformance fix, so no change to encoding behaviour, so the JSONs in this PR are unchanged.

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 14, 2026

@sam-at-axiotic I tried to run mteb run -m axiotic/ogma-base -t ArguAna from embeddings-benchmark/mteb#4670, but got only 0.29648, while you report 0.45003

@kieran-axiotic
Copy link
Copy Markdown

Hi @Samoed,

I have reproduced from a clean checkout (mteb 2.12.30 + PR #4670 applied as written, axiotic/ogma-base @ 6c9cd11d, CPU, torch 2.12, transformers 5.8.1): ArguAna nDCG@10 = 0.48 via mteb.evaluate(meta, get_task("ArguAna")). Matches our score jsons.

Can you share pip freeze, your torch/transformers versions, and confirm your ~/.cache/huggingface/hub/models--axiotic--ogma-base/snapshots/ only contains 6c9cd11d…? Also worth rm -rf ~/.cache/mteb and re-running — 0.296 isn't reachable by any (task-token × self-mask × dataset) combination we tested.

Many thanks!

@Samoed
Copy link
Copy Markdown
Member

Samoed commented May 14, 2026

Yes, I rerun again and got 0.48 unexpectedly, but I think 0.45 vs 0.48 is pretty big gap. How did you evaluate your models? Can you rerun tasks with mteb implementation?

Samoed pushed a commit to embeddings-benchmark/mteb that referenced this pull request May 14, 2026
OgmaWrapper was declared as a bare class, so isinstance(model, EncoderProtocol)
returns False and RetrievalEvaluator rejects it with TypeError. Inheriting from
AbsEncoder picks up the default similarity / similarity_pairwise implementations
driven by ModelMeta.similarity_fn_name (already set to COSINE for all Ogma
models). No change to encoding behaviour.

Reported by @Samoed on embeddings-benchmark/results#525.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants