Skip to content

More F2LLM-v2 results#527

Merged
Samoed merged 7 commits into
embeddings-benchmark:mainfrom
Geralt-Targaryen:main
May 11, 2026
Merged

More F2LLM-v2 results#527
Samoed merged 7 commits into
embeddings-benchmark:mainfrom
Geralt-Targaryen:main

Conversation

@Geralt-Targaryen
Copy link
Copy Markdown
Contributor

@Geralt-Targaryen Geralt-Targaryen commented May 9, 2026

Checklist

  • My model has a model sheet, report, or similar
  • My model has a reference implementation in mteb/models/model_implementations/, this can be as an API. Instruction on how to add a model can be found here
    • No, but there is an existing PR ___
  • The results submitted are obtained using the reference implementation
  • My model is available, either as a publicly accessible API or publicly on e.g., Huggingface
  • I solemnly swear that for all results submitted I have not trained on the evaluation dataset including training splits. If I have, I have disclosed it clearly.

Add results on MTEB(Law, v1), ChemTEB, RTEB, RAR-b, LongEmbed, NanoBEIR, and BuiltBench(eng). Relevant prompts are added to implementation in embeddings-benchmark/mteb#4643.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

Model Results Comparison

Reference models: intfloat/multilingual-e5-large, google/gemini-embedding-001
New models evaluated: codefuse-ai/F2LLM-v2-0.6B, codefuse-ai/F2LLM-v2-1.7B, codefuse-ai/F2LLM-v2-14B, codefuse-ai/F2LLM-v2-160M, codefuse-ai/F2LLM-v2-330M, codefuse-ai/F2LLM-v2-4B, codefuse-ai/F2LLM-v2-80M, codefuse-ai/F2LLM-v2-8B
Tasks: AILACasedocs, ARCChallenge, AlphaNLI, AmazonPolarityClassification, AmazonReviewsClassification, ArxivClusteringP2P, ArxivClusteringS2S, BiorxivClusteringP2P, BiorxivClusteringS2S, BrightRetrieval, BuiltBenchClusteringP2P, BuiltBenchClusteringS2S, BuiltBenchReranking, BuiltBenchRetrieval, CQADupstackAndroidRetrieval, CQADupstackEnglishRetrieval, CQADupstackGisRetrieval, CQADupstackMathematicaRetrieval, CQADupstackPhysicsRetrieval, CQADupstackProgrammersRetrieval, CQADupstackRetrieval, CQADupstackStatsRetrieval, CQADupstackTexRetrieval, CQADupstackWebmastersRetrieval, CQADupstackWordpressRetrieval, ChatDoctorRetrieval, ChemHotpotQARetrieval, ChemNQRetrieval, ClimateFEVER, DBPedia, DS1000Retrieval, EmotionClassification, FEVER, FinQARetrieval, FinanceBenchRetrieval, FreshStackRetrieval, GerDaLIRSmall, HC3FinanceRetrieval, HellaSwag, HotpotQA, HumanEvalRetrieval, LEMBNarrativeQARetrieval, LEMBNeedleRetrieval, LEMBQMSumRetrieval, LEMBSummScreenFDRetrieval, LEMBWikimQARetrieval, LeCaRDv2, LegalBenchConsumerContractsQA, LegalSummarization, MBPPRetrieval, MSMARCO, MTOPIntentClassification, MedrxivClusteringP2P, MedrxivClusteringS2S, NQ, NanoArguAnaRetrieval, NanoClimateFeverRetrieval, NanoDBPediaRetrieval, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQARetrieval, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQRetrieval, NanoQuoraRetrieval, NanoSCIDOCSRetrieval, NanoSciFactRetrieval, NanoTouche2020Retrieval, PIQA, PubChemAISentenceParaphrasePC, PubChemSMILESBitextMining, PubChemSMILESPC, PubChemSynonymPC, PubChemWikiPairClassification, PubChemWikiParagraphsPC, Quail, QuoraRetrieval, RARbCode, RARbMath, RedditClustering, RedditClusteringP2P, SDSEyeProtectionClassification, SDSGlovesClassification, SIQA, STS16, STS22, SciDocsRR, StackExchangeClustering, StackExchangeClusteringP2P, StackOverflowDupQuestions, SummEval, TempReasonL2Context, TempReasonL2Fact, TempReasonL2Pure, TempReasonL3Context, TempReasonL3Fact, TempReasonL3Pure, Touche2020, TwentyNewsgroupsClustering, WikiSQLRetrieval, WikipediaBioMetChemClassification, WikipediaBiolumNeurochemClassification, WikipediaChemEngSpecialtiesClassification, WikipediaChemFieldsClassification, WikipediaChemistryTopicsClassification, WikipediaChemistryTopicsClustering, WikipediaCompChemSpectroscopyClassification, WikipediaCryobiologySeparationClassification, WikipediaCrystallographyAnalyticalClassification, WikipediaGreenhouseEnantiopureClassification, WikipediaIsotopesFissionClassification, WikipediaLuminescenceClassification, WikipediaOrganicInorganicClassification, WikipediaSaltsSemiconductorsClassification, WikipediaSolidStateColloidalClassification, WikipediaSpecialtiesInChemistryClustering, WikipediaTheoreticalAppliedClassification

Results for codefuse-ai/F2LLM-v2-0.6B

task_name codefuse-ai/F2LLM-v2-0.6B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AILACasedocs 0.4109 0.4833 0.2643 0.6560 Octen/Octen-Embedding-8B-INT8 False
ARCChallenge 0.1862 nan 0.1083 0.2668 GritLM/GritLM-7B False
AlphaNLI 0.2736 nan 0.1359 0.4393 Alibaba-NLP/gte-Qwen2-7B-instruct False
AmazonPolarityClassification 0.9675 nan 0.9326 0.9774 nvidia/NV-Embed-v2 True
AmazonReviewsClassification 0.5718 nan 0.4312 0.6880 TencentBAC/Conan-embedding-v2 False
ArxivClusteringP2P 0.5396 nan 0.4473 0.6092 TencentBAC/Conan-embedding-v2 True
ArxivClusteringS2S 0.4805 nan 0.3871 0.5520 TencentBAC/Conan-embedding-v2 True
BiorxivClusteringP2P 0.5862 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 True
BiorxivClusteringS2S 0.5363 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 True
BrightRetrieval 0.1326 nan nan 0.2720 ByteDance-Seed/Seed1.5-Embedding False
BuiltBenchClusteringP2P 0.6025 nan 0.4869 0.6767 Alibaba-NLP/gte-Qwen2-1.5B-instruct False
BuiltBenchClusteringS2S 0.4772 nan 0.3909 0.5766 Salesforce/SFR-Embedding-2_R False
BuiltBenchReranking 0.6268 nan 0.6236 0.7653 Alibaba-NLP/gte-Qwen2-7B-instruct False
BuiltBenchRetrieval 0.6487 nan 0.6308 0.7687 Linq-AI-Research/Linq-Embed-Mistral False
CQADupstackAndroidRetrieval 0.5421 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.5125 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGisRetrieval 0.4365 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.3691 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.5350 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.4723 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.4573 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.4061 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.3384 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.4398 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.3654 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ChatDoctorRetrieval 0.7649 0.7352 0.5687 0.7722 voyageai/voyage-4-large (embed_dim=2048) False
ChemHotpotQARetrieval 0.8069 nan 0.7979 0.9531 infly/inf-retriever-v1 False
ChemNQRetrieval 0.5939 nan 0.6617 0.7046 intfloat/multilingual-e5-small False
ClimateFEVER 0.4162 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.4142 nan 0.413 0.5350 nvidia/NV-Embed-v2 True
DS1000Retrieval 0.6507 0.6870 nan 0.7149 google/gemini-embedding-2-preview False
EmotionClassification 0.9216 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 True
FEVER 0.9075 nan 0.8279 0.9628 voyageai/voyage-3-m-exp True
FinQARetrieval 0.5181 0.6464 nan 0.8897 voyageai/voyage-4-large (embed_dim=2048) False
FinanceBenchRetrieval 0.7673 0.9157 nan 0.9459 Octen/Octen-Embedding-8B False
FreshStackRetrieval 0.3519 0.3979 0.2519 0.5776 Octen/Octen-Embedding-8B False
GerDaLIRSmall 0.3084 nan 0.1572 0.5944 mteb/baseline-bm25s False
HC3FinanceRetrieval 0.6081 0.7758 nan 0.8242 nvidia/NV-Embed-v2 False
HellaSwag 0.2999 nan 0.2735 0.3966 infly/inf-retriever-v1 False
HotpotQA 0.6522 nan 0.7122 0.8696 voyageai/voyage-3-m-exp True
HumanEvalRetrieval 0.9623 0.9910 nan 1.0000 google/gemini-embedding-2-preview False
LEMBNarrativeQARetrieval 0.5195 nan 0.2422 0.7690 lightonai/GTE-ModernColBERT-v1 False
LEMBNeedleRetrieval 0.5875 nan 0.28 0.9325 mteb/baseline-bm25s False
LEMBQMSumRetrieval 0.4522 nan 0.2426 0.8323 mteb/baseline-bm25s False
LEMBSummScreenFDRetrieval 0.9691 nan 0.7112 0.9784 mteb/baseline-bm25s False
LEMBWikimQARetrieval 0.8976 nan 0.568 0.9988 lightonai/GTE-ModernColBERT-v1 False
LeCaRDv2 0.7076 nan 0.5583 0.7777 Mira190/Euler-Legal-Embedding-V1 False
LegalBenchConsumerContractsQA 0.7601 nan 0.733 0.8675 voyageai/voyage-3 False
LegalSummarization 0.6391 0.7122 0.621 0.7921 voyageai/voyage-3.5 False
MBPPRetrieval 0.8848 0.9416 nan 0.9608 voyageai/voyage-4-large (embed_dim=2048) False
MSMARCO 0.4134 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 True
MTOPIntentClassification 0.9379 nan 0.672 0.9429 BAAI/bge-multilingual-gemma2 True
MedrxivClusteringP2P 0.4821 nan 0.317 0.5153 voyageai/voyage-3-m-exp True
MedrxivClusteringS2S 0.4556 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 True
NQ 0.6090 nan 0.6403 0.8248 voyageai/voyage-3-m-exp True
NanoArguAnaRetrieval 0.5816 nan nan 0.7739 infly/inf-retriever-v1-1.5b True
NanoClimateFeverRetrieval 0.4831 nan nan 0.4667 infly/inf-retriever-v1-1.5b False
NanoDBPediaRetrieval 0.6176 nan nan 0.7345 infly/inf-retriever-v1 True
NanoFEVERRetrieval 0.9528 nan nan 0.9759 infly/inf-retriever-v1 True
NanoFiQA2018Retrieval 0.5823 nan nan 0.6972 infly/inf-retriever-v1 True
NanoHotpotQARetrieval 0.7897 nan nan 0.9095 infly/inf-retriever-v1 True
NanoMSMARCORetrieval 0.6912 nan nan 0.7006 infly/inf-retriever-v1 True
NanoNFCorpusRetrieval 0.3777 nan nan 0.4710 infly/inf-retriever-v1 True
NanoNQRetrieval 0.7045 nan nan 0.7831 infly/inf-retriever-v1 True
NanoQuoraRetrieval 0.9609 nan nan 0.9728 intfloat/multilingual-e5-small False
NanoSCIDOCSRetrieval 0.4215 nan nan 0.5333 infly/inf-retriever-v1 False
NanoSciFactRetrieval 0.7561 nan nan 0.8632 infly/inf-retriever-v1 True
NanoTouche2020Retrieval 0.5108 nan nan 0.6953 mteb/baseline-bm25s False
PIQA 0.3234 nan 0.2882 0.4544 nvidia/NV-Embed-v2 False
PubChemAISentenceParaphrasePC 0.9492 nan 0.9664 0.9748 sentence-transformers/multi-qa-mpnet-base-dot-v1 False
PubChemSMILESBitextMining 0.0067 nan 0.0021 0.0074 ICT-TIME-and-Querit/BOOM_4B_v1 False
PubChemSMILESPC 0.1373 nan 0.1077 0.1612 ICT-TIME-and-Querit/BOOM_4B_v1 False
PubChemSynonymPC 0.7339 nan 0.6396 0.7352 openai/text-embedding-3-large False
PubChemWikiPairClassification 0.9639 nan 0.9452 0.9641 bedrock/amazon-titan-embed-text-v2 False
PubChemWikiParagraphsPC 0.4636 nan 0.192 0.5127 openai/text-embedding-3-large False
Quail 0.1592 nan 0.0485 0.2657 Alibaba-NLP/gte-Qwen2-7B-instruct False
QuoraRetrieval 0.8890 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RARbCode 0.6932 nan 0.5891 0.9049 Alibaba-NLP/gte-Qwen2-7B-instruct False
RARbMath 0.9489 nan 0.6732 0.9420 voyageai/voyage-3.5 False
RedditClustering 0.6033 nan 0.4691 0.7716 voyageai/voyage-3-m-exp True
RedditClusteringP2P 0.6635 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 True
SDSEyeProtectionClassification 0.7621 nan 0.7115 0.8299 minishlab/potion-multilingual-128M False
SDSGlovesClassification 0.7382 nan 0.6371 0.7533 sentence-transformers/static-similarity-mrl-multilingual-v1 False
SIQA 0.0442 nan 0.0536 0.0836 Alibaba-NLP/gte-Qwen2-7B-instruct False
STS16 0.8504 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STS22 0.6644 0.7176 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
SciDocsRR 0.8510 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
StackExchangeClustering 0.7389 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 True
StackExchangeClusteringP2P 0.4449 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 True
StackOverflowDupQuestions 0.4713 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B True
SummEval 0.3224 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TempReasonL2Context 0.3071 nan 0.2975 0.6405 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL2Fact 0.3512 nan 0.4296 0.6412 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL2Pure 0.0265 nan 0.0205 0.1420 GritLM/GritLM-8x7B False
TempReasonL3Context 0.2354 nan 0.2551 0.4766 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL3Fact 0.2774 nan 0.3821 0.4739 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL3Pure 0.0617 nan 0.0831 0.1666 Linq-AI-Research/Linq-Embed-Mistral False
Touche2020 0.2660 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.5468 nan 0.394 0.8349 voyageai/voyage-3-m-exp True
WikiSQLRetrieval 0.9834 0.8814 nan 0.9892 Octen/Octen-Embedding-8B False
WikipediaBioMetChemClassification 0.9894 nan 0.9877 0.9980 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaBiolumNeurochemClassification 0.9153 nan 0.9571 0.9847 openai/text-embedding-3-large False
WikipediaChemEngSpecialtiesClassification 0.6524 nan 0.3202 0.7976 bedrock/cohere-embed-english-v3 False
WikipediaChemFieldsClassification 0.5144 nan 0.4876 0.6020 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaChemistryTopicsClassification 0.6884 nan 0.8463 0.9366 openai/text-embedding-3-large False
WikipediaChemistryTopicsClustering 0.3952 nan 0.652 0.7900 openai/text-embedding-3-large False
WikipediaCompChemSpectroscopyClassification 0.7448 nan 0.7466 0.8258 VPLabs/SearchMap_Preview False
WikipediaCryobiologySeparationClassification 0.8026 nan 0.9197 0.9631 bedrock/amazon-titan-embed-text-v1 False
WikipediaCrystallographyAnalyticalClassification 0.9316 nan 0.9296 0.9842 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaGreenhouseEnantiopureClassification 0.9596 nan 0.9737 0.9890 VPLabs/SearchMap_Preview False
WikipediaIsotopesFissionClassification 0.8476 nan 0.9071 0.9333 openai/text-embedding-3-large False
WikipediaLuminescenceClassification 0.9000 nan 0.8793 0.9341 bedrock/amazon-titan-embed-text-v1 False
WikipediaOrganicInorganicClassification 0.8266 nan 0.8856 0.9205 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaSaltsSemiconductorsClassification 0.8152 nan 0.8545 0.9242 VPLabs/SearchMap_Preview False
WikipediaSolidStateColloidalClassification 0.7489 nan 0.7872 0.8550 bedrock/amazon-titan-embed-text-v1 False
WikipediaSpecialtiesInChemistryClustering 0.2173 nan 0.0065 0.4695 VPLabs/SearchMap_Preview False
WikipediaTheoreticalAppliedClassification 0.6869 nan 0.6316 0.6978 ICT-TIME-and-Querit/BOOM_4B_v1 False
Average 0.5873 0.7404 0.5018 0.7101 nan -

Model have high performance on these tasks: RARbMath,BiorxivClusteringP2P,BiorxivClusteringS2S,NanoClimateFeverRetrieval

Training datasets: ANLI, AmazonCounterfactualClassification, AmazonCounterfactualVNClassification, AmazonPolarityClassification, AmazonPolarityClassification.v2, AmazonPolarityVNClassification, AmazonQA, AmazonReviewClassification, ArXivHierarchicalClusteringP2P, ArXivHierarchicalClusteringS2S, ArguAna, ArguAna-Fa, ArguAna-Fa.v2, ArguAna-NL, ArguAna-NL.v2, ArguAna-PL, ArguAna-VN, ArxivClusteringP2P, ArxivClusteringP2P.v2, ArxivClusteringS2S, Aya, BQ, BactrianXLanguageClassification, BactrianXTranslation, Banking77Classification, Banking77Classification.v2, Banking77VNClassification, BioASQ, BiorxivClusteringP2P, BiorxivClusteringP2P.v2, BiorxivClusteringS2S, BiorxivClusteringS2S.v2, CEDR, CLIRMatrix, CMCQA, CMNLI, CNNDM, COIG, COLIEE, CORD19, CSL, CoLA, CodeFeedbackMT, CodeFeedbackST, CodeSearchNet, CodeSearchNetCCR, CosQA, DBPedia, DBPedia-Fa, DBPedia-NL, DBPedia-PL, DBPedia-PLHardNegatives, DBPedia-VN, DBPediaHardNegatives, DBPediaHardNegatives.v2, DuReader, ELI5, ESCI, EmotionClassification, EmotionClassification.v2, EmotionVNClassification, Europarl, FEVER, FEVER-FaHardNegatives, FEVER-NL, FEVER-VN, FEVERHardNegatives, FEVERHardNegatives.v2, FEVERNL, FiQA-PL, FiQA2018, FiQA2018-Fa, FiQA2018-Fa.v2, FiQA2018-NL, FiQA2018-VN, GooAQ, HUMEArxivClusteringP2P, HUMEEmotionClassification, HUMERedditClusteringP2P, HUMESTS12, HUMESTS22, HUMESTSBenchmark, HUMEToxicConversationsClassification, HUMETweetSentimentExtractionClassification, HealthCareMagic, HotpotQA, HotpotQA-Fa, HotpotQA-FaHardNegatives, HotpotQA-NL, HotpotQA-PL, HotpotQA-PLHardNegatives, HotpotQA-VN, HotpotQAHardNegatives, HotpotQAHardNegatives.v2, HotpotQANL, HuatuoEncQA, HuatuoKGQA, ImdbClassification, ImdbClassification.v2, ImdbVNClassification, InfinityInstruct, KoAlpaca, KoAlpacaRealQA, KoMagpie, LCQMC, LCSTS, LLMRetrievalData, Lawzhidao, M2Lingual, MEDI2, MIRACLJaRetrievalLite, MIRACLReranking, MIRACLRetrieval, MIRACLRetrievalHardNegatives, MIRACLRetrievalHardNegatives.v2, MKQA, MLDR, MLSUMClustering, MLSUMRetrieval, MMARCO, MNLI, MQA, MSMARCO, MSMARCO-Fa, MSMARCO-FaHardNegatives, MSMARCO-PL, MSMARCO-PLHardNegatives, MSMARCO-VN, MSMARCOHardNegatives, MSMARCOv2, MSciNLI, MTOPDomainClassification, MTOPDomainVNClassification, MTOPIntentClassification, MTOPIntentVNClassification, MURI, MailruQA, MassiveIntentClassification, MassiveIntentVNClassification, MassiveScenarioClassification, MassiveScenarioVNClassification, MedInstruct, MedMCQA, MedQA, MedQuAD, MedicalFlashcards, MedicalInstruction, MedicalQARu, MedrxivClusteringP2P, MedrxivClusteringP2P.v2, MedrxivClusteringS2S, MedrxivClusteringS2S.v2, MrTidyRetrieval, MrTyDiJaRetrievalLite, MultiAlpaca, MultiCPRECom, MultiCPRMedical, NFCorpus, NFCorpus-Fa, NFCorpus-NL, NFCorpus-NL.v2, NFCorpus-PL, NFCorpus-VN, NQ, NQ-Fa, NQ-FaHardNegatives, NQ-NL, NQ-PL, NQ-PLHardNegatives, NQ-VN, NQHardNegatives, NanoArguAnaRetrieval, NanoDBPedia-VN, NanoDBPediaRetrieval, NanoFEVER-VN, NanoFEVERRetrieval, NanoFiQA2018Retrieval, NanoHotpotQA-VN, NanoHotpotQARetrieval, NanoMSMARCO-VN, NanoMSMARCORetrieval, NanoNFCorpusRetrieval, NanoNQ-VN, NanoNQRetrieval, NanoSciFactRetrieval, NaturalReasoning, NordicClassification, NordicRetrieval, NordicSTS, NordicTextMatching, OASST2, OCNLI, OpenCodeGeneticInstruct, OpenCodeReasoning2, OpenOrca, PAQ, PQuAD, ParSQuAD, ParaCrawl, PawsX, PersianQA, ProCQA, PubMedQA, QBQTC, QQP, RedditClustering, RedditClustering-VN, RedditClustering.v2, RedditClusteringP2P, RedditClusteringP2P-VN, RedditClusteringP2P.v2, RefGPT, RuInstruct, RuSentimentClustering, S2ORC, SIB200, SNLI, SPECTER, SQuAD, STS12, STS22, STS22.v2, STSBenchmark, STSBenchmark-VN, SciFact, SciFact-Fa, SciFact-Fa.v2, SciFact-NL, SciFact-NL.v2, SciFact-PL, SciFact-VN, SentenceCompression, SiberianDataset, SimCLUE, StackExchange, StackExchangeClustering, StackExchangeClustering-VN, StackExchangeClustering.v2, StackExchangeClusteringP2P, StackExchangeClusteringP2P-VN, StackExchangeClusteringP2P.v2, StackExchangeDupQuestions, StackOverflowDupQuestions, StackOverflowDupQuestions-VN, StackOverflowQA, SyntheticText2SQL, T2Ranking, THUCNews, TNews, TNews.v2, ToxicConversationsClassification, ToxicConversationsClassification.v2, ToxicConversationsVNClassification, TriviaQA, TweetSentimentExtractionClassification, TweetSentimentExtractionClassification.v2, TweetSentimentExtractionVNClassification, TwentyNewsgroupsClustering, TwentyNewsgroupsClustering-VN, TwentyNewsgroupsClustering.v2, UNPC, Waimai, Waimai.v2, WebFAQ, WikiOmnia, WildChat, XCodeEvalCodeToCode, XCodeEvalNLToCode, XCodeEvalTranslation, XNLI, XSum, YahooAnswers, cMedQAv2, webMedQA


Results for codefuse-ai/F2LLM-v2-1.7B

task_name codefuse-ai/F2LLM-v2-1.7B google/gemini-embedding-001 intfloat/multilingual-e5-large Max result Model with max result In Training Data
AILACasedocs 0.4182 0.4833 0.2643 0.6560 Octen/Octen-Embedding-8B-INT8 False
ARCChallenge 0.2371 nan 0.1083 0.2668 GritLM/GritLM-7B False
AlphaNLI 0.3063 nan 0.1359 0.4393 Alibaba-NLP/gte-Qwen2-7B-instruct False
AmazonPolarityClassification 0.9715 nan 0.9326 0.9774 nvidia/NV-Embed-v2 True
AmazonReviewsClassification 0.5915 nan 0.4312 0.6880 TencentBAC/Conan-embedding-v2 False
ArxivClusteringP2P 0.5515 nan 0.4473 0.6092 TencentBAC/Conan-embedding-v2 True
ArxivClusteringS2S 0.5034 nan 0.3871 0.5520 TencentBAC/Conan-embedding-v2 True
BiorxivClusteringP2P 0.6454 nan 0.355 0.5522 TencentBAC/Conan-embedding-v2 True
BiorxivClusteringS2S 0.6117 nan 0.333 0.5092 TencentBAC/Conan-embedding-v2 True
BrightRetrieval 0.1538 nan nan 0.2720 ByteDance-Seed/Seed1.5-Embedding False
BuiltBenchClusteringP2P 0.6701 nan 0.4869 0.6767 Alibaba-NLP/gte-Qwen2-1.5B-instruct False
BuiltBenchClusteringS2S 0.5148 nan 0.3909 0.5766 Salesforce/SFR-Embedding-2_R False
BuiltBenchReranking 0.6618 nan 0.6236 0.7653 Alibaba-NLP/gte-Qwen2-7B-instruct False
BuiltBenchRetrieval 0.6961 nan 0.6308 0.7687 Linq-AI-Research/Linq-Embed-Mistral False
CQADupstackAndroidRetrieval 0.5648 nan 0.4904 0.7426 voyageai/voyage-3-m-exp False
CQADupstackEnglishRetrieval 0.5435 nan 0.4581 0.6998 voyageai/voyage-3-m-exp False
CQADupstackGisRetrieval 0.4570 nan 0.3695 0.6340 voyageai/voyage-3-m-exp False
CQADupstackMathematicaRetrieval 0.4065 nan 0.2818 0.6948 voyageai/voyage-3-m-exp False
CQADupstackPhysicsRetrieval 0.5687 nan 0.4366 0.7371 voyageai/voyage-3-m-exp False
CQADupstackProgrammersRetrieval 0.4948 nan 0.416 0.6587 voyageai/voyage-3-m-exp False
CQADupstackRetrieval 0.4855 nan 0.3967 0.6830 voyageai/voyage-3-m-exp False
CQADupstackStatsRetrieval 0.4264 nan 0.3238 0.6242 voyageai/voyage-3-m-exp False
CQADupstackTexRetrieval 0.3681 nan 0.2836 0.6295 voyageai/voyage-3-m-exp False
CQADupstackWebmastersRetrieval 0.4651 nan 0.3988 0.6835 voyageai/voyage-3-m-exp False
CQADupstackWordpressRetrieval 0.3876 nan 0.3164 0.5862 voyageai/voyage-3-m-exp False
ChatDoctorRetrieval 0.7860 0.7352 0.5687 0.7722 voyageai/voyage-4-large (embed_dim=2048) False
ChemHotpotQARetrieval 0.8439 nan 0.7979 0.9531 infly/inf-retriever-v1 False
ChemNQRetrieval 0.6560 nan 0.6617 0.7046 intfloat/multilingual-e5-small False
ClimateFEVER 0.4327 nan 0.2573 0.5693 voyageai/voyage-3-m-exp False
DBPedia 0.4274 nan 0.413 0.5350 nvidia/NV-Embed-v2 True
DS1000Retrieval 0.6646 0.6870 nan 0.7149 google/gemini-embedding-2-preview False
EmotionClassification 0.9169 nan 0.4758 0.9387 TencentBAC/Conan-embedding-v2 True
FEVER 0.9107 nan 0.8279 0.9628 voyageai/voyage-3-m-exp True
FinQARetrieval 0.5648 0.6464 nan 0.8897 voyageai/voyage-4-large (embed_dim=2048) False
FinanceBenchRetrieval 0.8155 0.9157 nan 0.9459 Octen/Octen-Embedding-8B False
FreshStackRetrieval 0.3664 0.3979 0.2519 0.5776 Octen/Octen-Embedding-8B False
GerDaLIRSmall 0.3898 nan 0.1572 0.5944 mteb/baseline-bm25s False
HC3FinanceRetrieval 0.6862 0.7758 nan 0.8242 nvidia/NV-Embed-v2 False
HellaSwag 0.3173 nan 0.2735 0.3966 infly/inf-retriever-v1 False
HotpotQA 0.6789 nan 0.7122 0.8696 voyageai/voyage-3-m-exp True
HumanEvalRetrieval 0.9797 0.9910 nan 1.0000 google/gemini-embedding-2-preview False
LEMBNarrativeQARetrieval 0.5908 nan 0.2422 0.7690 lightonai/GTE-ModernColBERT-v1 False
LEMBNeedleRetrieval 0.4700 nan 0.28 0.9325 mteb/baseline-bm25s False
LEMBQMSumRetrieval 0.4829 nan 0.2426 0.8323 mteb/baseline-bm25s False
LEMBSummScreenFDRetrieval 0.9776 nan 0.7112 0.9784 mteb/baseline-bm25s False
LEMBWikimQARetrieval 0.9147 nan 0.568 0.9988 lightonai/GTE-ModernColBERT-v1 False
LeCaRDv2 0.7177 nan 0.5583 0.7777 Mira190/Euler-Legal-Embedding-V1 False
LegalBenchConsumerContractsQA 0.7823 nan 0.733 0.8675 voyageai/voyage-3 False
LegalSummarization 0.6614 0.7122 0.621 0.7921 voyageai/voyage-3.5 False
MBPPRetrieval 0.9022 0.9416 nan 0.9608 voyageai/voyage-4-large (embed_dim=2048) False
MSMARCO 0.4265 nan 0.437 0.4812 TencentBAC/Conan-embedding-v2 True
MTOPIntentClassification 0.9459 nan 0.672 0.9429 BAAI/bge-multilingual-gemma2 True
MedrxivClusteringP2P 0.5232 nan 0.317 0.5153 voyageai/voyage-3-m-exp True
MedrxivClusteringS2S 0.4972 nan 0.2976 0.4969 TencentBAC/Conan-embedding-v2 True
NQ 0.6436 nan 0.6403 0.8248 voyageai/voyage-3-m-exp True
NanoArguAnaRetrieval 0.5546 nan nan 0.7739 infly/inf-retriever-v1-1.5b True
NanoClimateFeverRetrieval 0.4509 nan nan 0.4667 infly/inf-retriever-v1-1.5b False
NanoDBPediaRetrieval 0.6457 nan nan 0.7345 infly/inf-retriever-v1 True
NanoFEVERRetrieval 0.9352 nan nan 0.9759 infly/inf-retriever-v1 True
NanoFiQA2018Retrieval 0.6678 nan nan 0.6972 infly/inf-retriever-v1 True
NanoHotpotQARetrieval 0.8199 nan nan 0.9095 infly/inf-retriever-v1 True
NanoMSMARCORetrieval 0.6664 nan nan 0.7006 infly/inf-retriever-v1 True
NanoNFCorpusRetrieval 0.3622 nan nan 0.4710 infly/inf-retriever-v1 True
NanoNQRetrieval 0.7307 nan nan 0.7831 infly/inf-retriever-v1 True
NanoQuoraRetrieval 0.9670 nan nan 0.9728 intfloat/multilingual-e5-small False
NanoSCIDOCSRetrieval 0.4474 nan nan 0.5333 infly/inf-retriever-v1 False
NanoSciFactRetrieval 0.8154 nan nan 0.8632 infly/inf-retriever-v1 True
NanoTouche2020Retrieval 0.5169 nan nan 0.6953 mteb/baseline-bm25s False
PIQA 0.3478 nan 0.2882 0.4544 nvidia/NV-Embed-v2 False
PubChemAISentenceParaphrasePC 0.9511 nan 0.9664 0.9748 sentence-transformers/multi-qa-mpnet-base-dot-v1 False
PubChemSMILESBitextMining 0.0085 nan 0.0021 0.0074 ICT-TIME-and-Querit/BOOM_4B_v1 False
PubChemSMILESPC 0.1760 nan 0.1077 0.1612 ICT-TIME-and-Querit/BOOM_4B_v1 False
PubChemSynonymPC 0.7425 nan 0.6396 0.7352 openai/text-embedding-3-large False
PubChemWikiPairClassification 0.9720 nan 0.9452 0.9641 bedrock/amazon-titan-embed-text-v2 False
PubChemWikiParagraphsPC 0.5686 nan 0.192 0.5127 openai/text-embedding-3-large False
Quail 0.1875 nan 0.0485 0.2657 Alibaba-NLP/gte-Qwen2-7B-instruct False
QuoraRetrieval 0.8939 nan 0.8926 0.9235 TencentBAC/Conan-embedding-v2 False
RARbCode 0.7251 nan 0.5891 0.9049 Alibaba-NLP/gte-Qwen2-7B-instruct False
RARbMath 0.9645 nan 0.6732 0.9420 voyageai/voyage-3.5 False
RedditClustering 0.6556 nan 0.4691 0.7716 voyageai/voyage-3-m-exp True
RedditClusteringP2P 0.6867 nan 0.63 0.7527 NovaSearch/stella_en_1.5B_v5 True
SDSEyeProtectionClassification 0.8195 nan 0.7115 0.8299 minishlab/potion-multilingual-128M False
SDSGlovesClassification 0.7723 nan 0.6371 0.7533 sentence-transformers/static-similarity-mrl-multilingual-v1 False
SIQA 0.0594 nan 0.0536 0.0836 Alibaba-NLP/gte-Qwen2-7B-instruct False
STS16 0.8520 nan 0.8579 0.9763 Gameselo/STS-multilingual-mpnet-base-v2 False
STS22 0.6670 0.7176 0.6365 0.8314 OrdalieTech/Solon-embeddings-mini-beta-1.1 True
SciDocsRR 0.8636 nan 0.8422 0.9114 TencentBAC/Conan-embedding-v2 False
StackExchangeClustering 0.7764 nan 0.5837 0.8395 TencentBAC/Conan-embedding-v2 True
StackExchangeClusteringP2P 0.4471 nan 0.329 0.5157 TencentBAC/Conan-embedding-v2 True
StackOverflowDupQuestions 0.4912 nan 0.5014 0.5904 Qwen/Qwen3-Embedding-8B True
SummEval 0.3114 nan 0.2964 0.3360 bigscience/sgpt-bloom-7b1-msmarco False
TempReasonL2Context 0.3784 nan 0.2975 0.6405 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL2Fact 0.4287 nan 0.4296 0.6412 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL2Pure 0.0425 nan 0.0205 0.1420 GritLM/GritLM-8x7B False
TempReasonL3Context 0.2727 nan 0.2551 0.4766 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL3Fact 0.3234 nan 0.3821 0.4739 Alibaba-NLP/gte-Qwen2-7B-instruct False
TempReasonL3Pure 0.0839 nan 0.0831 0.1666 Linq-AI-Research/Linq-Embed-Mistral False
Touche2020 0.2684 nan 0.2313 0.3939 voyageai/voyage-3-m-exp False
TwentyNewsgroupsClustering 0.5927 nan 0.394 0.8349 voyageai/voyage-3-m-exp True
WikiSQLRetrieval 0.9884 0.8814 nan 0.9892 Octen/Octen-Embedding-8B False
WikipediaBioMetChemClassification 0.9928 nan 0.9877 0.9980 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaBiolumNeurochemClassification 0.9204 nan 0.9571 0.9847 openai/text-embedding-3-large False
WikipediaChemEngSpecialtiesClassification 0.7065 nan 0.3202 0.7976 bedrock/cohere-embed-english-v3 False
WikipediaChemFieldsClassification 0.5514 nan 0.4876 0.6020 ICT-TIME-and-Querit/BOOM_4B_v1 False
WikipediaChemistryTopicsClassification 0.7677 nan 0.8463 0.9366 openai/text-embedding-3-large False
WikipediaChemistryTopicsClustering 0.4075 nan 0.652 0.7900 openai/text-embedding-3-large False
WikipediaCompChemSpectroscopyClassificat

Note: Content truncated due to GitHub API limits. See the full report in the workflow artifacts.

@Samoed Samoed merged commit 82a2a14 into embeddings-benchmark:main May 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants