machinelearningZH
diff --git a/‎.github/workflows/lint.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/lint.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/test.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/test.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.python-version‎
Lines changed: 1 addition & 1 deletion b/‎.python-version‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 9 additions & 3 deletions b/‎README.md‎
Lines changed: 9 additions & 3 deletions
diff --git a/‎configs/example.yaml‎
Lines changed: 27 additions & 10 deletions b/‎configs/example.yaml‎
Lines changed: 27 additions & 10 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 11 additions & 9 deletions b/‎pyproject.toml‎
Lines changed: 11 additions & 9 deletions
diff --git a/‎semsearcheval/constants.py‎
Lines changed: 6 additions & 6 deletions b/‎semsearcheval/constants.py‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎semsearcheval/metrics.py‎
Lines changed: 54 additions & 10 deletions b/‎semsearcheval/metrics.py‎
Lines changed: 54 additions & 10 deletions
@@ -16,7 +16,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: "3.11"
+          python-version: "3.13"
 
       - name: Install uv
         run: pip3 install uv
 
@@ -16,7 +16,7 @@ jobs:
       - name: Set up Python
         uses: actions/setup-python@v5
         with:
-          python-version: '3.11'
+          python-version: '3.13'
 
       - name: Install uv
         run: pip3 install uv
 
@@ -1 +1 @@
-3.11
+3.13
@@ -3,7 +3,7 @@
 **A framework for evaluating semantic search across custom datasets, metrics, and embedding backends.**
 
 ![GitHub License](https://img.shields.io/github/license/machinelearningZH/semantic-search-eval)
-[![PyPI - Python](https://img.shields.io/badge/python-v3.11+-blue.svg)](https://github.com/machinelearningZH/semantic-search-eval)
+[![PyPI - Python](https://img.shields.io/badge/python-v3.13+-blue.svg)](https://github.com/machinelearningZH/semantic-search-eval)
 [![GitHub Stars](https://img.shields.io/github/stars/machinelearningZH/semantic-search-eval.svg)](https://github.com/machinelearningZH/semantic-search-eval/stargazers)
 [![GitHub Issues](https://img.shields.io/github/issues/machinelearningZH/semantic-search-eval.svg)](https://github.com/machinelearningZH/semantic-search-eval/issues)
 [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/machinelearningZH/semantic-search-eval.svg)](https://img.shields.io/github/issues-pr/machinelearningZH/semantic-search-eval)
@@ -33,7 +33,7 @@
 ## Features
 - **Flexible model integration**: HuggingFace, OpenAI, BM25, and more.
 - **Simple YAML-based configuration**.
-- **Custom evaluation metrics**: e.g., Accuracy@k, Latency.
+- **Custom evaluation metrics**: e.g., Accuracy@k, NDCG@k, Latency.
 - **Integrated visualizations** via `seaborn`/`matplotlib`.
 
 ## Installation
@@ -67,6 +67,9 @@ You need two input files:
 > [!NOTE]
 > Embedding models have a maximum input length - more on this in the next section. If your documents exceed this length, they should be split into smaller chunks before evaluation to ensure compatibility with the models. All preprocessing (e.g., cleaning, tokenization) should be completed before evaluation, as it is not (yet) supported in this toolkit.
 
+> [!NOTE]
+> The current implementation of this toolkit assumes **exactly one relevant document per query**. All metrics (Accuracy@k, NDCG@k, etc.) are designed for this single-answer evaluation scenario. If you have multiple relevant documents per query, the current metrics will not produce meaningful results.
+
 ### Configuration
 Create a YAML config to define datasets, models, and metrics. Use [`configs/example.yaml`](configs/example.yaml) as a template.
 
@@ -76,7 +79,10 @@ Key fields:
 - `docs` and `queries`: paths to your documents and queries in CSV or parquet format
 - `is-public-data`: set to true to use OpenAI query creator if data is public
 - `max-len`: set to the **shortest model limit** to ensure fair evaluation with same input text length for all models
-- `models`: define model backends and options
+- `models`: define model backends and options. Supported backends are `huggingface`, `lexical`, and `open-ai`. HuggingFace models support the following optional parameters (check the usage examples on HuggingFace to see whether a model uses either of these parameters):
+  - `set_builtin_query_prompt` / `set_builtin_passage_prompt`: use a model's built-in named prompt for queries/passages
+  - `set_query_task_prompt` / `set_passage_task_prompt`: pass a task string to the encoder (e.g. for Jina models)
+  - `set_custom_query_prefix` / `set_custom_passage_prefix`: prepend a custom string to each query/passage at inference time (mutually exclusive with the built-in prompt options)
 
 ### OpenAI Key
 To use OpenAI-based features:
 
@@ -15,6 +15,7 @@ metrics:
   - accuracy@1 # Top-1 accuracy: checks if gold doc is rank 1
   - accuracy@5 # Top-5 accuracy: checks if gold doc is in top 5
   - accuracy@10 # Top-10 accuracy: same logic for top 10
+  - ndcg@10 # NDCG@10: rewards higher-ranked gold docs (position-aware)
   - latency # Time taken for full inference per model
 
 # Global max token length for truncation (needs to be based on smallest model max len for fair comparison)
@@ -26,24 +27,40 @@ models:
   lexical:
     bm25: de_core_news_sm # uses spacy model to lemmatize before indexing
 
-  intfloat:
-    intfloat-small: intfloat/multilingual-e5-small # max-len 512
-    intfloat-base: intfloat/multilingual-e5-base # max-len 512
-
-  intfloat-instruct:
-    intfloat-instruct: intfloat/multilingual-e5-large-instruct # max-len 512
-
   huggingface:
     jina-v2: jinaai/jina-embeddings-v2-base-de # max-len 8192
     all-MiniLM-v2: sentence-transformers/all-MiniLM-L6-v2 # max-len 512
     granite: ibm-granite/granite-embedding-278m-multilingual # max-len 512
     nomic: 
       model: nomic-ai/nomic-embed-text-v2-moe # max-len 512
-      use_query_prompt: true
-      use_passage_prompt: true
+      set_builtin_query_prompt: query
+      set_builtin_passage_prompt: passage
     snowflake:
       model: Snowflake/snowflake-arctic-embed-l-v2.0 # max-len 8192
-      use_query_prompt: true
+      set_builtin_query_prompt: query
+    jina-v3:
+      model: jinaai/jina-embeddings-v3 # max-len 8192
+      set_builtin_query_prompt: retrieval.query
+      set_builtin_passage_prompt: retrieval.passage
+      set_passage_task_prompt: retrieval.passage
+      set_query_task_prompt: retrieval.query
+    jina-v5-small:
+      model: jinaai/jina-embeddings-v5-text-small # max-len 32768
+      set_query_task_prompt: retrieval
+      set_passage_task_prompt: retrieval
+      set_builtin_passage_prompt: document
+      set_builtin_query_prompt: query
+    intfloat-small:
+      model: intfloat/multilingual-e5-small # max-len 512
+      set_custom_query_prefix: "query: "
+      set_custom_passage_prefix: "passage: "
+    intfloat-base:
+      model: intfloat/multilingual-e5-base # max-len 512
+      set_custom_query_prefix: "query: "
+      set_custom_passage_prefix: "passage: "
+    intfloat-instruct:
+      model: intfloat/multilingual-e5-large-instruct # max-len 512
+      set_custom_query_prefix: "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: "
 
   # open-ai:
   #   open-ai-3-small: text-embedding-3-small # max-len 8191 
 
@@ -7,22 +7,24 @@ authors = [
 ]
 license = { file = "LICENSE" }
 readme = "README.md"
-requires-python = ">=3.11"
+requires-python = ">=3.13"
 dependencies = [
-    "accelerate>=1.6.0",
-    "einops>=0.8.1",
+    "accelerate>=1.13.0",
+    "einops>=0.8.2",
     "de-core-news-sm",
     "openai>=1.77.0",
     "ordered-set>=4.1.0",
-    "polars>=1.29.0",
-    "pytest-mock>=3.14.0",
+    "polars>=1.40.1",
+    "pytest-mock>=3.15.1",
     "rank-bm25>=0.2.2",
-    "ruamel-yaml>=0.18.10",
+    "ruamel-yaml>=0.19.1",
     "seaborn>=0.13.2",
-    "sentence-transformers>=4.1.0",
-    "spacy>=3.8.5",
-    "tiktoken>=0.9.0",
+    "sentence-transformers<=5.4.1",
+    "transformers<5.0.0",
+    "spacy>=3.8.14",
+    "tiktoken>=0.12.0",
     "dotenv>=0.9.9",
+    "peft>=0.19.1",
 ]
 
 [dependency-groups]
 
@@ -4,23 +4,23 @@
 
 from typing import Dict, Type
 
-from semsearcheval.metrics import Accuracy, Latency, Metric
+from semsearcheval.metrics import NDCG, Accuracy, Latency, Metric
 from semsearcheval.models import (
     BM25Model,
     HuggingFaceModel,
-    IntFloatInstructModel,
-    IntFloatModel,
     Model,
     OpenAIModel,
 )
 
 
 model_registry: Dict[str, Type[Model]] = {
     "huggingface": HuggingFaceModel,
-    "intfloat": IntFloatModel,
-    "intfloat-instruct": IntFloatInstructModel,
     "lexical": BM25Model,
     "open-ai": OpenAIModel,
 }
 
-metric_registry: Dict[str, Type[Metric]] = {"accuracy": Accuracy, "latency": Latency}
+metric_registry: Dict[str, Type[Metric]] = {
+    "accuracy": Accuracy,
+    "latency": Latency,
+    "ndcg": NDCG,
+}
@@ -17,6 +17,15 @@ class Metric(ABC):
     def __init__(self, name: str) -> None:
         self.name = name
 
+    def _parse_k(self, name: str) -> None:
+        """Extracts the top-k cutoff from the metric name."""
+        if "@" not in name:
+            raise ValueError(f"Invalid metric name: {name}. Expected format: metric@k")
+        k = int(name.split("@")[1])
+        if k <= 0:
+            raise ValueError(f"Invalid k value: {k}. Must be a positive integer.")
+        return k
+
     @abstractmethod
     def compute(self, result: Result) -> Tuple[float, str]:
         pass
@@ -34,24 +43,18 @@ def __init__(self, name: str) -> None:
         super().__init__(name)
         self.k = self._parse_k(name)
 
-    def _parse_k(self, name: str) -> None:
-        """Extracts the top-k cutoff from the metric name."""
-        if "@" not in name:
-            raise ValueError(f"Invalid metric name: {name}. Expected format: accuracy@k")
-        k = int(name.split("@")[1])
-        if k <= 0:
-            raise ValueError(f"Invalid k value: {k}. Must be a positive integer.")
-        return k
-
     def compute(self, result: Result) -> Tuple[float, str]:
         """
         Compute top-k accuracy: proportion of queries where the gold document index
         appears in the top-k predicted documents (by similarity score).
         """
         correct_retrieved = 0
         for q_sims, gold_index in zip(result.similarity, result.gold_indices):
-            # Get indices of top-k most similar documents (descending order as higher similarity is better)
+            # Sort similarity scores in descending order (highest similarity first)
+            # and select the indices of the top-k documents
             top_k = np.argsort(q_sims)[::-1][: self.k]
+            
+            # Check if the gold (correct) document is among the top-k predictions
             if gold_index in top_k:
                 correct_retrieved += 1
 
@@ -68,3 +71,44 @@ class Latency(Metric):
 
     def compute(self, result: Result) -> Tuple[float, str]:
         return result.time, "s"
+
+
+class NDCG(Metric):
+    """
+    Computes NDCG@k (Normalized Discounted Cumulative Gain) over all queries.
+
+    With a single relevant document per query the ideal DCG is always 1.0,
+    so NDCG simplifies to 1/log2(rank+1) if the gold doc is within the top k,
+    and 0 otherwise. The result is averaged across all queries.
+    The metric name should be in the form: ndcg@k.
+    """
+
+    def __init__(self, name: str) -> None:
+        super().__init__(name)
+        self.k = self._parse_k(name)
+
+    def compute(self, result: Result) -> Tuple[float, str]:
+        """
+        Compute NDCG@k. For each query, find the rank of the gold document
+        within the top k. Score is 1/log2(rank+1) if found, 0 otherwise.
+        """
+        total_ndcg = 0.0
+        for q_sims, gold_index in zip(result.similarity, result.gold_indices):
+            # Sort similarity scores in descending order and select top-k document indices
+            ranked = np.argsort(q_sims)[::-1][: self.k]
+            
+            # Find if and where the gold document appears in the top-k ranked results
+            positions = np.where(ranked == gold_index)[0]
+            
+            if len(positions) > 0:
+                # Convert 0-based position to 1-based rank
+                rank = positions[0] + 1
+                
+                # Apply logarithmic discount: 1/log2(rank+1)
+                # This rewards higher-ranked results more than lower-ranked ones
+                total_ndcg += 1.0 / np.log2(rank + 1)
+            # If gold doc not in top-k, contributes 0 to the score
+        
+        # Calculate average NDCG across all queries and convert to percentage
+        avg_ndcg = total_ndcg / result.similarity.shape[0] * 100
+        return avg_ndcg, "%"