Added cross encoder re-ranker (#2)

devpramod-intel · web-flow · commit 75e16cf14e68 · 2023-08-10T13:22:06.000-06:00
* Added cross encoder re-ranker

* Update README.md

* Update README.md
diff --git a/README.md b/README.md
@@ -40,9 +40,12 @@ Deploying these techniques, the pipeline for building a semantic vertical search
 
 In some situations, an organization may also wish to fine-tune and retrain the pre-trained model with their own specialized dataset in order to improve the performance of the model to documents that an organization may have.  For example, if an organization's documents are largely financial in nature, it could be useful to fine-tune these models so that they become aware of domain-specific jargon related to financial transactions or common phrases.  In this reference kit, we do not demonstrate this process but more information on training and transfer learning techniques can be found at https://www.sbert.net/examples/training/sts/README.html.
 
+Moreover, if companies aim to enhance capabilities centered around the vertical search engine, it can serve as a retreiver for custom documentation. The results from this retreiver can subsequently be input into a large language model, enabling context-aware responses to build a high quality chatbot.
+
+
 ### Re-ranking
 
-In this reference kit, we focus on the document retrieval aspect of building a vertical search engine to obtain an initial list of the top-K most similar documents in the corpus for a given query.  Often times, this is sufficient for building a feature rich system.  However, in some situations, a 3rd component,  the re-ranker, which is not included in this reference kit, could be added to the search pipeline to improve results. In this architecture, for a given query, the *document retrieval* step will use one model to rapidly obtain a list of the top-K documents (as shown in this reference kit), followed by a *re-ranking* step which will use a different model to re-order the list of K retrieved documents before returning to the user.  The second re-ranking refinement step has been shown to improve user satisfaction, especially when fine-tuned on a custom corpus, but may be unnecessary as a starting point for building a functional vertical search engine.  To extend this reference implementation with re-ranking, we direct you to https://www.sbert.net/examples/applications/retrieve_rerank/README.html for further details on implementation where Intel® oneAPI optimizations can also be applied to speed up re-ranking models.
+In this reference kit, we focus on the document retrieval aspect of building a vertical search engine to obtain an initial list of the top-K most similar documents in the corpus for a given query.  Often times, this is sufficient for building a feature rich system.  However, in some situations, a 3rd component,  the re-ranker, could be added to the search pipeline to improve results. In this architecture, for a given query, the *document retrieval* step will use one model to rapidly obtain a list of the top-K documents, followed by a *re-ranking* step which will use a different model to re-order the list of K retrieved documents before returning to the user.  The second re-ranking refinement step has been shown to improve user satisfaction, especially when fine-tuned on a custom corpus, but may be unnecessary as a starting point for building a functional vertical search engine.  To know more about re-ranker, we direct you to https://www.sbert.net/examples/applications/retrieve_rerank/README.html for further details. In this reference kit we use `cross-encoder/ms-marco-MiniLM-L-6-v2` model as re-ranker. For more details about different re-ranker models visit https://www.sbert.net/docs/pretrained-models/ce-msmarco.html.
 
 ### Key Implementation Details
 
@@ -55,7 +58,7 @@ The reference kit implementation is a reference solution to the described use ca
 
 ### E2E Architecture
 
-![Use_case_flow](assets/e2e-embedding-original.png)
+![Use_case_flow](assets/e2e-embedding-reranking.png)
 
 ### Expected Input-Output
 
@@ -204,14 +207,16 @@ optional arguments:
   --benchmark_mode                toggle to benchmark embedding
   --n_runs N_RUNS                 number of iterations to benchmark embedding
   --intel                         use intel pytorch extension to optimize model
+  --use_re_ranker                 toggle to use cross encoder re-ranker model
+  --input_corpus INPUT_CORPUS     path to corpus to embed
 ```
 
 To perform realtime query search using the above set of saved corpus embeddings and the provided configuration file, which points to the saved embeddings file, we can run the commands:
 
 ```shell
 cd src
 conda activate vse_stock
-python run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json
+python run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json --use_re_ranker --input_corpus ../data/corpus_abbreviated.csv
 cd ..
 ```
 
@@ -256,7 +261,7 @@ This reference kit extends to demonstrate the advantages of using the Intel® Ex
 
 ![Model Quantization](assets/embedding-optimized.png)
 
-### IIntel® Optimized Offline Realtime Query Search Decision Flow
+### Intel® Optimized Offline Realtime Query Search Decision Flow
 
 ![Optimized Execution](assets/realtime-search-optimized.png)
 
@@ -322,7 +327,7 @@ To perform query searches with these additional optimizations and the `ipexrun`
 ```shell
 cd src
 conda activate vse_intel
-ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json --intel
+ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json --intel --use_re_ranker --input_corpus ../data/corpus_abbreviated.csv
 cd ..
 ```
 
@@ -348,15 +353,15 @@ optional arguments:
   --batch_size BATCH_SIZE                  batch size to use. Defaults to 32.
   --save_model_dir SAVE_MODEL_DIR          directory to save the quantized model to
   --inc_config_file INC_CONFIG_FILE        INC conf yaml
-
+  --use_re_ranker                          toggle to use cross encoder re-ranker model
 ```
 
 which can be used for our models as follows:
 
 ```shell
 cd src
 conda activate vse_intel
-python run_quantize_inc.py --query_file ../data/quant_queries.csv --corpus_file ../data/corpus_quantization.csv --ground_truth_file ../data/ground_truth_quant.csv --vse_config configs/vse_config_inc.yml --save_model_dir ../saved_models/inc_int8 --inc_config_file conf.yml 
+python run_quantize_inc.py --query_file ../data/quant_queries.csv --corpus_file ../data/corpus_quantization.csv --ground_truth_file ../data/ground_truth_quant.csv --vse_config configs/vse_config_inc.yml --save_model_dir ../saved_models/inc_int8 --inc_config_file conf.yml --use_re_ranker
 cd ..
 ```
 
@@ -398,7 +403,7 @@ To do realtime query searching, we can run the commands:
 ```shell
 cd src
 conda activate vse_intel
-ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_inc.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json --intel
+ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_inc.yml --input_queries ../data/test_queries.csv --output_file ../saved_output/rankings.json --intel --use_re_ranker --input_corpus ../data/corpus_abbreviated.csv
 cd ..
 ```
 
@@ -518,7 +523,7 @@ To replicate the performance experiments described above, do the following:
     python run_document_embedder.py --vse_config configs/vse_config_base.yml --input_corpus ../data/corpus_abbreviated.csv --output_file ../saved_output/embeddings.pkl --batch_size 64
 
     # Run benchmarks on single query search
-    python run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/stock.log
+    python run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/stock.log --input_corpus ../data/corpus_abbreviated.csv
     ```
 
 6. For the intel environment, run the following to run and log results to the ../logs/intel.log file
@@ -540,7 +545,7 @@ To replicate the performance experiments described above, do the following:
     ipexrun --use_logical_core --enable_tcmalloc run_document_embedder.py --vse_config configs/vse_config_base.yml --input_corpus ../data/corpus_abbreviated.csv --output_file ../saved_output/embeddings.pkl --batch_size 64 --intel
 
     # Run single query search experiments using IPEX
-    ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/intel.log --intel
+    ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_base.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/intel.log --intel --input_corpus ../data/corpus_abbreviated.csv
 
     # Quantize the model using INC (long run time!)
     python run_quantize_inc.py --query_file ../data/quant_queries.csv --corpus_file ../data/corpus_quantization.csv --ground_truth_file ../data/ground_truth_quant.csv --vse_config configs/vse_config_inc.yml --save_model_dir ../saved_models/inc_int8 --inc_config_file conf.yml 
@@ -553,7 +558,7 @@ To replicate the performance experiments described above, do the following:
     ipexrun --use_logical_core --enable_tcmalloc run_document_embedder.py --vse_config configs/vse_config_inc.yml --input_corpus ../data/corpus_abbreviated.csv --logfile ../logs/intel_inc_int8.log --batch_size 128 --benchmark_mode --intel
 
     # Run single query search experiments using INC INT8
-    ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_inc.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/intel_inc_int8.log --intel
+    ipexrun --use_logical_core --enable_tcmalloc run_query_search.py --vse_config configs/vse_config_inc.yml --input_queries ../data/test_queries.csv --benchmark_mode --n_runs 10000 --batch_size 1 --logfile ../logs/intel_inc_int8.log --intel --input_corpus ../data/corpus_abbreviated.csv
 
     ```
 
@@ -570,4 +575,4 @@ To replicate the performance experiments described above, do the following:
     ```bash
     apt install libgl1-mesa-glx
     ```
-    
+    
diff --git a/assets/e2e-embedding-reranking.png b/assets/e2e-embedding-reranking.png
diff --git a/src/configs/vse_config_base.yml b/src/configs/vse_config_base.yml
@@ -5,11 +5,11 @@ version: 1.0
 model:
   format: default # default, inc, pt
   pretrained_model: sentence-transformers/msmarco-distilbert-base-tas-b
+  cross_encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
   max_seq_length: 128
 
 # inference config
-inference: 
+inference:
   top_k : 5
   score_function : dot # cos_sim, dot
-  corpus_embeddings_path : ../saved_output/embeddings.pkl
-  
+  corpus_embeddings_path : ../saved_output/embeddings.pkl
diff --git a/src/configs/vse_config_inc.yml b/src/configs/vse_config_inc.yml
@@ -5,13 +5,14 @@ version: 1.0
 model:
   format: inc # default, inc, pt
   pretrained_model: sentence-transformers/msmarco-distilbert-base-tas-b
+  cross_encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
   max_seq_length: 128
 
   # inc required config parameters
   path: ../saved_models/inc_int8
  
 # inference config
-inference: 
+inference:
   top_k : 5
   score_function : dot # cos_sim, dot
   corpus_embeddings_path : ../saved_output/embeddings.pkl
diff --git a/src/run_quantize_inc.py b/src/run_quantize_inc.py
@@ -25,7 +25,7 @@
     PreTrainedTokenizer,
     PreTrainedModel
 )
-
+from sentence_transformers.cross_encoder import CrossEncoder
 from utils.dataloader import (
     load_queries, load_corpus, QueryDataset, CorpusDataset
 )
@@ -35,11 +35,13 @@
 def quantize_model(
         tokenizer: PreTrainedTokenizer,
         embedder: PreTrainedModel,
+        cross_encoder: CrossEncoder,
         queries: QueryDataset,
         corpus: CorpusDataset,
         inc_config_file: str,
         score_func,
         gt,
+        use_re_ranker: bool = True,
         top_k: int = 5,
         max_seq_length: int = 128,
         batch_size: int = 64):
@@ -93,7 +95,16 @@ def evaluate_contains_top_entries(model_q) -> float:
             corpus_embeddings,
             top_k=top_k,
             score_function=score_func)
-
+    
+    ### Added Re-Ranker
+        if use_re_ranker:
+            for i in range(len(res)):
+                cross_inp = [[queries[i], corpus[entry['corpus_id']]] for entry in res[i]]
+                cross_scores = cross_encoder.predict(cross_inp)
+                for idx in range(len(cross_scores)):
+                    res[i][idx]['cross-score'] = float(cross_scores[idx])
+                res[i] = sorted(res[i], key=lambda x: x['cross-score'], reverse=True)
+    #######    
         correct = 0
         for idx, query_ranking in enumerate(res):
             matches = []
@@ -170,7 +181,7 @@ def main(flags) -> None:
         conf['model']['pretrained_model'])
     embedder = AutoModel.from_pretrained(conf['model']['pretrained_model'])
     embedder.eval()
-
+    cross_encoder = CrossEncoder(conf['model']['cross_encoder'])
     score_func = util.cos_sim
     if conf['inference']['score_function'] == 'dot':
         score_func = util.dot_score
@@ -181,14 +192,16 @@ def main(flags) -> None:
     quantized_model = quantize_model(
         tokenizer,
         embedder,
+        cross_encoder,
         query_dataset,
         corpus_dataset,
         flags.inc_config_file,
         score_func,
         ground_truth,
         top_k=conf['inference']['top_k'],
         max_seq_length=conf["model"]["max_seq_length"],
-        batch_size=64)
+        batch_size=64,
+        use_re_ranker=flags.use_re_ranker)
     quantized_model.save(flags.save_model_dir)
 
 
@@ -230,6 +243,13 @@ def main(flags) -> None:
                         required=True
                         )
 
+    parser.add_argument('--use_re_ranker',
+                        required=False,
+                        help="Use cross encoder re-ranking",
+                        action="store_true",
+                        default=True
+                        )
+
     FLAGS = parser.parse_args()
 
-    main(FLAGS)
+    main(FLAGS)
diff --git a/src/run_query_search.py b/src/run_query_search.py
@@ -29,21 +29,23 @@
     PreTrainedTokenizer,
     PreTrainedModel
 )
-
-from utils.dataloader import load_queries
-from utils.embed import encode, batch_encode
-
+from sentence_transformers.cross_encoder import CrossEncoder
+from utils.dataloader import load_queries, load_corpus, CorpusDataset
+from utils.embed import encode, batch_encode 
 random.seed(0)
 
 
 def search_query(
         logger: logging.Logger,
         tokenizer: PreTrainedTokenizer,
         embedder: PreTrainedModel,
+        cross_encoder: CrossEncoder,
         queries: List[str],
         corpus_embeddings: np.ndarray,
         idx_to_ids: Dict[int, str],
+        corpus: CorpusDataset,
         top_k: int = 5,
+        use_re_ranker: bool=True,
         max_sequence_length: int = 128,
         batch_size: int = 1,
         n_runs: int = 100,
@@ -65,8 +67,10 @@ def search_query(
             Pre-embedded corpus of documents.
         idx_to_ids: (Dict[int, int]):
             Map of embedding index to document ids.
+        corpus (CorpusDataset):
+            CorpusDataset to embed.
         top_k (int, optional):
-            Number of entries similar corpus documents to return.
+            Number of entries similar corpus documents to return. 
         max_sequence_length (int, optional):
             max sequence length. Defaults to 128.
         batch_size (int, optional):
@@ -139,10 +143,19 @@ def search_query(
             corpus_embeddings,
             top_k=top_k,
             score_function=score_func)
-
+        
+        if use_re_ranker and cross_encoder!=None:
+            ### Prepare inp for cross_encoder using output of bi_encoder
+            for i in range(len(out)):
+                cross_inp = [[queries[i], corpus[entry['corpus_id']]] for entry in out[i]]
+                cross_scores = cross_encoder.predict(cross_inp)
+                for idx in range(len(cross_scores)):
+                    out[i][idx]['cross-score'] = float(cross_scores[idx])
+                out[i] = sorted(out[i], key=lambda x: x['cross-score'], reverse=True)
+            
         # map index based ids to raw corpus_ids
-        for res in out:
-            for entry in res:
+        for i in range(len(out)):
+            for entry in out[i]:
                 entry['corpus_id'] = idx_to_ids[entry['corpus_id']]
 
         if output_file is not None:
@@ -186,6 +199,7 @@ def main(flags):
 
         # load the pretrained embedding model
         embedder = AutoModel.from_pretrained(conf['model']['pretrained_model'])
+        cross_encoder = CrossEncoder(conf['model']['cross_encoder'])
 
     elif conf["model"]["format"] == "inc":
 
@@ -194,7 +208,7 @@ def main(flags):
 
         embedder = AutoModel.from_pretrained(conf['model']['pretrained_model'])
         embedder = load(conf["model"]["path"], embedder)
-
+        cross_encoder = CrossEncoder(conf['model']['cross_encoder'])
         # re-establish logger because it breaks from above
         logging.getLogger().handlers.clear()
 
@@ -222,7 +236,6 @@ def main(flags):
     if flags.intel:
         import intel_extension_for_pytorch as ipex
         embedder = ipex.optimize(embedder, dtype=torch.float32)
-
     sample_inputs = tokenizer.batch_decode([
         random.sample(
             range(tokenizer.vocab_size), max_sequence_length) for
@@ -252,16 +265,21 @@ def main(flags):
         corpus_embeddings = saved_embeddings['embeddings']
         idx_to_ids = dict(enumerate(ids))
 
+    # read in corpus dataset
+    corpus = load_corpus(flags.input_corpus)
     search_query(
         logger=logger,
         tokenizer=tokenizer,
         embedder=embedder,
+        cross_encoder=cross_encoder,
         queries=input_file.queries,
         corpus_embeddings=corpus_embeddings,
         idx_to_ids=idx_to_ids,
+        corpus=corpus,
         top_k=conf['inference']['top_k'],
         max_sequence_length=max_sequence_length,
         batch_size=flags.batch_size,
+        use_re_ranker=flags.use_re_ranker,
         n_runs=flags.n_runs,
         score=conf['inference']['score_function'],
         output_file=flags.output_file,
@@ -289,6 +307,12 @@ def main(flags):
                         type=str
                         )
 
+    parser.add_argument('--input_corpus',
+                        required=True,
+                        help="path to corpus to embed",
+                        type=str
+                        )
+
     parser.add_argument('--output_file',
                         required=False,
                         help="file to output top k documents to",
@@ -310,6 +334,13 @@ def main(flags):
                         default=False
                         )
 
+    parser.add_argument('--use_re_ranker',
+                        required=False,
+                        help="Use cross encoder reranking",
+                        action="store_true",
+                        default=False
+                        )
+
     parser.add_argument('--n_runs',
                         required=False,
                         help="number of iterations to benchmark embedding",
@@ -326,4 +357,4 @@ def main(flags):
 
     FLAGS = parser.parse_args()
 
-    main(FLAGS)
+    main(FLAGS)