-
Notifications
You must be signed in to change notification settings - Fork 0
A Tensor‐Based Multi‐Word Extraction and Embedding Algorithm for Domain‐Specific Contextualization
This paper proposes a novel algorithm combining Named Entity Recognition (NER), TF-IDF/BM25 ranking, and embeddings to identify domain-specific multi-word terms. The extracted terms are represented as tensors to contextualize the full essence of input documents, thereby enhancing downstream information retrieval or natural language understanding tasks. We compare our approach to other existing methods and demonstrate how it efficiently captures domain-relevant phrases while reducing semantic noise.
Multi-word entity extraction and representation are crucial for accurate contextual modeling in domains like law, healthcare, and finance, where precise terminology is essential. Traditional methods for multi-word term extraction often result in incomplete or irrelevant phrases that fail to capture the true essence of the subject. This paper presents an algorithm that combines statistical scoring (TF-IDF, BM25), NER, and embedding averaging to create document-level tensors capable of effectively representing domain-specific contexts. We evaluate the performance of this approach against baselines involving traditional extraction and embedding techniques.
The use of TF-IDF and BM25 for relevance ranking has been well-documented in information retrieval (Robertson & Zaragoza, 2009). Named Entity Recognition (NER) models, both rule-based (e.g., Stanford NER) and modern transformer-based models, are used for extracting important entities (Lample et al., 2016). Multi-prototype word embeddings, proposed by Reisinger & Mooney (2010), and hybrid retrieval models, like BM25-BERT (GitHub, 2020)【36†source】, also serve as inspiration for this work. While these techniques are individually effective, their integration to refine domain-specific phrase relevance is novel.
We define two sets of candidate tokens, Set A and Set B, from an input document ( D ):
-
Set A (( A )): Tokens identified using a Named Entity Recognition (NER) library. These tokens are trained specifically for the domain using:
- A list of popular abbreviations and domain-specific acronyms.
- User queries with a high degree of click-through rates.
- Terms extracted from the index section of domain-specific manuals, when applicable.
-
Set B (( B )): Tokens extracted based on statistical scores derived from domain-specific corpora. We use TF-IDF, BM25, or Pointwise Mutual Information (PMI) to assign a relevance score to the n-grams in ( D ). Only n-grams with a score exceeding a predefined threshold ( au ) are included in Set B. Full documents are used for scoring, with longer documents split by chapters rather than individual pages to maintain contextual integrity.
In practice, ( A ) tends to contain well-established named entities, while ( B ) includes phrases that are highly relevant to the domain but may not be recognized by typical NER.
To obtain a refined list of highly relevant tokens, we compute the intersection (( A \cap B )) to form a set of core domain-relevant tokens:
- Intersection Set (( A \cap B )): Represents n-grams that are highly relevant based on both statistical scores and NER.
- Union Set Minus Intersection (( A \cup B - A \cap B )): Contains additional tokens that might be contextually important but are not as central as the intersection set.
For each token ( t \in A \cup B ), an embedding vector ( \mathbf{v}_t ) is generated using an embedding model such as BERT (Devlin et al., 2019). We define the document tensor ( T_D ) as follows:
where ( n ) is the number of extracted tokens from ( A \cup B ). Each ( \mathbf{v}_t ) is a vector of dimension ( d ), making ( T_D ) a matrix of size ( n imes d ). The tensor ( T_D ) captures the semantic relationships between different terms in the document.
To further summarize the document's essence, an average pooling operation can be performed on the tensor:
This averaged vector ( ar{\mathbf{v}}_D ) can be used as the document representation for downstream tasks.
We evaluate the proposed algorithm on domain-specific corpora, including law and healthcare datasets. The corpora are indexed in Apache Solr to calculate TF-IDF/BM25 scores. We use the SpaCy NER model to extract entities for Set A and a BERT-based embedding service to generate token embeddings.
To evaluate the quality of the extracted n-grams, we use precision, recall, and F1 score against manually annotated domain-specific phrases. For document representation quality, we use cosine similarity between averaged document embeddings and reference vectors.
The intersection set (( A \cap B )) consistently yields higher precision scores compared to using either Set A or Set B alone. Averaging embeddings of the intersection set provides a more accurate document representation, as evidenced by improved cosine similarity with reference vectors.
The proposed algorithm effectively balances statistical scoring and named entity recognition to produce a highly relevant set of domain-specific phrases. Representing these phrases as a tensor allows for a unified embedding that contextualizes the full essence of the document. Compared to traditional extraction and embedding techniques, our approach offers better precision and improved document representation.
We introduced a tensor-based algorithm for multi-word entity extraction and document representation that combines NER, TF-IDF/BM25 ranking, and embeddings. The use of an intersection-based filtering strategy ensures the extracted tokens are both statistically relevant and contextually significant. Future work includes optimizing the embedding aggregation technique to further enhance document-level understanding.
- Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval.
- Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. Proceedings of NAACL.
- Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
- Reisinger, J., & Mooney, R. J. (2010). Multi-Prototype Vector-Space Models of Word Meaning. Proceedings of NAACL-HLT.