Skip to content

Commit 6f78628

Browse files
Rephrased documentation and readded formulae
1 parent 78a5a2a commit 6f78628

2 files changed

Lines changed: 35 additions & 5 deletions

File tree

docs/clustering.md

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,11 +121,11 @@ By and large there are two types of methods that can be used for importance esti
121121

122122
| Importance method | Type | Description | Advantages |
123123
| - | - | - | - |
124-
| `linear` **(NEW)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
125-
| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. |
126-
| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
127124
| `soft-c-tf-idf` *(default)* | Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
125+
| `fighting-words` **(NEW)** | Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
126+
| `c-tf-idf` | Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
128127
| `centroid` | Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
128+
| `linear` **(NEW, EXPERIMENTAL)** | Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
129129

130130

131131
!!! quote "Choose a term importance estimation method"
@@ -140,6 +140,30 @@ By and large there are two types of methods that can be used for importance esti
140140
model = ClusteringTopicModel(feature_importance="c-tf-idf")
141141
```
142142

143+
??? info "Click to see formulas"
144+
#### Soft-c-TF-IDF
145+
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
146+
- Estimate weight of term $j$ for topic $z$: <br>
147+
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
148+
$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and
149+
$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
150+
- Estimate inverse document/topic frequency for term $j$:
151+
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
152+
$N$ is the total number of documents.
153+
- Calculate importance of term $j$ for topic $z$:
154+
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
155+
156+
#### c-TF-IDF
157+
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
158+
- $tf_{zj} = \frac{t_{zj}}{w_z}$, where
159+
$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and
160+
$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
161+
- Estimate inverse document/topic frequency for term $j$:
162+
$idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})$, where
163+
$A = \frac{\sum_z \sum_j t_{zj}}{Z}$ is the average number of words per topic, and $Z$ is the number of topics.
164+
- Calculate importance of term $j$ for topic $z$:
165+
$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
166+
143167
=== "Centroid Proximity (Top2Vec)"
144168

145169
```python

turftopic/models/cluster.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,12 @@
3333
ClusterNode,
3434
LinkageMethod,
3535
)
36-
from turftopic.multimodal import Image, ImageRepr, MultimodalEmbeddings, MultimodalModel
36+
from turftopic.multimodal import (
37+
Image,
38+
ImageRepr,
39+
MultimodalEmbeddings,
40+
MultimodalModel,
41+
)
3742
from turftopic.types import VALID_DISTANCE_METRICS, DistanceMetric
3843
from turftopic.utils import safe_binarize
3944
from turftopic.vectorizers.default import default_vectorizer
@@ -453,7 +458,8 @@ def fit_predict(
453458
raw_documents: iterable of str
454459
Documents to fit the model on.
455460
y: None
456-
Originally ignored, in case of a dimensionality reduction that can utilize labels,
461+
Ignored, when the dimensionality reduction is TSNE (the default),
462+
in case of a dimensionality reduction that can utilize labels,
457463
you can pass labels to the model to inform the clustering process.
458464
embeddings: ndarray of shape (n_documents, n_dimensions), optional
459465
Precomputed document encodings.

0 commit comments

Comments
 (0)