You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/clustering.md
+27-3Lines changed: 27 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -121,11 +121,11 @@ By and large there are two types of methods that can be used for importance esti
121
121
122
122
| Importance method | Type | Description | Advantages |
123
123
| - | - | - | - |
124
-
|`linear`**(NEW)**| Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
125
-
|`fighting-words`**(NEW)**| Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. |
126
-
|`c-tf-idf`| Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
127
124
|`soft-c-tf-idf`*(default)*| Lexical | A c-tf-idf mehod that can interpret soft cluster assignments. | Can interpret soft cluster assignment in models like Gaussian Mixtures, less sensitive to stop words than vanilla c-tf-idf. |
125
+
|`fighting-words`**(NEW)**| Lexical | Compute word importance based on cluster differences using the Fightin' Words algorithm by Monroe et al. | A theoretically motivated probabilistic model that was explicitly designed for discovering lexical differences in groups of text. See [Fightin' Words paper](https://languagelog.ldc.upenn.edu/myl/Monroe.pdf). |
126
+
|`c-tf-idf`| Lexical | Compute how unique terms are in a cluster with a tf-idf style weighting scheme. This is the default in BERTopic. | Very fast, easy to understand and is not affected by cluster shape. |
128
127
|`centroid`| Semantic | Word importance based on words' proximity to cluster centroid vectors. This is the default in Top2Vec. | Produces clean topics, easily interpretable. |
128
+
|`linear`**(NEW, EXPERIMENTAL)**| Semantic | Project words onto the parameter vectors of a linear classifier (LDA). | Topic differences are measured in embedding space and are determined by predictive power, and are therefore accurate and clean. |
129
129
130
130
131
131
!!! quote "Choose a term importance estimation method"
@@ -140,6 +140,30 @@ By and large there are two types of methods that can be used for importance esti
140
140
model = ClusteringTopicModel(feature_importance="c-tf-idf")
141
141
```
142
142
143
+
??? info "Click to see formulas"
144
+
#### Soft-c-TF-IDF
145
+
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
146
+
- Estimate weight of term $j$ for topic $z$: <br>
147
+
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
148
+
$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and
149
+
$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
150
+
- Estimate inverse document/topic frequency for term $j$:
151
+
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
152
+
$N$ is the total number of documents.
153
+
- Calculate importance of term $j$ for topic $z$:
154
+
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
155
+
156
+
#### c-TF-IDF
157
+
- Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
158
+
- $tf_{zj} = \frac{t_{zj}}{w_z}$, where
159
+
$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and
160
+
$w_{z}= \sum_{j} t_{zj}$ is all words in the topic <br>
161
+
- Estimate inverse document/topic frequency for term $j$:
162
+
$idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})$, where
163
+
$A = \frac{\sum_z \sum_j t_{zj}}{Z}$ is the average number of words per topic, and $Z$ is the number of topics.
0 commit comments