Skip to content

Commit 059c656

Browse files
Merge branch 'main' into logreg_clustering
2 parents d38bcfe + 2f2af9e commit 059c656

5 files changed

Lines changed: 114 additions & 63 deletions

File tree

README.md

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,13 @@ pip install turftopic
3636
If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.
3737

3838
```bash
39-
pip install turftopic[pyro-ppl]
39+
pip install "turftopic[pyro-ppl]"
4040
```
4141

4242
If you want to use clustering models like BERTopic or Top2Vec, install:
4343

4444
```bash
45-
pip install turftopic[umap-learn]
45+
pip install "turftopic[umap-learn]"
4646
```
4747

4848
### Fitting a Model
@@ -52,6 +52,8 @@ scikit-learn workflows.
5252

5353
Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.
5454

55+
> If you are using a Mac, you might have to install the required SSL certificates on your system in order to be able to download the dataset.
56+
5557
```python
5658
from sklearn.datasets import fetch_20newsgroups
5759

@@ -68,7 +70,8 @@ Turftopic also comes with interpretation tools that make it easy to display and
6870
```python
6971
from turftopic import KeyNMF
7072

71-
model = KeyNMF(20).fit(corpus)
73+
model = KeyNMF(20)
74+
document_topic_matrix = model.fit_transform(corpus)
7275
```
7376

7477
### Interpreting Models
@@ -131,6 +134,8 @@ model.print_topic_distribution(
131134

132135
Turftopic now allows you to automatically assign human readable names to topics using LLMs or n-gram retrieval!
133136

137+
> You will need to `pip install "turftopic[openai]"` for this to work.
138+
134139
```python
135140
from turftopic import KeyNMF
136141
from turftopic.namers import OpenAITopicNamer
@@ -154,6 +159,8 @@ model.print_topics()
154159

155160
You can use a set of custom vectorizers for topic modeling over **phrases**, as well as **lemmata** and **stems**.
156161

162+
> You will need to `pip install "turftopic[spacy]"` for this to work.
163+
157164
```python
158165
from turftopic import BERTopic
159166
from turftopic.vectorizers.spacy import NounPhraseCountVectorizer
@@ -175,10 +182,34 @@ model.print_topics()
175182

176183
### Visualization
177184

178-
Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
185+
Turftopic comes with a number of visualization and pretty printing utilities for specific models and specific contexts, such as hierarchical or dynamic topic modelling.
186+
You will find an overview of these in the [Interpreting and Visualizing Models](https://x-tabdeveloping.github.io/turftopic/model_interpretation/) section of our documentation.
187+
188+
```
189+
pip install "turftopic[datamapplot, openai]"
190+
```
191+
192+
```python
193+
from turftopic import ClusteringTopicModel
194+
from turftopic.namers import OpenAITopicNamer
195+
196+
model = ClusteringTopicModel(feature_importance="centroid").fit(corpus)
197+
198+
namer = OpenAITopicNamer("gpt-4o-mini")
199+
model.rename_topics(namer)
200+
201+
fig = model.plot_clusters_datamapplot()
202+
fig.show()
203+
```
204+
205+
<center>
206+
<img src="https://github.com/x-tabdeveloping/turftopic/blob/main/docs/images/cluster_datamapplot.png?raw=true" width="70%" style="margin-left: auto;margin-right: auto;">
207+
</center>
208+
209+
In addition, Turftopic is natively supported in [topicwizard](https://github.com/x-tabdeveloping/topicwizard), an interactive topic model visualization library, is compatible with all models from Turftopic.
179210

180211
```bash
181-
pip install topic-wizard
212+
pip install "turftopic[topic-wizard]"
182213
```
183214

184215
By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.
@@ -189,10 +220,10 @@ import topicwizard
189220
topicwizard.visualize(corpus, model=model)
190221
```
191222

192-
<figure>
223+
<center>
193224
<img src="https://x-tabdeveloping.github.io/topicwizard/_images/screenshot_topics.png" width="70%" style="margin-left: auto;margin-right: auto;">
194225
<figcaption>Screenshot of the topicwizard Web Application</figcaption>
195-
</figure>
226+
</center>
196227

197228
Alternatively you can use the [Figures API](https://x-tabdeveloping.github.io/topicwizard/figures.html) in topicwizard for individual HTML figures.
198229

paper.bib

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -170,13 +170,15 @@ @inproceedings{sentence_transformers
170170
abstract = "BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) has set a new state-of-the-art performance on sentence-pair regression tasks like semantic textual similarity (STS). However, it requires that both sentences are fed into the network, which causes a massive computational overhead: Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations ({\textasciitilde}65 hours) with BERT. The construction of BERT makes it unsuitable for semantic similarity search as well as for unsupervised tasks like clustering. In this publication, we present Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. This reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT. We evaluate SBERT and SRoBERTa on common STS tasks and transfer learning tasks, where it outperforms other state-of-the-art sentence embeddings methods."
171171
}
172172
173-
@software{topicwizard,
174-
author = {Kardos, Márton},
175-
month = nov,
176-
title = {{topicwizard: Pretty and opinionated topic model visualization in Python}},
177-
url = {https://github.com/x-tabdeveloping/topic-wizard},
178-
version = {0.5.0},
179-
year = {2023}
173+
@misc{topicwizard,
174+
title={topicwizard -- a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation},
175+
author={Márton Kardos and Kenneth C. Enevoldsen and Kristoffer Laigaard Nielbo},
176+
year={2025},
177+
eprint={2505.13034},
178+
archivePrefix={arXiv},
179+
primaryClass={cs.CL},
180+
url={https://arxiv.org/abs/2505.13034},
181+
doi="10.48550/arXiv.2505.13034"
180182
}
181183

182184
@article{discourse_analysis,
@@ -218,7 +220,8 @@ @InProceedings{content_recommendation
218220
address="Cham",
219221
pages="247--263",
220222
abstract="We propose a plot-based recommendation system, which is based upon an evaluation of similarity between the plot of a video that was watched by a user and a large amount of plots stored in a movie database. Our system is independent from the number of user ratings, thus it is able to propose famous and beloved movies as well as old or unheard movies/programs that are still strongly related to the content of the video the user has watched. The system implements and compares the two Topic Models, Latent Semantic Allocation (LSA) and Latent Dirichlet Allocation (LDA), on a movie database of two hundred thousand plots that has been constructed by integrating different movie databases in a local NoSQL (MongoDB) DBMS. The topic models behaviour has been examined on the basis of standard metrics and user evaluations, performance assessments with 30 users to compare our tool with a commercial system have been conducted.",
221-
isbn="978-3-319-27030-2"
223+
isbn="978-3-319-27030-2",
224+
doi={10.1007/978-3-319-27030-2_16},
222225
}
223226

224227
@article{unsupervised_classification,
@@ -248,7 +251,8 @@ @InProceedings{information_retrieval
248251
address="Berlin, Heidelberg",
249252
pages="29--41",
250253
abstract="We explore the utility of different types of topic models for retrieval purposes. Based on prior work, we describe several ways that topic models can be integrated into the retrieval process. We evaluate the effectiveness of different types of topic models within those retrieval approaches. We show that: (1) topic models are effective for document smoothing; (2) more rigorous topic models such as Latent Dirichlet Allocation provide gains over cluster-based models; (3) more elaborate topic models that capture topic dependencies provide no additional gains; (4) smoothing documents by using their similar documents is as effective as smoothing them by using topic models; (5) doing query expansion should utilize topics discovered in the top feedback documents instead of coarse-grained topics from the whole corpus; (6) generally, incorporating topics in the feedback documents for building relevance models can benefit the performance more for queries that have more relevant documents.",
251-
isbn="978-3-642-00958-7"
254+
isbn="978-3-642-00958-7",
255+
doi={10.1007/978-3-642-00958-7_6},
252256
}
253257

254258
@misc{data_mixers,

paper.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ bibliography: paper.bib
3434

3535
# Summary
3636

37-
Turftopic is a topic modelling library including a number of recent topic models that go beyond bag-of-words models and can understand text in context, utilizing representations from transformers.
37+
Topic models are machine learning techniques that are able to discover themes in a set of documents.
38+
Turftopic is a topic modelling library including a number of recent developments in topic modelling that go beyond bag-of-words models and can understand text in context, utilizing representations from transformers.
3839
Turftopic focuses on ease of use, providing a unified interface for a number of different modern topic models, and boasting both model-specific and model-agnostic interpretation and visualization utilities.
3940
While the user is afforded great flexibility in model choice and customization, the library comes with reasonable defaults, so as not to needlessly overwhelm first-time users.
4041
In addition, Turftopic allows the user to: a) model topics as they change over time, b) learn topics on-line from a stream of texts, c) find hierarchical structure in topics, d) learning topics in multilingual texts and corpora.
@@ -50,10 +51,11 @@ Some attempts have been made at creating unified packages for modern topic model
5051
These packages, however, have a focus on neural models and topic model evaluation, have abstract and highly specialized interfaces, and do not include some popular topic models.
5152
Additionally, while model interpretation is fundamental aspect of topic modelling, the interpretation utilities provided in these libraries are fairly limited, especially in comparison with model-specific packages, like BERTopic.
5253

53-
Turftopic unifies state-of-the-art contextual topic models under a superset of the `scikit-learn` [@scikit-learn] API, which users are likely already familiar with, and can be readily included in `scikit-learn` workflows and pipelines.
54+
Turftopic unifies state-of-the-art contextual topic models under a superset of the `scikit-learn` [@scikit-learn] API, which many users may be familiar with, and can be readily included in `scikit-learn` workflows and pipelines.
5455
We focused on making Turftopic first and foremost an easy-to-use library that does not necessitate expert knowledge or excessive amounts of code to get started with, but gives great flexibility to power users.
55-
Furthermore, we included an extensive suite of pretty-printing and visualization utilities that aid users in interpreting their results.
56-
The library also includes three topic models, which to our knowledge only have implementations in Turftopic, these are: KeyNMF [@keynmf], Semantic Signal Separation (S^3^) [@s3], and GMM, a Gaussian Mixture model of document representations with a soft-c-tf-idf term weighting scheme.
56+
Furthermore, we included an extensive suite of pretty-printing and model-specific visualization utilities that aid users in interpreting their results.
57+
In addition, we provide native compatibility with `topicwizard` [@topicwizard], a model-agnostic topic model visualization library.
58+
The library also includes three topic models that, to our knowledge, only have implementations in Turftopic: KeyNMF [@keynmf], Semantic Signal Separation (S^3^) [@s3], and GMM, a Gaussian Mixture model of document representations with a soft-c-tf-idf term weighting scheme.
5759

5860
# Functionality
5961

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ profile = "black"
99

1010
[project]
1111
name = "turftopic"
12-
version = "0.17.1"
12+
version = "0.17.2"
1313
description = "Topic modeling with contextual representations from sentence transformers."
1414
authors = [
1515
{ name = "Márton Kardos <power.up1163@gmail.com>", email = "martonkardos@cas.au.dk" }

turftopic/multimodal.py

Lines changed: 56 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,8 @@
55

66
import numpy as np
77
from PIL import Image
8-
from sentence_transformers import SentenceTransformer
98

109
from turftopic.data import TopicData
11-
from turftopic.encoders.multimodal import MultimodalEncoder
1210

1311
UrlStr = str
1412

@@ -42,6 +40,61 @@ class MultimodalEmbeddings(TypedDict):
4240
document_embeddings: np.ndarray
4341

4442

43+
def encode_multimodal(
44+
encoder, sentences: list[str], images: list[ImageRepr]
45+
) -> dict[str, np.ndarray]:
46+
"""Produce multimodal embeddings of the documents passed to the model.
47+
48+
Parameters
49+
----------
50+
encoder
51+
MTEB or SentenceTransformer compatible embedding model.
52+
sentences: list[str]
53+
Textual documents to encode.
54+
images: list[ImageRepr]
55+
Corresponding images for each document.
56+
57+
Returns
58+
-------
59+
MultimodalEmbeddings
60+
Text, image and joint document embeddings.
61+
"""
62+
if len(sentences) != len(images):
63+
raise ValueError("Images and documents were not the same length.")
64+
if hasattr(encoder, "get_text_embeddings"):
65+
text_embeddings = np.array(encoder.get_text_embeddings(sentences))
66+
else:
67+
text_embeddings = encoder.encode(sentences)
68+
embedding_size = text_embeddings.shape[1]
69+
images = list(_load_images(images))
70+
if hasattr(encoder, "get_image_embeddings"):
71+
image_embeddings = np.array(encoder.get_image_embeddings(images))
72+
else:
73+
image_embeddings = []
74+
for image in images:
75+
if image is not None:
76+
image_embeddings.append(encoder.encode(image))
77+
else:
78+
image_embeddings.append(np.full(embedding_size, np.nan))
79+
image_embeddings = np.stack(image_embeddings)
80+
if hasattr(encoder, "get_fused_embeddings"):
81+
document_embeddings = np.array(
82+
encoder.get_fused_embeddings(
83+
texts=sentences,
84+
images=images,
85+
)
86+
)
87+
else:
88+
document_embeddings = _naive_join_embeddings(
89+
text_embeddings, image_embeddings
90+
)
91+
return {
92+
"text_embeddings": text_embeddings,
93+
"image_embeddings": image_embeddings,
94+
"document_embeddings": document_embeddings,
95+
}
96+
97+
4598
class MultimodalModel:
4699
"""Base model for multimodal topic models."""
47100

@@ -65,46 +118,7 @@ def encode_multimodal(
65118
Text, image and joint document embeddings.
66119
67120
"""
68-
if len(sentences) != len(images):
69-
raise ValueError("Images and documents were not the same length.")
70-
if hasattr(self.encoder_, "get_text_embeddings"):
71-
text_embeddings = np.array(
72-
self.encoder_.get_text_embeddings(sentences)
73-
)
74-
else:
75-
text_embeddings = self.encoder_.encode(sentences)
76-
embedding_size = text_embeddings.shape[1]
77-
images = list(_load_images(images))
78-
if hasattr(self.encoder_, "get_image_embeddings"):
79-
image_embeddings = np.array(
80-
self.encoder_.get_image_embeddings(images)
81-
)
82-
else:
83-
image_embeddings = []
84-
for image in images:
85-
if image is not None:
86-
image_embeddings.append(self.encoder_.encode(image))
87-
else:
88-
image_embeddings.append(np.full(embedding_size, np.nan))
89-
image_embeddings = np.stack(image_embeddings)
90-
print(image_embeddings)
91-
if hasattr(self.encoder_, "get_fused_embeddings"):
92-
document_embeddings = np.array(
93-
self.encoder_.get_fused_embeddings(
94-
texts=sentences,
95-
images=images,
96-
)
97-
)
98-
else:
99-
document_embeddings = _naive_join_embeddings(
100-
text_embeddings, image_embeddings
101-
)
102-
103-
return {
104-
"text_embeddings": text_embeddings,
105-
"image_embeddings": image_embeddings,
106-
"document_embeddings": document_embeddings,
107-
}
121+
return encode_multimodal(self.encoder_, sentences, images)
108122

109123
@staticmethod
110124
def validate_embeddings(embeddings: Optional[MultimodalEmbeddings]):

0 commit comments

Comments
 (0)