BERTopic Embeddings and Visualizations

This document describes the embedding-based features added to support BERTopic topic models.

Overview

BERTopic models use neural embeddings to represent topics in a high-dimensional semantic space. These embeddings enable advanced visualizations and similarity metrics that are not available for traditional LDA/HLDA models.

Features Implemented

1. Topic Embedding Storage

File: import_tool/analysis/interfaces/bertopic_analysis.py

Topic embeddings are now automatically saved during BERTopic analysis:

Location: analyses/{analysis_name}/topic_embeddings.json
Format: JSON dictionary mapping topic IDs to embedding vectors
Source: BERTopic's topic_embeddings_ attribute (centroid embeddings for each topic)

Implementation:

def _save_topic_embeddings(self, topic_model):
    """Save topic embeddings for visualization and similarity calculations."""
    embeddings_dict = {}
    topic_info = topic_model.get_topic_info()

    for idx, topic_id in enumerate(topic_info['Topic']):
        if topic_id != -1:  # Skip outlier topic
            embedding = topic_model.topic_embeddings_[idx].tolist()
            embeddings_dict[str(topic_id)] = embedding

    with io.open(self.topic_embeddings_file, 'w', encoding='utf-8') as f:
        json.dump(embeddings_dict, f)

2. Embedding Distance Metric

File: import_tool/metric/topic/pairwise/embedding_distance.py

A new pairwise metric computes semantic similarity between topics using cosine distance in embedding space.

Key Features:

Automatically computed for all BERTopic analyses
Uses cosine distance: 1 - cosine_similarity
Range: 0 (identical) to 2 (opposite)
Stored in database as PairwiseTopicMetricValue with metric name "Embedding Distance"

Usage:

# Manually compute for an existing analysis
python import_tool/metric/topic/pairwise/embedding_distance.py \
    -d state_of_the_union \
    -a bertopicauto

Registration: Added to import_tool/metric/topic/pairwise/__init__.py to run automatically during import.

3. BERTopic Native Visualizations

Files:

Backend: visualize/bertopic_viz.py
Frontend: visualize/static/scripts/topic_embeddings_view.js
URL routing: visualize/urls.py

Provides access to BERTopic's built-in Plotly visualizations through the Topical Guide UI.

Available Visualizations:

Topics Map (visualize_topics())
- 2D scatter plot of topics in embedding space
- Uses UMAP or t-SNE for dimensionality reduction
- Topics closer together are more semantically similar
- Interactive: hover for details, click to explore
Documents Map (visualize_documents())
- 2D visualization of individual documents colored by their assigned topics
- Helps identify document clusters and outliers
- Shows how well documents fit within their assigned topics
Similarity Heatmap (visualize_heatmap())
- Matrix showing pairwise similarity between all topics
- Color-coded for easy comparison
- Interactive tooltips with exact values
Topic Hierarchy (visualize_hierarchy())
- Hierarchical clustering dendrogram
- Shows how topics group together at different similarity levels
- Useful for understanding topic structure
Top Words (visualize_barchart())
- Bar charts of most representative words per topic
- Shows first 10 topics with top 10 words each
- Side-by-side comparison
Term Rank (visualize_term_rank())
- Shows how c-TF-IDF scores decline as more terms are added to topic representations
- Useful for determining optimal number of words per topic
- Helps assess topic quality
Topics Over Time (visualize_topics_over_time())
- Track how topic frequencies change over time
- Requires temporal data (timestamps for documents)
- Perfect for datasets like State of the Union speeches
Topics Per Class (visualize_topics_per_class())
- Compare topic representations across different document classes
- Requires class labels (e.g., by president, party, decade)
- Shows how different groups approach topics
Hierarchical Documents (visualize_hierarchical_documents())
- View documents across different levels of the topic hierarchy
- Combines hierarchy and document views

Access: Navigate to Topic Space in the Topical Guide UI (only available for BERTopic analyses).

4. Embedding Distance in Similar Topics Tab

The "Similar Topics" tab in the single topic view now includes an "Embedding Distance" column alongside existing metrics like:

Document Correlation
Word Correlation

This column shows the semantic distance between the current topic and other topics based on their neural embeddings.

Technical Details

Cosine Distance Calculation

def cosine_distance(vec1, vec2):
    """
    Compute cosine distance between two vectors.
    Returns a value between 0 (identical) and 2 (opposite).
    """
    dot_product = sum(a * b for a, b in zip(vec1, vec2))
    mag1 = math.sqrt(sum(a * a for a in vec1))
    mag2 = math.sqrt(sum(b * b for b in vec2))

    if mag1 == 0 or mag2 == 0:
        return 1.0  # Undefined, return neutral value

    cosine_sim = dot_product / (mag1 * mag2)
    return 1.0 - cosine_sim

API Endpoint

URL: /bertopic-viz/<dataset>/<analysis>/<viz_type>/

Parameters:

dataset: Dataset name
analysis: Analysis name (must be a BERTopic analysis)
viz_type: One of topics, documents, heatmap, hierarchy, hierarchical_documents, barchart, term_rank, topics_over_time, topics_per_class

Returns: HTML containing interactive Plotly visualization

Example:

GET /bertopic-viz/state_of_the_union/bertopicauto/topics/

Frontend Integration

The visualization view loads BERTopic visualizations in an iframe:

var url = "/bertopic-viz/" + dataset + "/" + analysis + "/" + vizType + "/";

d3.select("#embeddings-plot")
    .append("iframe")
    .attr("src", url)
    .attr("width", "100%")
    .attr("height", "700px");

Benefits

For Researchers

Semantic Similarity: Understand which topics are semantically related, not just through word overlap
Visual Exploration: Interactive 2D maps make it easy to explore topic structure
Hierarchical Relationships: See how topics group at different levels of granularity
Quality Assessment: Visualizations help assess whether topics are well-separated or overlapping

For Topic Model Comparison

Compare embedding-based distance vs. word-based correlation
Identify topics that are:
- Semantically similar but lexically different (low embedding distance, low word correlation)
- Lexically similar but semantically different (high word correlation, high embedding distance)

Upgrading Existing Analyses

If you have existing BERTopic analyses that were created before this feature was added:

Re-run the analysis to generate embeddings:

python tg.py analyze state_of_the_union \
    --analysis-tool BERTopic \
    --number-of-topics 20 \
    --stopwords stopwords/english_all.txt \
    --verbose

The embedding distance metric will be computed automatically during import.
Visualizations will be available in the UI.

Dependencies

Required packages (already in requirements.txt for BERTopic):

bertopic - Core BERTopic library
sentence-transformers - For generating embeddings
plotly - For interactive visualizations
umap-learn - For dimensionality reduction
hdbscan - For clustering

Future Enhancements

Potential additions:

Hierarchical Topics (High Priority): Full data model integration for BERTopic hierarchies
- Current Status: Visualization available via visualize_hierarchy() and visualize_hierarchical_documents()
- Needed: Implement get_hierarchy_iterator() in bertopic_analysis.py to populate the database hierarchy
- BERTopic Method: Use topic_model.hierarchical_topics() to generate the hierarchy
- Benefit: Enable browsing topic trees similar to HLDA but with embedding-based hierarchies
Cross-Dataset Alignment: Compare topics across different datasets using embedding alignment
- Use embedding space to map equivalent topics across analyses
Topic Merging/Splitting: Use embeddings to suggest topic consolidation or division
- Identify topics that should be merged (high embedding similarity)
- Identify topics that are too broad (high internal variance)
Custom Embedding Models: Allow users to specify different sentence transformer models
- Currently fixed to "all-MiniLM-L6-v2"
- Add UI option to select from common models (e.g., "all-mpnet-base-v2", domain-specific models)
Save Auxiliary Data for Advanced Visualizations: Store documents, timestamps, and class labels during analysis
- Currently some visualizations may fail if auxiliary data files don't exist
- Modify bertopic_analysis.py to save this data automatically

Examples

Example 1: Finding Semantically Similar Topics

In the Topics view:

Click on a topic to view its details
Go to the "Similar Topics" tab
Sort by "Embedding Distance" column (ascending)
Topics at the top are most semantically similar

Example 2: Exploring Topic Space

In Topic Space:

Select "Topics Map" to see topics in 2D space
Hover over points to see topic names
Zoom and pan to explore different regions
Try "Documents Map" to see how individual documents cluster by topic

Example 3: Understanding Topic Hierarchy

Navigate to Topic Space
Select "Topic Hierarchy"
Examine the dendrogram to see how topics cluster
Identify major theme groups and sub-themes

Example 4: Temporal Analysis

Navigate to Topic Space
Select "Topics Over Time"
See how topic frequencies evolve across different time periods
Identify trending topics and declining themes

Troubleshooting

"No embedding distances found"

Cause: Analysis was run before embedding support was added.

Solution: Re-run the analysis with the latest code.

"BERTopic model file not found"

Cause: Model pickle file is missing or corrupted.

Solution: Re-run the analysis.

Visualization iframe shows error

Cause: Django view error or missing dependencies.

Solution:

Check server logs for detailed error message
Ensure all dependencies are installed
Verify the model file exists and is valid

"Documents file not found" error

Cause: Visualizations like Documents Map and Hierarchical Documents require the original document text.

Solution: These visualizations need auxiliary data files that may not be saved by default. This is a known limitation (see Future Enhancements #5).

"Timestamp data not found" or "Class label data not found"

Cause: Topics Over Time and Topics Per Class require temporal/class metadata.

Solution:

For temporal analysis, ensure your dataset includes timestamp information
For class-based analysis, add class labels to your documents
These are advanced features that may not be available for all datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERTopic Embeddings and Visualizations

Overview

Features Implemented

1. Topic Embedding Storage

2. Embedding Distance Metric

3. BERTopic Native Visualizations

4. Embedding Distance in Similar Topics Tab

Technical Details

Cosine Distance Calculation

API Endpoint

Frontend Integration

Benefits

For Researchers

For Topic Model Comparison

Upgrading Existing Analyses

Dependencies

Future Enhancements

Examples

Example 1: Finding Semantically Similar Topics

Example 2: Exploring Topic Space

Example 3: Understanding Topic Hierarchy

Example 4: Temporal Analysis

Troubleshooting

"No embedding distances found"

"BERTopic model file not found"

Visualization iframe shows error

"Documents file not found" error

"Timestamp data not found" or "Class label data not found"

See Also

FilesExpand file tree

BERTOPIC_EMBEDDINGS.md

Latest commit

History

BERTOPIC_EMBEDDINGS.md

File metadata and controls

BERTopic Embeddings and Visualizations

Overview

Features Implemented

1. Topic Embedding Storage

2. Embedding Distance Metric

3. BERTopic Native Visualizations

4. Embedding Distance in Similar Topics Tab

Technical Details

Cosine Distance Calculation

API Endpoint

Frontend Integration

Benefits

For Researchers

For Topic Model Comparison

Upgrading Existing Analyses

Dependencies

Future Enhancements

Examples

Example 1: Finding Semantically Similar Topics

Example 2: Exploring Topic Space

Example 3: Understanding Topic Hierarchy

Example 4: Temporal Analysis

Troubleshooting

"No embedding distances found"

"BERTopic model file not found"

Visualization iframe shows error

"Documents file not found" error

"Timestamp data not found" or "Class label data not found"

See Also