Built site for gh-pages

AXLMRIN · AXLMRIN · commit 81bc537cbe5b · 2025-11-11T14:37:00.000+01:00
diff --git a/.nojekyll b/.nojekyll
@@ -1 +1 @@
-faae111c
+f607bafe
diff --git a/about.html b/about.html
diff --git a/index.html b/index.html
diff --git a/robots.txt b/robots.txt
@@ -0,0 +1 @@
+Sitemap: https://AXLMRIN.github.io/bertopic-tutorial/sitemap.xml
diff --git a/search.json b/search.json
@@ -83,6 +83,13 @@
     "section": "Discussion on evaluating a topic model",
     "text": "Discussion on evaluating a topic model\nTopic model evaluation is an active domain of research that goes beyond the scope of this tutorial. We propose an overview of the methods that exist and how to quickly tell if your topic model can be used or need to be refined.\nIn short: quantitative methods are impractical and one should focus more on the qualitative evaluation.\n\nQualitative evaluation\nThere is no one way around qualitatively evaluate your BERTopic, however the point is \n\n\nQuantitative evaluation\nIn this section we introduce different metrics that can be used to evaluate your topic model. However, we mainly included it to warn you of the complexity behind evaluating a topic model and that there is no one-fit-all solution.\n\nFirst, choosing the coherence score by itself can have a large influence on the difference in performance you will find between models. For example, NPMI and UCI may each lead to quite different values. Second, the coherence score only tells a part of the story. Perhaps your purpose is more classification than having the most coherent words or perhaps you want as diverse topics as possible. These use cases require vastly different evaluation metrics to be used.\nResponse to “How to evaluate the performance of the model?” by Maarten Grootendorst source\n\nThere are two types of metrics that you could use:\n\nCluster metrics — ie focus on the group making. There exist a lot of metrics, but few are fit to our situation: unsupervised learning with density based algorithms. In our experience, optimising these metrics result in a sub-optimal solutions as illustrated bellow. Read more\nTopic representation metrics — ie focus on how relevant the keywords are. Although some metrics exist their utility is limited: good score does not necessarily align with what expert consider good topic models, and they are not good scores to optimise (Stammbach et al., 2023). Read more"
   },
+  {
+    "objectID": "tutorial.html#precompute-your-embeddings",
+    "href": "tutorial.html#precompute-your-embeddings",
+    "title": "BERTopic Tutorial",
+    "section": "Precompute your embeddings",
+    "text": "Precompute your embeddings\nPre-computing the embeddings is a good practice as it will prevent from computing them at each run, but also because it allows you to use a broader spectrum of embedding models that could ne necessarily be used with BERTopic15. We retrieve the embeddings and the documents\nds = load_from_disk(\"path/to/file\")\ndocs = np.array(ds[f\"resumes.en\"]) # Number of documents : 6500\nembeddings = np.array(ds[\"embedding\"]) # shape : (6500, 768)\nThe columns in the dataset are the same as before in addition to an embedding column containing the embeddings of the resumes."
+  },
   {
     "objectID": "tutorial.html#save-your-instance-locally",
     "href": "tutorial.html#save-your-instance-locally",
diff --git a/sitemap.xml b/sitemap.xml
@@ -0,0 +1,11 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
+  <url>
+    <loc>https://AXLMRIN.github.io/bertopic-tutorial/techy-notes.html</loc>
+    <lastmod>2025-11-11T13:36:43.412Z</lastmod>
+  </url>
+  <url>
+    <loc>https://AXLMRIN.github.io/bertopic-tutorial/tutorial.html</loc>
+    <lastmod>2025-11-11T13:36:43.477Z</lastmod>
+  </url>
+</urlset>
diff --git a/techy-notes.html b/techy-notes.html
@@ -433,7 +433,7 @@ <h1>Coherence metrics</h1>
   }
     var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//);
     var mailtoRegex = new RegExp(/^mailto:/);
-      var filterRegex = new RegExp('/' + window.location.host + '/');
+      var filterRegex = new RegExp("https:\/\/AXLMRIN\.github\.io\/bertopic-tutorial\/");
     var isInternal = (href) => {
         return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href);
     }
diff --git a/tutorial.html b/tutorial.html
@@ -209,6 +209,7 @@ <h2 id="toc-title">On this page</h2>
   </ul></li>
   <li><a href="#good-practices" id="toc-good-practices" class="nav-link" data-scroll-target="#good-practices">Good practices</a>
   <ul class="collapse">
+  <li><a href="#precompute-your-embeddings" id="toc-precompute-your-embeddings" class="nav-link" data-scroll-target="#precompute-your-embeddings">Precompute your embeddings</a></li>
   <li><a href="#save-your-instance-locally" id="toc-save-your-instance-locally" class="nav-link" data-scroll-target="#save-your-instance-locally">Save your instance locally</a></li>
   </ul></li>
   <li><a href="#bibliography" id="toc-bibliography" class="nav-link" data-scroll-target="#bibliography">Bibliography</a></li>
@@ -1309,11 +1310,14 @@ <h3 class="anchored" data-anchor-id="quantitative-evaluation">Quantitative evalu
 </section>
 <section id="good-practices" class="level1">
 <h1>Good practices</h1>
+<section id="precompute-your-embeddings" class="level2">
+<h2 class="anchored" data-anchor-id="precompute-your-embeddings">Precompute your embeddings</h2>
 <p>Pre-computing the embeddings is a good practice as it will prevent from computing them at each run, but also because it allows you to use a broader spectrum of embedding models that could ne necessarily be used with BERTopic<a href="#fn15" class="footnote-ref" id="fnref15" role="doc-noteref"><sup>15</sup></a>. We retrieve the embeddings and the documents</p>
 <div class="sourceCode" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a>ds <span class="op">=</span> load_from_disk(<span class="st">"path/to/file"</span>)</span>
 <span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>docs <span class="op">=</span> np.array(ds[<span class="ss">f"resumes.en"</span>]) <span class="co"># Number of documents : 6500</span></span>
 <span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a>embeddings <span class="op">=</span> np.array(ds[<span class="st">"embedding"</span>]) <span class="co"># shape : (6500, 768)</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <p>The columns in the dataset are the same as before in addition to an <code>embedding</code> column containing the embeddings of the resumes.</p>
+</section>
 <section id="save-your-instance-locally" class="level2">
 <h2 class="anchored" data-anchor-id="save-your-instance-locally">Save your instance locally</h2>
 <p>For reproducibility purposes, and more generally, to save your work, BERTopic lets you do that with the <code>save</code> method. Two parameters of importance:</p>
@@ -1458,7 +1462,7 @@ <h1>Bibliography</h1>
   }
     var localhostRegex = new RegExp(/^(?:http|https):\/\/localhost\:?[0-9]*\//);
     var mailtoRegex = new RegExp(/^mailto:/);
-      var filterRegex = new RegExp('/' + window.location.host + '/');
+      var filterRegex = new RegExp("https:\/\/AXLMRIN\.github\.io\/bertopic-tutorial\/");
     var isInternal = (href) => {
         return filterRegex.test(href) || localhostRegex.test(href) || mailtoRegex.test(href);
     }
diff --git a/tutorial.ipynb b/tutorial.ipynb

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Sitemap: https://AXLMRIN.github.io/bertopic-tutorial/sitemap.xml`
Original file line number	Diff line number	Diff line change
`@@ -433,7 +433,7 @@ <h1>Coherence metrics</h1>`
`433`	`433`	`}`
`434`	`434`	`var localhostRegex = new RegExp(/^(?:http\|https):\/\/localhost\:?[0-9]*\//);`
`435`	`435`	`var mailtoRegex = new RegExp(/^mailto:/);`
`436`		`- var filterRegex = new RegExp('/' + window.location.host + '/');`
	`436`	`+ var filterRegex = new RegExp("https:\/\/AXLMRIN\.github\.io\/bertopic-tutorial\/");`
`437`	`437`	`var isInternal = (href) => {`
`438`	`438`	`return filterRegex.test(href) \|\| localhostRegex.test(href) \|\| mailtoRegex.test(href);`
`439`	`439`	`}`