Have the offlinifier also translate the redirect stubs.

KubaO · KubaO · commit 1e360a82ca31 · 2026-05-17T18:38:36.000+02:00
diff --git a/docs/Miscellaneous/Documentation Development.md b/docs/Miscellaneous/Documentation Development.md
@@ -201,7 +201,7 @@ To check that none of the internal links in the most recent documentation build
 
     check.bat
 
-This runs [Lychee](https://github.com/lycheeverse/lychee) in offline mode against the built `_site/`.
+This runs three checks: [Lychee](https://github.com/lycheeverse/lychee) in offline mode against `_site/` (the live tree), the same against `_site-offline/` (the file://-browsable mirror), and a small Python pass over `_site-offline/` that flags any surviving `https://docs.twinbasic.com/<path>` link --- the offline mirror should not navigate back to the live docs site.
 
 ### Building and Local Serving
 
diff --git a/docs/_plugins/offlinify.md b/docs/_plugins/offlinify.md
@@ -77,7 +77,7 @@ For each page:
 3. **Detect jekyll-redirect-from stubs** by class-name string check (`page.class.name == "JekyllRedirectFrom::RedirectPage"`). The stubs are tiny HTML files whose meta-refresh, canonical link, `<script>location=`, and fallback `<a>` all reference an absolute `https://<site.url>/<path>` URL produced by `absolute_url`. Online these redirect to the canonical page; offline they would require network access and land on the live site rather than the local file — defeating the offline scenario. Rewrite each `<site.url><path>` occurrence to its resolved page-relative form via the same `compute_relative` the main HTML pass uses, then write the stub. Counted under `rewritten_redirects` in the summary log line. Some source pages (notably `Miscellaneous/Documentation Development.md`) intentionally link via `redirect_from` URLs as a stable-URL pattern, so the rewritten stubs let those source links navigate locally instead of failing. The class-name string check is used rather than `is_a?` so the plugin still loads if jekyll-redirect-from is removed. If `site.url` is unset (empty) the stub is written verbatim — the path-portion targets still resolve under lychee's offline check the same way the main HTML pass's link targets do.
 
 4. **Dispatch on output extension:**
-   - `.html`: dup `page.output`, scan for code-block ranges, run the combined HTML URL rewrite (see [HTML URL rewriting](#html-url-rewriting)), inject the search-setup script tags, write.
+   - `.html`: dup `page.output`, strip the jekyll-seo-tag block (see [SEO block stripping](#seo-block-stripping)), scan for code-block ranges, run the combined HTML URL rewrite (see [HTML URL rewriting](#html-url-rewriting)), inject the search-setup script tags, write.
    - `.css`: dup `page.output`, run the `url()` rewrite (see [CSS `url()` rewriting](#css-url-rewriting)), write.
    - Anything else (XML feeds, JSON, etc.): write `page.output` verbatim.
 
@@ -99,6 +99,14 @@ Fires at `:site, :post_write` — once after Jekyll's WRITE phase has populated
 
 ## Transformation passes
 
+### SEO block stripping
+
+The jekyll-seo-tag plugin emits a ~900-byte block at the top of every page's `<head>`, bracketed by `<!-- Begin Jekyll SEO tag vX.Y.Z -->` and `<!-- End Jekyll SEO tag -->` comments. Inside live a `<title>`, a generator tag, OpenGraph and Twitter Card meta, a `<link rel="canonical">` pointing at the live site, and a JSON-LD structured-data `<script>`. All of it exists for search-engine crawlers and social-media link previewers that never see `_site-offline/`.
+
+The whole block is stripped, except the `<title>` (the browser tab label, the only thing in the block a local reader actually uses). The bracketing comments go away too. On the current ~830-page site, the strip saves roughly 750 KB across the offline tree and removes three of the four `https://docs.twinbasic.com` references each page would otherwise contain (the fourth, the JSON-LD `"url"` field, is also inside the SEO block).
+
+Runs first in the `.html` branch of `process_page` so the URL rewrite isn't doing work on URLs we're about to delete, and so the code-block scan's byte offsets are valid against the post-strip content.
+
 ### HTML URL rewriting
 
 A single combined regex matches both absolute and page-relative URLs in `href`/`src` attributes:
@@ -292,7 +300,8 @@ The offline build touches the following files:
 | `docs/_config.yml` | `also_build_offline: true` (default-on) and `exclude: [_site-offline]` (keeps Jekyll's watcher from rebuilding on the plugin's own output). |
 | `docs/build.bat` | Plain `bundle exec jekyll build` — produces `_site/`, `_site-offline/`, and (via `pdfify.rb`) `_site-pdf/` in one run. |
 | `docs/serve.bat` | `bundle exec jekyll serve` — watcher-friendly thanks to the exclude. |
-| `docs/check.bat` | Dual lychee — strict on `_site-offline/`, permissive (`--fallback-extensions html`) on `_site/`. |
+| `docs/check.bat` | Local link check (dev-side only; CI runs the two lychee passes directly). Three steps: lychee permissive on `_site/`, lychee strict on `_site-offline/`, and `scripts/check_offline_live_links.py` against `_site-offline/`. Exits non-zero on any failure. |
+| `scripts/check_offline_live_links.py` | Flags any `https://docs.twinbasic.com/<path>` reference that survived offlinify in `_site-offline/` HTML, outside `<code>` / `<pre>` blocks. Skips the bare root (`https://docs.twinbasic.com[/]`) since intentional "go to the live site" links are allowed. Caught locally by `check.bat`; not wired into CI. |
 | `docs/.gitignore` | `_site`, `_site-offline`, and `_site-pdf` all excluded from git. |
 | `.github/workflows/jekyll-gh-pages.yml` | CI workflow. Builds, runs lychee against both trees, deploys to Pages, and (on manual dispatch) packages `_site-offline/` as a release artifact. |
 
@@ -322,6 +331,8 @@ The plugin surfaces several conditions in its summary log lines:
 
 - **`_site-offline/` triggering `jekyll serve` rebuilds.** Was a problem; now handled by two things in combination: `exclude: [_site-offline]` in `_config.yml`, and the "clean contents but keep the directory" trick in the wipe step (which keeps all watcher events under `_site-offline/...` where the exclude matches).
 
+- **Surviving live-site links.** The [SEO block stripping](#seo-block-stripping) pass removes the bulk of `https://docs.twinbasic.com` references each page contains (canonical link, OpenGraph URL, JSON-LD `url`). Anything left in `_site-offline/` is a source link that points at the live docs site -- usually a markdown author writing `https://docs.twinbasic.com/<path>` instead of a relative link or `/tB/...` permalink, which would silently navigate the offline reader back online. `scripts/check_offline_live_links.py` (run by `check.bat` after the offline lychee pass) flags these locally; the bare root `https://docs.twinbasic.com[/]` is exempt since intentional "go to the live site" links are allowed. CI does not run this check.
+
 ## Performance
 
 The optimization story is captured in the commit history. Briefly:
@@ -359,6 +370,7 @@ In source order in [`offlinify.rb`](offlinify.rb):
 - `rewrite_html!(content, file_dir, file_segs, site_paths, seg_cache, result_cache, baseurl, code_ranges)` — the combined HTML pass. One `gsub` per file over `HTML_COMBINED_RE`, dispatching on `raw.start_with?("/")`: absolute URLs go through `compute_relative`, page-relative URLs through `compute_rel_url`. Single cache lookup per match.
 - `rewrite_css!(content, file_dir, file_segs, site_paths, seg_cache, result_cache, baseurl)` — the CSS pass. One `gsub` per file over `CSS_URL_RE`, dispatched to `compute_relative` (CSS only carries absolute URLs in this codebase). No code-block handling — CSS has no equivalent concept.
 - `inject_search_setup!(content, file_segs)` — the second HTML transformation. Single regex substitution per file: finds the just-the-docs.js script tag and prepends the two new ones.
+- `strip_seo!(content)` — removes the jekyll-seo-tag plugin's output block from a page's `<head>`, keeping only the `<title>` tag. Runs first in the `.html` branch of `process_page` so the URL rewrite and code-block scan see the post-strip content.
 - `compute_relative(raw, file_segs, site_paths, seg_cache, baseurl)` — the absolute-URL resolver. Strip baseurl, probe candidates, compute LCP, return final URL.
 - `compute_rel_url(raw, file_segs, site_paths)` — the page-relative-URL resolver. Normalise against the current page's dir, probe candidates, return original raw plus matching suffix.
 - `patch_jtd_js!(out_dest)` — does the `navLink()` and `initSearch()` body substitutions.
diff --git a/docs/_plugins/offlinify.rb b/docs/_plugins/offlinify.rb
@@ -211,6 +211,24 @@ module Offlinify
   # percent-encoded byte-by-byte.
   PATH_SAFE_RE = /[^A-Za-z0-9\-_.~!$&'()*+,;=:@]/.freeze
 
+  # Matches the jekyll-seo-tag plugin's output, bracketed by the
+  # `Begin Jekyll SEO tag vX.Y.Z` / `End Jekyll SEO tag` comments
+  # the plugin emits unconditionally. Inside the block live a
+  # `<title>`, generator/OpenGraph/Twitter-Card meta tags, a
+  # `<link rel="canonical">` pointing at the live site, and a
+  # JSON-LD structured-data `<script>`. The whole block is ~900
+  # bytes per page; the contents do nothing offline (search-engine
+  # crawlers and social-media link previewers never see
+  # `_site-offline/`) so all but the `<title>` (the browser tab
+  # label) is stripped. The block is single-line in just-the-docs's
+  # rendered output but the regex uses `.*?` with multiline mode
+  # in case future theme versions reformat it.
+  SEO_BLOCK_RE = /<!-- Begin Jekyll SEO tag.*?<!-- End Jekyll SEO tag -->/m.freeze
+
+  # Matches the `<title>...</title>` tag inside the SEO block,
+  # preserved verbatim into the stripped output.
+  TITLE_RE = /<title>.*?<\/title>/m.freeze
+
   # Path of the just-the-docs JS file relative to the site root.
   JTD_JS_REL = "assets/js/just-the-docs.js"
 
@@ -446,6 +464,7 @@ def self.setup(site)
       rewritten_html: 0,
       rewritten_css: 0,
       rewritten_redirects: 0,
+      seo_stripped: 0,
       copied_files: 0,
       excluded_files: 0,
       unresolved: 0,
@@ -587,6 +606,7 @@ def self.process_page(page)
       case File.extname(dest_path).downcase
       when ".html"
         content = page.output.dup
+        @state[:seo_stripped] += 1 if strip_seo!(content)
         code_ranges = code_block_ranges(content)
         _changed, misses = rewrite_html!(content, file_dir, file_segs, @state[:site_paths], @state[:seg_cache], @state[:result_cache], @state[:baseurl], code_ranges)
         @state[:unresolved] += misses
@@ -640,6 +660,7 @@ def self.finish(site)
 
     summary = "rewrote #{@state[:rewritten_html]} HTML and #{@state[:rewritten_css]} CSS file(s), copied #{@state[:copied_files]} asset(s)"
     summary += ", rewrote #{@state[:rewritten_redirects]} redirect stub(s)" if @state[:rewritten_redirects].positive?
+    summary += ", stripped SEO block from #{@state[:seo_stripped]} page(s)" if @state[:seo_stripped].positive?
     summary += ", excluded #{@state[:excluded_files]} file(s)" if @state[:excluded_files].positive?
     summary += " (#{@state[:unresolved]} unresolved link(s) left as-is)" if @state[:unresolved].positive?
     Jekyll.logger.info "Offlinify:", summary
@@ -733,6 +754,33 @@ def self.inject_search_setup!(content, file_segs)
     true
   end
 
+  # Strip the jekyll-seo-tag plugin's output block, keeping only the
+  # `<title>` tag (the browser tab label). The block is ~900 bytes
+  # per page and the rest -- generator tag, OpenGraph/Twitter Card
+  # meta, canonical link pointing at the live site, JSON-LD
+  # structured data -- exists for search-engine crawlers and
+  # social-media link previewers that never see `_site-offline/`.
+  # Stripping also removes ~3 of the `https://docs.twinbasic.com`
+  # references per page that would otherwise need to be carved
+  # around by any "no live-site links" check.
+  #
+  # Runs before the URL rewrite so the rewrite isn't doing work on
+  # URLs we're about to delete, and before the code-block scan so
+  # the byte offsets it produces are valid against the post-strip
+  # content. Returns true when the strip happened, false when the
+  # block wasn't found (e.g. a page without the layout, or a future
+  # build where the plugin is removed).
+  def self.strip_seo!(content)
+    return false unless content.include?("<!-- Begin Jekyll SEO tag")
+    new_content = content.sub(SEO_BLOCK_RE) do |block|
+      title = block.match(TITLE_RE)
+      title ? title[0] : ""
+    end
+    return false if new_content == content
+    content.replace(new_content)
+    true
+  end
+
   # Convert the rendered `assets/js/search-data.json` into a sibling
   # `assets/js/search-data.js` that assigns the data to a global. The
   # JS file is loaded as a `<script src=>` from each page (see
diff --git a/docs/check.bat b/docs/check.bat
@@ -1,4 +1,5 @@
-@rem Use lychee to check the links in both build outputs.
+@rem Use lychee to check the links in both build outputs, then scan
+@rem _site-offline/ for live-site links that survived offlinify.
 @rem
 @rem _site/        Online tree. `--fallback-extensions html` mirrors what
 @rem               GitHub Pages does at request time: an extensionless
@@ -10,10 +11,17 @@
 @rem               markdown sources whose permalink shape doesn't match
 @rem               the rendered filename (e.g. `[Foo](Foo/)` when Jekyll
 @rem               wrote `Foo.html`, not `Foo/index.html`).
+@rem live-links   Greps _site-offline/ HTML for any surviving
+@rem               https://docs.twinbasic.com reference outside <code> /
+@rem               <pre> blocks. After _plugins/offlinify.rb strips the
+@rem               jekyll-seo-tag block from each page, none should
+@rem               remain -- a hit means a source link goes to the live
+@rem               site instead of the canonical /tB/... permalink.
+@rem               See ../scripts/check_offline_live_links.py.
 @rem
-@rem Both checks always run so you see all errors in one pass; the script
-@rem exits non-zero if either fails (online failure takes precedence in
-@rem the reported code).
+@rem All three checks always run so you see all errors in one pass; the
+@rem script exits non-zero if any fails (earlier failures take precedence
+@rem in the reported code).
 @setlocal
 @set LYCHEE="%~dp0..\.claude\lychee.exe"
 @echo Checking _site/ (online) ...
@@ -28,5 +36,10 @@
 @rem such fallback, and the link is just broken.
 @%LYCHEE% --offline --include-fragments --index-files "index.html" --root-dir ".\_site-offline" ".\_site-offline" %*
 @set EXIT2=%ERRORLEVEL%
+@echo.
+@echo Checking _site-offline/ for live-site links ...
+@python "%~dp0..\scripts\check_offline_live_links.py"
+@set EXIT3=%ERRORLEVEL%
 @if %EXIT1% NEQ 0 exit /b %EXIT1%
-@exit /b %EXIT2%
+@if %EXIT2% NEQ 0 exit /b %EXIT2%
+@exit /b %EXIT3%
diff --git a/scripts/check_offline_live_links.py b/scripts/check_offline_live_links.py
@@ -0,0 +1,97 @@
+"""
+Scan docs/_site-offline/ for any https://docs.twinbasic.com/<path>
+reference outside of <code> / <pre> blocks. Exit 1 if any found,
+0 otherwise.
+
+Run by docs/check.bat after the offline lychee pass. After
+_plugins/offlinify.rb's SEO-block strip, no live-site references
+should remain except:
+
+  * Sample URLs inside <code> / <pre> blocks (tutorial code that
+    legitimately shows live URLs as data, e.g. the VBRUN.Hyperlink
+    `NavigateTo "https://docs.twinbasic.com/"` example). Skipped
+    via the same code-block shape offlinify uses for its URL
+    rewrite.
+  * The bare root URL `https://docs.twinbasic.com` or
+    `https://docs.twinbasic.com/` -- intentional "go to the live
+    docs site" links (e.g. the Documentation entry in the FAQ
+    resource list). Skipped via the tail check below.
+
+Anything deeper (`https://docs.twinbasic.com/tB/Core/Const`,
+`https://docs.twinbasic.comi`, ...) is flagged: in the offline
+copy those navigate back to the live site, undermining the local
+read; in source they should be a relative link or a /tB/...
+permalink that resolves locally.
+
+Run from anywhere:
+    python scripts/check_offline_live_links.py
+"""
+
+import re
+import sys
+from pathlib import Path
+
+SCRIPT_DIR = Path(__file__).resolve().parent
+REPO_ROOT = SCRIPT_DIR.parent
+OFFLINE_TREE = REPO_ROOT / "docs" / "_site-offline"
+
+# Matches a <code>...</code> or <pre>...</pre> block. Same shape as
+# _plugins/offlinify.rb CODE_BLOCK_RE so sample URLs in tutorial
+# code are skipped here too.
+CODE_BLOCK_RE = re.compile(r"<(code|pre)\b[^>]*>.*?</\1>", re.DOTALL)
+
+# Captures the trailing path/typo characters after the domain. An
+# empty tail or `/` means the bare root URL (intentional). Anything
+# else is a deep link or a typo (`.comi`, `.com/tB/...`).
+LIVE_LINK_RE = re.compile(r"https://docs\.twinbasic\.com(?P<tail>[^\s\"'<>]*)")
+
+
+def main() -> int:
+    if not OFFLINE_TREE.is_dir():
+        print(
+            f"_site-offline/ not found at {OFFLINE_TREE} -- run docs/build.bat first."
+        )
+        return 2
+
+    hits = []
+    for html in sorted(OFFLINE_TREE.rglob("*.html")):
+        content = html.read_text(encoding="utf-8")
+        link_matches = list(LIVE_LINK_RE.finditer(content))
+        if not link_matches:
+            continue
+        code_ranges = [(m.start(), m.end()) for m in CODE_BLOCK_RE.finditer(content)]
+        for m in link_matches:
+            tail = m.group("tail")
+            if tail == "" or tail == "/":
+                continue
+            if any(s <= m.start() < e for s, e in code_ranges):
+                continue
+            line_num = content.count("\n", 0, m.start()) + 1
+            start = max(0, m.start() - 60)
+            end = min(len(content), m.start() + 80)
+            snippet = re.sub(r"[\r\n]+", " ", content[start:end])
+            hits.append((html, line_num, snippet))
+
+    if hits:
+        print(
+            f"FAIL: {len(hits)} reference(s) to docs.twinbasic.com in "
+            f"_site-offline/ outside code blocks:"
+        )
+        for path, line_num, snippet in hits:
+            try:
+                rel = path.relative_to(REPO_ROOT)
+            except ValueError:
+                rel = path
+            print(f"  {rel}:{line_num}: ...{snippet}...")
+        print()
+        print(
+            "Update the source markdown to use a relative link or /tB/... "
+            "permalink instead."
+        )
+        return 1
+
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())