Skip to content

Commit 1e360a8

Browse files
committed
Have the offlinifier also translate the redirect stubs.
1 parent a250066 commit 1e360a8

5 files changed

Lines changed: 178 additions & 8 deletions

File tree

docs/Miscellaneous/Documentation Development.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -201,7 +201,7 @@ To check that none of the internal links in the most recent documentation build
201201

202202
check.bat
203203

204-
This runs [Lychee](https://github.com/lycheeverse/lychee) in offline mode against the built `_site/`.
204+
This runs three checks: [Lychee](https://github.com/lycheeverse/lychee) in offline mode against `_site/` (the live tree), the same against `_site-offline/` (the file://-browsable mirror), and a small Python pass over `_site-offline/` that flags any surviving `https://docs.twinbasic.com/<path>` link --- the offline mirror should not navigate back to the live docs site.
205205

206206
### Building and Local Serving
207207

docs/_plugins/offlinify.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ For each page:
7777
3. **Detect jekyll-redirect-from stubs** by class-name string check (`page.class.name == "JekyllRedirectFrom::RedirectPage"`). The stubs are tiny HTML files whose meta-refresh, canonical link, `<script>location=`, and fallback `<a>` all reference an absolute `https://<site.url>/<path>` URL produced by `absolute_url`. Online these redirect to the canonical page; offline they would require network access and land on the live site rather than the local file — defeating the offline scenario. Rewrite each `<site.url><path>` occurrence to its resolved page-relative form via the same `compute_relative` the main HTML pass uses, then write the stub. Counted under `rewritten_redirects` in the summary log line. Some source pages (notably `Miscellaneous/Documentation Development.md`) intentionally link via `redirect_from` URLs as a stable-URL pattern, so the rewritten stubs let those source links navigate locally instead of failing. The class-name string check is used rather than `is_a?` so the plugin still loads if jekyll-redirect-from is removed. If `site.url` is unset (empty) the stub is written verbatim — the path-portion targets still resolve under lychee's offline check the same way the main HTML pass's link targets do.
7878

7979
4. **Dispatch on output extension:**
80-
- `.html`: dup `page.output`, scan for code-block ranges, run the combined HTML URL rewrite (see [HTML URL rewriting](#html-url-rewriting)), inject the search-setup script tags, write.
80+
- `.html`: dup `page.output`, strip the jekyll-seo-tag block (see [SEO block stripping](#seo-block-stripping)), scan for code-block ranges, run the combined HTML URL rewrite (see [HTML URL rewriting](#html-url-rewriting)), inject the search-setup script tags, write.
8181
- `.css`: dup `page.output`, run the `url()` rewrite (see [CSS `url()` rewriting](#css-url-rewriting)), write.
8282
- Anything else (XML feeds, JSON, etc.): write `page.output` verbatim.
8383

@@ -99,6 +99,14 @@ Fires at `:site, :post_write` — once after Jekyll's WRITE phase has populated
9999

100100
## Transformation passes
101101

102+
### SEO block stripping
103+
104+
The jekyll-seo-tag plugin emits a ~900-byte block at the top of every page's `<head>`, bracketed by `<!-- Begin Jekyll SEO tag vX.Y.Z -->` and `<!-- End Jekyll SEO tag -->` comments. Inside live a `<title>`, a generator tag, OpenGraph and Twitter Card meta, a `<link rel="canonical">` pointing at the live site, and a JSON-LD structured-data `<script>`. All of it exists for search-engine crawlers and social-media link previewers that never see `_site-offline/`.
105+
106+
The whole block is stripped, except the `<title>` (the browser tab label, the only thing in the block a local reader actually uses). The bracketing comments go away too. On the current ~830-page site, the strip saves roughly 750 KB across the offline tree and removes three of the four `https://docs.twinbasic.com` references each page would otherwise contain (the fourth, the JSON-LD `"url"` field, is also inside the SEO block).
107+
108+
Runs first in the `.html` branch of `process_page` so the URL rewrite isn't doing work on URLs we're about to delete, and so the code-block scan's byte offsets are valid against the post-strip content.
109+
102110
### HTML URL rewriting
103111

104112
A single combined regex matches both absolute and page-relative URLs in `href`/`src` attributes:
@@ -292,7 +300,8 @@ The offline build touches the following files:
292300
| `docs/_config.yml` | `also_build_offline: true` (default-on) and `exclude: [_site-offline]` (keeps Jekyll's watcher from rebuilding on the plugin's own output). |
293301
| `docs/build.bat` | Plain `bundle exec jekyll build` — produces `_site/`, `_site-offline/`, and (via `pdfify.rb`) `_site-pdf/` in one run. |
294302
| `docs/serve.bat` | `bundle exec jekyll serve` — watcher-friendly thanks to the exclude. |
295-
| `docs/check.bat` | Dual lychee — strict on `_site-offline/`, permissive (`--fallback-extensions html`) on `_site/`. |
303+
| `docs/check.bat` | Local link check (dev-side only; CI runs the two lychee passes directly). Three steps: lychee permissive on `_site/`, lychee strict on `_site-offline/`, and `scripts/check_offline_live_links.py` against `_site-offline/`. Exits non-zero on any failure. |
304+
| `scripts/check_offline_live_links.py` | Flags any `https://docs.twinbasic.com/<path>` reference that survived offlinify in `_site-offline/` HTML, outside `<code>` / `<pre>` blocks. Skips the bare root (`https://docs.twinbasic.com[/]`) since intentional "go to the live site" links are allowed. Caught locally by `check.bat`; not wired into CI. |
296305
| `docs/.gitignore` | `_site`, `_site-offline`, and `_site-pdf` all excluded from git. |
297306
| `.github/workflows/jekyll-gh-pages.yml` | CI workflow. Builds, runs lychee against both trees, deploys to Pages, and (on manual dispatch) packages `_site-offline/` as a release artifact. |
298307

@@ -322,6 +331,8 @@ The plugin surfaces several conditions in its summary log lines:
322331

323332
- **`_site-offline/` triggering `jekyll serve` rebuilds.** Was a problem; now handled by two things in combination: `exclude: [_site-offline]` in `_config.yml`, and the "clean contents but keep the directory" trick in the wipe step (which keeps all watcher events under `_site-offline/...` where the exclude matches).
324333

334+
- **Surviving live-site links.** The [SEO block stripping](#seo-block-stripping) pass removes the bulk of `https://docs.twinbasic.com` references each page contains (canonical link, OpenGraph URL, JSON-LD `url`). Anything left in `_site-offline/` is a source link that points at the live docs site -- usually a markdown author writing `https://docs.twinbasic.com/<path>` instead of a relative link or `/tB/...` permalink, which would silently navigate the offline reader back online. `scripts/check_offline_live_links.py` (run by `check.bat` after the offline lychee pass) flags these locally; the bare root `https://docs.twinbasic.com[/]` is exempt since intentional "go to the live site" links are allowed. CI does not run this check.
335+
325336
## Performance
326337

327338
The optimization story is captured in the commit history. Briefly:
@@ -359,6 +370,7 @@ In source order in [`offlinify.rb`](offlinify.rb):
359370
- `rewrite_html!(content, file_dir, file_segs, site_paths, seg_cache, result_cache, baseurl, code_ranges)` — the combined HTML pass. One `gsub` per file over `HTML_COMBINED_RE`, dispatching on `raw.start_with?("/")`: absolute URLs go through `compute_relative`, page-relative URLs through `compute_rel_url`. Single cache lookup per match.
360371
- `rewrite_css!(content, file_dir, file_segs, site_paths, seg_cache, result_cache, baseurl)` — the CSS pass. One `gsub` per file over `CSS_URL_RE`, dispatched to `compute_relative` (CSS only carries absolute URLs in this codebase). No code-block handling — CSS has no equivalent concept.
361372
- `inject_search_setup!(content, file_segs)` — the second HTML transformation. Single regex substitution per file: finds the just-the-docs.js script tag and prepends the two new ones.
373+
- `strip_seo!(content)` — removes the jekyll-seo-tag plugin's output block from a page's `<head>`, keeping only the `<title>` tag. Runs first in the `.html` branch of `process_page` so the URL rewrite and code-block scan see the post-strip content.
362374
- `compute_relative(raw, file_segs, site_paths, seg_cache, baseurl)` — the absolute-URL resolver. Strip baseurl, probe candidates, compute LCP, return final URL.
363375
- `compute_rel_url(raw, file_segs, site_paths)` — the page-relative-URL resolver. Normalise against the current page's dir, probe candidates, return original raw plus matching suffix.
364376
- `patch_jtd_js!(out_dest)` — does the `navLink()` and `initSearch()` body substitutions.

docs/_plugins/offlinify.rb

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,24 @@ module Offlinify
211211
# percent-encoded byte-by-byte.
212212
PATH_SAFE_RE = /[^A-Za-z0-9\-_.~!$&'()*+,;=:@]/.freeze
213213

214+
# Matches the jekyll-seo-tag plugin's output, bracketed by the
215+
# `Begin Jekyll SEO tag vX.Y.Z` / `End Jekyll SEO tag` comments
216+
# the plugin emits unconditionally. Inside the block live a
217+
# `<title>`, generator/OpenGraph/Twitter-Card meta tags, a
218+
# `<link rel="canonical">` pointing at the live site, and a
219+
# JSON-LD structured-data `<script>`. The whole block is ~900
220+
# bytes per page; the contents do nothing offline (search-engine
221+
# crawlers and social-media link previewers never see
222+
# `_site-offline/`) so all but the `<title>` (the browser tab
223+
# label) is stripped. The block is single-line in just-the-docs's
224+
# rendered output but the regex uses `.*?` with multiline mode
225+
# in case future theme versions reformat it.
226+
SEO_BLOCK_RE = /<!-- Begin Jekyll SEO tag.*?<!-- End Jekyll SEO tag -->/m.freeze
227+
228+
# Matches the `<title>...</title>` tag inside the SEO block,
229+
# preserved verbatim into the stripped output.
230+
TITLE_RE = /<title>.*?<\/title>/m.freeze
231+
214232
# Path of the just-the-docs JS file relative to the site root.
215233
JTD_JS_REL = "assets/js/just-the-docs.js"
216234

@@ -446,6 +464,7 @@ def self.setup(site)
446464
rewritten_html: 0,
447465
rewritten_css: 0,
448466
rewritten_redirects: 0,
467+
seo_stripped: 0,
449468
copied_files: 0,
450469
excluded_files: 0,
451470
unresolved: 0,
@@ -587,6 +606,7 @@ def self.process_page(page)
587606
case File.extname(dest_path).downcase
588607
when ".html"
589608
content = page.output.dup
609+
@state[:seo_stripped] += 1 if strip_seo!(content)
590610
code_ranges = code_block_ranges(content)
591611
_changed, misses = rewrite_html!(content, file_dir, file_segs, @state[:site_paths], @state[:seg_cache], @state[:result_cache], @state[:baseurl], code_ranges)
592612
@state[:unresolved] += misses
@@ -640,6 +660,7 @@ def self.finish(site)
640660

641661
summary = "rewrote #{@state[:rewritten_html]} HTML and #{@state[:rewritten_css]} CSS file(s), copied #{@state[:copied_files]} asset(s)"
642662
summary += ", rewrote #{@state[:rewritten_redirects]} redirect stub(s)" if @state[:rewritten_redirects].positive?
663+
summary += ", stripped SEO block from #{@state[:seo_stripped]} page(s)" if @state[:seo_stripped].positive?
643664
summary += ", excluded #{@state[:excluded_files]} file(s)" if @state[:excluded_files].positive?
644665
summary += " (#{@state[:unresolved]} unresolved link(s) left as-is)" if @state[:unresolved].positive?
645666
Jekyll.logger.info "Offlinify:", summary
@@ -733,6 +754,33 @@ def self.inject_search_setup!(content, file_segs)
733754
true
734755
end
735756

757+
# Strip the jekyll-seo-tag plugin's output block, keeping only the
758+
# `<title>` tag (the browser tab label). The block is ~900 bytes
759+
# per page and the rest -- generator tag, OpenGraph/Twitter Card
760+
# meta, canonical link pointing at the live site, JSON-LD
761+
# structured data -- exists for search-engine crawlers and
762+
# social-media link previewers that never see `_site-offline/`.
763+
# Stripping also removes ~3 of the `https://docs.twinbasic.com`
764+
# references per page that would otherwise need to be carved
765+
# around by any "no live-site links" check.
766+
#
767+
# Runs before the URL rewrite so the rewrite isn't doing work on
768+
# URLs we're about to delete, and before the code-block scan so
769+
# the byte offsets it produces are valid against the post-strip
770+
# content. Returns true when the strip happened, false when the
771+
# block wasn't found (e.g. a page without the layout, or a future
772+
# build where the plugin is removed).
773+
def self.strip_seo!(content)
774+
return false unless content.include?("<!-- Begin Jekyll SEO tag")
775+
new_content = content.sub(SEO_BLOCK_RE) do |block|
776+
title = block.match(TITLE_RE)
777+
title ? title[0] : ""
778+
end
779+
return false if new_content == content
780+
content.replace(new_content)
781+
true
782+
end
783+
736784
# Convert the rendered `assets/js/search-data.json` into a sibling
737785
# `assets/js/search-data.js` that assigns the data to a global. The
738786
# JS file is loaded as a `<script src=>` from each page (see

docs/check.bat

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
@rem Use lychee to check the links in both build outputs.
1+
@rem Use lychee to check the links in both build outputs, then scan
2+
@rem _site-offline/ for live-site links that survived offlinify.
23
@rem
34
@rem _site/ Online tree. `--fallback-extensions html` mirrors what
45
@rem GitHub Pages does at request time: an extensionless
@@ -10,10 +11,17 @@
1011
@rem markdown sources whose permalink shape doesn't match
1112
@rem the rendered filename (e.g. `[Foo](Foo/)` when Jekyll
1213
@rem wrote `Foo.html`, not `Foo/index.html`).
14+
@rem live-links Greps _site-offline/ HTML for any surviving
15+
@rem https://docs.twinbasic.com reference outside <code> /
16+
@rem <pre> blocks. After _plugins/offlinify.rb strips the
17+
@rem jekyll-seo-tag block from each page, none should
18+
@rem remain -- a hit means a source link goes to the live
19+
@rem site instead of the canonical /tB/... permalink.
20+
@rem See ../scripts/check_offline_live_links.py.
1321
@rem
14-
@rem Both checks always run so you see all errors in one pass; the script
15-
@rem exits non-zero if either fails (online failure takes precedence in
16-
@rem the reported code).
22+
@rem All three checks always run so you see all errors in one pass; the
23+
@rem script exits non-zero if any fails (earlier failures take precedence
24+
@rem in the reported code).
1725
@setlocal
1826
@set LYCHEE="%~dp0..\.claude\lychee.exe"
1927
@echo Checking _site/ (online) ...
@@ -28,5 +36,10 @@
2836
@rem such fallback, and the link is just broken.
2937
@%LYCHEE% --offline --include-fragments --index-files "index.html" --root-dir ".\_site-offline" ".\_site-offline" %*
3038
@set EXIT2=%ERRORLEVEL%
39+
@echo.
40+
@echo Checking _site-offline/ for live-site links ...
41+
@python "%~dp0..\scripts\check_offline_live_links.py"
42+
@set EXIT3=%ERRORLEVEL%
3143
@if %EXIT1% NEQ 0 exit /b %EXIT1%
32-
@exit /b %EXIT2%
44+
@if %EXIT2% NEQ 0 exit /b %EXIT2%
45+
@exit /b %EXIT3%
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
"""
2+
Scan docs/_site-offline/ for any https://docs.twinbasic.com/<path>
3+
reference outside of <code> / <pre> blocks. Exit 1 if any found,
4+
0 otherwise.
5+
6+
Run by docs/check.bat after the offline lychee pass. After
7+
_plugins/offlinify.rb's SEO-block strip, no live-site references
8+
should remain except:
9+
10+
* Sample URLs inside <code> / <pre> blocks (tutorial code that
11+
legitimately shows live URLs as data, e.g. the VBRUN.Hyperlink
12+
`NavigateTo "https://docs.twinbasic.com/"` example). Skipped
13+
via the same code-block shape offlinify uses for its URL
14+
rewrite.
15+
* The bare root URL `https://docs.twinbasic.com` or
16+
`https://docs.twinbasic.com/` -- intentional "go to the live
17+
docs site" links (e.g. the Documentation entry in the FAQ
18+
resource list). Skipped via the tail check below.
19+
20+
Anything deeper (`https://docs.twinbasic.com/tB/Core/Const`,
21+
`https://docs.twinbasic.comi`, ...) is flagged: in the offline
22+
copy those navigate back to the live site, undermining the local
23+
read; in source they should be a relative link or a /tB/...
24+
permalink that resolves locally.
25+
26+
Run from anywhere:
27+
python scripts/check_offline_live_links.py
28+
"""
29+
30+
import re
31+
import sys
32+
from pathlib import Path
33+
34+
SCRIPT_DIR = Path(__file__).resolve().parent
35+
REPO_ROOT = SCRIPT_DIR.parent
36+
OFFLINE_TREE = REPO_ROOT / "docs" / "_site-offline"
37+
38+
# Matches a <code>...</code> or <pre>...</pre> block. Same shape as
39+
# _plugins/offlinify.rb CODE_BLOCK_RE so sample URLs in tutorial
40+
# code are skipped here too.
41+
CODE_BLOCK_RE = re.compile(r"<(code|pre)\b[^>]*>.*?</\1>", re.DOTALL)
42+
43+
# Captures the trailing path/typo characters after the domain. An
44+
# empty tail or `/` means the bare root URL (intentional). Anything
45+
# else is a deep link or a typo (`.comi`, `.com/tB/...`).
46+
LIVE_LINK_RE = re.compile(r"https://docs\.twinbasic\.com(?P<tail>[^\s\"'<>]*)")
47+
48+
49+
def main() -> int:
50+
if not OFFLINE_TREE.is_dir():
51+
print(
52+
f"_site-offline/ not found at {OFFLINE_TREE} -- run docs/build.bat first."
53+
)
54+
return 2
55+
56+
hits = []
57+
for html in sorted(OFFLINE_TREE.rglob("*.html")):
58+
content = html.read_text(encoding="utf-8")
59+
link_matches = list(LIVE_LINK_RE.finditer(content))
60+
if not link_matches:
61+
continue
62+
code_ranges = [(m.start(), m.end()) for m in CODE_BLOCK_RE.finditer(content)]
63+
for m in link_matches:
64+
tail = m.group("tail")
65+
if tail == "" or tail == "/":
66+
continue
67+
if any(s <= m.start() < e for s, e in code_ranges):
68+
continue
69+
line_num = content.count("\n", 0, m.start()) + 1
70+
start = max(0, m.start() - 60)
71+
end = min(len(content), m.start() + 80)
72+
snippet = re.sub(r"[\r\n]+", " ", content[start:end])
73+
hits.append((html, line_num, snippet))
74+
75+
if hits:
76+
print(
77+
f"FAIL: {len(hits)} reference(s) to docs.twinbasic.com in "
78+
f"_site-offline/ outside code blocks:"
79+
)
80+
for path, line_num, snippet in hits:
81+
try:
82+
rel = path.relative_to(REPO_ROOT)
83+
except ValueError:
84+
rel = path
85+
print(f" {rel}:{line_num}: ...{snippet}...")
86+
print()
87+
print(
88+
"Update the source markdown to use a relative link or /tB/... "
89+
"permalink instead."
90+
)
91+
return 1
92+
93+
return 0
94+
95+
96+
if __name__ == "__main__":
97+
sys.exit(main())

0 commit comments

Comments
 (0)