Skip to content

Fix broken cross-package Haddock links on docs site#1180

Merged
Jimbo4350 merged 5 commits into
masterfrom
issue-601-chap-haddock-links
Apr 30, 2026
Merged

Fix broken cross-package Haddock links on docs site#1180
Jimbo4350 merged 5 commits into
masterfrom
issue-601-chap-haddock-links

Conversation

@Jimbo4350
Copy link
Copy Markdown
Contributor

@Jimbo4350 Jimbo4350 commented Apr 9, 2026

Changelog

- description: |
    Fix broken cross-package Haddock links on the hosted documentation site.
    Links to dependency packages (cardano-ledger-*, plutus-*, cardano-base,
    etc.) were relative paths pointing to directories that don't exist on the
    site, resulting in 404s. A post-processing script now rewrites these to
    absolute URLs pointing at the correct external documentation hosts, or to
    annotated unclickable spans where no upstream doc site exists.
  type:
   - bugfix
   - documentation
  projects:
   - cardano-api

Context

Fixes #601. The hosted Haddock site at cardano-api.cardano.intersectmbo.org only contains docs for packages in this repo. When Haddock generates cross-references to dependency types, it produces relative links like ../cardano-ledger-core/Cardano-Ledger-Credential.html — but those directories don't exist on the site.

The cabal haddock-project --html-location flag only accepts a single URL template, but our dependencies are spread across multiple hosts. The post-processing script scripts/fix-haddock-links.sh instead resolves each cross-package href and rewrites it after haddock generation.

Resolution per CHaP package, first hit wins:

  1. Name-suffix heuristic under *.cardano.intersectmbo.org — strip trailing -token segments and HEAD-probe each candidate's doc-index.html. Catches cardano-ledger-*, plutus-*, ouroboros-*, etc.

    Worked examplecardano-ledger-api:

    • candidate 1: https://cardano-ledger-api.cardano.intersectmbo.org/cardano-ledger-api/doc-index.html → 404 (no such subdomain)
    • candidate 2: https://cardano-ledger.cardano.intersectmbo.org/cardano-ledger-api/doc-index.html200

    Package resolves to base https://cardano-ledger.cardano.intersectmbo.org. Probing stops at the first 200; we never need candidate 3 (cardano).

  2. IOG_DOC_BASES fallback — known doc-site roots for irregular subdomains (e.g. cardano-base lives at base.cardano.intersectmbo.org).

Hrefs that don't resolve become annotated unclickable <span>s with tooltips. Non-CHaP packages (base, bytestring, etc.) are deliberately not linked — Haddock's URL shapes don't line up cleanly with Hackage. Module-level 404s at otherwise-valid sites are rescued via parent-module fallback where possible (Foo-Bar.htmlFoo.html#t:Bar); otherwise also become spans. Full design and CI policy (actionable vs unfixable buckets, KNOWN_UNDOCUMENTED allowlist) documented in the script preamble.

Tracking-issue automation

When the workflow's Fix cross-package Haddock links step fails on master, a follow-up GH Actions step opens (or comments on) a single rolling tracking issue labelled haddock-ci-failure with the PR opener as assignee. One issue per outage period — subsequent failures append a comment instead of duplicating; closing the issue resets the cycle.

Validation on a separate test branch (issue-601-test-tracking) with a synthetic break (typed-protocols removed from IOG_DOC_BASES, master-ref gate temporarily relaxed):

Test branch and test issue will be deleted/closed after merge.

How to trust this PR

  • The script only does sed-style rewrites on the post-haddock HTML output — it cannot affect source code or build artefacts.
  • All rewritten URLs are HEAD-validated; dead ones become unclickable spans, so the published site has zero clickable 404s.
  • Tracking-issue automation validated end-to-end via the runs linked above.

Checklist

  • Commit sequence broadly makes sense and commits have useful messages
  • New tests are added if needed and existing tests are updated
  • Self-reviewed the diff

@Jimbo4350 Jimbo4350 requested a review from newhoggy as a code owner April 9, 2026 16:42
Copilot AI review requested due to automatic review settings April 9, 2026 16:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes broken cross-package Haddock links on the hosted docs site by post-processing generated HTML and rewriting relative dependency links to absolute URLs for the appropriate external documentation hosts.

Changes:

  • Add a scripts/fix-haddock-links.sh post-processing script to rewrite href="../<pkg>/..." links to external doc hosts (ledger, base, plutus, etc., plus selected Hackage packages).
  • Run the link-fixing script in the GitHub Pages workflow after cabal haddock-project generation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
scripts/fix-haddock-links.sh New bash script that rewrites relative Haddock cross-package links to absolute URLs per package/host mapping.
.github/workflows/github-page.yml Invokes the new script as part of the docs build pipeline before deployment.

Comment thread scripts/fix-haddock-links.sh Outdated
Comment on lines +20 to +21
# Mapping of package-name-prefix -> base URL
# Each entry is: "prefix|base_url"
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this is a mapping of "package-name-prefix -> base URL", but the implementation matches exact package names (no prefix matching). This is misleading for future maintainers adding entries; either adjust the comment (e.g. "package name -> base URL") or change the implementation to do prefix-based matching as documented.

Suggested change
# Mapping of package-name-prefix -> base URL
# Each entry is: "prefix|base_url"
# Mapping of package name -> base URL
# Each entry is: "package_name|base_url"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete — this comment was on an earlier version of the script that had a hardcoded mapping table. Resolution is now via the name-suffix heuristic + IOG_DOC_BASES fallback (see commit 0cc7cf1 and the script preamble).

Comment thread scripts/fix-haddock-links.sh Outdated
Comment on lines +93 to +95
echo "Fixing cross-package Haddock links in $WEBSITE_DIR..."

find "$WEBSITE_DIR" -name '*.html' -print0 | xargs -0 -P "$(nproc)" sed -i "${SED_ARGS[@]}"
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script assumes GNU userland: nproc and sed -i without a backup suffix are not available/compatible on macOS by default, and the repo’s Nix devShell doesn’t appear to provide GNU coreutils/sed. If contributors run this locally on darwin it will fail; consider either (a) using a portable CPU-count fallback and a cross-platform in-place edit approach, or (b) explicitly using/providing GNU sed/coreutils (e.g. via Nix inputs or calling gsed/gnproc).

Suggested change
echo "Fixing cross-package Haddock links in $WEBSITE_DIR..."
find "$WEBSITE_DIR" -name '*.html' -print0 | xargs -0 -P "$(nproc)" sed -i "${SED_ARGS[@]}"
if command -v nproc >/dev/null 2>&1; then
CPU_COUNT="$(nproc)"
elif command -v getconf >/dev/null 2>&1; then
CPU_COUNT="$(getconf _NPROCESSORS_ONLN 2>/dev/null || true)"
elif command -v sysctl >/dev/null 2>&1; then
CPU_COUNT="$(sysctl -n hw.ncpu 2>/dev/null || true)"
else
CPU_COUNT=1
fi
CPU_COUNT="${CPU_COUNT:-1}"
if sed --version >/dev/null 2>&1; then
SED_INPLACE=(-i)
else
SED_INPLACE=(-i '')
fi
echo "Fixing cross-package Haddock links in $WEBSITE_DIR..."
find "$WEBSITE_DIR" -name '*.html' -print0 | xargs -0 -P "$CPU_COUNT" -n 1 sed "${SED_INPLACE[@]}" "${SED_ARGS[@]}"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — this script runs only on ubuntu-latest from .github/workflows/github-page.yml; cardano-api's CI is Linux-only and the nix devShell is Linux too. macOS portability would be defensive churn for a use case we don't have.

@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from ecc4783 to 1b7422c Compare April 9, 2026 19:54
Comment thread scripts/fix-haddock-links.sh
Comment thread scripts/fix-haddock-links.sh
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch 2 times, most recently from 2ed6d5d to 34323b6 Compare April 17, 2026 13:05
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch 2 times, most recently from 42885bd to dc943d1 Compare April 24, 2026 22:46
@Jimbo4350 Jimbo4350 requested a review from Copilot April 27, 2026 13:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment on lines +42 to +44
- name: Fix cross-package Haddock links
run: |
./scripts/fix-haddock-links.sh ./website
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repo uses Herald changelog fragments in .changes/ (see .herald.yml). This PR appears to add behavior affecting the hosted docs pipeline but does not include a new .changes/*.yml fragment; CI typically enforces this, so please add an appropriate fragment (likely bugfix and/or documentation, project cardano-api).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0396d5f — added .changes/20260428_cardano_api_fix_haddock_links.yml.

[ -d "$WEBSITE_DIR" ] || { echo "Error: $WEBSITE_DIR is not a directory" >&2; exit 1; }

TMPFILES=()
trap 'rm -f "${TMPFILES[@]}"' EXIT
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EXIT trap runs rm -f "${TMPFILES[@]}". If the script exits before any tmpfiles are created (e.g., invalid WEBSITE_DIR), this expands to rm -f with no operands and can emit an error in the trap. Guard the cleanup (e.g., only call rm when the array is non-empty).

Suggested change
trap 'rm -f "${TMPFILES[@]}"' EXIT
cleanup_tmpfiles() {
if (( ${#TMPFILES[@]} > 0 )); then
rm -f "${TMPFILES[@]}"
fi
}
trap cleanup_tmpfiles EXIT

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping — non-issue on this CI: GNU rm -f with no operands silently exits 0, and bash 4.4+ doesn't trip set -u on empty-array expansion. The trap is safe on ubuntu-latest, which is where this workflow runs.

Comment thread scripts/fix-haddock-links.sh Outdated
local code
code=$(curl -sI -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 \
"${1}/${2}/doc-index.html" 2>/dev/null || echo "000")
[[ "$code" == "200" || "$code" == "301" || "$code" == "302" ]]
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HTTP status handling only treats 200/301/302 as success. Several hosts legitimately respond with 307/308 redirects (your PR description even notes 307 on Hackage), so these will be misclassified as missing docs and may fail the pipeline. Consider treating 307/308 as success (or enabling curl -L and checking the final status) consistently across all URL probes/validations in this script.

Suggested change
[[ "$code" == "200" || "$code" == "301" || "$code" == "302" ]]
[[ "$code" == "200" || "$code" == "301" || "$code" == "302" || "$code" == "307" || "$code" == "308" ]]

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0396d5fprobe_site plus the Phase 2b probe, Phase 3 main validator, and Phase 3a rescue probe now all accept 200/301/302/307/308.

Comment on lines +177 to +180
resolve_url() {
local pkg="$1" is_chap="$2" base
if [[ "$is_chap" != "yes" ]]; then echo "HACKAGE"; return; fi
while IFS= read -r base; do
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says Hackage packages will be rewritten to hackage.haskell.org, but resolve_url returns HACKAGE for non-CHaP packages and Phase 2 explicitly makes those links unclickable instead of rewriting to Hackage. Either update the PR description to match the implemented behavior, or adjust the script to actually rewrite non-CHaP links to Hackage if that was the intended outcome.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed — PR description was updated yesterday to say non-CHaP packages are deliberately not linked, matching the implementation.

@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from 2f842d3 to 731b50a Compare April 27, 2026 17:24
Comment thread .changes/20260428_cardano_api_fix_haddock_links.yml Outdated
Comment thread scripts/fix-haddock-links.sh
# each of these subdirectories.
DOC_SUBDIRS=(api protocols framework)

derive_name_candidates() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strips all the way down to single-token names. derive_name_candidates "cardano-ledger-api" produces cardano.cardano.intersectmbo.org as its last candidate. If that subdomain ever resolves (wildcard DNS, some future deployment), the heuristic wins with the wrong site silently.

Adding a second break would cap it at two-token names:

derive_name_candidates() {
  local name="$1"
  while true; do
    printf '%s\n' "https://${name}.cardano.intersectmbo.org"
    [[ "$name" == *-* ]] || break
    name="${name%-*}"
    [[ "$name" == *-* ]] || break
  done
}

Comment thread .github/workflows/github-page.yml Outdated
Copy link
Copy Markdown
Contributor

@carbolymer carbolymer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jimbo4350 added a commit that referenced this pull request Apr 29, 2026
- Add changelog fragment with single-line description (no hard wraps in
  the YAML block scalar).
- derive_name_candidates: cap at two-token names. The previous loop
  emitted a single-token URL like cardano.cardano.intersectmbo.org as
  its last candidate. If that subdomain ever resolves (wildcard DNS,
  future deploy, catch-all), probe_site would silently accept it and
  rewrite every cardano-* link to the wrong site. An early break after
  the suffix strip prevents the heuristic from ever emitting a URL
  whose subdomain is a single bare token.
- Add --retry 3 --retry-delay 2 --retry-all-errors to all five curl
  sites. Every probe was previously a single-shot attempt (5s connect,
  10s max). Worst case: Phase 3 validation downgrades a perfectly
  valid rewritten URL into an unclickable <span>, silently shipping a
  regression. Other failure modes: probe_site falling through to a
  wrong candidate, or the CHaP index fetch killing the whole build on
  a CDN hiccup.
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from 0396d5f to 03b5b49 Compare April 29, 2026 20:26
Jimbo4350 added a commit that referenced this pull request Apr 29, 2026
- Changelog (.changes/.../fix_haddock_links.yml): collapse the
  description block scalar to a single line, removing the hard wraps
  Herald renders verbatim (r3159565460).

- derive_name_candidates (scripts/fix-haddock-links.sh): cap at
  two-token names. The previous loop emitted a single-token URL like
  cardano.cardano.intersectmbo.org as its last candidate; if that
  subdomain ever resolves (wildcard DNS, future deploy, catch-all),
  probe_site silently accepts it and rewrites every cardano-* link to
  the wrong site. Add an early break after the suffix strip so the
  heuristic never emits a URL whose subdomain is a single bare token
  (r3159784338).

- curl retries (scripts/fix-haddock-links.sh): every probe was a
  single-shot attempt (5s connect, 10s max) with no retry. Worst case:
  Phase 3 validation downgrades a perfectly valid rewritten URL into
  an unclickable <span>, silently shipping a regression. Other failure
  modes: probe_site falling through to a wrong candidate, or the CHaP
  index fetch killing the whole build on a CDN hiccup. Add
  --retry 3 --retry-delay 2 --retry-all-errors to all five curl sites
  (r3159784102).

- Workflow assignee 422 (.github/workflows/github-page.yml): the
  GitHub API rejects assignees that aren't repo collaborators with a
  422. If an external contributor merges a PR that breaks the haddock-
  links check on master, the rolling-issue workflow step would crash
  there, leaving the issue or comment half-created. Wrap both
  addAssignees calls in try/catch and remove assignees from the
  issues.create payload, doing assignment as a separate best-effort
  call so the issue always lands (r3159784547).
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from 03b5b49 to 68e0549 Compare April 29, 2026 21:02
Jimbo4350 added a commit that referenced this pull request Apr 29, 2026
- Changelog (.changes/.../fix_haddock_links.yml): collapse the
  description block scalar to a single line, removing the hard wraps
  Herald renders verbatim (r3159565460).

- derive_name_candidates (scripts/fix-haddock-links.sh): cap at
  two-token names. The previous loop emitted a single-token URL like
  cardano.cardano.intersectmbo.org as its last candidate; if that
  subdomain ever resolves (wildcard DNS, future deploy, catch-all),
  probe_site silently accepts it and rewrites every cardano-* link to
  the wrong site. Add an early break after the suffix strip so the
  heuristic never emits a URL whose subdomain is a single bare token
  (r3159784338).

- curl retries (scripts/fix-haddock-links.sh): every probe was a
  single-shot attempt (5s connect, 10s max) with no retry. Worst case:
  Phase 3 validation downgrades a perfectly valid rewritten URL into
  an unclickable <span>, silently shipping a regression. Other failure
  modes: probe_site falling through to a wrong candidate, or the CHaP
  index fetch killing the whole build on a CDN hiccup. Add
  --retry 3 --retry-delay 2 --retry-all-errors to all five curl sites
  (r3159784102).

- Workflow assignee 422 (.github/workflows/github-page.yml): the
  GitHub API rejects assignees that aren't repo collaborators with a
  422. If an external contributor merges a PR that breaks the haddock-
  links check on master, the rolling-issue workflow step would crash
  there, leaving the issue or comment half-created. Wrap both
  addAssignees calls in try/catch and remove assignees from the
  issues.create payload, doing assignment as a separate best-effort
  call so the issue always lands (r3159784547).
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from 68e0549 to e51ad20 Compare April 29, 2026 21:10
Copy link
Copy Markdown
Contributor

@palas palas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this works, I think it is very positive. I would merge it and give it a go. Maybe iterate if we see issues in the future. It is not critical code, so I don't think it is worth spending a huge amount of time ensuring it is bullet-proof. And bash is super finicky, so I think properly reviewing it would require a PhD in bash and a few days. So here are a couple of potential issues I found

# anywhere, only source on GitHub — e.g. kes-agent).
# c. Haddock-emitted absolute Hackage URLs that lack a package
# version (Hackage's routing requires one). A handful, out
# of our control, treated as noise.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another potential reason. Haddocks are for master branch, and master branch is linked to releases of other packages, so references may just not match because the symbol referenced may have disappeared, or moved since last release. We are still linking to it, but they won't be in the generated haddock in the repo of the dependency just because the haddock of the repo of the dependency is also built from master (not from the release we are linking, which may not even be the last one)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — this is a fourth unfixable sub-cause that was missing from the doc block. Added in 519aadc as sub-cause (d) alongside the existing (a) umbrella-vs-implementation, (b) KNOWN_UNDOCUMENTED, and (c) Hackage-URL-without-version cases.

import os, re, sys
website = sys.argv[1]
pattern = re.compile(
r'id="(t|v):([^"]+)" class="def">[^<]*</a>\s*<a href="file:///[^"]*?/([a-zA-Z][\w-]*)-\d+\.[^/]*/share/doc/html/src/([^"#]+)\.html'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be working on my machine, maybe because I get links like this:

file:///Users/palas/.local/state/cabal/store/ghc-9.14.1-inplace/crdn-ldgr-p-1.13.0.0-9cac6f16/share/doc/html/src/Cardano.Ledger.Api.html

But maybe it works on the GHA runner, which is what matters

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the script should warn against executing on darwin instead of failing silently

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only intended to run in CI on ubuntu-latest. macOS was never a target.

Comment thread scripts/fix-haddock-links.sh Outdated
while IFS= read -r pkg; do CHAP_SET["$pkg"]=1; done < "$CHAP_PKGS_FILE"

# Single HTML scan for all cross-package link targets
DISCOVERED_PKGS=$(grep -rohP 'href="\.\./(\.\./)?\K[a-zA-Z][a-zA-Z0-9_.-]*(?=/)' \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DISCOVERED_PKGS=$(grep -rohP 'href="\.\./(\.\./)?\K[a-zA-Z][a-zA-Z0-9_.-]*(?=/)' \
DISCOVERED_PKGS=$(grep -rohP --include='*.html' 'href="\.\./(\.\./)?\K[a-zA-Z][a-zA-Z0-9_.-]*(?=/)' \

Otherwise it tries to match things that are not HTML

Comment on lines +438 to +441
awk -F'\t' '{print $4}' "$REEXPORT_CANDIDATES" | sort -u | \
xargs -P 16 -I{} sh -c \
'code=$(curl -sI -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 --retry 3 --retry-delay 2 --retry-all-errors "{}"); if [ "$code" = "200" ] || [ "$code" = "301" ] || [ "$code" = "302" ] || [ "$code" = "307" ] || [ "$code" = "308" ]; then echo "{}"; fi' \
> "$REEXPORT_VALID_FILE" 2>/dev/null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
awk -F'\t' '{print $4}' "$REEXPORT_CANDIDATES" | sort -u | \
xargs -P 16 -I{} sh -c \
'code=$(curl -sI -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 --retry 3 --retry-delay 2 --retry-all-errors "{}"); if [ "$code" = "200" ] || [ "$code" = "301" ] || [ "$code" = "302" ] || [ "$code" = "307" ] || [ "$code" = "308" ]; then echo "{}"; fi' \
> "$REEXPORT_VALID_FILE" 2>/dev/null
awk -F'\t' '{print $4}' "$REEXPORT_CANDIDATES" | sort -u | \
xargs -P 16 -I{} sh -c \
'code=$(curl -sI -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 --retry 3 --retry-delay 2 --retry-all-errors "$1"); if [ "$code" = "200" ] || [ "$code" = "301" ] || [ "$code" = "302" ] || [ "$code" = "307" ] || [ "$code" = "308" ]; then echo "$1"; fi' _ {} \
> "$REEXPORT_VALID_FILE" 2>/dev/null

This avoids injection, which is not really a worry, but I figure it is probably going to be more reliable too, if we happen to get any weird character in URLs. This issue is in more places

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 7d1d631 across all three xargs sh -c probe call sites (Phase 2 dead-URL probe, Phase 3 reexport probe, Phase 3a rescue probe). Apostrophes in Haskell module names (e.g. Foo') would have broken the previous {} substitution by closing the outer `'...' wrapper, so this is a real fix, not just defensive.

Fixes #601. Cross-package hrefs emitted by cabal haddock-project are
relative paths (e.g. href="../cardano-ledger-api-1.2.3-hash/Foo.html")
that don't resolve on the published site — we only host cardano-api's
own output, not its dependencies, so every cross-package reference
404s by default.

Add scripts/fix-haddock-links.sh and wire it into the github-page
workflow between haddock-project and the artifact upload. The script
replaces each cross-package href with either an absolute URL on the
upstream doc site or a tooltip-annotated unclickable <span>, so the
published site has zero clickable 404s.

Pipeline
  Phase 1  Scan filesystem, symlink versioned directories, fetch the
           CHaP index, grep HTML for cross-package link targets.
  Phase 2  For each discovered target, probe candidate doc-site URLs
           and rewrite links (or mark unclickable if unresolvable).
  Phase 2b Rewrite local re-export pages to point at the defining
           upstream package, using Haddock's "Source" cabal-store link
           as ground truth for which package the type lives in.
  Phase 3  HEAD-validate rewritten URLs; rescue dead ones by probing
           doc-site subdirs (api/, protocols/, framework/) and parent
           modules with #t: fragment reconstruction. What can't be
           rescued becomes an annotated <span>.

Doc-site resolution for CHaP packages — two lookups, first hit wins:
  1. Name-suffix heuristic under *.cardano.intersectmbo.org — strip
     trailing '-token' segments of the package name and HEAD-probe
     each candidate's doc-index.html. Covers cardano-ledger-*,
     plutus-*, ouroboros-*, etc.
  2. Fixed fallback against a small IOG_DOC_BASES list — covers
     packages whose subdomain isn't a suffix of the package name
     (e.g. cardano-base at base.cardano.intersectmbo.org).

Non-CHaP packages (bootlibs like base, bytestring, time) are NOT
linked. Haddock's per-module URL structure doesn't line up cleanly
with Hackage's (src/ source views, -inplace version suffixes) so
Hackage rewrites mostly produce 404s, and readers of cardano-api docs
rarely click into bootlib internals. Rendered as unclickable spans,
no outbound link, no validation noise.

Dead-link CI policy
  Actionable (FAILS CI): a CHaP package the probe couldn't resolve to
  any doc site. Usually a gap in IOG_DOC_BASES — fix by adding the
  package's upstream doc base URL or, if genuinely unpublished, adding
  the package to KNOWN_UNDOCUMENTED.

  Unfixable (does NOT fail CI, logged for visibility): module-level
  404s on otherwise-valid upstream sites where upstream only publishes
  umbrella exposed-modules; packages with no published Haddocks
  anywhere; absolute Hackage URLs that lack a package version. All
  outside this repo to fix.

  Escape hatch FIX_HADDOCK_LINKS_ALLOW_DEAD=1 exits 0 even with
  actionable entries; under GitHub Actions, actionable entries emit
  ::warning:: annotations.

Rolling tracking issue
  Post-merge workflow failures on master open or comment on a single
  rolling issue, tagging the PR opener so the breakage lands on
  someone's board instead of going unnoticed. The Deploy step skips on
  failure, so the published site stays at its last good revision until
  the issue is resolved.

Includes a Herald changelog fragment under .changes/.
- Changelog (.changes/.../fix_haddock_links.yml): collapse the
  description block scalar to a single line, removing the hard wraps
  Herald renders verbatim (r3159565460).

- derive_name_candidates (scripts/fix-haddock-links.sh): cap at
  two-token names. The previous loop emitted a single-token URL like
  cardano.cardano.intersectmbo.org as its last candidate; if that
  subdomain ever resolves (wildcard DNS, future deploy, catch-all),
  probe_site silently accepts it and rewrites every cardano-* link to
  the wrong site. Add an early break after the suffix strip so the
  heuristic never emits a URL whose subdomain is a single bare token
  (r3159784338).

- curl retries (scripts/fix-haddock-links.sh): every probe was a
  single-shot attempt (5s connect, 10s max) with no retry. Worst case:
  Phase 3 validation downgrades a perfectly valid rewritten URL into
  an unclickable <span>, silently shipping a regression. Other failure
  modes: probe_site falling through to a wrong candidate, or the CHaP
  index fetch killing the whole build on a CDN hiccup. Add
  --retry 3 --retry-delay 2 --retry-all-errors to all five curl sites
  (r3159784102).

- Workflow assignee 422 (.github/workflows/github-page.yml): the
  GitHub API rejects assignees that aren't repo collaborators with a
  422. If an external contributor merges a PR that breaks the haddock-
  links check on master, the rolling-issue workflow step would crash
  there, leaving the issue or comment half-created. Wrap both
  addAssignees calls in try/catch and remove assignees from the
  issues.create payload, doing assignment as a separate best-effort
  call so the issue always lands (r3159784547).
Phase 1's grep scanned every file under WEBSITE_DIR. With --include='*.html'
the scan only considers HTML, avoiding spurious matches against CSS/JS/font
assets and saving wasted work. Suggested by palas in PR review.
Substituting {} directly into the inner sh -c command string would re-parse
URL characters as shell syntax (e.g. an apostrophe in a Haskell module name
like Foo' would terminate the outer single-quoted command). Pass the URL via
sh -c '...' _ {} so it expands as "$1" data instead. Suggested by palas.
Our haddocks are built from cardano-api master against pinned upstream
releases, but the upstream doc sites we link to publish from their own
master. Symbols moved or removed between the pinned release and current
master surface as 404s through no fault of this script. Pointed out by
palas in PR review.
@Jimbo4350 Jimbo4350 force-pushed the issue-601-chap-haddock-links branch from 519aadc to 1edced2 Compare April 30, 2026 12:01
@Jimbo4350 Jimbo4350 enabled auto-merge April 30, 2026 12:01
@Jimbo4350 Jimbo4350 added this pull request to the merge queue Apr 30, 2026
Merged via the queue into master with commit e194f1a Apr 30, 2026
33 checks passed
@Jimbo4350 Jimbo4350 deleted the issue-601-chap-haddock-links branch April 30, 2026 13:41
Jimbo4350 added a commit that referenced this pull request Apr 30, 2026
Brings in PR #1180 (Haddock cross-package link fix) which landed
on master after the release branch was cut. Docs-site only — no
package code changes.
Jimbo4350 added a commit that referenced this pull request Apr 30, 2026
Brings in PR #1180 (Haddock cross-package link fix) which landed
on master after the release branch was cut. Docs-site only — no
package code changes.
Copilot AI mentioned this pull request May 15, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] - CHaP packages links in Haddocks are broken

4 participants