Skip to content

Add checks for broken docs urls#6448

Merged
Alek99 merged 4 commits intomainfrom
carlos/docs-links-ci
May 5, 2026
Merged

Add checks for broken docs urls#6448
Alek99 merged 4 commits intomainfrom
carlos/docs-links-ci

Conversation

@carlosabadia
Copy link
Copy Markdown
Contributor

No description provided.

@carlosabadia carlosabadia requested review from a team and Alek99 as code owners May 4, 2026 11:36
@carlosabadia carlosabadia added the documentation Improvements or additions to documentation label May 4, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 4, 2026

Merging this PR will not alter performance

✅ 24 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing carlos/docs-links-ci (c7ed339) with main (3702d23)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR adds a new GitHub Actions workflow and Python script that validate /docs/* Markdown links against the Reflex app's generated sitemap.xml, catching broken URLs and underscore-in-path violations before they reach production. The implementation is well-structured, correctly strips fragments/query strings before the underscore check, and ships good test coverage — including the fragment-underscore false-positive regression case from prior review.

  • The LINK_RE regex only handles double-quoted Markdown link titles (\"...\"), not the single-quoted ('...') or parenthesised ((...)) forms. Links like [text](/docs/foo 'My Title') would have the title text absorbed into raw, causing every such link to report a spurious "not found in sitemap" error.

Confidence Score: 4/5

Safe to merge after addressing the single-quoted title regex gap; otherwise the tool works correctly.

One P1 logic issue: single-quoted Markdown link titles are not stripped from the captured URL, causing false-positive "not found in sitemap" errors. All other logic (fragment/query stripping for the underscore check, sitemap prefix normalization, skip-dirs) is correct and well-tested.

docs/app/scripts/check_doc_links.py — specifically the LINK_RE constant on line 25.

Important Files Changed

Filename Overview
.github/workflows/check_doc_links.yml New CI workflow that builds the Reflex frontend to generate sitemap.xml, then runs the link-checker script; triggers on docs/**/*.md, the script, and this file itself.
docs/app/scripts/check_doc_links.py New script scanning .md files for /docs/* links and validating them against sitemap.xml; correctly strips fragment/query before underscore check, handles both /docs-prefixed and non-prefixed sitemaps.
docs/app/tests/test_doc_links.py Comprehensive unit tests covering valid links, missing links, underscore detection, fragment handling, skip-dirs, and both sitemap prefix styles; includes the fragment-underscore false-positive regression test.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GitHub Actions Trigger\npull_request / push to main\nwith docs path filter] --> B[Checkout & Setup Build Env\npython 3.14 + uv sync]
    B --> C[uv run reflex export\n--frontend-only --no-zip\nGenerates .web/public/sitemap.xml]
    C --> D[uv run python\nscripts/check_doc_links.py]
    D --> E[load_sitemap_paths\nParse sitemap.xml → set of normalized paths]
    D --> F[iter_md_files\nrglob *.md, skip SKIP_DIRS]
    F --> G[iter_md_links\nMatch LINK_RE on each line]
    G --> H{For each raw URL}
    H --> I{Underscore in path_only?}
    I -- Yes --> J[Append underscore error]
    I -- No --> K{sitemap_key in valid_paths?}
    J --> K
    K -- No --> L[Append not-found error]
    K -- Yes --> M[OK]
    L --> N{Any errors?}
    J --> N
    M --> N
    N -- Yes --> O[Print errors to stderr\nExit 1 → CI fails]
    N -- No --> P[Print success\nExit 0]
Loading

Reviews (2): Last reviewed commit: "updates" | Re-trigger Greptile

Comment thread docs/app/scripts/check_doc_links.py Outdated
Comment thread docs/app/tests/test_doc_links.py
@masenf
Copy link
Copy Markdown
Collaborator

masenf commented May 4, 2026

@greptile-apps re-review

Comment thread docs/app/scripts/check_doc_links.py Outdated
Copy link
Copy Markdown
Collaborator

@masenf masenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the github actions workflow be an add-on step for the existing reflex-docs regression. most of the time taken in this workflow is actually building the app, but we already do that in the other workflow, so we basically get the link checking for free.

i also think the script output could be a little more verbose so you can see all of the links that got checked in the CI instead of "All /docs links resolve against sitemap.xml." and having to trust that it actually checked and didn't just scan the wrong dir and find no markdown files.

finally, love the tests for the test script, very nice.

@adhami3310
Copy link
Copy Markdown
Member

you can parse the markdown files instead of doing regexes btw, we reflex-docgen has a transformer that can be used for this

@Alek99 Alek99 merged commit 0487d9b into main May 5, 2026
69 checks passed
@Alek99 Alek99 deleted the carlos/docs-links-ci branch May 5, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants