Skip to content

ci(check-links): exclude non-ASCII bytes when extracting GitHub URLs#5336

Closed
davidkonigsberg wants to merge 1 commit intomainfrom
devin/1777896452-fix-check-links-cjk-urls
Closed

ci(check-links): exclude non-ASCII bytes when extracting GitHub URLs#5336
davidkonigsberg wants to merge 1 commit intomainfrom
devin/1777896452-fix-check-links-cjk-urls

Conversation

@davidkonigsberg
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #5334. After the sitemap fix landed, the next Check Links run still failed — but for a different reason: the local GitHub URL verification step extracted

https://github.com/fern-api/protoc-gen-openapi/releases/tag/v0.1.7)

(note the trailing fullwidth Chinese right parenthesis, U+FF09, hex EF BC 89) from <ref_file file="/home/ubuntu/repos/docs/fern/translations/zh/products/sdks/generators/csharp/changelog/2026-03-13.mdx" />, then tried to verify that a tag literally named v0.1.7) exists in fern-api/protoc-gen-openapi, which of course it doesn't.

The URL terminator class in each grep was [^"')<>[:space:]]+ — i.e. ASCII whitespace and a few markdown delimiters. The Chinese paren is none of those, so it was absorbed into the URL.

Fix: switch the URL extraction greps to grep -P (PCRE) with a terminator class that also excludes any non-ASCII byte (\x80-\xff), and export LC_ALL=C in those steps so PCRE runs in byte-oriented mode. URLs are pure ASCII per RFC 3986, so this is a tighter, correct constraint and is robust to any future CJK / accented punctuation that might land next to a bare URL in translated prose.

Verified locally on the current main:

  • releases/tag extraction: 8 URLs before / 8 after — only difference is the previously-broken one is now extracted as …/v0.1.7 instead of …/v0.1.7).
  • blob/tree, issues, compare, commits?, discussions, pull extractions: identical sets before/after.

Review & Testing Checklist for Human

  • Trigger the Check Links workflow manually via workflow_dispatch on this branch and confirm the job completes green (i.e. GitHub URLs missing locally: 0).

Notes

This only fixes the URL extractor. If you'd prefer the source markdown to use ASCII parens (or, better, a real [text](url) link) in the zh translation, that's a separate change in <ref_file file="/home/ubuntu/repos/docs/fern/translations/zh/products/sdks/generators/csharp/changelog/2026-03-13.mdx" /> — happy to do that as a follow-up.

The earlier-flagged empty sitemap-en.xml issue (buildwithfern.com/learn/sitemap-en.xml returning an empty <urlset>) is unrelated to this PR and still worth investigating in fern-platform.

Link to Devin session: https://app.devin.ai/sessions/f71d9cbb5efc401799576519b960ef05
Requested by: @davidkonigsberg

The releases/tag and other GitHub URL extraction regexes terminated only
on ASCII whitespace and common markdown delimiters. Translated prose
(e.g., the recently-added zh translations) can place a fullwidth Chinese
right parenthesis (U+FF09) immediately after a URL, which got absorbed
into the extracted URL and made the local tag verification fail.

Switch to grep -P with a terminator class that also excludes any
non-ASCII byte (\x80-\xff). LC_ALL=C is set so PCRE runs in
byte-oriented mode and the byte range matches as expected.

Co-Authored-By: David Konigsberg <davidakonigsberg@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@davidkonigsberg davidkonigsberg deleted the devin/1777896452-fix-check-links-cjk-urls branch May 5, 2026 11:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant