Skip to content

fix: preserve Chinese characters in permalinks#103

Merged
phernandez merged 2 commits intobasicmachines-co:mainfrom
andyxinweiminicloud:fix-chinese-permalink
Apr 17, 2025
Merged

fix: preserve Chinese characters in permalinks#103
phernandez merged 2 commits intobasicmachines-co:mainfrom
andyxinweiminicloud:fix-chinese-permalink

Conversation

@andyxinweiminicloud
Copy link
Copy Markdown
Contributor

Problem

The current implementation converts all Unicode characters to ASCII using unidecode, which causes Chinese characters in permalinks to be transliterated. This makes it difficult for Chinese users to recognize the permalinks, as they no longer match the original Chinese titles.

Solution

Modified the generate_permalink function in utils.py to preserve non-ASCII characters while still properly handling ASCII characters:

  1. Removed the Unicode to ASCII transliteration step using unidecode
  2. Updated character processing to only lowercase ASCII letters while preserving non-ASCII characters
  3. Modified the regex pattern to retain Chinese character ranges in permalinks
  4. Kept all other functionality (camelCase handling, replacing underscores, etc.) intact

This change allows Chinese characters to be preserved in permalinks, making them more intuitive and readable for Chinese users while maintaining compatibility with existing systems.

Testing

Successfully tested with various Chinese file paths and titles. Both file paths and permalinks correctly preserve Chinese characters while still properly processing English text.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 16, 2025

CLA assistant check
All committers have signed the CLA.

@phernandez
Copy link
Copy Markdown
Member

claude: is this fix safe for permalinks?

@github-actions
Copy link
Copy Markdown
Contributor

The code changes in this PR are safe for permalinks for the following reasons:

  1. Maintains Core Functionality: The fix preserves all essential permalink formatting operations (camelCase handling, replacing underscores with hyphens, collapsing multiple hyphens, etc.) while only modifying the character handling approach.

  2. Unicode Handling: The change removes unidecode() which previously converted Chinese characters to ASCII approximations. The new approach preserves non-ASCII characters but still properly handles ASCII characters.

  3. Conservative Approach: The regex pattern specifically includes common Chinese character ranges (\u4e00-\u9fff, \u3000-\u303f, \u3400-\u4dbf) while maintaining compatibility with existing ASCII handling.

  4. Documented Change: The docstring was updated to clearly state that non-ASCII characters are preserved, which is now reflected in both the code and examples.

  5. Backward Compatibility: Existing ASCII-only permalinks will be generated exactly the same way, ensuring no breaking changes for existing content.

The changes align with the codebase's UTF-8 handling approach seen in the test files, which already have comprehensive test coverage for UTF-8 characters throughout the system. Since permalinks are a critical part of the knowledge graph, preserving Chinese characters will improve usability for Chinese users without compromising functionality.

@phernandez phernandez changed the title Fix: preserve Chinese characters in permalinks fix: preserve Chinese characters in permalinks Apr 16, 2025
@phernandez
Copy link
Copy Markdown
Member

Hi @andyxinweiminicloud thanks for the fix. I'm going to merge this PR, then add a bit of logic to handle permalinks with unicode for latin characters and keep your fixes for Chinese characters.

@phernandez phernandez merged commit 03d4e97 into basicmachines-co:main Apr 17, 2025
3 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants