Skip to content

fix: rune-aware truncation in stepName to prevent invalid UTF-8 in JSON-LD#106

Closed
greynewell wants to merge 1 commit intomainfrom
fix-stepname-utf8-truncation
Closed

fix: rune-aware truncation in stepName to prevent invalid UTF-8 in JSON-LD#106
greynewell wants to merge 1 commit intomainfrom
fix-stepname-utf8-truncation

Conversation

@greynewell
Copy link
Copy Markdown
Contributor

@greynewell greynewell commented Apr 9, 2026

Summary

  • stepName in internal/archdocs/pssg/schema/jsonld.go used byte-based slicing (step[:77]) to cap long recipe instruction steps
  • When a step contains multi-byte UTF-8 characters (e.g. "sauté", "jalapeño", "crème brûlée"), slicing at byte 77 can land mid-sequence, producing invalid UTF-8 in the generated JSON-LD <script> tag
  • Fix converts to []rune before truncating — same pattern already used in ReadClaudeMD, dotEscape, and other truncation points in the codebase

Test plan

  • TestStepName_MultiByteUTF8: 81 × "é" (162 bytes, 81 runes) — verifies output is valid UTF-8 and is truncated
  • TestStepName_ShortStep: short step returned unchanged
  • TestStepName_FirstSentence: first-sentence extraction unaffected
  • TestStepName_TruncatesLongASCII: ASCII truncation still works
  • go test ./... passes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed string truncation logic to properly handle multi-byte UTF-8 characters without splitting them mid-character.
  • Tests

    • Added comprehensive unit tests for instruction step string handling, covering ASCII truncation, first-sentence extraction, and UTF-8 character validation scenarios.

stepName used byte-based slicing (step[:77]) to truncate long recipe
instruction steps. When a step contains multi-byte UTF-8 characters
(e.g. "sauté", "jalapeño", "crème"), slicing at byte 77 can land in
the middle of a multi-byte sequence, producing invalid UTF-8 in the
generated JSON-LD structured data.

Fixes by converting to []rune before truncating, matching the same
pattern used elsewhere in the codebase (e.g. ReadClaudeMD, dotEscape).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

The code updates UTF-8 string truncation logic in the stepName function. Instead of truncating based on byte position (which could break multi-byte UTF-8 characters), the code now converts strings to runes, truncates at the rune level, and converts back—ensuring complete characters are preserved. Comprehensive tests were added to verify this behavior.

Changes

Cohort / File(s) Summary
UTF-8 Truncation Fix
internal/archdocs/pssg/schema/jsonld.go
Updated stepName truncation logic to use rune-based slicing instead of byte-based slicing, preventing corruption when truncating strings with multi-byte UTF-8 characters.
Test Coverage
internal/archdocs/pssg/schema/jsonld_test.go
Added comprehensive unit tests covering short inputs, first-sentence extraction, ASCII truncation, and UTF-8 edge cases to ensure truncation produces valid output.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • jonathanpopham

Poem

🔤 UTF-8 runes now whole, not torn in two,
Where multi-byte chars stay complete and true,
No more broken characters mid-slice,
Truncation done properly—oh, how nice! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: fixing rune-aware truncation in stepName to prevent invalid UTF-8 in JSON-LD output.
Description check ✅ Passed The description covers all required template sections with clear problem statement, solution explanation, and comprehensive test plan with concrete test cases.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-stepname-utf8-truncation
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix-stepname-utf8-truncation

Comment @coderabbitai help to get the list of available commands and usage tips.

@greynewell
Copy link
Copy Markdown
Contributor Author

Duplicate of #104 which already fixed this in main.

@greynewell greynewell closed this Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant