Skip to content

Unicode character support in screen tab names#1642

Open
Explorer09 wants to merge 2 commits into
htop-dev:mainfrom
Explorer09:screen-tab
Open

Unicode character support in screen tab names#1642
Explorer09 wants to merge 2 commits into
htop-dev:mainfrom
Explorer09:screen-tab

Conversation

@Explorer09
Copy link
Copy Markdown
Contributor

@Explorer09 Explorer09 commented Mar 18, 2025

I would like to share a feature that I have done as part of a code refactoring process.

These changes are still work in progress, but I think it's good to file it and gather early feedback.

This feature makes the screen tab names display with the correct terminal widths when they contain Unicode characters (in UTF-8 encoding). It replaces the naive strlen calculation with a new function that calculates the display terminal width of a name string.

The PR introduces the following functions:

  • String_makePrintable, a sanitization function with an API similar toxStrndup. It reads the string, converts all unprintable characters to replacement code points (U+FFFD), and allocates and re-encodes the string.
  • EncodePrintableString, an internal routine for String_makePrintable. This routine is designed with room to support RichString buffers
  • String_mbswidth, a function that calculates the terminal width of a string, or calculate the byte length of a substring if it is clipped by a terminal width limit.

This PR is related to #1513.

Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
@fasterit
Copy link
Copy Markdown
Member

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

@BenBE BenBE added enhancement Extension or improvement to existing feature code quality ♻️ Code quality enhancement labels Mar 18, 2025
@Explorer09
Copy link
Copy Markdown
Contributor Author

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

@BenBE
Copy link
Copy Markdown
Member

BenBE commented Mar 18, 2025

Fails CI:

XUtils.c:426:6: error: 'HAVE_STRNLEN' is not defined, evaluates to 0 [-Werror,-Wundef]
  426 | # if HAVE_STRNLEN
      |      ^
1 error generated.``

Noted. It should have been #ifdef.

IDK, but nowhere else do we check for strnlen, so why here all of a sudden?

@BenBE
Copy link
Copy Markdown
Member

BenBE commented Mar 18, 2025

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

@Explorer09
Copy link
Copy Markdown
Contributor Author

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

@BenBE
Copy link
Copy Markdown
Member

BenBE commented Mar 18, 2025

Any objections to pull in 4ccfc38 to main now and do the Unicode support later? That commit is mostly cleanup and doesn't affect much of the later work in this PR.

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

BTW, I borrowed the "margin left" and "column gap" naming from CSS. 🙂

Fine.

@Explorer09
Copy link
Copy Markdown
Contributor Author

I have just one question with the commit: Is it okay for the constants to be defined in CRT.h or is there a better header to put those two constant names?

CRT.h seems to be fine. ScreenManager.h may be another place to put them, but actually I'm not too sure, where's best …

Action.c does not include ScreenManager.h. If I define the constants in ScreenManager.h, then Action.c would need another include line. Hence I avoided that in the first place.

@Explorer09 Explorer09 force-pushed the screen-tab branch 4 times, most recently from 5f64daf to fe32490 Compare March 18, 2025 20:52
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.h
Comment on lines +29 to +33
typedef struct WCharEncoderState_ {
size_t pos;
void* buf;
mbstate_t mbState;
} WCharEncoderState;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also carry the buffer size, so the internal EncodeWChar routines can perform bound checking.

Also note, that the CGroup code deliberately uses two distinct functions (one for counting the size, another for actually writing) and that these functions return success (character properly counted/written to buffer) or failure (OOB write). Similar could be done here too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose to make the size counting and the actual writing the same routine for a reason: I expect the "rules" for filtering characters be tweakable in the future. If there are additional characters besides !iswprint that need to be filtered out (replaced with U+FFFD), people only need to modify one routine for that.

I'm no sure about bound checking yet. I can add it into the struct, but it would be for assertions only, as I always expect the buffer allocated would just fit. (That is, no truncation.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've changed my mind about the bound checking here. Considering the standard library functions mbsrtowcs(3) and wcsrtombs(3) both allow limits to the destination buffer size, I can add it to the WCharEncoderState as a feature.

I still don't recommend it though. The string buffer would not be null-terminated when the truncation occurs, and I won't fix that (for multi-byte encodings other than UTF-8, it is allowed that wcrtomb(buf, L'\0', ps) returns a length greater than 1).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what "Failure to write to the buffer" could be used for. If the EncodeWChar function sees that the buffer is full and not NUL-terminated, it should return false (AKA failure), which all callees could propagate accordingly. Similar to what the CGROUP compression code¹ does …

¹Actually that code over-allocates by one byte and forces the NUL-termination in that over-allocated byte. Cf. linux/CGroupUtils.c in functions CGroup_filterName and CGroup_filterContainer.

Copy link
Copy Markdown
Contributor Author

@Explorer09 Explorer09 Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE Nope. After a bit of consideration, this length check cannot be done in EncodeWChar like in the CGROUP code. I wrote a comment on why instead.

It's a limitation of the wcrtomb(3) interface. When we can detect wcrtomb writing out of bounds we might be too late (access violation already happens). wcrtomb doesn't give us the "length check before writing" we need. Trying to workaround that would make code unnecessarily complicated (wcrtomb called one more time - three times in total).

Also, we cannot put the '\0' byte in arbitrary position. Unlike in the CGROUP code, we can only put '\0' on character boundaries here. So overall there's nothing I can do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And with the implementation used in the CGroup code, a false returned from the buffer write callback would immediately terminate processing. That's basically the same semantics used there too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BenBE That means an additional memcpy call, slowing things down slightly. For checking an error that should never happen unless there's a bug in code. I don't really like this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That memcpy is from L1 cache to a buffer that's likely in L1 too. As mentioned elsewhere: If in doubt, I prefer correctness over performance.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you want to avoid this, over-allocate by MB_LEN_MAX + 1 and you can always write to ps->buf, if you force the NUL terminator on error.

Over-allocation to solve the (actually non-existent) problem of under-allocation. Sorry, but I'm irritated by this logic.

The whole point of adding the bound check by your suggestion was to correct the case of under-allocation (allocated buffer size is less than required). While I argued first that the problem literally won't exist, I agree on the point that we can assert and detect the error, at least as a precaution. The buffer should be allocated with the right size already, thus any code that attempts to correct or recover from any potential error is useless. I might just make the error fatal instead and call it a day.

if (ps->buf && len <= ps->size - ps->pos) {
   fail();
}

As for why there is no NUL termination for the string when odds happen here: We are using a multibyte encoder wcrtomb that is capable of handling any encoding, not just UTF-8. When interpreting the standard behavior of wcrtomb in the strict sense, it is not necessary for a null wide character to be encoded as 1 byte. The standard wcrtomb will output shift sequences (if any) that restore the shift state to the initial state before outputting the null byte itself. Since that would require extra buffer space (up to MB_CUR_MAX bytes), any truncation of a multibyte string would have to be decided beforehand, and it won't be an escape hatch for any buffer size problem.

Copy link
Copy Markdown
Member

@BenBE BenBE May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK on making the error fatal.

Anyway, a good API makes it impossible to cause issues by abusing it. Thus if wcrtomb may write up to MB_LEN_MAX bytes you better should pass a buffer that can hold MB_LEN_MAX bytes, i.e. ensure the remaining space is large enough. Even if the actual call is known to write less.

And adding MB_LEN_MAX + 1 to the allocation won't cause any noticeable overhead (malloc will just allocate the same and in rare occasions or under memory pressure may delay due to the OOM killer sweeping some swap …

Comment thread configure.ac Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
@Explorer09 Explorer09 force-pushed the screen-tab branch 3 times, most recently from b2db29d to b61a964 Compare April 16, 2026 16:41
@Explorer09 Explorer09 force-pushed the screen-tab branch 2 times, most recently from 41331a0 to f7a1a80 Compare April 18, 2026 21:20
@Explorer09 Explorer09 marked this pull request as ready for review April 18, 2026 23:11
@Explorer09 Explorer09 force-pushed the screen-tab branch 3 times, most recently from d881071 to d1713e5 Compare May 1, 2026 10:12
Comment thread Action.c
for (unsigned int i = 0; i < settings->nScreens; i++) {
const char* tab = settings->screens[i]->heading;
int width = rem >= bracketWidth ? (int)strnlen(tab, rem - bracketWidth + 1) : 0;
const char* ptr = tab;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce this extra variable? Was this patch written by AI? ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically no AI is used :)

The ptr is a temporary variable holding a pointer because of the API design I explained in https://github.com/htop-dev/htop/pull/1642/changes/BASE..6b82bcef3b8f7aecd9be6c83d66e84af152a1d58#r3211053429

Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
Comment thread XUtils.c Outdated
return buf;
}

bool String_decodeNextWChar(MBStringDecoderState* ps) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming consistency String_encodeWChar vs. String_decodeNextWChar.

Also, any reason to have this function exported from this TU?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a reason for asymmetric naming here. (You may suggest a better naming.)

encodeWChar() and decodeNextWChar() are for low-level encoder and decoder with states.

For the decoder:

bool String_decodeNextWChar(MBStringDecoderState* ps);

The input string is specified as ps->str and the next character decoded will be outputted in ps->ch.

For the encoder:

/* The typedef is to allow different encoders to coexist. I have thought of an encoder that would present all non-printable characters as "\xXXXX" escape sequence. */
typedef ATTR_NONNULL void (*EncodeWChar)(WCharEncoderState* ps, /*wchar_t or int*/ wc);
void String_encodeWChar(WCharEncoderState* ps, /*wchar_t or int*/ wc);

The next character that would be encoded is specified as wc argument here and will be appended to the end of ps->buf. The WCharEncoderState object does not otherwise know the next wchar_t to encode.

Perhaps the name String_encodeWChar is obscure on what it does. Maybe the name WCharEncoder_appendWChar will be better? I'm not sure yet.

Comment thread XUtils.c
return totalWidth;
}

int String_mbswidth(const char** str, size_t maxLen, int maxWidth) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have the public API of this function modify an input argument to move to the end of the function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has two outputs: One is the terminal width (i.e. num of terminal columns) and the other is the end character position after calculating the width.

The second output is useful to tell, for example, how many Unicode characters would it take to fit the number of terminal columns.

This functions has two uses combined in a single API: (1) to count the number of terminal columns for a given multibyte string; (2) to count the number characters it takes to fit the given terminal width.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might seem redundant, but I'd prefer if this output parameter was a separate (NULL-able) argument to the function. This separates concerns (input argument vs. optional additional output) and makes the API clearer. This also allows to just skip the extra local variable in case you don't care AND don't have it unexpectedly be changed.

Copy link
Copy Markdown
Contributor Author

@Explorer09 Explorer09 May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might seem redundant, but I'd prefer if this output parameter was a separate (NULL-able) argument to the function. This separates concerns (input argument vs. optional additional output) and makes the API clearer. This also allows to just skip the extra local variable in case you don't care AND don't have it unexpectedly be changed.

A static inline function that wraps around this String_mbswidth() for a more convenient output interface? That wouldn't hurt, I guess.

By the way, the "input pointer that would be modified on output" is not my original idea. There are precedents in mbsrtowcs(3) and mbsnrtowcs(3) (compare mbsrtowcs(3) with mbstowcs(3) and I would argue that the const char** argument in the former makes the API more flexible).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave both variants of the API to Copilot to compare (without initially disclosing what the function would do) and except for "if it runs in a loop for parsing" the additional out argument was the preferred variant. This only solidified further once I gave Copilot also some more semantics of the function it just had rated. One interesting aspect was that modern compiler tend to optimize the additional out parameter variant better. So there's several aspects that lean towards that variant of the API.

Using the C standard library's string function API as a baseline for API quality sets a very low bar for quality … ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave both variants of the API to Copilot to compare (without initially disclosing what the function would do) and except for "if it runs in a loop for parsing" the additional out argument was the preferred variant.

What variants of the API did you give to Copilot in particular?

One interesting aspect was that modern compiler tend to optimize the additional out parameter variant better. So there's several aspects that lean towards that variant of the API.

Again I'm not sure what variant you are referring to.

Using the C standard library's string function API as a baseline for API quality sets a very low bar for quality … ;-)

I personally value utility of the API over the "quality" which I'm not sure what it is. Convenience of passing the arguments and calling the function? Register and stack usage?

It's not I disagree with changing the API, but I didn't see an API suggestion to compare with, thus I cannot judge anything regarding what Copilot said.

Comment thread XUtils.h
Comment on lines +29 to +33
typedef struct WCharEncoderState_ {
size_t pos;
void* buf;
mbstate_t mbState;
} WCharEncoderState;
Copy link
Copy Markdown
Member

@BenBE BenBE May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK on making the error fatal.

Anyway, a good API makes it impossible to cause issues by abusing it. Thus if wcrtomb may write up to MB_LEN_MAX bytes you better should pass a buffer that can hold MB_LEN_MAX bytes, i.e. ensure the remaining space is large enough. Even if the actual call is known to write less.

And adding MB_LEN_MAX + 1 to the allocation won't cause any noticeable overhead (malloc will just allocate the same and in rare occasions or under memory pressure may delay due to the OOM killer sweeping some swap …

@Explorer09 Explorer09 force-pushed the screen-tab branch 2 times, most recently from ffb876d to 7a186ea Compare May 11, 2026 04:07
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 17, 2026

Review Change Stack

📝 Walkthrough

Overview

Adds UTF-8/Unicode-aware terminal-width handling for screen tab names so multibyte, zero-width and invalid sequences are measured and rendered correctly.

Changes

  • Action.c: replace byte-count-based tab heading width logic with String_mbswidth; tab label slicing now advances by computed display columns and renders the matching byte span.
  • ScreenManager.c: drawTab computes display width with String_mbswidth, renders only the fitting byte span, and advances x by display columns rather than byte count.
  • XUtils.c / XUtils.h: introduce a focused multibyte utility layer:
    • String_mbswidth: measures terminal display columns for a (possibly clipped) UTF-8 byte substring and advances the input pointer.
    • String_lineBreakWidth: separator-aware width measurement up to maxWidth.
    • EncodePrintableString / String_makePrintable: sanitize and re-encode input, replacing invalid/unprintable codepoints with U+FFFD, using a two-pass allocation and callback encoder.
    • MBStringDecoder, WCharEncoderState, EncodeWChar, and MBStringDecoder_nextWChar: stateful decoding helpers; wide-char paths are conditional on HAVE_LIBNCURSESW.

Implementation quality

Good encapsulation: all Unicode/multibyte logic is contained in XUtils with minimal, targeted callsite changes. Two-pass encoding and explicit decoder state handling show care for invalid/incomplete sequences. Conditional compilation is used correctly for wide-char vs byte builds. The new code is nontrivial and requires careful review (high review effort).

Commit structure & flow

Commits are sensibly split: utility/types and helpers first, then callsite changes. This split aids review and bisecting.

Cleanliness & scope

Focused, low-risk callsite changes; APIs are reusable and documented. Overall clean implementation.

Known issues / review notes

  • CI failure: XUtils.c uses # if HAVE_STRNLEN which triggers -Wundef when HAVE_STRNLEN is not defined; change to #ifdef HAVE_STRNLEN.
  • Preprocessor consistency: reviewer noted conditional checking of strnlen is unusual—ensure alignment with project conventions.
  • mbstate_t handling: original approach of zeroing mbstate_t on decode failure is incorrect for stateful encodings (e.g., ISO-2022). Correct pattern (noted by author and reviewer) is to snapshot mbstate_t, call mbrtowc, and restore the saved state when mbrtowc returns (size_t)-1 or (size_t)-2 to preserve shift state. Ensure this pattern is consistently applied before merging.

Walkthrough

This PR adds multibyte/Unicode string utilities to XUtils (encoding, printable filtering, stateful decoding, line-break and mbswidth measurement). Action.c and ScreenManager.c were updated to use String_mbswidth when measuring tab labels so drawing and hit-area calculations use terminal display-column widths and render only the fitting byte span.

Poem

Bytes gave way to columns clear,
Multibyte whispers now appear,
Tabs count glyphs by terminal light,
Cursor and label hold their sight,
Unicode finds its proper cheer.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7f12d758-cdea-480b-bbce-51bb9eced430

📥 Commits

Reviewing files that changed from the base of the PR and between b83a55b and 42b8e95.

📒 Files selected for processing (4)
  • Action.c
  • ScreenManager.c
  • XUtils.c
  • XUtils.h

Comment thread XUtils.c
Comment on lines +411 to +415
case (size_t)-1:
// Invalid sequence
ps->ch = WEOF;
return false;

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify callers that may loop on String_lineBreakWidth/String_mbswidth
# and require forward progress when malformed bytes are present.
rg -n -C3 '\bString_lineBreakWidth\s*\(|\bString_mbswidth\s*\('

Repository: htop-dev/htop

Length of output: 2394


🏁 Script executed:

# Get the size of XUtils.c first to determine reading strategy
wc -l XUtils.c

Repository: htop-dev/htop

Length of output: 70


🏁 Script executed:

# Read the decoder function around lines 411-415
sed -n '385,420p' XUtils.c

Repository: htop-dev/htop

Length of output: 969


🏁 Script executed:

# Read the String_lineBreakWidth function around lines 442-480
sed -n '442,480p' XUtils.c

Repository: htop-dev/htop

Length of output: 1278


🏁 Script executed:

# Read the full String_decodeNextWChar to understand the complete context
sed -n '370,440p' XUtils.c

Repository: htop-dev/htop

Length of output: 1725


🏁 Script executed:

sed -n '442,533p' XUtils.c

Repository: htop-dev/htop

Length of output: 2586


Consume bytes on invalid multibyte sequences to ensure forward progress.

Invalid UTF-8 at line 411 returns false without advancing ps->str, causing String_lineBreakWidth to return without updating *str (line 495 is unreached). Callers retrying on the same pointer risk infinite loops. The incomplete sequence case (lines 418-421) correctly advances pointers; invalid sequences must do the same.

Proposed fix
   switch (len) {
   case (size_t)-1:
-      // Invalid sequence
-      ps->ch = WEOF;
-      return false;
+      // Invalid sequence: consume one byte to guarantee progress
+      ps->str += 1;
+      ps->maxLen -= 1;
+      ps->ch = WEOF;
+      return false;

   case (size_t)-2:

And in String_lineBreakWidth, update *str on decode failure:

-      if (!String_decodeNextWChar(&state))
+      if (!String_decodeNextWChar(&state)) {
+         *str = state.str;
          break;
+      }

✅ Addressed in commits 1374d7c to 9d9301f

Comment thread XUtils.h
#endif

ATTR_NONNULL_N(1, 4) ATTR_ACCESS2_W(1) ATTR_ACCESS3_R(2, 3)
void EncodePrintableString(WCharEncoderState* ps, const char* src, size_t maxLen, EncodeWChar encodeWChar);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Rename EncodePrintableString to module-prefixed style.

This new public function name breaks the project’s C naming convention and makes the API inconsistent with surrounding String_* utilities.

As per coding guidelines, use ModuleName_functionName() naming convention for C functions.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
XUtils.c (1)

410-420: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guarantee forward progress on malformed multibyte input.

String_decodeNextWChar() leaves the decoder on the same byte for EILSEQ, and String_lineBreakWidth() drops any decoder progress when String_decodeNextWChar() returns false. A caller that retries String_mbswidth() on the returned *str can stall forever on malformed or truncated input.

Proposed fix
   switch (len) {
   case (size_t)-1:
-      // Invalid sequence
+      // Invalid sequence: consume one byte and reset the shift state.
+      ps->str += 1;
+      ps->maxLen -= 1;
+      memset(&ps->mbState, 0, sizeof(ps->mbState));
       ps->ch = WEOF;
       return false;
 
   case (size_t)-2:
       // Incomplete sequence
       ps->str += ps->maxLen;
       ps->maxLen = 0;
       return false;
@@
-      if (!String_decodeNextWChar(&state))
+      if (!String_decodeNextWChar(&state)) {
+         *str = state.str;
          break;
+      }

Also applies to: 475-476


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 06c427dd-1b32-423f-9c68-3f85a38430d2

📥 Commits

Reviewing files that changed from the base of the PR and between 42b8e95 and 9d9301f.

📒 Files selected for processing (4)
  • Action.c
  • ScreenManager.c
  • XUtils.c
  • XUtils.h

@Explorer09
Copy link
Copy Markdown
Contributor Author

@coderabbitai The suggestion of memset(&ps->mbState, 0, sizeof(ps->mbState)); on encountering an invalid sequence is incorrect. On a non-UTF-8 encoding, this would cause the decoder to lose the shift state if any shift state is required to decode the following string.

The safer approach is to restore the mbstate_t object to prior to decoding the invalid sequence.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Tip

For best results, initiate chat on the files or code changes.

@Explorer09, you're right — the memset approach is wrong. Zeroing mbstate_t unconditionally resets the shift state, which corrupts decoding of subsequent characters in stateful encodings (e.g., ISO-2022 variants).

The correct pattern is to snapshot mbState before the mbrtowc call and restore it on failure:

mbstate_t savedState = ps->mbState;
size_t rc = mbrtowc(&wc, src, remaining, &ps->mbState);
if (rc == (size_t)-1 || rc == (size_t)-2) {
    ps->mbState = savedState; /* restore shift state */
    /* handle invalid/incomplete sequence */
}

This preserves the shift state for the continuation of the string. I'll record this correction.


✏️ Learnings added
Learnt from: Explorer09
URL: https://github.com/htop-dev/htop/pull/1642

Timestamp: 2026-05-19T07:06:43.468Z
Learning: In htop's XUtils.c, when handling invalid or incomplete multibyte sequences in `String_decodeNextWChar` (or similar mbrtowc-based decoders), do NOT reset mbstate_t with memset to zero on error. Instead, save a snapshot of `mbstate_t` before calling `mbrtowc` and restore it on failure. Zeroing mbstate_t discards the shift state, which corrupts decoding in stateful encodings like ISO-2022.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4cdabc95-6f5b-4929-8efd-c4909a26881e

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9301f and a3a5f28.

📒 Files selected for processing (4)
  • Action.c
  • ScreenManager.c
  • XUtils.c
  • XUtils.h

Comment thread XUtils.c Outdated
* String_makePrintable()
* EncodePrintableString()
* String_lineBreakWidth()
* String_mbswidth()

Signed-off-by: Kang-Che Sung <explorer09@gmail.com>
Compute the terminal widths of screen tab names using
String_mbswidth(). No longer assume each character takes one terminal
column.

Signed-off-by: Kang-Che Sung <explorer09@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
XUtils.c (1)

478-479: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve pointer progress when decoding fails.

At Line [478], breaking immediately on MBStringDecoder_nextWChar() failure drops decoder progress (notably the incomplete-sequence path that already advanced decoder.str), so callers can reprocess the same bytes.

Proposed fix
-      if (!MBStringDecoder_nextWChar(&decoder))
-         break;
+      if (!MBStringDecoder_nextWChar(&decoder)) {
+         *str = decoder.str;
+         break;
+      }
#!/bin/bash
# Verify failure path behavior and whether caller updates *str on decode failure.
rg -n -C3 '\bbool MBStringDecoder_nextWChar\b|\bString_lineBreakWidth\b|if \(!MBStringDecoder_nextWChar\(&decoder\)\)'

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: d7294559-8c8a-42f8-99f7-04c4f0b544ad

📥 Commits

Reviewing files that changed from the base of the PR and between a3a5f28 and f33122f.

📒 Files selected for processing (4)
  • Action.c
  • ScreenManager.c
  • XUtils.c
  • XUtils.h

@Explorer09
Copy link
Copy Markdown
Contributor Author

Preserve pointer progress when decoding fails.

At Line [478], breaking immediately on MBStringDecoder_nextWChar() failure drops decoder progress (notably the incomplete-sequence path that already advanced decoder.str), so callers can reprocess the same bytes.

The stopping at the first byte of the invalid sequence is intentional. How the invalid byte sequence would be handled is left to the caller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

code quality ♻️ Code quality enhancement enhancement Extension or improvement to existing feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants