Skip to content

Fix problem with SetConsoleOutputCP CP_UTF8 on Windows#896

Open
madah81pnz1 wants to merge 3 commits intoxiph:masterfrom
madah81pnz1:fix-win-unicode
Open

Fix problem with SetConsoleOutputCP CP_UTF8 on Windows#896
madah81pnz1 wants to merge 3 commits intoxiph:masterfrom
madah81pnz1:fix-win-unicode

Conversation

@madah81pnz1
Copy link
Copy Markdown
Contributor

No description provided.

Also splits up the WriteConsoleW() calls if the buffer is too big,
as this was at least a problem in Windows 7 and earlier.
@madah81pnz1
Copy link
Copy Markdown
Contributor Author

Note this needs more testing, especially piping binary from stdin.
Also this removes the \r\n and \n translation, which might mean some adaptions are needed in case a user is piping a file that does contains \r\n, since then the \r will appear in the input.

Also might need another test binary to be able to test windows console output properly, e.g. by calling ReadConsoleW() to check what was actually written.

@madah81pnz1
Copy link
Copy Markdown
Contributor Author

Note this needs more testing, especially piping binary from stdin.

Reading from stdin turns into mojibake, e.g. --import-tags-from-file=-
I guess it has to use the same method as output, ReadConsoleW/ReadFile, Using fopen with stdin will not work, as the (wide) input will then be translated to the current codepage.

@madah81pnz1
Copy link
Copy Markdown
Contributor Author

This seems to work now, I will clean up the code and push more commits, but it might take some more time.

The behavior is now changed for both stdout and stdin:

  • Console (FILE_TYPE_CHAR) uses ReadConsoleW()/WriteConsoleW() with UTF-16, doesn't use the codepage.
  • Redirect to/from pipe (FILE_TYPE_PIPE) uses the codepage from GetConsoleCP()/GetConsoleOutputCP().
  • Redirect to/from disk (FILE_TYPE_DISK) uses UTF-8 only, ignores the codepage.
  • Command-line arguments uses UTF-16, so also no need for codepage conversion.

All calls to utf8_encode() and utf8_decode() have been removed when #ifdef _WIN32 due to this, since the conversion needs to happen as early as possible, not when handling the tags.

--no-utf8-convert only affects pipe redirects from stdin, not stdout. This is because there is no way to pass down a "raw" flag to vfprintf_utf8 and wprint_console. It could be added with some additional refactoring, but I'm not sure if it is worth it.

fgets() has also been replaced with new functions that handles buffering and any newline combination of \r \n \r\n. The reasoning behind this is that it lets flac import files that were created on other platforms, as otherwise the \r would be part of the tag on non-Windows platforms, and only Windows would handle \r and \r\n as the same.
Newlines are now always a single \n even on Windows. This works fine even in Windows XP console, even when piped via more. The downside is if the user opens a redirected file with an editor that can't handle \n, which is the case with Notepad before Windows 10, see https://devblogs.microsoft.com/commandline/extended-eol-in-notepad/

The pipe redirect is a bit troublesome though. Seems like an unfixable problem, unless you use chcp 65001 (CP_UTF8) by default for the console.
Since printing out any unicode to the console always works due to WriteConsoleW(), it might come as a surprise to the user that characters turn into ? or � when piped via more or find.
But there's no way to make those tools work otherwise, they all have to agree which codepage to use.

But considering also that if you run flac/metaflac with Python, e.g. subprocess.run() and capture_output=True. This is now a redirect and thus everything will be converted to the console output codepage. It is not so straight-forward how you could set the console codepage to 65001/CP_UTF8 from Python, but this is true for any console-mode application on Windows.

One way to handle this could be to print a warning on stderr if the current text can't be losslessly decoded/encoded via the console codepage. This could also be done for corrupted UTF-8 when converted into UTF-16 for the console.

Testing all this is not so straightfoward though. I've used a Python script for the pipe and disk parts, and a C++ program that calls ReadConsoleOutputW() directly for the console output.
However ReadConsoleOutputW() does not handle surrogate pairs at all, and only returns replacement characters (U+FFFD), even if I can see the correct characters in the console, and manual selection copying from the console works.
Testing with a very long tag is hard due to the 32K commandline limit, and also --import-tags-from-file has a 64K line limit. I used the libFLAC API directly to set longer tags, but then there's UTF8_BUFFER_SIZE being defined as 32768, which silently truncates all longer strings in vfprintf_utf8().
This makes it hard to test the new buffer mechanism that converts in several steps between UTF-16, UTF-8 and the console codepage. But also, maybe no-one really cares about such long tags? Even a full album lyrics tag wouldn't come close to that limit.

The behavior is now changed for both stdout and stdin:
* Console (FILE_TYPE_CHAR) uses ReadConsoleW()/WriteConsoleW() with UTF-16, doesn't use the codepage.
* Redirect to/from pipe (FILE_TYPE_PIPE) uses the codepage from GetConsoleCP()/GetConsoleOutputCP().
* Redirect to/from disk (FILE_TYPE_DISK) uses UTF-8 only, ignores the codepage.
* Command-line arguments uses UTF-16, so also not need for codepage conversion.

All calls to utf8_encode() and utf8_decode() have been removed when #ifdef _WIN32 due to this, since the conversion needs to happen as early as possible, not when handling the tags.
--no-utf8-convert only affects pipe redirects.

fgets has been replaced with new functions that handles buffering and any newline combination of \r \n \r\n
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant