Skip to content

Malformed UTF-8 input triggers undefined behavior in unicode_autogenerated.hpp #295

@VAHPEM

Description

@VAHPEM

Environment

toml++ version and/or commit hash:
Commit: 209ba6675bde846e878111e03bb84852a537f7e3

Compiler:
Clang++ 18

C++ standard mode:
C++17

Target arch:
AArch64 (ARM64 Linux)

Library configuration overrides:
None

Relevant compilation flags:
-O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer

Describe the bug

Malformed UTF-8 input can trigger undefined behavior in the Unicode classification logic.

While fuzzing the toml::parse() API using libFuzzer with AddressSanitizer and UBSan enabled, a minimal two-byte input causes execution to reach an unreachable code path inside the Unicode helper implementation.

Minimal crashing input (hex):

cc aa

Escaped representation:

\xcc\xaa

Running the parser with sanitizers produces:

target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior target/include/toml++/impl/unicode_autogenerated.hpp:44:13

Expected behavior: malformed UTF-8 sequences should be rejected cleanly (e.g., via parse_error) instead of triggering undefined behavior.

Steps to reproduce (or a small repro code sample)

Save the following program as repro_unicode.cpp:

#include <string_view>
#include <toml++/toml.hpp>

int main()
{
    static constexpr char input[] = "\xcc\xaa";
    try
    {
        auto tbl = toml::parse(std::string_view{input, 2});
        (void)tbl;
    }
    catch (...)
    {
    }
}

Build from the repository root:

clang++ -O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer repro_unicode.cpp src/toml.cpp -I include -o repro_unicode

Run:

ASAN_OPTIONS=detect_leaks=0 ./repro_unicode

Additional information

While continuing fuzzing, another minimized input was found that triggers the same issue:

Hex:
cf aa

Escaped:
\xcf\x9b

Running the parser with this input produces the same UBSan report:

target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point

This suggests the issue is not limited to a single malformed UTF-8 sequence but may affect a broader class of invalid Unicode inputs.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions