Environment
toml++ version and/or commit hash:
Commit: 209ba6675bde846e878111e03bb84852a537f7e3
Compiler:
Clang++ 18
C++ standard mode:
C++17
Target arch:
AArch64 (ARM64 Linux)
Library configuration overrides:
None
Relevant compilation flags:
-O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer
Describe the bug
Malformed UTF-8 input can trigger undefined behavior in the Unicode classification logic.
While fuzzing the toml::parse() API using libFuzzer with AddressSanitizer and UBSan enabled, a minimal two-byte input causes execution to reach an unreachable code path inside the Unicode helper implementation.
Minimal crashing input (hex):
cc aa
Escaped representation:
\xcc\xaa
Running the parser with sanitizers produces:
target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior target/include/toml++/impl/unicode_autogenerated.hpp:44:13
Expected behavior: malformed UTF-8 sequences should be rejected cleanly (e.g., via parse_error) instead of triggering undefined behavior.
Steps to reproduce (or a small repro code sample)
Save the following program as repro_unicode.cpp:
#include <string_view>
#include <toml++/toml.hpp>
int main()
{
static constexpr char input[] = "\xcc\xaa";
try
{
auto tbl = toml::parse(std::string_view{input, 2});
(void)tbl;
}
catch (...)
{
}
}
Build from the repository root:
clang++ -O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer repro_unicode.cpp src/toml.cpp -I include -o repro_unicode
Run:
ASAN_OPTIONS=detect_leaks=0 ./repro_unicode
Additional information
While continuing fuzzing, another minimized input was found that triggers the same issue:
Hex:
cf aa
Escaped:
\xcf\x9b
Running the parser with this input produces the same UBSan report:
target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point
This suggests the issue is not limited to a single malformed UTF-8 sequence but may affect a broader class of invalid Unicode inputs.
Environment
toml++ version and/or commit hash:
Commit: 209ba6675bde846e878111e03bb84852a537f7e3
Compiler:
Clang++ 18
C++ standard mode:
C++17
Target arch:
AArch64 (ARM64 Linux)
Library configuration overrides:
None
Relevant compilation flags:
-O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer
Describe the bug
Malformed UTF-8 input can trigger undefined behavior in the Unicode classification logic.
While fuzzing the
toml::parse()API using libFuzzer with AddressSanitizer and UBSan enabled, a minimal two-byte input causes execution to reach an unreachable code path inside the Unicode helper implementation.Minimal crashing input (hex):
cc aa
Escaped representation:
\xcc\xaa
Running the parser with sanitizers produces:
target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program pointSUMMARY: UndefinedBehaviorSanitizer: undefined-behavior target/include/toml++/impl/unicode_autogenerated.hpp:44:13Expected behavior: malformed UTF-8 sequences should be rejected cleanly (e.g., via
parse_error) instead of triggering undefined behavior.Steps to reproduce (or a small repro code sample)
Save the following program as
repro_unicode.cpp:Build from the repository root:
clang++ -O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer repro_unicode.cpp src/toml.cpp -I include -o repro_unicodeRun:
ASAN_OPTIONS=detect_leaks=0 ./repro_unicodeAdditional information
While continuing fuzzing, another minimized input was found that triggers the same issue:
Hex:
cf aa
Escaped:
\xcf\x9b
Running the parser with this input produces the same UBSan report:
target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point
This suggests the issue is not limited to a single malformed UTF-8 sequence but may affect a broader class of invalid Unicode inputs.