Malformed UTF-8 input triggers undefined behavior in unicode_autogenerated.hpp

## Environment
**toml++ version and/or commit hash:**  
Commit: 209ba6675bde846e878111e03bb84852a537f7e3


**Compiler:**  
Clang++ 18

**C++ standard mode:**  
C++17


**Target arch:**  
AArch64 (ARM64 Linux)


**Library configuration overrides:**  
None


**Relevant compilation flags:**  
-O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer



## Describe the bug
Malformed UTF-8 input can trigger undefined behavior in the Unicode classification logic.

While fuzzing the `toml::parse()` API using libFuzzer with AddressSanitizer and UBSan enabled, a minimal two-byte input causes execution to reach an unreachable code path inside the Unicode helper implementation.

Minimal crashing input (hex):

cc aa

Escaped representation:

\xcc\xaa

Running the parser with sanitizers produces:

`target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point`

`SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior target/include/toml++/impl/unicode_autogenerated.hpp:44:13`

Expected behavior: malformed UTF-8 sequences should be rejected cleanly (e.g., via `parse_error`) instead of triggering undefined behavior.



## Steps to reproduce (or a small repro code sample)
Save the following program as `repro_unicode.cpp`:

    #include <string_view>
    #include <toml++/toml.hpp>

    int main()
    {
        static constexpr char input[] = "\xcc\xaa";
        try
        {
            auto tbl = toml::parse(std::string_view{input, 2});
            (void)tbl;
        }
        catch (...)
        {
        }
    }

Build from the repository root:

`clang++ -O1 -g -DNDEBUG -std=c++17 -fsanitize=address,undefined -fno-omit-frame-pointer repro_unicode.cpp src/toml.cpp -I include -o repro_unicode`

Run:

`ASAN_OPTIONS=detect_leaks=0 ./repro_unicode`


## Additional information
While continuing fuzzing, another minimized input was found that triggers the same issue:

Hex:
cf aa

Escaped:
\xcf\x9b

Running the parser with this input produces the same UBSan report:

target/include/toml++/impl/unicode_autogenerated.hpp:44:13: runtime error: execution reached an unreachable program point

This suggests the issue is not limited to a single malformed UTF-8 sequence but may affect a broader class of invalid Unicode inputs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Malformed UTF-8 input triggers undefined behavior in unicode_autogenerated.hpp #295

Environment

Describe the bug

Steps to reproduce (or a small repro code sample)

Additional information

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Malformed UTF-8 input triggers undefined behavior in unicode_autogenerated.hpp #295

Description

Environment

Describe the bug

Steps to reproduce (or a small repro code sample)

Additional information

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions