Skip to content

NULL Pointer Dereference in PyUpb_PyToUpbEnum "enum value for name" lookup when given malformed utf8 str  #27106

@dynapx

Description

@dynapx

What version of protobuf and what language are you using?
Version: main / v29.x (and likely older versions using the upb backend)
Language: Python (C-Extension / upb implementation)
What supported operating system version are you using (e.g. Linux, Windows)?
Linux (Ubuntu 22.04 / x86_64)
What supported runtime / compiler version are you using (e.g. python version, gcc version)
Python: 3.11+
Compiler: GCC 11.4 / Clang 16.0
What did you do?
Steps to reproduce the behavior:
I attempted to assign a string containing a "lone surrogate" (e.g., "\ud800") to a Protobuf Enum field using the C-extension backend. While "\ud800" is a valid Python str object, it is invalid in standard UTF-8. The underlying C API PyUnicode_AsUTF8AndSize fails on such strings, but the return value is not checked in python/convert.c.
Sample PoC (Simplified from attached script):
code
Python
import os

Force C++ (upb) implementation

os.environ['PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION'] = 'upb'
from google.protobuf import descriptor_pool, message_factory, descriptor_pb2

1. Define a message with an Enum

pool = descriptor_pool.Default()
file_proto = descriptor_pb2.FileDescriptorProto(name="poc.proto", syntax="proto3")
msg_type = file_proto.message_type.add(name="PocMessage")
enum_type = msg_type.enum_type.add(name="Color")
enum_type.value.add(name="UNKNOWN", number=0)
msg_type.field.add(name="favorite_color", number=1, type=14, type_name=".PocMessage.Color")
pool.Add(file_proto)
PocMessage = message_factory.GetMessageClass(pool.FindMessageTypeByName("PocMessage"))

2. Trigger Segfault

msg = PocMessage()
msg.favorite_color = "\ud800" # Lone surrogate triggers NULL from PyUnicode_AsUTF8AndSize
What did you expect to see?
The Python interpreter should raise a ValueError or UnicodeEncodeError (indicating that the string is not a valid enum label or cannot be encoded), allowing the application to handle the error gracefully.
What did you see instead?
The Python interpreter crashed immediately with a Segmentation fault (SIGSEGV).
Technical Analysis:
The crash occurs in python/convert.c within PyUpb_PyToUpbEnum.
At line 171, the code calls const char* name = PyUnicode_AsUTF8AndSize(obj, &size);.
When obj contains a lone surrogate, this API returns NULL and sets a Python exception. Because the code proceeds to use name without a NULL check at line 173 (upb_EnumDef_FindValueByNameWithSize(e, name, size)), the upb native runtime dereferences the NULL pointer, causing a hard crash that bypasses Python's try...except blocks.
Traceback (via faulthandler):
code
Text
[*] Backend implementation: upb
--- Testing path 1 (Assignment) ---
Fatal Python error: Segmentation fault

Current thread 0x0000743ef3a73740 (most recent call first):
File "poc.py", line 97 in trigger_poc
File "poc.py", line 101 in
Segmentation fault (core dumped)
Anything else we should know about your project / environment
This vulnerability was identified using inter-procedural static analysis. The bug is critical as it provides a Remote Denial of Service (DoS) vector: an attacker sending a malformed string in a Protobuf message can crash a server process entirely.
The vulnerability is reachable through multiple core paths:
Scalar Assignment: msg.enum_field = val
Map Insertion: msg.map_field[key] = val (where value is Enum)
Membership Checks: val in msg.repeated_enum_field
I have submitted a fix for this issue in Pull Request #27095

Poc1.py

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions