Performance: Optimize Varint Encoding/Decoding with Loop Unrolling (and potentially SVE2)

This applies to C++ generated code and core library functions (specifically `coded_stream.h`).

The current Varint encoding/decoding implementation relies on tight loops that can lead to high branch misprediction rates and suboptimal instruction pipelining, especially on modern AArch64 servers. In high-throughput scenarios, Varint processing becomes a bottleneck.

I have implemented a specialized version of `UnsafeVarint` that uses explicit length checks and helper functions designed for compiler loop unrolling.

1.  **Loop Unrolling:** By pre-calculating the required bytes and using a fixed-length loop in `EncodeBytes`, we allow the compiler to unroll the logic, reducing branch misses significantly.
2.  **Performance Gains:** Initial benchmarks on ARM64 servers show an average **30%+ improvement** in encoding performance for a single core.

**Proposed Code Snippet:**
```cpp
// Within google/protobuf/io/coded_stream.h

PROTOBUF_ALWAYS_INLINE static uint8_t* EncodeBytes(uint32_t value, uint8_t* ptr, size_t num) {
  for (size_t i = 0; i < num; ++i) {
    ptr[i] = (uint8_t)((value >> (7 * i)) & 0x7F) | (1U << 7);
  }
  ptr[num] = (uint8_t)(value >> (num * 7));
  ptr += (num + 1);
  return ptr;
}

PROTOBUF_ALWAYS_INLINE static uint8_t* UnsafeVarint(uint32_t value, uint8_t* ptr) {
  if (value < (1U << 7)) {
    *ptr = value & 0x7F;
    ++ptr;
  } else if (value < (1U << 14)) {
    ptr = EncodeBytes(value, ptr, 1);
  } else if (value < (1U << 21)) {
    ptr = EncodeBytes(value, ptr, 2);
  } else if (value < (1U << 28)) {
    ptr = EncodeBytes(value, ptr, 3);
  } else {
    ptr = EncodeBytes(value, ptr, 4);
  }
  return ptr;
}
// Similar logic for uint64...
```

**Describe alternatives you've considered**
I am also exploring **ARM64 SVE2 (Scalable Vector Extension 2)** instructions. Specifically, using `BEXT` (Bit Extract) and `BDEP` (Bit Deposit) equivalents in SVE2 could potentially:
* Increase encoding performance to **2.5x** of the current implementation.
* Improve decoding performance by approximately **65%**.

**Question for Maintainers:**
Does the Protobuf architecture currently support or welcome architecture-specific intrinsics (like SVE2) for such core operations? If so, I would like to submit a PR including both the general loop-unrolling optimization and the AArch64-specific SIMD enhancements.

**Additional context**
The benchmarks were conducted on a standard ARM64 cloud instance. I can provide detailed `google/benchmark` results if required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Optimize Varint Encoding/Decoding with Loop Unrolling (and potentially SVE2) #26931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance: Optimize Varint Encoding/Decoding with Loop Unrolling (and potentially SVE2) #26931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions