Skip to content

Performance: Optimize Varint Encoding/Decoding with Loop Unrolling (and potentially SVE2) #26931

@liuyang-664

Description

@liuyang-664

This applies to C++ generated code and core library functions (specifically coded_stream.h).

The current Varint encoding/decoding implementation relies on tight loops that can lead to high branch misprediction rates and suboptimal instruction pipelining, especially on modern AArch64 servers. In high-throughput scenarios, Varint processing becomes a bottleneck.

I have implemented a specialized version of UnsafeVarint that uses explicit length checks and helper functions designed for compiler loop unrolling.

  1. Loop Unrolling: By pre-calculating the required bytes and using a fixed-length loop in EncodeBytes, we allow the compiler to unroll the logic, reducing branch misses significantly.
  2. Performance Gains: Initial benchmarks on ARM64 servers show an average 30%+ improvement in encoding performance for a single core.

Proposed Code Snippet:

// Within google/protobuf/io/coded_stream.h

PROTOBUF_ALWAYS_INLINE static uint8_t* EncodeBytes(uint32_t value, uint8_t* ptr, size_t num) {
  for (size_t i = 0; i < num; ++i) {
    ptr[i] = (uint8_t)((value >> (7 * i)) & 0x7F) | (1U << 7);
  }
  ptr[num] = (uint8_t)(value >> (num * 7));
  ptr += (num + 1);
  return ptr;
}

PROTOBUF_ALWAYS_INLINE static uint8_t* UnsafeVarint(uint32_t value, uint8_t* ptr) {
  if (value < (1U << 7)) {
    *ptr = value & 0x7F;
    ++ptr;
  } else if (value < (1U << 14)) {
    ptr = EncodeBytes(value, ptr, 1);
  } else if (value < (1U << 21)) {
    ptr = EncodeBytes(value, ptr, 2);
  } else if (value < (1U << 28)) {
    ptr = EncodeBytes(value, ptr, 3);
  } else {
    ptr = EncodeBytes(value, ptr, 4);
  }
  return ptr;
}
// Similar logic for uint64...

Describe alternatives you've considered
I am also exploring ARM64 SVE2 (Scalable Vector Extension 2) instructions. Specifically, using BEXT (Bit Extract) and BDEP (Bit Deposit) equivalents in SVE2 could potentially:

  • Increase encoding performance to 2.5x of the current implementation.
  • Improve decoding performance by approximately 65%.

Question for Maintainers:
Does the Protobuf architecture currently support or welcome architecture-specific intrinsics (like SVE2) for such core operations? If so, I would like to submit a PR including both the general loop-unrolling optimization and the AArch64-specific SIMD enhancements.

Additional context
The benchmarks were conducted on a standard ARM64 cloud instance. I can provide detailed google/benchmark results if required.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions