asm-bufferutil

WebSocket acceleration in hand-written x86-64 assembly with tiered SIMD dispatch (AVX-512 / AVX2 / SSE2).

A drop-in replacement for bufferutil — the native addon that makes the ws WebSocket library fast. Instead of C, the hot paths are written in NASM assembly with multi-tier SIMD vectorization, non-temporal memory paths for large payloads, and opmask branchless tails on AVX-512.

Why?

The WebSocket protocol (RFC 6455 §5.3) requires that every client-to-server frame be masked with a 4-byte key via XOR. This is a tight loop that runs on every single message. The ws library delegates this to bufferutil, which implements it in C. This project replaces that C with hand-tuned assembly that adapts to your CPU's capabilities at runtime.

Performance

Benchmarked on AMD Ryzen 9 7950X3D (Zen 4, AVX-512) vs bufferutil (C):

Operation	Speedup vs bufferutil	Peak throughput
mask	1.1–1.5×	55 GB/s
unmask	1.1–1.7×	90 GB/s
batch unmask (64B frames)	6–10×	—

Additional operations not in bufferutil:

Operation	vs Node.js built-in	Notes
SHA-1 (SHA-NI)	2.3× vs `crypto.createHash`	Hardware SHA-NI, ≤119 bytes
Base64 encode	—	VBMI2 VPMULTISHIFTQB + VPERMB pipeline
Header search	—	AVX-512 VPCMPEQB first+last byte filter
CRC-32C	—	PCLMULQDQ 4-way parallel folding
UTF-8 validation	—	SIMD ASCII fast path + scalar state machine

Architecture

┌──────────────────────────────────────────────────────────────┐
│  Node.js  (your app, or ws internals)                        │
│                                                              │
│  const util = require('asm-bufferutil');                      │
│  util.mask(source, mask, output, offset, length);            │
│  util.unmask(buffer, mask);                                  │
│  util.sha1(data);            // SHA-NI hardware acceleration │
│  util.base64Encode(data);    // SIMD base64                  │
│  util.findHeader(buf, needle);                               │
│  util.utf8Validate(buf);     // SIMD UTF-8 validation        │
│  util.batchUnmask(data, offsets, lengths, masks, count);     │
└────────────────────┬─────────────────────────────────────────┘
                     │  Direct V8 C++ API (zero N-API dispatch overhead)
┌────────────────────▼─────────────────────────────────────────┐
│  ws_fast_api.cc  —  V8 C++ API bridge                        │
│                                                              │
���  • node::Buffer::Data() for zero-copy pointer extraction     │
│  ��� Batch operations: C++ loop with zero V8 API calls inside  │
│  • Multi-buffer parallel: ws_unmask4/ws_mask4 for 4 frames   │
└─────────────���──────┬─────────────────────────────────────────┘
                     │  System V AMD64 calling convention
┌────────────────────▼─────────────────────────────────────────┐
│  Assembly layer  (NASM, Linux x86-64)                        │
│                                                              │
│  ws_cpu.asm        — CPUID feature detection + cache sizing  │
│                      cpu_tier: 0=scalar, 1=SSE2, 2=AVX2,    │
│                                3=AVX-512F+BW                 │
│                      cpu_features bitmask: GFNI, PCLMULQDQ,  │
│                        BMI2, LZCNT, VBMI, AMD, VBMI2        │
│                      nt_threshold: 50% of L3 cache           │
│                                                              │
│  ws_mask_asm.asm   — Mask/unmask with tiered dispatch:       │
│                      • GPR 4× unroll (< 128B)               │
│                      • AVX-512: 512B/iter, opmask tail,      │
│                        alignment preamble, dual-stream unmask │
│                      • AVX2: 128B/iter, aligned stores       │
│                      • SSE2: 64B/iter fallback               │
│                      • NT-store paths (> L3/2 threshold)     │
│                      • ws_unmask4/ws_mask4: 4-frame parallel  │
│                      • ws_find_header: VPCMPEQB wide search   │
│                      • ws_mask_gfni: GFNI experiment baseline │
│                                                              │
│  ws_base64_asm.asm — Base64 encode with dispatch:            │
│                      • AVX-512 VBMI2: VPMULTISHIFTQB+VPERMB  │
│                      • AVX-512 VBMI: VPERMB LUT              │
│                      • AVX2: Klomp/Muła VPSHUFB method       │
│                      • SSE2 / scalar fallback                │
│                                                              │
│  ws_crc32_asm.asm  — CRC-32C:                               │
│                      • PCLMULQDQ 4-way folding (≥ 256B)      │
│                      • SSE4.2 CRC32 instruction (< 256B)     │
│                                                              │
│  ws_utf8_asm.asm   — UTF-8 validation:                       │
│                      • AVX-512: 64B ASCII fast check          │
│                      • AVX2: 32B ASCII fast check             │
│                      • Scalar state machine (non-ASCII)       │
│                                                              │
│  ws_sha1_ni.c      — SHA-1 via Intel SHA-NI intrinsics       │
│                      (≤ 119 bytes, WebSocket handshake use)   │
│                                                              │
│  ws_fallback.c     — C/SIMD fallback (Windows/macOS)         │
│                      AVX2 → SSE2 → scalar dispatch            │
└──────────────────────────────────────────────────────────────┘

How the masking works

WebSocket masking is deceptively simple. Given a payload and a 4-byte mask key:

masked[i] = payload[i] XOR mask[i % 4]

In pure JavaScript, this is a byte-at-a-time loop. In our assembly:

Broadcast: VPBROADCASTD replicates the 4-byte mask across a 512-bit ZMM register (64 bytes of mask).
Alignment preamble: An opmask partial store aligns the destination to a 64-byte cache line boundary, eliminating split-line penalties.
Bulk XOR: 8× unrolled VPXORD processes 512 bytes per iteration with interleaved load-XOR-store scheduling.
Opmask tail: BZHI builds a bitmask for the remaining 0–63 bytes; vmovdqu8{k1} handles the tail with zero branches.
NT path: For payloads exceeding 50% of L3 cache, switches to non-temporal stores (VMOVNTDQ) to avoid polluting the cache hierarchy.

Build

Prerequisites: Linux x86-64 with nasm for assembly SIMD paths. On Windows and macOS, the C fallback (ws_fallback.c) provides AVX2/SSE2/scalar dispatch automatically.

# Install NASM (Ubuntu/Debian)
sudo apt install nasm

# Build the native addon
npm install
npm run build

# Run tests (89 tests)
npm test

# Run benchmarks
npm run bench

Requires Node.js ≥ 20 and node-gyp.

Usage

Standalone

const bufferUtil = require("asm-bufferutil");
const crypto = require("crypto");

const payload = Buffer.from("Hello WebSocket!");
const mask = crypto.randomBytes(4);
const output = Buffer.alloc(payload.length);

// Mask
bufferUtil.mask(payload, mask, output, 0, payload.length);

// Unmask (in-place)
bufferUtil.unmask(output, mask);
// output now equals payload

// SHA-1 (hardware-accelerated on CPUs with SHA-NI)
const hash = bufferUtil.sha1(Buffer.from("input data"));

// Base64 encode
const b64 = bufferUtil.base64Encode(hash);

// UTF-8 validation
const valid = bufferUtil.utf8Validate(Buffer.from("こんにちは"));

// Batch unmask (packed buffer API — zero V8 overhead in inner loop)
bufferUtil.batchUnmask(data, offsets, lengths, masks, count);

As a drop-in for ws

The ws library checks for bufferutil at startup. You can redirect it:

// Option 1: Package aliasing in package.json
{
  "dependencies": {
    "bufferutil": "npm:asm-bufferutil@1.0.0"
  }
}

// Option 2: Module resolution override
// In your entry point, before requiring ws:
const Module = require('module');
const originalResolve = Module._resolveFilename;
Module._resolveFilename = function(request, ...args) {
  if (request === 'bufferutil') {
    return require.resolve('asm-bufferutil');
  }
  return originalResolve.call(this, request, ...args);
};

const WebSocket = require('ws');
// ws now uses your assembly masking!

How this compares to bufferutil

Aspect	bufferutil	asm-bufferutil
Language	C	x86-64 Assembly (NASM)
SIMD	Compiler decides	Explicit AVX-512 / AVX2 / SSE2 tiering
Masking strategy	32-bit XOR in C loop	512-bit VPXORD (16× wider)
V8 bridge	N-API	Direct V8 C++ API (lower overhead)
Batch API	None	Packed buffer, zero V8 calls in loop
Extra operations	mask, unmask	+ SHA-1, base64, CRC-32C, UTF-8, findHeader
NT stores	No	Auto for payloads > 50% L3
Portability	Any platform with C compiler	Linux x86-64 (C fallback elsewhere)

CPU feature detection

At module load, _init_cpu_features() runs CPUID to detect:

cpu_tier: 0 (scalar), 1 (SSE2), 2 (AVX2), 3 (AVX-512F+BW)
cpu_features: GFNI, PCLMULQDQ, BMI2, LZCNT, VBMI, VBMI2, AMD vendor
nt_threshold: 50% of L3 cache size (auto-detected via cache topology CPUID leaves)

All dispatch is runtime — a single binary adapts to the host CPU.

Relevant to Meteor/DDP

If you're running Meteor.js, every DDP message (method calls, subscriptions, collection updates) goes through WebSocket framing. On a busy system with hundreds of concurrent connections, the masking loop runs thousands of times per second. Shaving microseconds here compounds into meaningful CPU savings.

File structure

asm-bufferutil/
├── src/
│   ├── ws_fast_api.cc          # Direct V8 C++ API bridge
│   ├── ws_napi.c               # Legacy N-API bridge
│   ├── ws_sha1_ni.c            # SHA-1 with Intel SHA-NI intrinsics
│   ├── ws_fallback.c           # C/SIMD fallback (Windows/macOS)
│   ├── ws_mask_asm.asm         # Mask/unmask + findHeader (AVX-512/AVX2/SSE2/GPR)
│   ├── ws_base64_asm.asm       # Base64 encode (VBMI2/VBMI/AVX2/SSE2/scalar)
│   ├── ws_crc32_asm.asm        # CRC-32C (PCLMULQDQ folding / SSE4.2)
│   ├── ws_utf8_asm.asm         # UTF-8 validation (AVX-512/AVX2 ASCII + scalar)
│   ├── ws_cpu.asm              # CPUID feature detection + cache topology
│   └── websocket_server.asm    # WebSocket server-side frame assembly
├── test/
│   ├── index.js                # Correctness test suite (89 tests)
│   ├── crc32.c                 # C-level CRC-32C test harness
│   └── client.html             # Browser WebSocket test client
├── bench/
│   └── index.js                # Time-based benchmarks (all operations)
├── .github/workflows/
│   ├── ci.yml                  # CI: build + test
│   └── benchmark.yml           # Benchmark with artifact upload
├── binding.gyp                 # node-gyp build config (NASM actions)
├── index.js                    # JS entry point with native/fallback dispatch
├── package.json
├── CHANGELOG.md
├── CONTRIBUTING.md
└── README.md

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

asm-bufferutil

Why?

Performance

Architecture

How the masking works

Build

Usage

Standalone

As a drop-in for ws

How this compares to bufferutil

CPU feature detection

Relevant to Meteor/DDP

File structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
bench		bench
src		src
test		test
.gitignore		.gitignore
.npmignore		.npmignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
binding.gyp		binding.gyp
fallback.js		fallback.js
index.js		index.js
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

asm-bufferutil

Why?

Performance

Architecture

How the masking works

Build

Usage

Standalone

As a drop-in for ws

How this compares to bufferutil

CPU feature detection

Relevant to Meteor/DDP

File structure

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages