Skip to content

feat(perf): Add fast path for utf16le encoding in stringToBuffer()/bufferToString()#981

Merged
boorad merged 19 commits into
margelo:mainfrom
wh201906:wh201906/fast-utf16
Apr 28, 2026
Merged

feat(perf): Add fast path for utf16le encoding in stringToBuffer()/bufferToString()#981
boorad merged 19 commits into
margelo:mainfrom
wh201906:wh201906/fast-utf16

Conversation

@wh201906
Copy link
Copy Markdown
Contributor

@wh201906 wh201906 commented Apr 26, 2026

The native implementation is way much faster

name utf16le encode 32B utf16le encode 1MB utf16le encode 32B (ASCII only) utf16le encode 1MB (ASCII only) utf16le decode 32B utf16le decode 1MB utf16le decode 32B (ASCII only) utf16le decode 1MB (ASCII only)
ratio 2.18x 318.79x 2.09x 164.03x 3.39x 2005.62x 3.27x 886.29x
Screenshot IMG_20260427_003334

In the current mainstream React Native JavaScript engine, Hermes, strings are internally represented using UTF-16 or ASCII. Therefore, when the native side needs access to the UTF-16 representation of a string, Hermes can provide the underlying data with minimal overhead. However, in the current implementation of Nitro, JavaScript strings are always converted to UTF-8 by default. For UTF-16 data, this introduces unnecessary conversion overhead and may also lead to data loss (e.g., unpaired surrogates) during the conversion process.

To address this, I bypass the Nitrogen-generated conversion path from JS string to std::string by accessing jsi::String object directly. For other encodings, the existing Nitrogen-like code path is preserved (call jsi::String::utf8() like what nitro does). For UTF-16 encoding, a lower-level fast path is used whenever possible (call jsi::String::getStringData()).

Note: this optimized UTF-16 encoding/decoding path is only available in the Hermes environment and for React Native 0.78+. Therefore, I added conditional checks on both the JavaScript side and the C++ side to selectively enable this feature.

For testing, I added UTF-16LE-related test cases based on Node.js, as well as performance benchmarks for the UTF-16 encoding path.

(text polished by ChatGPT)

@wh201906
Copy link
Copy Markdown
Contributor Author

wh201906 commented Apr 27, 2026

Test cases from Node.js v24.15.0

Roundtrips ASCII text through utf16le encoding.

Current encoding_tests.ts:

test(SUITE, '[Node.js] Roundtrips ASCII text through utf16le encoding.', () => {
  const str = 'foo';
  const ab = stringToBuffer(str, 'utf16le');
  expect(bufferToString(ab, 'utf16le')).to.equal(str);
});

Original Node.js (test/parallel/test-buffer-tostring.js):

// utf8, ucs2, ascii, latin1, utf16le
for (const encoding of [
  'utf8',
  'utf-8',
  'ucs2',
  'ucs-2',
  'ascii',
  'latin1',
  'binary',
  'utf16le',
  'utf-16le',
].flatMap(e => [e, e.toUpperCase()])) {
  assert.strictEqual(Buffer.from('foo', encoding).toString(encoding), 'foo');
}
Roundtrips UTF-16LE text containing an unpaired high surrogate.

Current encoding_tests.ts:

test(
  SUITE,
  'Roundtrips UTF-16LE text containing an unpaired high surrogate.',
  () => {
    const str = 'A\uD83DB';
    const ab = stringToBuffer(str, 'utf16le');
    expect(toU8(ab)).to.deep.equal(
      new Uint8Array([0x41, 0x00, 0x3d, 0xd8, 0x42, 0x00]),
    );
    expect(bufferToString(ab, 'utf16le')).to.equal(str);
  },
);

Original Node.js:

No direct matching test case was found in Node.js v24.15.0.

Verified Node.js runtime behavior:

const str = 'A\uD83DB';
const buf = Buffer.from(str, 'utf16le');
assert.deepStrictEqual([...buf], [0x41, 0x00, 0x3d, 0xd8, 0x42, 0x00]);
assert.strictEqual(buf.toString('utf16le'), str);
Roundtrips UTF-16LE text containing an unpaired low surrogate.

Current encoding_tests.ts:

test(
  SUITE,
  'Roundtrips UTF-16LE text containing an unpaired low surrogate.',
  () => {
    const str = 'A\uDC00B';
    const ab = stringToBuffer(str, 'utf16le');
    expect(toU8(ab)).to.deep.equal(
      new Uint8Array([0x41, 0x00, 0x00, 0xdc, 0x42, 0x00]),
    );
    expect(bufferToString(ab, 'utf16le')).to.equal(str);
  },
);

Original Node.js:

No direct matching test case was found in Node.js v24.15.0.

Verified Node.js runtime behavior:

const str = 'A\uDC00B';
const buf = Buffer.from(str, 'utf16le');
assert.deepStrictEqual([...buf], [0x41, 0x00, 0x00, 0xdc, 0x42, 0x00]);
assert.strictEqual(buf.toString('utf16le'), str);
UTF-16LE encoding of "über"

Current encoding_tests.ts:

test(SUITE, '[Node.js] UTF-16LE encoding of "über"', () => {
  expect(toU8(stringToBuffer('über', 'utf16le'))).to.deep.equal(
    new Uint8Array([252, 0, 98, 0, 101, 0, 114, 0]),
  );
});

Original Node.js (test/parallel/test-buffer-alloc.js):

['ucs2', 'ucs-2', 'utf16le', 'utf-16le'].forEach(encoding => {
  {
    // Test for proper UTF16LE encoding, length should be 8
    const f = Buffer.from('über', encoding);
    assert.deepStrictEqual(f, Buffer.from([252, 0, 98, 0, 101, 0, 114, 0]));
  }
});
UTF-16LE encoding of "привет"

Current encoding_tests.ts:

test(SUITE, '[Node.js] UTF-16LE encoding of "привет"', () => {
  const encoded = toU8(stringToBuffer('привет', 'utf16le'));
  expect(encoded).to.deep.equal(
    new Uint8Array([63, 4, 64, 4, 56, 4, 50, 4, 53, 4, 66, 4]),
  );
  expect(bufferToString(encoded.buffer as ArrayBuffer, 'utf16le')).to.equal(
    'привет',
  );
});

Original Node.js (test/parallel/test-buffer-alloc.js):

['ucs2', 'ucs-2', 'utf16le', 'utf-16le'].forEach(encoding => {
  {
    // Length should be 12
    const f = Buffer.from('привет', encoding);
    assert.deepStrictEqual(
      f,
      Buffer.from([63, 4, 64, 4, 56, 4, 50, 4, 53, 4, 66, 4]),
    );
    assert.strictEqual(f.toString(encoding), 'привет');
  }
});
UTF-16LE encoding of Thumbs up sign (U+1F44D)

Current encoding_tests.ts:

test(SUITE, '[Node.js] UTF-16LE encoding of Thumbs up sign (U+1F44D)', () => {
  expect(toU8(stringToBuffer('\uD83D\uDC4D', 'utf16le'))).to.deep.equal(
    new Uint8Array([0x3d, 0xd8, 0x4d, 0xdc]),
  );
});

Original Node.js (test/parallel/test-buffer-alloc.js):

{
  const f = Buffer.from('\uD83D\uDC4D', 'utf-16le'); // THUMBS UP SIGN (U+1F44D)
  assert.strictEqual(f.length, 4);
  assert.deepStrictEqual(f, Buffer.from('3DD84DDC', 'hex'));
}
Decodes UTF-16LE bytes back to Japanese text.

Current encoding_tests.ts:

test(SUITE, '[Node.js] Decodes UTF-16LE bytes back to Japanese text.', () => {
  const bytes = new Uint8Array([
    0x42, 0x30, 0x44, 0x30, 0x46, 0x30, 0x48, 0x30, 0x4a, 0x30,
  ]);
  expect(bufferToString(bytes.buffer as ArrayBuffer, 'utf16le')).to.equal(
    'あいうえお',
  );
});

Original Node.js (test/parallel/test-buffer-alloc.js):

['ucs2', 'ucs-2', 'utf16le', 'utf-16le'].forEach(encoding => {
  const b = Buffer.allocUnsafe(10);
  b.write('あいうえお', encoding);
  assert.strictEqual(b.toString(encoding), 'あいうえお');
});
Decodes UTF-16LE bytes correctly from a sliced buffer starting at byte offset 1.

Current encoding_tests.ts:

test(
  SUITE,
  '[Node.js] Decodes UTF-16LE bytes correctly from a sliced buffer starting at byte offset 1.',
  () => {
    const bytes = new Uint8Array([
      0xff, 0x42, 0x30, 0x44, 0x30, 0x46, 0x30, 0x48, 0x30, 0x4a, 0x30,
    ]);
    expect(
      bufferToString(bytes.slice(1).buffer as ArrayBuffer, 'utf16le'),
    ).to.equal('あいうえお');
  },
);

Original Node.js (test/parallel/test-buffer-alloc.js):

['ucs2', 'ucs-2', 'utf16le', 'utf-16le'].forEach(encoding => {
  const b = Buffer.allocUnsafe(11);
  b.write('あいうえお', 1, encoding);
  assert.strictEqual(b.toString(encoding, 1), 'あいうえお');
});

@wh201906
Copy link
Copy Markdown
Contributor Author

This PR is ready for review.

@boorad
Copy link
Copy Markdown
Collaborator

boorad commented Apr 27, 2026

PR #981: Fast UTF-16LE Path — Evaluation

Branch state: Behind main by ~12 commits (based on ab84046 from Apr 20). The git diff main..HEAD looks scary (huge package.json/test deletions), but the actual PR delta is just 8 files (gh pr diff confirms this). It needs a rebase before merging, but the PR's intent is small and focused.

CI: C++ Lint, JS Lint, tsc, Android build → all green. iOS/Android e2e still in progress.

What it does (correctly)
The PR's core insight is solid: Nitrogen converts jsi::String → std::string (UTF-8) before C++ sees it, which (a) wastes time and (b) destroys unpaired surrogates via U+FFFD substitution. By dropping bufferToString/stringToBuffer from the Nitro spec and re-registering them with registerRawHybridMethod, the C++ side gets the raw jsi::String and can:

  • For UTF-16LE encode (RN 0.79+): call jsi::String::createFromUtf16 directly from the buffer bytes
  • For UTF-16LE decode (RN 0.78+): call getStringData() which streams ASCII/UTF-16 chunks straight out of Hermes' internal representation

The 318x–2000x speedup on 1MB payloads is plausible — they're literally avoiding a full UTF-8 transcode in both directions. Test cases preserve unpaired surrogates ('A\uD83DB' ↔ [0x41,0x00,0x3d,0xd8,0x42,0x00]), which only works because of this bypass.

Correctness checks

  • Endian handling: if constexpr (std::endian::native == std::endian::little && sizeof(char16_t) == 2) fast-path with memcpy; falls back to manual byte-swap loop. Both paths look right.
  • Odd-length buffers: len / 2 silently truncates → matches Node.js Buffer semantics.
  • ASCII chunk in decodeUtf16Le: only writes low byte, relies on vector::resize() zero-initializing the high byte. Correct but a comment would help future readers.
  • Test fixtures: Comprehensive — Node.js test vectors for ASCII/multibyte/surrogates/sliced buffers, all attributed in PR comments.
  • std::endian / : requires C++20 ✓ (project mandates this).

Concerns

  1. Manually edits nitrogen/generated/ files marked "DO NOT MODIFY". This works only because the spec was updated to drop those methods, so re-running nitrogen produces matching output. Fragile if anyone re-runs codegen and the manually-removed bits come back. The pre-commit hook does run bob build though, so it should stay in sync.
  2. Inconsistent naming: bufferToJsiString (camelCase) but JsiStringToBuffer (Pascal). Should both be camelCase.
  3. Tests run unconditionally: utf16le-specific tests don't gate on Hermes/RN version. On RN<0.78 or JSC, the C++ branch is #if-omitted and would throw Unsupported encoding: utf16le — tests would fail. Not a problem for this repo's CI (modern Hermes), but worth noting.
  4. Behavioral change for non-Hermes/older RN: utf16le now falls through to CraftzdogBuffer.from(input, 'utf16le') polyfill (lossy on surrogates) rather than throwing. This is arguably an improvement but is a silent semantics change.
  5. Reinterpret-cast uint8_t* → char16_t*: assumes 2-byte alignment. ArrayBuffer storage is typically aligned, and Uint8Array.slice() produces a fresh aligned ArrayBuffer, so the test case is fine, but worth a defensive note.

Bottom line

Approve with rebase. The implementation is technically sound, the perf wins are real, and the correctness story (preserving unpaired surrogates) is a meaningful improvement over the existing UTF-8 detour. Suggest:

  • Rebase on current main (12 commits behind, including the security audit phases).
  • Rename JsiStringToBuffer → jsiStringToBuffer for consistency.
  • Add a brief comment in decodeUtf16Le ASCII branch explaining the zero-init reliance.
  • Optional: gate utf16le tests on isHermes && RN >= 0.78 if non-Hermes test envs are on the roadmap.

@wh201906
Copy link
Copy Markdown
Contributor Author

wh201906 commented Apr 28, 2026

  • Rebase on current main (12 commits behind, including the security audit phases).

I've merged main branch into this PR. There are no merge conflicts.

  • Rename JsiStringToBufferjsiStringToBuffer for consistency.

Finished in 397326f

  • Add a brief comment in decodeUtf16Le ASCII branch explaining the zero-init reliance.

Finished in a0c982c

  • Optional: gate utf16le tests on isHermes && RN >= 0.78 if non-Hermes test envs are on the roadmap.

For benchmarking, I use ab2str()/binaryLikeToArrayBuffer() rather than stringToBuffer()/bufferToString() so they won't cause exceptions.
For tests in example app, I intentionally use stringToBuffer()/bufferToString() so users can know the RN version/runtime are not supported for UTF16 fast path. I don't know if it's a good idea.

@wh201906
Copy link
Copy Markdown
Contributor Author

wh201906 commented Apr 28, 2026

Reinterpret-cast uint8_t* → char16_t*: assumes 2-byte alignment.

I forget the exact place but there are comments in Hermes source saying unaligned access on modern CPUs is fine.
https://github.com/facebook/hermes/blob/4fcedcc1b3cb3d16c7944faa6df0a942eef114dd/lib/Support/UTF8.cpp#L199-L201

I guess I can init uint16_t[] as the buffer for alignment then reintepret it as uint8_t*, but this requires more code and might be less maintainable

@wh201906 wh201906 force-pushed the wh201906/fast-utf16 branch from 7b3989e to 02d389d Compare April 28, 2026 13:11
1. REACT_NATIVE_VERSION_xxx macros are introduced in RN v0.79.0, so RNQC_NATIVE_GET_STRING_DATA won't work as expected in the old implementation
2. Both QuickCrypto.podspec and android/CMakeLists.txt explicitly specify C++20 so it's safe to use concept
3. In the old implementation, It's semantically possible that HybridUtils.cpp failed to import ReactNativeVersion.h and disabled native utf16 paths while conversion.ts still tries to access them. This commit fixes it.
bufferToString() doesn't provide offset(start) argument
@wh201906
Copy link
Copy Markdown
Contributor Author

I made some changes:

a0c982c: Add comments
02d389d: Fix the benchmark. It is expected to run regardless of RN version/runtime, but bufferToString() for utf16le is only available on Hermes with RN v0.79+
6c052fe: Fixed the detection of jsi::String::getStringData(). The original macro-based detection is not available until RN v0.79.
7a3d503: Clean up the test cases. bufferToString() doesn't support offset/start argument so no need to test it like Buffer.toString(encoding, 1)


It's ready for review.

@boorad boorad merged commit 391bc6d into margelo:main Apr 28, 2026
8 checks passed
@wh201906 wh201906 deleted the wh201906/fast-utf16 branch April 29, 2026 01:07
@wh201906 wh201906 changed the title Add fast path for utf16le encoding in stringToBuffer()/bufferToString() feat: Add fast path for utf16le encoding in stringToBuffer()/bufferToString() May 5, 2026
@wh201906 wh201906 changed the title feat: Add fast path for utf16le encoding in stringToBuffer()/bufferToString() feat(perf): Add fast path for utf16le encoding in stringToBuffer()/bufferToString() May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants