Skip to content

fix(native-preview): preserve lone surrogate string literals#3518

Open
TorinAsakura wants to merge 9 commits intomicrosoft:mainfrom
TorinAsakura:fix/wtf8-surrogate-literals
Open

fix(native-preview): preserve lone surrogate string literals#3518
TorinAsakura wants to merge 9 commits intomicrosoft:mainfrom
TorinAsakura:fix/wtf8-surrogate-literals

Conversation

@TorinAsakura
Copy link
Copy Markdown

@TorinAsakura TorinAsakura commented Apr 23, 2026

Fixes #1701

Summary

  • Encode JS string literal text with WTF-8 when it contains lone UTF-16 surrogate escapes
  • Preserve surrogate escapes for template literal spans as well as string literals
  • Decode WTF-8 string data in native-preview APIs and msgpack strings
  • Keep TextDecoder.decode options consistent across WTF-8 decoded segments
  • Bump the binary AST protocol version from 5 to 6
  • Add Go and native-preview regression coverage for surrogate pairs and lone high/low surrogates

Tests

  • npx hereby tsgo:build
  • go test ./internal/api/encoder
  • npm run -w @typescript/native-preview node -- --test test/wtf8.test.ts test/async/api.test.ts test/sync/api.test.ts
  • npx hereby lint
  • TSGO_HEREBY_NOEMBED=true npx hereby lint
Verification logs
$ git rev-parse HEAD
a10df38001ee4c5169683e00710fda35b12d3a82

$ npx hereby tsgo:build
Using ./Herebyfile.mjs to run tsgo:build
Starting tsgo:build
[01:37:04.557] [0] $ go build '-tags=noembed' -o ./built/local/ ./cmd/tsgo
[01:37:20.315] [0] ok (done in 15.7s)
Finished tsgo:build in 15.7s
Completed tsgo:build in 15.7s

$ go test ./internal/api/encoder
ok  	github.com/microsoft/typescript-go/internal/api/encoder	(cached)

$ npm run -w @typescript/native-preview node -- --test test/wtf8.test.ts test/async/api.test.ts test/sync/api.test.ts
> @typescript/native-preview@0.0.0 node
> node --experimental-strip-types --no-warnings --conditions @typescript/source --test test/wtf8.test.ts test/async/api.test.ts test/sync/api.test.ts

ok unicode escapes (121.38975ms)
ok template unicode escapes (112.558709ms)
▶ Wtf8Decoder
  ok decodes standard UTF-8 (1.724417ms)
  ok preserves WTF-8 encoded lone surrogates (3.5715ms)
ok Wtf8Decoder (7.294292ms)
 tests 202
 suites 63
 pass 202
 fail 0
 cancelled 0
 skipped 0
 todo 0
 duration_ms 42181.814

$ npx hereby lint
Using ./Herebyfile.mjs to run lint
Starting lint
[01:37:40.271] [0] $ ./_tools/custom-gcl run
0 issues.
[01:37:50.933] [0] ok (done in 10.6s)
Linting _tools
[01:37:50.936] [1] $ ./_tools/custom-gcl run
0 issues.
[01:37:51.778] [1] ok (done in 843ms)
Finished lint in 11.5s
Completed lint in 11.5s

$ TSGO_HEREBY_NOEMBED=true npx hereby lint
Using ./Herebyfile.mjs to run lint
Starting lint
[01:37:40.270] [0] $ ./_tools/custom-gcl run --build-tags noembed
0 issues.
[01:37:47.083] [0] ok (done in 6.8s)
Linting _tools
[01:37:47.086] [1] $ ./_tools/custom-gcl run --build-tags noembed
0 issues.
[01:37:48.082] [1] ok (done in 997ms)
Finished lint in 7.8s
Completed lint in 7.8s

@TorinAsakura TorinAsakura force-pushed the fix/wtf8-surrogate-literals branch from d573d96 to 5168a98 Compare April 23, 2026 02:27
@TorinAsakura TorinAsakura changed the title Preserve lone surrogates in native preview string literals Fix native preview lone surrogate string literals Apr 23, 2026
@TorinAsakura TorinAsakura changed the title Fix native preview lone surrogate string literals fix(native-preview): preserve lone surrogate string literals Apr 23, 2026
@TorinAsakura TorinAsakura force-pushed the fix/wtf8-surrogate-literals branch from 5168a98 to fcce776 Compare April 23, 2026 02:29
@TorinAsakura

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings April 30, 2026 20:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses lossy handling of JS string literals containing lone UTF-16 surrogate escapes by encoding such strings as WTF-8 in the binary AST protocol, updating native-preview decoding accordingly, and bumping the protocol version.

Changes:

  • Add Go-side reconstruction of literal text from raw source and encode lone surrogates as WTF-8 bytes when emitting binary AST strings.
  • Update native-preview to decode protocol/msgpack strings with a WTF-8-aware decoder and bump protocol version from 5 to 6.
  • Add regression tests in both Go and native-preview for surrogate pairs and lone surrogates.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/api/encoder/literal_text.go Adds escape decoding + WTF-8 emission for surrogate code units based on raw literal text.
internal/api/encoder/encoder.go Bumps protocol version to 6, documents WTF-8, and uses new literal-text encoding for string/template literals.
internal/api/encoder/encoder_test.go Adds a Go regression test asserting WTF-8 bytes are preserved in encoded string literal text.
_packages/native-preview/src/api/node/wtf8.ts Introduces a WTF-8-capable decoder to preserve lone surrogates when decoding bytes to JS strings.
_packages/native-preview/src/api/sync/api.ts Switches RemoteSourceFile decoding from TextDecoder to Wtf8Decoder.
_packages/native-preview/src/api/async/api.ts Switches RemoteSourceFile decoding from TextDecoder to Wtf8Decoder.
_packages/native-preview/src/api/node/node.ts Uses Wtf8Decoder when decoding node payloads into RemoteSourceFile.
_packages/native-preview/src/api/node/msgpack.ts Uses Wtf8Decoder for msgpack string decoding.
_packages/native-preview/src/api/node/protocol.ts Updates native-preview protocol version constant to 6.
_packages/native-preview/test/wtf8.test.ts Adds direct unit coverage for WTF-8 decoding behavior.
_packages/native-preview/test/sync/api.test.ts Adds sync API regression coverage for a string literal containing lone surrogate escapes.
_packages/native-preview/test/async/api.test.ts Adds async API regression coverage for a string literal containing lone surrogate escapes.
_packages/native-preview/test/encoder.test.ts Updates tests to expect protocol version 6.
Comments suppressed due to low confidence (1)

_packages/native-preview/src/api/node/msgpack.ts:114

  • MsgpackReader now decodes strings with Wtf8Decoder, but MsgpackWriter still encodes strings with TextEncoder, which replaces lone surrogates with U+FFFD. If any msgpack string payloads can contain lone surrogates (or if callers round-trip data that now preserves them), this will be lossy. Consider introducing a WTF-8 encoder (paired with Wtf8Decoder) and using it in MsgpackWriter.writeString (and other protocol string encoders like the AST StringTable) to keep the protocol symmetric.
const encoder = new TextEncoder();
const decoder = new Wtf8Decoder();

export class MsgpackWriter {
    private buf: Uint8Array;
    private view: DataView;
    private pos: number;

    constructor(initialSize = 256) {
        this.buf = new Uint8Array(initialSize);
        this.view = new DataView(this.buf.buffer);
        this.pos = 0;
    }

    private ensure(n: number): void {
        if (this.pos + n > this.buf.length) {
            let newSize = this.buf.length * 2;
            while (newSize < this.pos + n) newSize *= 2;
            const next = new Uint8Array(newSize);
            next.set(this.buf);
            this.buf = next;
            this.view = new DataView(this.buf.buffer);
        }
    }

    writeArrayHeader(length: number): void {
        if (length <= 0x0f) {
            this.ensure(1);
            this.buf[this.pos++] = 0x90 | length;
        }
        else if (length <= 0xffff) {
            this.ensure(3);
            this.buf[this.pos++] = 0xdc;
            this.view.setUint16(this.pos, length, false);
            this.pos += 2;
        }
        else {
            this.ensure(5);
            this.buf[this.pos++] = 0xdd;
            this.view.setUint32(this.pos, length, false);
            this.pos += 4;
        }
    }

    writeUint(value: number): void {
        if (value <= 0x7f) {
            this.ensure(1);
            this.buf[this.pos++] = value;
        }
        else if (value <= 0xff) {
            this.ensure(2);
            this.buf[this.pos++] = 0xcc;
            this.buf[this.pos++] = value;
        }
        else if (value <= 0xffff) {
            this.ensure(3);
            this.buf[this.pos++] = 0xcd;
            this.view.setUint16(this.pos, value, false);
            this.pos += 2;
        }
        else {
            this.ensure(5);
            this.buf[this.pos++] = 0xce;
            this.view.setUint32(this.pos, value, false);
            this.pos += 4;
        }
    }

    writeString(str: string): void {
        const encoded = encoder.encode(str);
        const len = encoded.length;
        if (len <= 0x1f) {

Comment thread internal/api/encoder/literal_text.go
Comment thread internal/api/encoder/encoder.go
@TorinAsakura TorinAsakura force-pushed the fix/wtf8-surrogate-literals branch from e7cbc0d to adf8f61 Compare April 30, 2026 21:57
@TorinAsakura TorinAsakura requested a review from Copilot April 30, 2026 22:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Comment thread internal/api/encoder/literal_text.go Outdated
Comment thread internal/api/encoder/literal_text.go
Comment thread _packages/native-preview/src/api/node/wtf8.ts Outdated
Comment thread _packages/native-preview/src/api/node/wtf8.ts Outdated
Comment thread _packages/native-preview/src/api/node/wtf8.ts Outdated
Comment thread _packages/native-preview/src/api/node/wtf8.ts Outdated
Comment thread _packages/native-preview/src/api/node/wtf8.ts Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

surrogate pair and lone surrogate support in stringLiteral

2 participants