Improve TextLine and line table performance by packing existing data into unused bits by CyrusNajmabadi · Pull Request #83000 · dotnet/roslyn

CyrusNajmabadi · 2026-03-31T21:38:41Z

Background

TextLine is a readonly struct returned whenever code asks "what line is position X on?" — which happens constantly during compilation, IDE features, and diagnostics. The line table (SourceText.LineInfo) is consulted for every such lookup.

What this PR does

`TextLine._data` — merge two `int` fields into one `ulong`

Previously TextLine stored _start: int and _endIncludingBreaks: int as separate fields. This PR replaces them with a single ulong _data encoding:

Bits	Field
63–62	line break length (0, 1, or 2)
61–31	start position (31 bits)
30–0	total length = `EndIncludingLineBreak − Start` (31 bits)

This makes EndIncludingLineBreak a trivial Start + length with no branching, and End a simple EndIncludingLineBreak − breakLen. Previously, End required calling TextUtilities.GetStartAndLengthOfLineBreakEndingAt on every access to determine the line break length.

`_lineStarts` — pack prior line break length into each entry

SourceText.LineInfo stores line start positions in a SegmentedList. These were plain ints. Since source positions are always non-negative and fit in 31 bits in practice, the top bit(s) of a uint were free.

Each _lineStarts[i] (for i > 0) records where a line started — which is always immediately after a line break. That prior line break is always length 1 or 2, never 0. This allows bias-1 encoding in a single top bit: 0 = length 1, 1 = length 2. The first entry (_lineStarts[0]) always starts at position 0 and its top bit is never read for break length purposes, since the indexer reads break length from _lineStarts[index + 1].

Result: LineInfo's indexer can call TextLine.FromSpanUnsafe(..., lineBreakLength) directly instead of leaving the length unknown and recomputing it on first access.

No loss of representable values

TextLine previously used two ints; it now uses one ulong. The 31-bit start and 31-bit length fields cover the full non-negative int range.
_lineStarts previously used 32-bit int start positions; it now uses 31-bit positions. Source texts are capped at int.MaxValue bytes (~2 GB) by the existing SourceText API, so 31 bits (2^31 − 1 ≈ 2 GB) covers the identical range.

Testing

Added a comprehensive test suite (TextLineNewLineTests) that exercises line parsing across all four SourceText implementations (StringText, LargeText, SubText, CompositeText) and every supported line break type (\n, \r, \r\n, \u0085, \u2028, \u2029).

The tests are structured in layers:

Ground truth tests verify exact Start/End/EndIncludingLineBreak values for simple scenarios (newline at start, middle, end, consecutive newlines, mixed newlines, etc.), parameterized across all text kinds and all newline types — so every assertion runs against every implementation.
Cross-type consistency tests take 25 interesting content patterns and compare every pair of text kinds against each other, catching any disagreement between implementations even if the expected values above were somehow wrong.
Implementation-specific edge cases target the tricky boundaries unique to each type: LargeText tests every chunk boundary position including splitting \r\n across chunks and single-character-per-chunk scenarios. SubText tests exhaustively check every possible substring of several content patterns, verifying against StringText as reference. CompositeText tests try every possible 2-way and 3-way split point for each content pattern, including cases where \r\n spans segment boundaries.

CyrusNajmabadi · 2026-03-31T22:11:03Z

@ToddGrun to review.

Comprehensive Theory-based tests validating Start, End, EndIncludingLineBreak, Span, and SpanIncludingLineBreak for every supported line break sequence (\n, \r, \r\n, \u0085, \u2028, \u2029) across StringText, LargeText, SubText, and CompositeText. Tests are parameterized over a TextKind enum so every ground-truth assertion is verified against all four SourceText implementations. A cross-type consistency test checks all 16 implementation pairs agree on line structure for a broad set of content patterns. Type-specific edge cases cover LargeText chunk boundaries, SubText CRLF splitting, and CompositeText segment boundaries.

Encodes start (31 bits), line length (30 bits), and line break length (3 bits) into one 64-bit field, making End and EndIncludingLineBreak O(1) for lines created via FromSpan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The 30-bit length field now always stores EndIncludingLineBreak - Start, matching the natural semantics of both FromSpan and FromSpanUnsafe. Bit 63 is now a standalone known flag with bits 62-61 for break length, making EndIncludingLineBreak unconditionally O(1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Cast int operands through uint before ulong to prevent sign extension before bitwise-or operations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ubText Adds a FromSpanUnsafe overload that accepts a known line break length, avoiding the deferred text inspection for callers that already have the information. Updates CompositeText (derived from the last segment's line) and both SubText call sites (zero for the split-CRLF sentinel line; computed via overlap of the underlying line's break range with the subtext's span for the general case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Changes _lineStarts from SegmentedList<int> to SegmentedList<uint>. Each entry uses the top 2 bits to store the line break length of the prior line and the bottom 30 bits for the line start position. This lets the LineInfo indexer pass a known line break length to FromSpanUnsafe, avoiding deferred computation on every line access. ParseLineStarts is updated to encode the break length when appending each new line start, including the cross-buffer CR+LF case. IndexOf is updated with a manual binary search that masks the top bits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

With the unknown case eliminated, the _data encoding drops from 1+2+31+30 to 2+31+31 bits. Removes KnownShift/KnownMask, unifies StartMask and LengthMask into a single PositionMask, and eliminates the deferred-computation path in LineBreakLength entirely. Equals now compares _data directly; GetHashCode hashes both halves of _data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-authored-by: Cyrus Najmabadi <cyrus.najmabadi@gmail.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…SubText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Every line entry at index > 0 is created when a line break is found, so the prior break length is always 1 or 2, never 0. This allows bias-1 encoding: store (length - 1) in a single top bit, and read back with (bit + 1). The freed bit extends line start positions from 30 to 31 bits, matching the full range of non-negative int values and removing any theoretical limit on representable source positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Matches the SourceText encoding: each entry packs the line start in the low 31 bits and the prior line break length (bias-1) in the top bit. Instead of the goto-based pattern that added current-line starts on each break, now adds next-line starts at break time (matching SourceText's approach). For \r, adds a provisional length-1 entry immediately; if \n follows, upgrades the entry in-place to length-2 with the position advanced past the \n. This keeps the fast path clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Restore UTF-8 BOM on SourceText.cs, TextLine.cs, LargeText.cs - Use >>> (unsigned right shift) instead of >> for clarity - Add blank line before SubText break-region comment - Break confusing Math.Max expression into named intermediates - Fix xUnit1026: use InlineData for chunk boundary tests

CyrusNajmabadi · 2026-04-13T08:34:38Z

could i get reviews on this? @phil-allen-msft ?

CyrusNajmabadi · 2026-04-13T08:34:43Z

@jjonescz ptal :)

The mask strips tag bits to yield the raw value, not a "position".

CyrusNajmabadi · 2026-04-13T15:37:13Z

@dotnet/roslyn-compiler @jcouv for another set of eyes. thanks :)

333fred

One small nit, otherwise looking good.

Co-authored-by: Fred Silberberg <fred@silberberg.xyz>

…/roslyn into textline-encoding # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

CyrusNajmabadi · 2026-04-16T06:11:19Z

Thanks @333fred . Fixed.

CyrusNajmabadi · 2026-04-16T14:45:20Z

This is good to merge whenever you are happy @333fred

CyrusNajmabadi · 2026-04-16T17:49:42Z

Thanks! I got a nice clean round number :)

…ture * upstream/main: (77 commits) Fix ArgumentNullException in VB Edit and Continue (dotnet#83250) Fix property pattern completion filtering out member being edited (dotnet#83230) Add branch merge skill (dotnet#83229) [main] Source code updates from dotnet/dotnet (dotnet#83215) Support MatchPriority comparison in LSP completion (dotnet#83164) Have CompleteStatement handle EOF statements (dotnet#83205) Minor cleanups related to attributes in VB (dotnet#83206) Simplify Address additional PR feedback Port remaining unit test projects to Linux (dotnet#83153) Unsafe evolution: allow unsafe property accessors (dotnet#83115) Address PR review feedback Allow cohost rename in Razor source-generated docs Refine code review skill (dotnet#82666) Review feedback Allow creation of DocumentUri instances even if System.Uri cannot parse it Improve TextLine and line table performance by packing existing data into unused bits (dotnet#83000) Skip TestFindReferencesAsync_UsingAlias on non-Windows platforms (dotnet#83188) [main] Source code updates from dotnet/dotnet (dotnet#83174) fix comment ...

github-actions Bot added the Area-Compilers label Mar 31, 2026

dotnet-policy-service Bot added the Community The pull request was submitted by a contributor who is not a Microsoft employee. label Mar 31, 2026

CyrusNajmabadi commented Mar 31, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/CompositeText.cs Outdated

CyrusNajmabadi commented Mar 31, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/CompositeText.cs

CyrusNajmabadi marked this pull request as ready for review March 31, 2026 22:07

CyrusNajmabadi requested a review from a team as a code owner March 31, 2026 22:07

davidwengier reviewed Mar 31, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated

CyrusNajmabadi changed the title ~~Experiments with TextLine encoding.~~ Improve TextLine and line table performance by packing existing data into unused bits Mar 31, 2026

CyrusNajmabadi commented Mar 31, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated

CyrusNajmabadi requested a review from davidwengier March 31, 2026 22:34

Cyrus Najmabadi and others added 19 commits April 1, 2026 12:20

Replace magic numbers in TextLine with named constants

9475add

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix sign-extension warnings in _data packing

29aca77

Cast int operands through uint before ulong to prevent sign extension before bitwise-or operations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove original FromSpanUnsafe.

fd34481

Simplify

ae2b332

Apply suggestions from code review

7768e2a

Co-authored-by: Cyrus Najmabadi <cyrus.najmabadi@gmail.com>

Remove .claude/settings.local.json; add .claude/ to .gitignore

4bbb661

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract GetLineStart helper in LineInfo to reduce repetition

ef9db37

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extract PackEntry helper in LineInfo; add comment on lineBreakLen in …

3796b78

…SubText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Apply suggestion from @CyrusNajmabadi

345e889

Add internal LineBreakLength property to TextLine; use it in Composit…

6ae1a2a

…eText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Assert priorLineBreakLength is 1 or 2 in PackEntry

680d7cd

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

CyrusNajmabadi requested a review from ToddGrun April 2, 2026 20:17

Cyrus Najmabadi added 2 commits April 2, 2026 22:19

Restore UTF-8 BOM on SourceText.cs and TextLine.cs

b010ba2

Restore UTF-8 BOM on SubText.cs

4e93a23

jjonescz approved these changes Apr 13, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/TextLine.cs Outdated

Comment thread .gitignore

CyrusNajmabadi commented Apr 13, 2026

View reviewed changes

Comment thread .gitignore

CyrusNajmabadi and others added 3 commits April 13, 2026 13:31

Apply suggestion from @CyrusNajmabadi

4e0ec2d

Merge remote-tracking branch 'upstream/main' into textline-encoding

db196a8

Rename PositionMask to RawValueMask for clarity

e8869d4

The mask strips tag bits to yield the raw value, not a "position".

333fred reviewed Apr 14, 2026

View reviewed changes

Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated

CyrusNajmabadi and others added 3 commits April 16, 2026 08:10

Update src/Compilers/Core/Portable/Text/SourceText.cs

1e11358

Co-authored-by: Fred Silberberg <fred@silberberg.xyz>

Merge remote-tracking branch 'upstream/main' into textline-encoding

0e2af6b

CyrusNajmabadi requested a review from 333fred April 16, 2026 06:11

333fred approved these changes Apr 16, 2026

View reviewed changes

333fred merged commit 0344ff9 into dotnet:main Apr 16, 2026
28 checks passed

dotnet-policy-service Bot added this to the Next milestone Apr 16, 2026

CyrusNajmabadi deleted the textline-encoding branch April 16, 2026 17:49

This was referenced Apr 16, 2026

[release/11.0.1xx-preview3] Source code updates from dotnet/roslyn dotnet/dotnet#5939

Closed

[main] Source code updates from dotnet/roslyn dotnet/dotnet#6091

Merged

dotnet-bot mentioned this pull request Apr 19, 2026

[Automated] PRs inserted in VS build main-11718.31 #83238

Closed

dotnet-bot mentioned this pull request Apr 22, 2026

[Automated] PRs inserted in VS build dev.alanren.whatsnew-ssms-11721.499 #83294

Closed

dotnet-maestro Bot mentioned this pull request Apr 28, 2026

[release/10.0.3xx] Source code updates from dotnet/roslyn dotnet/dotnet#6340

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve TextLine and line table performance by packing existing data into unused bits#83000

Improve TextLine and line table performance by packing existing data into unused bits#83000
333fred merged 39 commits into
dotnet:mainfrom
CyrusNajmabadi:textline-encoding

CyrusNajmabadi commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 13, 2026

Uh oh!

CyrusNajmabadi commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 13, 2026 •

edited by jjonescz

Loading

Uh oh!

333fred left a comment

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

CyrusNajmabadi commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

What this PR does

TextLine._data — merge two int fields into one ulong

_lineStarts — pack prior line break length into each entry

No loss of representable values

Testing

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 13, 2026

Uh oh!

CyrusNajmabadi commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 13, 2026 • edited by jjonescz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

333fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

Uh oh!

CyrusNajmabadi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CyrusNajmabadi commented Mar 31, 2026 •

edited

Loading

`TextLine._data` — merge two `int` fields into one `ulong`

`_lineStarts` — pack prior line break length into each entry

CyrusNajmabadi commented Mar 31, 2026 •

edited

Loading

CyrusNajmabadi commented Apr 13, 2026 •

edited by jjonescz

Loading