Skip to content

Improve TextLine and line table performance by packing existing data into unused bits#83000

Merged
333fred merged 39 commits into
dotnet:mainfrom
CyrusNajmabadi:textline-encoding
Apr 16, 2026
Merged

Improve TextLine and line table performance by packing existing data into unused bits#83000
333fred merged 39 commits into
dotnet:mainfrom
CyrusNajmabadi:textline-encoding

Conversation

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor

@CyrusNajmabadi CyrusNajmabadi commented Mar 31, 2026

Background

TextLine is a readonly struct returned whenever code asks "what line is position X on?" — which happens constantly during compilation, IDE features, and diagnostics. The line table (SourceText.LineInfo) is consulted for every such lookup.

What this PR does

TextLine._data — merge two int fields into one ulong

Previously TextLine stored _start: int and _endIncludingBreaks: int as separate fields. This PR replaces them with a single ulong _data encoding:

Bits Field
63–62 line break length (0, 1, or 2)
61–31 start position (31 bits)
30–0 total length = EndIncludingLineBreak − Start (31 bits)

This makes EndIncludingLineBreak a trivial Start + length with no branching, and End a simple EndIncludingLineBreak − breakLen. Previously, End required calling TextUtilities.GetStartAndLengthOfLineBreakEndingAt on every access to determine the line break length.

_lineStarts — pack prior line break length into each entry

SourceText.LineInfo stores line start positions in a SegmentedList. These were plain ints. Since source positions are always non-negative and fit in 31 bits in practice, the top bit(s) of a uint were free.

Each _lineStarts[i] (for i > 0) records where a line started — which is always immediately after a line break. That prior line break is always length 1 or 2, never 0. This allows bias-1 encoding in a single top bit: 0 = length 1, 1 = length 2. The first entry (_lineStarts[0]) always starts at position 0 and its top bit is never read for break length purposes, since the indexer reads break length from _lineStarts[index + 1].

Result: LineInfo's indexer can call TextLine.FromSpanUnsafe(..., lineBreakLength) directly instead of leaving the length unknown and recomputing it on first access.

No loss of representable values

  • TextLine previously used two ints; it now uses one ulong. The 31-bit start and 31-bit length fields cover the full non-negative int range.
  • _lineStarts previously used 32-bit int start positions; it now uses 31-bit positions. Source texts are capped at int.MaxValue bytes (~2 GB) by the existing SourceText API, so 31 bits (2^31 − 1 ≈ 2 GB) covers the identical range.

Testing

Added a comprehensive test suite (TextLineNewLineTests) that exercises line parsing across all four SourceText implementations (StringText, LargeText, SubText, CompositeText) and every supported line break type (\n, \r, \r\n, \u0085, \u2028, \u2029).

The tests are structured in layers:

  1. Ground truth tests verify exact Start/End/EndIncludingLineBreak values for simple scenarios (newline at start, middle, end, consecutive newlines, mixed newlines, etc.), parameterized across all text kinds and all newline types — so every assertion runs against every implementation.

  2. Cross-type consistency tests take 25 interesting content patterns and compare every pair of text kinds against each other, catching any disagreement between implementations even if the expected values above were somehow wrong.

  3. Implementation-specific edge cases target the tricky boundaries unique to each type: LargeText tests every chunk boundary position including splitting \r\n across chunks and single-character-per-chunk scenarios. SubText tests exhaustively check every possible substring of several content patterns, verifying against StringText as reference. CompositeText tests try every possible 2-way and 3-way split point for each content pattern, including cases where \r\n spans segment boundaries.

@dotnet-policy-service dotnet-policy-service Bot added the Community The pull request was submitted by a contributor who is not a Microsoft employee. label Mar 31, 2026
Comment thread src/Compilers/Core/Portable/Text/CompositeText.cs Outdated
Comment thread src/Compilers/Core/Portable/Text/CompositeText.cs
@CyrusNajmabadi CyrusNajmabadi marked this pull request as ready for review March 31, 2026 22:07
@CyrusNajmabadi CyrusNajmabadi requested a review from a team as a code owner March 31, 2026 22:07
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

CyrusNajmabadi commented Mar 31, 2026

@ToddGrun to review.

Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated
@CyrusNajmabadi CyrusNajmabadi changed the title Experiments with TextLine encoding. Improve TextLine and line table performance by packing existing data into unused bits Mar 31, 2026
Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated
Cyrus Najmabadi and others added 19 commits April 1, 2026 12:20
Comprehensive Theory-based tests validating Start, End, EndIncludingLineBreak,
Span, and SpanIncludingLineBreak for every supported line break sequence
(\n, \r, \r\n, \u0085, \u2028, \u2029) across StringText, LargeText, SubText,
and CompositeText.

Tests are parameterized over a TextKind enum so every ground-truth assertion
is verified against all four SourceText implementations. A cross-type
consistency test checks all 16 implementation pairs agree on line structure
for a broad set of content patterns. Type-specific edge cases cover LargeText
chunk boundaries, SubText CRLF splitting, and CompositeText segment boundaries.
Encodes start (31 bits), line length (30 bits), and line break length
(3 bits) into one 64-bit field, making End and EndIncludingLineBreak
O(1) for lines created via FromSpan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 30-bit length field now always stores EndIncludingLineBreak - Start,
matching the natural semantics of both FromSpan and FromSpanUnsafe.
Bit 63 is now a standalone known flag with bits 62-61 for break length,
making EndIncludingLineBreak unconditionally O(1).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cast int operands through uint before ulong to prevent sign extension
before bitwise-or operations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ubText

Adds a FromSpanUnsafe overload that accepts a known line break length,
avoiding the deferred text inspection for callers that already have the
information. Updates CompositeText (derived from the last segment's line)
and both SubText call sites (zero for the split-CRLF sentinel line;
computed via overlap of the underlying line's break range with the
subtext's span for the general case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Changes _lineStarts from SegmentedList<int> to SegmentedList<uint>.
Each entry uses the top 2 bits to store the line break length of the
prior line and the bottom 30 bits for the line start position. This
lets the LineInfo indexer pass a known line break length to
FromSpanUnsafe, avoiding deferred computation on every line access.
ParseLineStarts is updated to encode the break length when appending
each new line start, including the cross-buffer CR+LF case.
IndexOf is updated with a manual binary search that masks the top bits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With the unknown case eliminated, the _data encoding drops from
1+2+31+30 to 2+31+31 bits. Removes KnownShift/KnownMask, unifies
StartMask and LengthMask into a single PositionMask, and eliminates
the deferred-computation path in LineBreakLength entirely. Equals now
compares _data directly; GetHashCode hashes both halves of _data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cyrus Najmabadi <cyrus.najmabadi@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SubText

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every line entry at index > 0 is created when a line break is found,
so the prior break length is always 1 or 2, never 0. This allows
bias-1 encoding: store (length - 1) in a single top bit, and read
back with (bit + 1). The freed bit extends line start positions from
30 to 31 bits, matching the full range of non-negative int values and
removing any theoretical limit on representable source positions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eText

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the SourceText encoding: each entry packs the line start in the
low 31 bits and the prior line break length (bias-1) in the top bit.
Instead of the goto-based pattern that added current-line starts on each
break, now adds next-line starts at break time (matching SourceText's
approach). For \r, adds a provisional length-1 entry immediately; if \n
follows, upgrades the entry in-place to length-2 with the position
advanced past the \n. This keeps the fast path clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Restore UTF-8 BOM on SourceText.cs, TextLine.cs, LargeText.cs
- Use >>> (unsigned right shift) instead of >> for clarity
- Add blank line before SubText break-region comment
- Break confusing Math.Max expression into named intermediates
- Fix xUnit1026: use InlineData for chunk boundary tests
@CyrusNajmabadi CyrusNajmabadi requested a review from ToddGrun April 2, 2026 20:17
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

could i get reviews on this? @phil-allen-msft ?

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

@jjonescz ptal :)

Comment thread src/Compilers/Core/Portable/Text/TextLine.cs Outdated
Comment thread .gitignore
Comment thread .gitignore
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

CyrusNajmabadi commented Apr 13, 2026

@dotnet/roslyn-compiler @jcouv for another set of eyes. thanks :)

Copy link
Copy Markdown
Member

@333fred 333fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit, otherwise looking good.

Comment thread src/Compilers/Core/Portable/Text/SourceText.cs Outdated
CyrusNajmabadi and others added 3 commits April 16, 2026 08:10
Co-authored-by: Fred Silberberg <fred@silberberg.xyz>
…/roslyn into textline-encoding # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with

'#' will be ignored, and an empty message aborts # the commit.
@CyrusNajmabadi CyrusNajmabadi requested a review from 333fred April 16, 2026 06:11
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Thanks @333fred . Fixed.

@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

This is good to merge whenever you are happy @333fred

@333fred 333fred merged commit 0344ff9 into dotnet:main Apr 16, 2026
28 checks passed
@dotnet-policy-service dotnet-policy-service Bot added this to the Next milestone Apr 16, 2026
@CyrusNajmabadi CyrusNajmabadi deleted the textline-encoding branch April 16, 2026 17:49
@CyrusNajmabadi
Copy link
Copy Markdown
Contributor Author

Thanks! I got a nice clean round number :)

333fred added a commit to 333fred/roslyn that referenced this pull request Apr 20, 2026
…ture

* upstream/main: (77 commits)
  Fix ArgumentNullException in VB Edit and Continue (dotnet#83250)
  Fix property pattern completion filtering out member being edited (dotnet#83230)
  Add branch merge skill (dotnet#83229)
  [main] Source code updates from dotnet/dotnet (dotnet#83215)
  Support MatchPriority comparison in LSP completion (dotnet#83164)
  Have CompleteStatement handle EOF statements (dotnet#83205)
  Minor cleanups related to attributes in VB (dotnet#83206)
  Simplify
  Address additional PR feedback
  Port remaining unit test projects to Linux (dotnet#83153)
  Unsafe evolution: allow unsafe property accessors (dotnet#83115)
  Address PR review feedback
  Allow cohost rename in Razor source-generated docs
  Refine code review skill (dotnet#82666)
  Review feedback
  Allow creation of DocumentUri instances even if System.Uri cannot parse it
  Improve TextLine and line table performance by packing existing data into unused bits (dotnet#83000)
  Skip TestFindReferencesAsync_UsingAlias on non-Windows platforms (dotnet#83188)
  [main] Source code updates from dotnet/dotnet (dotnet#83174)
  fix comment
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-Compilers Community The pull request was submitted by a contributor who is not a Microsoft employee.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants