Improve TextLine and line table performance by packing existing data into unused bits#83000
Merged
Conversation
CyrusNajmabadi
commented
Mar 31, 2026
CyrusNajmabadi
commented
Mar 31, 2026
Contributor
Author
|
@ToddGrun to review. |
CyrusNajmabadi
commented
Mar 31, 2026
Comprehensive Theory-based tests validating Start, End, EndIncludingLineBreak, Span, and SpanIncludingLineBreak for every supported line break sequence (\n, \r, \r\n, \u0085, \u2028, \u2029) across StringText, LargeText, SubText, and CompositeText. Tests are parameterized over a TextKind enum so every ground-truth assertion is verified against all four SourceText implementations. A cross-type consistency test checks all 16 implementation pairs agree on line structure for a broad set of content patterns. Type-specific edge cases cover LargeText chunk boundaries, SubText CRLF splitting, and CompositeText segment boundaries.
Encodes start (31 bits), line length (30 bits), and line break length (3 bits) into one 64-bit field, making End and EndIncludingLineBreak O(1) for lines created via FromSpan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The 30-bit length field now always stores EndIncludingLineBreak - Start, matching the natural semantics of both FromSpan and FromSpanUnsafe. Bit 63 is now a standalone known flag with bits 62-61 for break length, making EndIncludingLineBreak unconditionally O(1). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Cast int operands through uint before ulong to prevent sign extension before bitwise-or operations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ubText Adds a FromSpanUnsafe overload that accepts a known line break length, avoiding the deferred text inspection for callers that already have the information. Updates CompositeText (derived from the last segment's line) and both SubText call sites (zero for the split-CRLF sentinel line; computed via overlap of the underlying line's break range with the subtext's span for the general case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Changes _lineStarts from SegmentedList<int> to SegmentedList<uint>. Each entry uses the top 2 bits to store the line break length of the prior line and the bottom 30 bits for the line start position. This lets the LineInfo indexer pass a known line break length to FromSpanUnsafe, avoiding deferred computation on every line access. ParseLineStarts is updated to encode the break length when appending each new line start, including the cross-buffer CR+LF case. IndexOf is updated with a manual binary search that masks the top bits. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
With the unknown case eliminated, the _data encoding drops from 1+2+31+30 to 2+31+31 bits. Removes KnownShift/KnownMask, unifies StartMask and LengthMask into a single PositionMask, and eliminates the deferred-computation path in LineBreakLength entirely. Equals now compares _data directly; GetHashCode hashes both halves of _data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Cyrus Najmabadi <cyrus.najmabadi@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…SubText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every line entry at index > 0 is created when a line break is found, so the prior break length is always 1 or 2, never 0. This allows bias-1 encoding: store (length - 1) in a single top bit, and read back with (bit + 1). The freed bit extends line start positions from 30 to 31 bits, matching the full range of non-negative int values and removing any theoretical limit on representable source positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eText Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Matches the SourceText encoding: each entry packs the line start in the low 31 bits and the prior line break length (bias-1) in the top bit. Instead of the goto-based pattern that added current-line starts on each break, now adds next-line starts at break time (matching SourceText's approach). For \r, adds a provisional length-1 entry immediately; if \n follows, upgrades the entry in-place to length-2 with the position advanced past the \n. This keeps the fast path clean. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Restore UTF-8 BOM on SourceText.cs, TextLine.cs, LargeText.cs - Use >>> (unsigned right shift) instead of >> for clarity - Add blank line before SubText break-region comment - Break confusing Math.Max expression into named intermediates - Fix xUnit1026: use InlineData for chunk boundary tests
Contributor
Author
|
could i get reviews on this? @phil-allen-msft ? |
Contributor
Author
|
@jjonescz ptal :) |
jjonescz
approved these changes
Apr 13, 2026
CyrusNajmabadi
commented
Apr 13, 2026
The mask strips tag bits to yield the raw value, not a "position".
Contributor
Author
|
@dotnet/roslyn-compiler @jcouv for another set of eyes. thanks :) |
333fred
reviewed
Apr 14, 2026
Member
333fred
left a comment
There was a problem hiding this comment.
One small nit, otherwise looking good.
Co-authored-by: Fred Silberberg <fred@silberberg.xyz>
…/roslyn into textline-encoding # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
Contributor
Author
|
Thanks @333fred . Fixed. |
Contributor
Author
|
This is good to merge whenever you are happy @333fred |
333fred
approved these changes
Apr 16, 2026
Contributor
Author
|
Thanks! I got a nice clean round number :) |
This was referenced Apr 16, 2026
333fred
added a commit
to 333fred/roslyn
that referenced
this pull request
Apr 20, 2026
…ture * upstream/main: (77 commits) Fix ArgumentNullException in VB Edit and Continue (dotnet#83250) Fix property pattern completion filtering out member being edited (dotnet#83230) Add branch merge skill (dotnet#83229) [main] Source code updates from dotnet/dotnet (dotnet#83215) Support MatchPriority comparison in LSP completion (dotnet#83164) Have CompleteStatement handle EOF statements (dotnet#83205) Minor cleanups related to attributes in VB (dotnet#83206) Simplify Address additional PR feedback Port remaining unit test projects to Linux (dotnet#83153) Unsafe evolution: allow unsafe property accessors (dotnet#83115) Address PR review feedback Allow cohost rename in Razor source-generated docs Refine code review skill (dotnet#82666) Review feedback Allow creation of DocumentUri instances even if System.Uri cannot parse it Improve TextLine and line table performance by packing existing data into unused bits (dotnet#83000) Skip TestFindReferencesAsync_UsingAlias on non-Windows platforms (dotnet#83188) [main] Source code updates from dotnet/dotnet (dotnet#83174) fix comment ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
TextLineis areadonly structreturned whenever code asks "what line is position X on?" — which happens constantly during compilation, IDE features, and diagnostics. The line table (SourceText.LineInfo) is consulted for every such lookup.What this PR does
TextLine._data— merge twointfields into oneulongPreviously
TextLinestored_start: intand_endIncludingBreaks: intas separate fields. This PR replaces them with a singleulong _dataencoding:EndIncludingLineBreak − Start(31 bits)This makes
EndIncludingLineBreaka trivialStart + lengthwith no branching, andEnda simpleEndIncludingLineBreak − breakLen. Previously,Endrequired callingTextUtilities.GetStartAndLengthOfLineBreakEndingAton every access to determine the line break length._lineStarts— pack prior line break length into each entrySourceText.LineInfostores line start positions in aSegmentedList. These were plainints. Since source positions are always non-negative and fit in 31 bits in practice, the top bit(s) of auintwere free.Each
_lineStarts[i](fori > 0) records where a line started — which is always immediately after a line break. That prior line break is always length 1 or 2, never 0. This allows bias-1 encoding in a single top bit:0= length 1,1= length 2. The first entry (_lineStarts[0]) always starts at position 0 and its top bit is never read for break length purposes, since the indexer reads break length from_lineStarts[index + 1].Result:
LineInfo's indexer can callTextLine.FromSpanUnsafe(..., lineBreakLength)directly instead of leaving the length unknown and recomputing it on first access.No loss of representable values
TextLinepreviously used twoints; it now uses oneulong. The 31-bit start and 31-bit length fields cover the full non-negativeintrange._lineStartspreviously used 32-bitintstart positions; it now uses 31-bit positions. Source texts are capped atint.MaxValuebytes (~2 GB) by the existingSourceTextAPI, so 31 bits (2^31 − 1 ≈ 2 GB) covers the identical range.Testing
Added a comprehensive test suite (TextLineNewLineTests) that exercises line parsing across all four SourceText implementations (StringText, LargeText, SubText, CompositeText) and every supported line break type (\n, \r, \r\n, \u0085, \u2028, \u2029).
The tests are structured in layers:
Ground truth tests verify exact Start/End/EndIncludingLineBreak values for simple scenarios (newline at start, middle, end, consecutive newlines, mixed newlines, etc.), parameterized across all text kinds and all newline types — so every assertion runs against every implementation.
Cross-type consistency tests take 25 interesting content patterns and compare every pair of text kinds against each other, catching any disagreement between implementations even if the expected values above were somehow wrong.
Implementation-specific edge cases target the tricky boundaries unique to each type: LargeText tests every chunk boundary position including splitting \r\n across chunks and single-character-per-chunk scenarios. SubText tests exhaustively check every possible substring of several content patterns, verifying against StringText as reference. CompositeText tests try every possible 2-way and 3-way split point for each content pattern, including cases where \r\n spans segment boundaries.