Skip to content

Add view-based tokenizer offset mapping#333

Draft
mattt wants to merge 4 commits into
mainfrom
mattt/offset-mapping
Draft

Add view-based tokenizer offset mapping#333
mattt wants to merge 4 commits into
mainfrom
mattt/offset-mapping

Conversation

@mattt
Copy link
Copy Markdown
Collaborator

@mattt mattt commented Mar 9, 2026

Resolves #307

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an offsets-aware encoding API to the Tokenizers module to support mapping encoded tokens back to spans in the original input text (targeting BERT-style NER workflows; resolves #307).

Changes:

  • Introduces TokenEncodingView and a new Tokenizer.encodeWithOffsets(text:addSpecialTokens:) API that returns token ids, token strings, and optional source spans.
  • Adds offset-aware tokenization plumbing (pre-tokenization with offsets and post-processing with offsets).
  • Extends tokenizer tests (generic + BERT-specific) to exercise the new API.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
Sources/Tokenizers/Tokenizer.swift Adds TokenEncodingView, encodeWithOffsets, and offset-aware tokenization logic; encode() now delegates to the new path.
Sources/Tokenizers/PreTokenizer.swift Adds PreTokenizedText and preTokenizeWithOffsets to propagate offsets through select pre-tokenizers.
Sources/Tokenizers/PostProcessor.swift Adds PostProcessedToken and postProcessWithOffsets to preserve offsets through post-processing.
Tests/TokenizersTests/TokenizerTests.swift Adds a basic encodeWithOffsets check in the shared tokenizer test suite.
Tests/TokenizersTests/BertTokenizerTests.swift Adds a BERT-specific offset mapping test validating special-token spans and a simple substring extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/Tokenizers/Tokenizer.swift
Comment thread Sources/Tokenizers/PostProcessor.swift
Comment thread Sources/Tokenizers/Tokenizer.swift
Comment thread Sources/Tokenizers/Tokenizer.swift Outdated
Comment thread Tests/TokenizersTests/TokenizerTests.swift Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated
Comment thread Sources/Tokenizers/Tokenizer.swift Outdated
Comment thread Tests/TokenizersTests/TokenizerTests.swift
Comment thread Tests/TokenizersTests/BertTokenizerTests.swift Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated
Comment thread Sources/Tokenizers/Tokenizer.swift Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +460 to +471
/// A tokenizer that can return source spans for encoded tokens.
public protocol OffsetMappingTokenizer: Tokenizer {
/// Encodes text into a view of token IDs, token strings, and source spans.
///
/// - Parameters:
/// - text: The input text to encode
/// - addSpecialTokens: Whether to add special tokens (e.g., BOS, EOS)
/// - Returns: A token encoding view.
/// Spans are `nil` for synthetic/special tokens
/// or when offset mapping is unavailable.
func encodeWithOffsets(text: String, addSpecialTokens: Bool) -> TokenEncodingView
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encodeWithOffsets is only available via OffsetMappingTokenizer, but AutoTokenizer.from(...) returns Tokenizer (see same file). This means most callers won’t be able to access offset mapping without a downcast, which undermines the public API goal from #307. Consider making encodeWithOffsets a requirement on Tokenizer (with a default implementation returning nil spans when unsupported), or change the AutoTokenizer factory return type to OffsetMappingTokenizer and/or add a bridging API so callers don’t need to know the concrete type. Also, since the protocol requirement lacks a default arg, any OffsetMappingTokenizer values can’t call encodeWithOffsets(text:) without passing addSpecialTokens; adding an extension overload with addSpecialTokens: Bool = true would improve ergonomics.

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +246
func postProcessWithOffsets(postProcessor: PostProcessor?, tokens: [PostProcessedToken], addSpecialTokens: Bool = true) -> [PostProcessedToken] {
guard let postProcessor else { return tokens }

let tokenStrings = tokens.map(\.text)
let processedStrings = postProcessor.postProcess(tokens: tokenStrings, tokensPair: nil, addSpecialTokens: addSpecialTokens)

// Map offsets by source token position (not token text) to avoid collisions
// with inserted special tokens and to preserve order after post-processing.
var sourceIndex = 0
return processedStrings.map { token in
guard sourceIndex < tokens.count else {
return PostProcessedToken(text: token, offset: nil)
}

let sourceToken = tokens[sourceIndex]
let isDirectMatch = token == sourceToken.text
let isWhitespaceNormalizedMatch = token.trimmingCharacters(in: .whitespaces) == sourceToken.text.trimmingCharacters(in: .whitespaces)

if isDirectMatch || isWhitespaceNormalizedMatch {
sourceIndex += 1
return PostProcessedToken(text: token, offset: sourceToken.offset)
}

// Synthetic/special tokens added by post-processing have no source span.
return PostProcessedToken(text: token, offset: nil)
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postProcessWithOffsets advances sourceIndex by comparing token text (token == sourceToken.text / trimmed match). This can mis-assign spans when the post-processor inserts special tokens whose string happens to equal a real source token (e.g. user text contains "[CLS]" / "[SEP]" as added tokens). In that case the inserted special token can incorrectly consume the first source token’s offset and shift the rest. A more reliable approach is to drive offset assignment from the post-processor configuration/structure (e.g. for BertProcessing/RobertaProcessing insert known specials with nil offsets at fixed positions; for TemplateProcessing walk the template and consume offsets only for Sequence items).

Copilot uses AI. Check for mistakes.
@mattt mattt marked this pull request as draft March 9, 2026 17:04
@mattt
Copy link
Copy Markdown
Collaborator Author

mattt commented Mar 9, 2026

Converting to draft so that I can think about this some more. The current API with Python-style AutoTokenizer(from:) returning a type-erased Tokenizermixed with not all tokenizers being able to support encoding with offsets (...right? or am I missing something) forces us to either to adopt a fallback-with-failure model for unsupported cases, or requires an awkwardif case let tokenizer as OffsetMappingTokenizer = AutoTokenizer(from: ...)` cast.

I'm probably overlooking something obvious, but I'm too close to the problem and need to put it down for a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add offset mapping support to tokenizer.encode()

2 participants