Add view-based tokenizer offset mapping#333
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an offsets-aware encoding API to the Tokenizers module to support mapping encoded tokens back to spans in the original input text (targeting BERT-style NER workflows; resolves #307).
Changes:
- Introduces
TokenEncodingViewand a newTokenizer.encodeWithOffsets(text:addSpecialTokens:)API that returns token ids, token strings, and optional source spans. - Adds offset-aware tokenization plumbing (pre-tokenization with offsets and post-processing with offsets).
- Extends tokenizer tests (generic + BERT-specific) to exercise the new API.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
Sources/Tokenizers/Tokenizer.swift |
Adds TokenEncodingView, encodeWithOffsets, and offset-aware tokenization logic; encode() now delegates to the new path. |
Sources/Tokenizers/PreTokenizer.swift |
Adds PreTokenizedText and preTokenizeWithOffsets to propagate offsets through select pre-tokenizers. |
Sources/Tokenizers/PostProcessor.swift |
Adds PostProcessedToken and postProcessWithOffsets to preserve offsets through post-processing. |
Tests/TokenizersTests/TokenizerTests.swift |
Adds a basic encodeWithOffsets check in the shared tokenizer test suite. |
Tests/TokenizersTests/BertTokenizerTests.swift |
Adds a BERT-specific offset mapping test validating special-token spans and a simple substring extraction. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// A tokenizer that can return source spans for encoded tokens. | ||
| public protocol OffsetMappingTokenizer: Tokenizer { | ||
| /// Encodes text into a view of token IDs, token strings, and source spans. | ||
| /// | ||
| /// - Parameters: | ||
| /// - text: The input text to encode | ||
| /// - addSpecialTokens: Whether to add special tokens (e.g., BOS, EOS) | ||
| /// - Returns: A token encoding view. | ||
| /// Spans are `nil` for synthetic/special tokens | ||
| /// or when offset mapping is unavailable. | ||
| func encodeWithOffsets(text: String, addSpecialTokens: Bool) -> TokenEncodingView | ||
| } |
There was a problem hiding this comment.
encodeWithOffsets is only available via OffsetMappingTokenizer, but AutoTokenizer.from(...) returns Tokenizer (see same file). This means most callers won’t be able to access offset mapping without a downcast, which undermines the public API goal from #307. Consider making encodeWithOffsets a requirement on Tokenizer (with a default implementation returning nil spans when unsupported), or change the AutoTokenizer factory return type to OffsetMappingTokenizer and/or add a bridging API so callers don’t need to know the concrete type. Also, since the protocol requirement lacks a default arg, any OffsetMappingTokenizer values can’t call encodeWithOffsets(text:) without passing addSpecialTokens; adding an extension overload with addSpecialTokens: Bool = true would improve ergonomics.
| func postProcessWithOffsets(postProcessor: PostProcessor?, tokens: [PostProcessedToken], addSpecialTokens: Bool = true) -> [PostProcessedToken] { | ||
| guard let postProcessor else { return tokens } | ||
|
|
||
| let tokenStrings = tokens.map(\.text) | ||
| let processedStrings = postProcessor.postProcess(tokens: tokenStrings, tokensPair: nil, addSpecialTokens: addSpecialTokens) | ||
|
|
||
| // Map offsets by source token position (not token text) to avoid collisions | ||
| // with inserted special tokens and to preserve order after post-processing. | ||
| var sourceIndex = 0 | ||
| return processedStrings.map { token in | ||
| guard sourceIndex < tokens.count else { | ||
| return PostProcessedToken(text: token, offset: nil) | ||
| } | ||
|
|
||
| let sourceToken = tokens[sourceIndex] | ||
| let isDirectMatch = token == sourceToken.text | ||
| let isWhitespaceNormalizedMatch = token.trimmingCharacters(in: .whitespaces) == sourceToken.text.trimmingCharacters(in: .whitespaces) | ||
|
|
||
| if isDirectMatch || isWhitespaceNormalizedMatch { | ||
| sourceIndex += 1 | ||
| return PostProcessedToken(text: token, offset: sourceToken.offset) | ||
| } | ||
|
|
||
| // Synthetic/special tokens added by post-processing have no source span. | ||
| return PostProcessedToken(text: token, offset: nil) | ||
| } |
There was a problem hiding this comment.
postProcessWithOffsets advances sourceIndex by comparing token text (token == sourceToken.text / trimmed match). This can mis-assign spans when the post-processor inserts special tokens whose string happens to equal a real source token (e.g. user text contains "[CLS]" / "[SEP]" as added tokens). In that case the inserted special token can incorrectly consume the first source token’s offset and shift the rest. A more reliable approach is to drive offset assignment from the post-processor configuration/structure (e.g. for BertProcessing/RobertaProcessing insert known specials with nil offsets at fixed positions; for TemplateProcessing walk the template and consume offsets only for Sequence items).
|
Converting to draft so that I can think about this some more. The current API with Python-style I'm probably overlooking something obvious, but I'm too close to the problem and need to put it down for a moment. |
Resolves #307