Add view-based tokenizer offset mapping by mattt · Pull Request #333 · huggingface/swift-transformers

mattt · 2026-03-09T11:53:47Z

Resolves #307

Copilot

Pull request overview

This PR adds an offsets-aware encoding API to the Tokenizers module to support mapping encoded tokens back to spans in the original input text (targeting BERT-style NER workflows; resolves #307).

Changes:

Introduces TokenEncodingView and a new Tokenizer.encodeWithOffsets(text:addSpecialTokens:) API that returns token ids, token strings, and optional source spans.
Adds offset-aware tokenization plumbing (pre-tokenization with offsets and post-processing with offsets).
Extends tokenizer tests (generic + BERT-specific) to exercise the new API.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`Sources/Tokenizers/Tokenizer.swift`	Adds `TokenEncodingView`, `encodeWithOffsets`, and offset-aware tokenization logic; `encode()` now delegates to the new path.
`Sources/Tokenizers/PreTokenizer.swift`	Adds `PreTokenizedText` and `preTokenizeWithOffsets` to propagate offsets through select pre-tokenizers.
`Sources/Tokenizers/PostProcessor.swift`	Adds `PostProcessedToken` and `postProcessWithOffsets` to preserve offsets through post-processing.
`Tests/TokenizersTests/TokenizerTests.swift`	Adds a basic `encodeWithOffsets` check in the shared tokenizer test suite.
`Tests/TokenizersTests/BertTokenizerTests.swift`	Adds a BERT-specific offset mapping test validating special-token spans and a simple substring extraction.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…r protocol

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-09T13:33:40Z

+/// A tokenizer that can return source spans for encoded tokens.
+public protocol OffsetMappingTokenizer: Tokenizer {
+    /// Encodes text into a view of token IDs, token strings, and source spans.
+    ///
+    /// - Parameters:
+    ///   - text: The input text to encode
+    ///   - addSpecialTokens: Whether to add special tokens (e.g., BOS, EOS)
+    /// - Returns: A token encoding view.
+    ///            Spans are `nil` for synthetic/special tokens
+    ///            or when offset mapping is unavailable.
+    func encodeWithOffsets(text: String, addSpecialTokens: Bool) -> TokenEncodingView
+}


encodeWithOffsets is only available via OffsetMappingTokenizer, but AutoTokenizer.from(...) returns Tokenizer (see same file). This means most callers won’t be able to access offset mapping without a downcast, which undermines the public API goal from #307. Consider making encodeWithOffsets a requirement on Tokenizer (with a default implementation returning nil spans when unsupported), or change the AutoTokenizer factory return type to OffsetMappingTokenizer and/or add a bridging API so callers don’t need to know the concrete type. Also, since the protocol requirement lacks a default arg, any OffsetMappingTokenizer values can’t call encodeWithOffsets(text:) without passing addSpecialTokens; adding an extension overload with addSpecialTokens: Bool = true would improve ergonomics.

Copilot · 2026-03-09T13:33:40Z

+func postProcessWithOffsets(postProcessor: PostProcessor?, tokens: [PostProcessedToken], addSpecialTokens: Bool = true) -> [PostProcessedToken] {
+    guard let postProcessor else { return tokens }
+
+    let tokenStrings = tokens.map(\.text)
+    let processedStrings = postProcessor.postProcess(tokens: tokenStrings, tokensPair: nil, addSpecialTokens: addSpecialTokens)
+
+    // Map offsets by source token position (not token text) to avoid collisions
+    // with inserted special tokens and to preserve order after post-processing.
+    var sourceIndex = 0
+    return processedStrings.map { token in
+        guard sourceIndex < tokens.count else {
+            return PostProcessedToken(text: token, offset: nil)
+        }
+
+        let sourceToken = tokens[sourceIndex]
+        let isDirectMatch = token == sourceToken.text
+        let isWhitespaceNormalizedMatch = token.trimmingCharacters(in: .whitespaces) == sourceToken.text.trimmingCharacters(in: .whitespaces)
+
+        if isDirectMatch || isWhitespaceNormalizedMatch {
+            sourceIndex += 1
+            return PostProcessedToken(text: token, offset: sourceToken.offset)
+        }
+
+        // Synthetic/special tokens added by post-processing have no source span.
+        return PostProcessedToken(text: token, offset: nil)
+    }


postProcessWithOffsets advances sourceIndex by comparing token text (token == sourceToken.text / trimmed match). This can mis-assign spans when the post-processor inserts special tokens whose string happens to equal a real source token (e.g. user text contains "[CLS]" / "[SEP]" as added tokens). In that case the inserted special token can incorrectly consume the first source token’s offset and shift the rest. A more reliable approach is to drive offset assignment from the post-processor configuration/structure (e.g. for BertProcessing/RobertaProcessing insert known specials with nil offsets at fixed positions; for TemplateProcessing walk the template and consume offsets only for Sequence items).

mattt · 2026-03-09T17:07:49Z

Converting to draft so that I can think about this some more. The current API with Python-style AutoTokenizer(from:) returning a type-erased Tokenizermixed with not all tokenizers being able to support encoding with offsets (...right? or am I missing something) forces us to either to adopt a fallback-with-failure model for unsupported cases, or requires an awkwardif case let tokenizer as OffsetMappingTokenizer = AutoTokenizer(from: ...)` cast.

I'm probably overlooking something obvious, but I'm too close to the problem and need to put it down for a moment.

Add view-based tokenizer offset mapping

8f6a6c2

mattt requested a review from Copilot March 9, 2026 11:53

Copilot started reviewing on behalf of mattt March 9, 2026 11:54 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread Sources/Tokenizers/Tokenizer.swift

Comment thread Sources/Tokenizers/PostProcessor.swift

Comment thread Sources/Tokenizers/Tokenizer.swift

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated

Comment thread Tests/TokenizersTests/TokenizerTests.swift Outdated

Incorporate feedback from review

98089e6

mattt requested a review from Copilot March 9, 2026 12:36

Copilot started reviewing on behalf of mattt March 9, 2026 12:37 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated

Comment thread Tests/TokenizersTests/TokenizerTests.swift

Comment thread Tests/TokenizersTests/BertTokenizerTests.swift Outdated

Incorporate feedback from second round of review

a3eb39a

mattt requested review from Copilot and pcuenca March 9, 2026 13:03

Copilot started reviewing on behalf of mattt March 9, 2026 13:04 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated

Comment thread Sources/Tokenizers/Tokenizer.swift Outdated

Replace fallback-with-failure approach with new OffsetMappingTokenize…

95692c9

…r protocol

mattt requested a review from Copilot March 9, 2026 13:29

Copilot started reviewing on behalf of mattt March 9, 2026 13:29 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

mattt marked this pull request as draft March 9, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add view-based tokenizer offset mapping#333

Add view-based tokenizer offset mapping#333
mattt wants to merge 4 commits into
mainfrom
mattt/offset-mapping

mattt commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

Copilot AI Mar 9, 2026

Uh oh!

mattt commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattt commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

mattt commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants