Problem
When using BERT-based models for Named Entity Recognition (NER) tasks, I need to map token indices back to character positions in the
original text for entity extraction. Currently, there is no public API to obtain character offsets for tokens.
Use Case
A common NER workflow:
- Tokenize text: "John Smith works at Google"
- Run NER model → get entity predictions with token indices
- Need offsets to map tokens back to text → extract "John Smith" at positions [0-10]
Current Limitation
The encode() and tokenize() methods only return token IDs and token strings, not their character positions in the original text. While TokenLatticeNode internally tracks startOffset and length, this information is not exposed publicly.
Example :
// Current API - no offset information
let ids = tokenizer.encode(text: "John Smith works at Google")
// Returns: [101, 2054, 3smith, 3638, 1012, 102]
// But we need: token positions in the original text
Proposed Solution
Add a new method to the Tokenizer protocol that returns token offsets:
/// Result of encoding text with offset information
public struct EncodingWithOffsets {
public let ids: [Int] // Token IDs
public let tokens: [String] // Token strings
public let offsets: [(Int, Int)] // [(start, end), ...] character positions
}
// New method in Tokenizer protocol
public func encodeWithOffsets(text: String, addSpecialTokens: Bool = true) -> EncodingWithOffsets
Benefits
- Consistent with Hugging Face Python API (return_offsets_mapping=True)
- No breaking changes to existing API (new method, not modification)
Related Issues
This aligns with the tokenizer implementation pattern from the Hugging Face transformers library which has offset_mapping as a standard feature.
Other example
PS: First time for me to open an issue, so sorry if that not fit correct format. And thank you for this repo, he helps me a lot to start in ML.
Problem
When using BERT-based models for Named Entity Recognition (NER) tasks, I need to map token indices back to character positions in the
original text for entity extraction. Currently, there is no public API to obtain character offsets for tokens.
Use Case
A common NER workflow:
Current Limitation
The
encode()andtokenize()methods only return token IDs and token strings, not their character positions in the original text. WhileTokenLatticeNodeinternally tracksstartOffsetandlength, this information is not exposed publicly.Example :
Proposed Solution
Add a new method to the
Tokenizerprotocol that returns token offsets:Benefits
Related Issues
This aligns with the tokenizer implementation pattern from the Hugging Face transformers library which has
offset_mappingas a standard feature.Other example
PS: First time for me to open an issue, so sorry if that not fit correct format. And thank you for this repo, he helps me a lot to start in ML.