Add offset mapping support to tokenizer.encode()

### Problem
When using BERT-based models for Named Entity Recognition (NER) tasks, I need to map token indices back to character positions in the 
original text for entity extraction. Currently, there is no public API to obtain character offsets for tokens.

### Use Case
A common NER workflow:
1. Tokenize text: "John Smith works at Google"
2. Run NER model → get entity predictions with token indices
3. Need offsets to map tokens back to text → extract "John Smith" at positions [0-10]

### Current Limitation
The `encode()` and `tokenize()` methods only return token IDs and token strings, not their character positions in the original text. While `TokenLatticeNode` internally tracks `startOffset` and `length`, this information is not exposed publicly.
**Example :**
```
// Current API - no offset information
let ids = tokenizer.encode(text: "John Smith works at Google")
// Returns: [101, 2054, 3smith, 3638, 1012, 102]
// But we need: token positions in the original text
```

### Proposed Solution
Add a new method to the `Tokenizer` protocol that returns token offsets: 
```
/// Result of encoding text with offset information
public struct EncodingWithOffsets {
    public let ids: [Int]                    // Token IDs
    public let tokens: [String]              // Token strings
    public let offsets: [(Int, Int)]         // [(start, end), ...] character positions
}

// New method in Tokenizer protocol
public func encodeWithOffsets(text: String, addSpecialTokens: Bool = true) -> EncodingWithOffsets
```

### Benefits
- Consistent with Hugging Face Python API (return_offsets_mapping=True)
- No breaking changes to existing API (new method, not modification)

### Related Issues
This aligns with the tokenizer implementation pattern from the [Hugging Face transformers library](https://huggingface.co/docs/transformers/en/glossary#offsets-mapping) which has `offset_mapping` as a standard feature.
[Other example](https://discuss.huggingface.co/t/return-offsets-mapping-when-decoding/152215) 


PS: First time for me to open an issue, so sorry if that not fit correct format. And thank you for this repo, he helps me a lot to start in ML.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offset mapping support to tokenizer.encode() #307

Problem

Use Case

Current Limitation

Proposed Solution

Benefits

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add offset mapping support to tokenizer.encode() #307

Description

Problem

Use Case

Current Limitation

Proposed Solution

Benefits

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions