Skip to content

Add offset mapping support to tokenizer.encode() #307

@avariable2

Description

@avariable2

Problem

When using BERT-based models for Named Entity Recognition (NER) tasks, I need to map token indices back to character positions in the
original text for entity extraction. Currently, there is no public API to obtain character offsets for tokens.

Use Case

A common NER workflow:

  1. Tokenize text: "John Smith works at Google"
  2. Run NER model → get entity predictions with token indices
  3. Need offsets to map tokens back to text → extract "John Smith" at positions [0-10]

Current Limitation

The encode() and tokenize() methods only return token IDs and token strings, not their character positions in the original text. While TokenLatticeNode internally tracks startOffset and length, this information is not exposed publicly.
Example :

// Current API - no offset information
let ids = tokenizer.encode(text: "John Smith works at Google")
// Returns: [101, 2054, 3smith, 3638, 1012, 102]
// But we need: token positions in the original text

Proposed Solution

Add a new method to the Tokenizer protocol that returns token offsets:

/// Result of encoding text with offset information
public struct EncodingWithOffsets {
    public let ids: [Int]                    // Token IDs
    public let tokens: [String]              // Token strings
    public let offsets: [(Int, Int)]         // [(start, end), ...] character positions
}

// New method in Tokenizer protocol
public func encodeWithOffsets(text: String, addSpecialTokens: Bool = true) -> EncodingWithOffsets

Benefits

  • Consistent with Hugging Face Python API (return_offsets_mapping=True)
  • No breaking changes to existing API (new method, not modification)

Related Issues

This aligns with the tokenizer implementation pattern from the Hugging Face transformers library which has offset_mapping as a standard feature.
Other example

PS: First time for me to open an issue, so sorry if that not fit correct format. And thank you for this repo, he helps me a lot to start in ML.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions