Tokenizing with MLX data

.. currentmodule:: mlx.data

MLX data allows sample transformations with the full flexibility of python which means that you could use any python tokenizer in a :meth:`Buffer.key_transform`. However, this is likely to be subject to the GIL which means that effectively only one sample can be tokenized at a time.

A better choice is to use an :class:`mlx.data.core.CharTrie` to tokenize your data, taking full advatage of a multicore system. You can build the trie yourself or use one of the provided helpers to build a trie from an SentencePiece model or a plain text vocabulary file.

from mlx.data.core import CharTrie, Tokenizer

# We can build a trie ourselves
trie = CharTrie()
for t in b"a quick brown fox jumped over the lazy dog".split():
    trie.insert(t)
trie.insert(b" ")

tokenizer = Tokenizer(trie)
print(tokenizer.tokenize_shortest(b"a quick brown fox jumped over the lazy dog"))
# [0, 9, 1, 9, 2, 9, 3, 9, 4, 9, 5, 9, 6, 9, 7, 9, 8]

# We can also add all the letters in the trie and then tokenize anything we want
import string
for l in string.ascii_letters:
    trie.insert(bytes(l, "utf-8"))

print(tokenizer.tokenize_shortest(b"This is a quick example"))
# [54, 16, 17, 27, 9, 17, 27, 9, 0, 9, 1, 9, 13, 32, 0, 21, 24, 20, 13]

# The more useful option is to read the trie from a file, for instance an spm model
from mlx.data.tokenizer_helpers import read_trie_from_spm

trie, weights = read_trie_from_spm("path/to/spm/model")
tokenizer = Tokenizer(trie, trie_key_scores=weights)
tokenizer.tokenize_shortest(b"This is some more text to tokenize")

.. autosummary::
   :toctree: _autosummary
   :template: data_core_modules.rst
   :recursive:

   core.Tokenizer
   core.CharTrie
   tokenizer_helpers.read_trie_from_vocab
   tokenizer_helpers.read_trie_from_spm
   tokenizer_helpers.read_bpe_from_spm
   tokenizer_helpers.read_bpe_from_hf
   tokenizer_helpers.gpt2_byte_map

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing with MLX data

FilesExpand file tree

tokenizing.rst

Latest commit

History

tokenizing.rst

File metadata and controls

Tokenizing with MLX data