executorch/extension/llm/README.md at main · CodeLinaro/executorch

This subtree contains libraries and utils of running generative AI, including Large Language Models (LLM) using ExecuTorch. Below is a list of sub folders.

export

Model preparation codes are in export folder. The main entry point is the LLMEdgeManager class. It hosts a torch.nn.Module, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime. Note that ExecuTorch supports two quantization APIs: eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization).

Commonly used methods in this class include:

set_output_dir: where users want to save the exported .pte file.
to_dtype: override the data type of the module.
source_transform: execute a series of source transform passes. Some transform passes include
- weight only quantization, which can be done at source (eager mode) level.
- replace some torch operators to a custom operator. For example, replace_sdpa_with_custom_op.
torch.export: get a graph that is ready for pt2 graph-based quantization.
pt2e_quantize with passed in quantizers.
- util functions in quantizer_lib.py can help to get different quantizers based on the needs.
export_to_edge: export to edge dialect
to_backend: lower the graph to an acceleration backend.
to_executorch: get the executorch graph with optional optimization passes.
save_to_pte: finally, the lowered and optimized graph can be saved into a .pte file for the runtime.

Some usage of LLMEdgeManager can be found in executorch/examples/models/llama, and executorch/examples/models/llava.

When the .pte file is exported and saved, we can load and run it in a runner (see below).

tokenizer

Currently, we support two types of tokenizers: sentencepiece and Tiktoken.

In Python:
- utils.py: get the tokenizer from a model file path, based on the file format.
- tokenizer.py: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load.
In C++:
- tokenizer.h: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations:
  - bpe_tokenizer. Note: we need the rewritten version of tokenizer artifact (refer to tokenizer.py above), for bpe tokenizer to work.
  - tiktoken. For llama3 and llama3.1.

sampler

A sampler class in C++ to sample the logistics given some hyperparameters.

custom_ops

Contains custom op, such as:

custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments.
- custom_ops.py, op_sdpa_aot.cpp: custom op definition in PyTorch with C++ registration.
- op_sdpa.cpp: the optimized operator implementation and registration of sdpa_with_kv_cache.out.

runner

It hosts the libary components used in a C++ llm runner. Currently, it hosts stats.h on runtime status like token numbers and latency.

With the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.

Usages can also be found in the torchchat repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

export

tokenizer

sampler

custom_ops

runner

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

export

tokenizer

sampler

custom_ops

runner