Skip to content

Latest commit

 

History

History
49 lines (40 loc) · 3.33 KB

File metadata and controls

49 lines (40 loc) · 3.33 KB

This subtree contains libraries and utils of running generative AI, including Large Language Models (LLM) using ExecuTorch. Below is a list of sub folders.

export

Model preparation codes are in export folder. The main entry point is the LLMEdgeManager class. It hosts a torch.nn.Module, with a list of methods that can be used to prepare the LLM model for ExecuTorch runtime. Note that ExecuTorch supports two quantization APIs: eager mode quantization (aka source transform based quantization) and PyTorch 2 Export based quantization (aka pt2e quantization).

Commonly used methods in this class include:

  • set_output_dir: where users want to save the exported .pte file.
  • to_dtype: override the data type of the module.
  • source_transform: execute a series of source transform passes. Some transform passes include
    • weight only quantization, which can be done at source (eager mode) level.
    • replace some torch operators to a custom operator. For example, replace_sdpa_with_custom_op.
  • torch.export: get a graph that is ready for pt2 graph-based quantization.
  • pt2e_quantize with passed in quantizers.
    • util functions in quantizer_lib.py can help to get different quantizers based on the needs.
  • export_to_edge: export to edge dialect
  • to_backend: lower the graph to an acceleration backend.
  • to_executorch: get the executorch graph with optional optimization passes.
  • save_to_pte: finally, the lowered and optimized graph can be saved into a .pte file for the runtime.

Some usage of LLMEdgeManager can be found in executorch/examples/models/llama, and executorch/examples/models/llava.

When the .pte file is exported and saved, we can load and run it in a runner (see below).

tokenizer

Currently, we support two types of tokenizers: sentencepiece and Tiktoken.

  • In Python:
    • utils.py: get the tokenizer from a model file path, based on the file format.
    • tokenizer.py: rewrite a sentencepiece tokenizer model to a serialization format that the runtime can load.
  • In C++:
    • tokenizer.h: a simple tokenizer interface. Actual tokenizer classes can be implemented based on this. In this folder, we provide two tokenizer implementations:
      • bpe_tokenizer. Note: we need the rewritten version of tokenizer artifact (refer to tokenizer.py above), for bpe tokenizer to work.
      • tiktoken. For llama3 and llama3.1.

sampler

A sampler class in C++ to sample the logistics given some hyperparameters.

custom_ops

Contains custom op, such as:

  • custom sdpa: implements CPU flash attention and avoids copies by taking the kv cache as one of its arguments.
    • custom_ops.py, op_sdpa_aot.cpp: custom op definition in PyTorch with C++ registration.
    • op_sdpa.cpp: the optimized operator implementation and registration of sdpa_with_kv_cache.out.

runner

It hosts the libary components used in a C++ llm runner. Currently, it hosts stats.h on runtime status like token numbers and latency.

With the components above, an actual runner can be built for a model or a series of models. An example is in //executorch/examples/models/llama/runner, where a C++ runner code is built to run Llama 2, 3, 3.1 and other models using the same architecture.

Usages can also be found in the torchchat repo.