Skip to content

Latest commit

 

History

History
47 lines (34 loc) · 1.77 KB

File metadata and controls

47 lines (34 loc) · 1.77 KB

encoder

Scala

The Scala code in this project constitutes a library which is used to access models generated by the Python code. This would typically happen internal to projects like processors, and instructions for incorporation can be found in the main README file.

Python

Models

The encoder is dependent on a large number of libraries, which probably should not be installed into the global Python environment of your computer. Instead, use conda or venv to start a local environment for the libraries.

conda create --name env
conda activate env

or

/bin/python3.9 -m venv env
source env/bin/activate

For the Python part of this encoder subproject, a requirements.txt file is provided. Run something like

pip install -r requirements.txt

from the subproject directory to ensure that you have all the necessary Python modules installed and that their versions match expectations.

If you add a library, perform

pip freeze > requirements.txt

To run the tests from the subproject directory, use

pytest

To check the type hinting, run in the src/main/python directory

mypy *.py

Tokenizers

The Python code from this subproject is also used to download Hugging Face tokenizers and convert them into Rust format for use with the tokenizer subproject. The program to do that is save_pretrained.py. It will download the specified tokenizers and save them in a local directory. From there they can be manually copied to the tokenizer resource directory and published with an updated release of that subproject.