The Scala code in this project constitutes a library which is used to access models generated by the Python code. This would typically happen internal to projects like processors, and instructions for incorporation can be found in the main README file.
The encoder is dependent on a large number of libraries, which probably should not be installed into the global Python environment of your computer. Instead, use conda or venv to start a local environment for the libraries.
conda create --name env
conda activate envor
/bin/python3.9 -m venv env
source env/bin/activateFor the Python part of this encoder subproject, a requirements.txt file is provided. Run something like
pip install -r requirements.txtfrom the subproject directory to ensure that you have all the necessary Python modules installed and that their versions match expectations.
If you add a library, perform
pip freeze > requirements.txtTo run the tests from the subproject directory, use
pytestTo check the type hinting, run in the src/main/python directory
mypy *.pyThe Python code from this subproject is also used to download Hugging Face tokenizers and convert them into Rust format for use with the tokenizer subproject. The program to do that is save_pretrained.py. It will download the specified tokenizers and save them in a local directory. From there they can be manually copied to the tokenizer resource directory and published with an updated release of that subproject.