|
1 | 1 | # llm-uservice |
2 | 2 |
|
3 | | -Helm chart for deploying LLM microservice. |
| 3 | +Helm chart for deploying OPEA LLM microservices. |
4 | 4 |
|
5 | | -llm-uservice depends on TGI, you should set TGI_LLM_ENDPOINT as tgi endpoint. |
| 5 | +## Installing the chart |
6 | 6 |
|
7 | | -## (Option1): Installing the chart separately |
| 7 | +`llm-uservice` depends on one of the following inference backend services: |
8 | 8 |
|
9 | | -First, you need to install the tgi chart, please refer to the [tgi](../tgi) chart for more information. |
| 9 | +- TGI: please refer to [tgi](../tgi) chart for more information |
10 | 10 |
|
11 | | -After you've deployted the tgi chart successfully, please run `kubectl get svc` to get the tgi service endpoint, i.e. `http://tgi`. |
| 11 | +- vLLM: please refer to [vllm](../vllm) chart for more information |
12 | 12 |
|
13 | | -To install the chart, run the following: |
| 13 | +First, you need to install one of the dependent chart, i.e. `tgi` or `vllm` helm chart. |
14 | 14 |
|
15 | | -```console |
16 | | -cd GenAIInfra/helm-charts/common/llm-uservice |
17 | | -export HFTOKEN="insert-your-huggingface-token-here" |
18 | | -export TGI_LLM_ENDPOINT="http://tgi" |
19 | | -helm dependency update |
20 | | -helm install llm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set TGI_LLM_ENDPOINT=${TGI_LLM_ENDPOINT} --wait |
21 | | -``` |
| 15 | +After you've deployed the dependent chart successfully, please run `kubectl get svc` to get the backend inference service endpoint, e.g. `http://tgi`, `http://vllm`. |
22 | 16 |
|
23 | | -## (Option2): Installing the chart with dependencies automatically |
| 17 | +To install the `llm-uservice` chart, run the following: |
24 | 18 |
|
25 | 19 | ```console |
26 | 20 | cd GenAIInfra/helm-charts/common/llm-uservice |
27 | | -export HFTOKEN="insert-your-huggingface-token-here" |
28 | 21 | helm dependency update |
29 | | -helm install llm-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set tgi.enabled=true --wait |
| 22 | +export HFTOKEN="insert-your-huggingface-token-here" |
| 23 | +# set backend inferene service endpoint URL |
| 24 | +# for tgi |
| 25 | +export LLM_ENDPOINT="http://tgi" |
| 26 | +# for vllm |
| 27 | +# export LLM_ENDPOINT="http://vllm" |
| 28 | + |
| 29 | +# set the same model used by the backend inference service |
| 30 | +export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3" |
| 31 | + |
| 32 | +# install llm-textgen with TGI backend |
| 33 | +helm install llm-uservice . --set TEXTGEN_BACKEND="TGI" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
| 34 | + |
| 35 | +# install llm-textgen with vLLM backend |
| 36 | +# helm install llm-uservice . --set TEXTGEN_BACKEND="vLLM" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
| 37 | + |
| 38 | +# install llm-docsum with TGI backend |
| 39 | +# helm install llm-uservice . --set image.repository="opea/llm-docsum" --set DOCSUM_BACKEND="TGI" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set MAX_INPUT_TOKENS=2048 --set MAX_TOTAL_TOKENS=4096 --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
| 40 | + |
| 41 | +# install llm-docsum with vLLM backend |
| 42 | +# helm install llm-uservice . --set image.repository="opea/llm-docsum" --set DOCSUM_BACKEND="vLLM" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set MAX_INPUT_TOKENS=2048 --set MAX_TOTAL_TOKENS=4096 --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
| 43 | + |
| 44 | +# install llm-faqgen with TGI backend |
| 45 | +# helm install llm-uservice . --set image.repository="opea/llm-faqgen" --set FAQGEN_BACKEND="TGI" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
| 46 | + |
| 47 | +# install llm-faqgen with vLLM backend |
| 48 | +# helm install llm-uservice . --set image.repository="opea/llm-faqgen" --set FAQGEN_BACKEND="vLLM" --set LLM_ENDPOINT=${LLM_ENDPOINT} --set LLM_MODEL_ID=${LLM_MODEL_ID} --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --wait |
30 | 49 | ``` |
31 | 50 |
|
32 | 51 | ## Verify |
33 | 52 |
|
34 | 53 | To verify the installation, run the command `kubectl get pod` to make sure all pods are running. |
35 | 54 |
|
36 | | -Then run the command `kubectl port-forward svc/llm-uservice 9000:9000` to expose the llm-uservice service for access. |
| 55 | +Then run the command `kubectl port-forward svc/llm-uservice 9000:9000` to expose the service for access. |
37 | 56 |
|
38 | 57 | Open another terminal and run the following command to verify the service if working: |
39 | 58 |
|
40 | 59 | ```console |
| 60 | +# for llm-textgen service |
41 | 61 | curl http://localhost:9000/v1/chat/completions \ |
42 | | - -X POST \ |
43 | | - -d '{"query":"What is Deep Learning?","max_tokens":17,"top_k":10,"top_p":0.95,"typical_p":0.95,"temperature":0.01,"repetition_penalty":1.03,"streaming":true}' \ |
44 | | - -H 'Content-Type: application/json' |
| 62 | + -X POST \ |
| 63 | + -d d '{"model": "${LLM_MODEL_ID}", "messages": "What is Deep Learning?", "max_tokens":17}' \ |
| 64 | + -H 'Content-Type: application/json' |
| 65 | + |
| 66 | +# for llm-docsum service |
| 67 | +curl http://localhost:9000/v1/docsum \ |
| 68 | + -X POST \ |
| 69 | + -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.", "max_tokens":32, "language":"en"}' \ |
| 70 | + -H 'Content-Type: application/json' |
| 71 | + |
| 72 | +# for llm-faqgen service |
| 73 | +curl http://localhost:9000/v1/faqgen \ |
| 74 | + -X POST \ |
| 75 | + -d '{"query":"Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5.","max_tokens": 128}' \ |
| 76 | + -H 'Content-Type: application/json' |
45 | 77 | ``` |
46 | 78 |
|
47 | 79 | ## Values |
48 | 80 |
|
49 | | -| Key | Type | Default | Description | |
50 | | -| ------------------------------- | ------ | ---------------- | ------------------------------- | |
51 | | -| global.HUGGINGFACEHUB_API_TOKEN | string | `""` | Your own Hugging Face API token | |
52 | | -| image.repository | string | `"opea/llm-tgi"` | | |
53 | | -| service.port | string | `"9000"` | | |
54 | | -| TGI_LLM_ENDPOINT | string | `""` | LLM endpoint | |
55 | | -| global.monitoring | bool | `false` | Service usage metrics | |
| 81 | +| Key | Type | Default | Description | |
| 82 | +| ------------------------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------- | |
| 83 | +| global.HUGGINGFACEHUB_API_TOKEN | string | `""` | Your own Hugging Face API token | |
| 84 | +| image.repository | string | `"opea/llm-textgen"` | one of "opea/llm-textgen", "opea/llm-docsum", "opea/llm-faqgen" | |
| 85 | +| LLM_ENDPOINT | string | `""` | backend inference service endpoint | |
| 86 | +| LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | model used by the inference backend | |
| 87 | +| TEXTGEN_BACKEND | string | `"tgi"` | backend inference engine, only valid for llm-textgen image, one of "TGI", "vLLM" | |
| 88 | +| DOCSUM_BACKEND | string | `"tgi"` | backend inference engine, only valid for llm-docsum image, one of "TGI", "vLLM" | |
| 89 | +| FAQGEN_BACKEND | string | `"tgi"` | backend inference engine, only valid for llm-faqgen image, one of "TGi", "vLLM" | |
| 90 | +| global.monitoring | bool | `false` | Service usage metrics | |
0 commit comments