You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| Navigate to |[Part 2: Improving Resource Utilization](../Part_2-improving_resource_utilization/)|[Documentation: Model Repository](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_repository.md)|[Documentation: Model Configuration](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md)|
Any deep learning inference serving solution needs to tackle two fundamental challenges:
36
36
* Managing multiple models.
37
37
* Versioning, loading, and unloading models.
38
38
39
+
## Before we begin
40
+
41
+
The conceptual guide aims to educate developers about the challenges faced whilst building inference infrastructure for deploying deep learning pipelines. `Part 1 - Part 5` of this guide build towards solving a simple problem: deploying a performant and scalable pipeline for transcribing text from images. This pipeline includes 5 steps:
42
+
* Pre-process the raw image
43
+
* Detect "Text boxes" (Text Detection Model)
44
+
* Crop and pre-process text
45
+
* Recognize Text (Text Recognition Model)
46
+
* Final post-processing
47
+
48
+
**In `Part 1`, we start by deploying both the models on Triton but the pre/post processing steps are done on client side.**
49
+
39
50
## Managing multiple models
40
51
41
52
The key challenge around managing multiple models is to build an infrastructure that can cater to the different requirements of different models. For instance, users may need to deploy a PyTorch model and TensorFlow model on the same server, and they have different loads for both the models, need to run them on different hardware devices, and need to independently manage the serving configurations (model queues, versions, caching, acceleration, and more). The Triton Inference Server caters to all of the above and more.
@@ -320,6 +331,6 @@ The policies can also be set via command line arguments whilst launching the ser
320
331
321
332
# What's next?
322
333
323
-
In this tutorial, we covered the very basics of setting up and querying a Triton Inference Server. This is Part 1 of a 5 part tutorial series that covers the challenges faced in deploying Deep Learning models to production. Part 2 covers `Concurrent Model Execution and Dynamic Batching`. Depending on your workload and experience you might want to jump to Part 5 with covers `Building an Ensemble Pipeline with multiple models, pre and post processing steps, and adding business logic`.
334
+
In this tutorial, we covered the very basics of setting up and querying a Triton Inference Server. This is Part 1 of a 6 part tutorial series that covers the challenges faced in deploying Deep Learning models to production. [Part 2](../Part_2-improving_resource_utilization/) covers `Concurrent Model Execution and Dynamic Batching`. Depending on your workload and experience you might want to jump to [Part 5](../Part_5-Model_Ensembles/) which covers `Building an Ensemble Pipeline with multiple models, pre and post processing steps, and adding business logic`.
Part-1 of this series introduced the mechanisms to set up a Triton Inference Server. This iteration discusses the concept of dynamic batching and concurrent model execution. These are important features that can be used to reduce latency as well as increase throughput via higher resource utilization.
@@ -269,4 +269,4 @@ This is a perfect example of "simply enabling all the features" isn't a one-size
269
269
270
270
# What's next?
271
271
272
-
In this tutorial, we covered the two key concepts, `dynamic batching` and `concurrent model execution`, which can be used to improve resource utilization. This is Part 2 of a 5 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. As you may have figured, there are many possible combinations to use the features discussed in this tutorial, especially with nodes having multiple GPUs. Part 3 covers `Model Analyzer`, a tool which helps to find the best possible deployment configuration.
272
+
In this tutorial, we covered the two key concepts, `dynamic batching` and `concurrent model execution`, which can be used to improve resource utilization. This is Part 2 of a 6 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. As you may have figured, there are many possible combinations to use the features discussed in this tutorial, especially with nodes having multiple GPUs. Part 3 covers `Model Analyzer`, a tool which helps to find the best possible deployment configuration.
Every inference deployment has its unique set of challenges. These challenges may arise from Service Level Agreements about maintaining latency, limited hardware resources, unique requirements of individual models, the nature and the volume of requests, or something completely different. Additionally, the Triton Inference Server has many features which can be leveraged be make tradeoffs between memory consumption and performance.
35
36
36
37
With the sheer number of features and requirements, finding an optimal configuration for each deployment becomes a task for "Sweeping" through each of the possible configurations to measure performance. This discussion covers:
@@ -130,4 +131,4 @@ Sample reports can be found in the `reports` folder.
130
131
131
132
# What's next?
132
133
133
-
In this tutorial, we covered the use of Model Analyzer, which is a tool to select the best possible deployment configuration with respect to resource utilization. This is Part 3 of a 5 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. Part 4 covers `Inference Accelerations`, which will talk about framework level optimizations to accelerate your models!
134
+
In this tutorial, we covered the use of Model Analyzer, which is a tool to select the best possible deployment configuration with respect to resource utilization. This is Part 3 of a 6 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. Part 4 covers `Inference Accelerations`, which will talk about framework level optimizations to accelerate your models!
| Navigate to |[Part 3: Optimizing Triton Configuration](../Part_3-optimizing_triton_configuration/)|[Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/)|
Model acceleration is a complex nuanced topic. The viability of techniques like graph optimizations for models, pruning, knowledge distillation, quantization, and more, highly depend on the structure of the model. Each of these topics are vast fields of research in their own right and building custom tools requires massive engineering investment.
36
36
@@ -225,4 +225,4 @@ The sections above describe converting models and using different accelerators a
225
225
226
226
# What's next?
227
227
228
-
In this tutorial, we covered a plethora of optimization options available to accelerate models while using the Triton Inference Server. This is Part 4 of a 10 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. Part 5 covers `Building a model ensemble`. Part 3 and Part 4 focus on two different aspects, resource utilizations and framework level model acceleration respectively. Using both of these techniques in conjunction will lead to the best performance possible. Since the specific selections are highly dependent on workloads, models, SLAs, and hardware resources, this process varies for each user. We highly encourage users to experiment with all these features to find our the best deployment configuration for their use case.
228
+
In this tutorial, we covered a plethora of optimization options available to accelerate models while using the Triton Inference Server. This is Part 4 of a 6 part tutorial series which covers the challenges faced in deploying Deep Learning models to production. Part 5 covers `Building a model ensemble`. Part 3 and Part 4 focus on two different aspects, resource utilizations and framework level model acceleration respectively. Using both of these techniques in conjunction will lead to the best performance possible. Since the specific selections are highly dependent on workloads, models, SLAs, and hardware resources, this process varies for each user. We highly encourage users to experiment with all these features to find our the best deployment configuration for their use case.
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_5-Model_Ensembles/README.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,6 +29,9 @@
29
29
30
30
# Executing Multiple Models with Model Ensembles
31
31
32
+
| Navigate to | [Part 4: Accelerating Models](../Part_4-inference_acceleration/) | [Part 6: Using the BLS API to build complex pipelines](../Part_6-building_complex_pipelines/) | [Documentation: Ensembles](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/architecture.html#ensemble-models)
Modern machine learning systems often involve the execution of several models, whether that is because of pre- and post-processing steps, aggregating the prediction of multiple models, or having different models executing different tasks. In this example, we'll be exploring the use of Model Ensembles for executing multiple models server side with only a single network call. This offers the benefit of reducing the number of times we need to copy data between the client and the server, and eliminating some of the latency inherent to network calls.
33
36
34
37
To illustrate the process of creating a model ensemble, we'll be reusing the model pipeline first introduced in [Part 1](../Part_1-model_deployment/README.md). In the previous examples, we've executed the text detection and recognition models separately, with our client making two different network calls and performing various processing steps -- such as cropping and resizing images, or decoding tensors into text -- in between. Below is a simplified diagram of the pipeline, with some steps occurring on the client and some on the server.
Copy file name to clipboardExpand all lines: Conceptual_Guide/Part_6-building_complex_pipelines/README.md
+3-2Lines changed: 3 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,9 +28,10 @@
28
28
29
29
# Building Complex Pipelines: Stable Diffusion
30
30
31
-
*Note*: This tutorial aims at demonstrating the ease of deployment and doesn't incorporate all possible optimizations using the NVIDIA ecosystem.
31
+
| Navigate to |[Part 5: Building Model Ensembles](../Part_5-Model_Ensembles/)|[Documentation: BLS](https://github.com/triton-inference-server/python_backend#business-logic-scripting)|
It is recommended to watch [this explainer video](https://youtu.be/JgP2WgNIq_w) with discusses the pipeline, before proceeding with the example. This example focuses on showcasing two of Triton Inference Server's features:
34
+
**Watch [this explainer video](https://youtu.be/JgP2WgNIq_w) with discusses the pipeline, before proceeding with the example**. This example focuses on showcasing two of Triton Inference Server's features:
34
35
* Using multiple frameworks in the same inference pipeline. Refer [this for more information](https://github.com/triton-inference-server/backend#where-can-i-find-all-the-backends-that-are-available-for-triton) about supported frameworks.
35
36
* Using the Python Backend's [Business Logic Scripting](https://github.com/triton-inference-server/python_backend#business-logic-scripting) API to build complex non linear pipelines.
*[Using the BLS API to build complex pipelines](Part_6-building_complex_pipelines/README.md)
36
+
*[Part 1: Model Deployment](Part_1-model_deployment/): This guide talks about deploying and managing multiple models.
37
+
*[Part 2: Improving Resource Utilization](Part_2-improving_resource_utilization/): This guide discusses two popular features/techniques used to maximize a GPU's utilization whilst deploying models.
38
+
*[Part 3: Optimizing Triton Configuration](Part_3-optimizing_triton_configuration/): Each deployment has requirements specific to the use case. This guide walks users through the process of tailoring deployment configurations to match the SLAs.
39
+
*[Part 4: Accelerating Models](Part_4-inference_acceleration/): Another path towards achieving higher throughput is to accelerate the underlying models. This guide covers SDKs and tools which can be used to accelerate the models.
40
+
*[Part 5: Building Model Ensembles](./Part_5-Model_Ensembles/): Models are rarely used standalone. This guide will cover "how to build a deep learning inference pipeline?"
41
+
*[Part 6: Using the BLS API to build complex pipelines](Part_6-building_complex_pipelines/): Often times there are scenarios where the pipeline requires control flows. Learn how to work with complex pipelines with models deployed on different backends.
Copy file name to clipboardExpand all lines: HuggingFace/README.md
+7-4Lines changed: 7 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,12 +31,15 @@
31
31
32
32
**Note**: If you are new to the Triton Inference Server, it is recommended to review [Part 1 of the Conceptual Guide](../Conceptual_Guide/Part_1-model_deployment/README.md). This tutorial assumes basic understanding about the Triton Inference Server.
33
33
34
+
|Related Pages | HuggingFace model exporting guide: [ONNX](https://huggingface.co/docs/transformers/serialization), [TorchScript](https://huggingface.co/docs/transformers/torchscript)|
35
+
| ------------ | --------------- |
36
+
34
37
Developers often work with open source models. HuggingFace is a popular source of many open source models. The discussion in this guide will focus on how a user can deploy almost any model from HuggingFace with the Triton Inference Server. For this example, the [ViT](https://arxiv.org/abs/2010.11929) model available on [HuggingFace](https://huggingface.co/docs/transformers/v4.24.0/en/model_doc/vit#transformers.ViTModel) is being used.
35
38
36
39
There are two primary methods of deploying a model pipeline on the Triton Inference Server:
37
-
***Approach 1:** Deploy the pipeline without explicitly breaking apart model from a pipeline. The core advantage of this approach is that users can quickly deploy their pipeline. This can be achieved with the use of Triton's ["Python Backend"](https://github.com/triton-inference-server/python_backend). Refer [this example](https://github.com/triton-inference-server/python_backend#usage) for more information.
40
+
***Approach 1:** Deploy the pipeline without explicitly breaking apart model from a pipeline. The core advantage of this approach is that users can quickly deploy their pipeline. This can be achieved with the use of Triton's ["Python Backend"](https://github.com/triton-inference-server/python_backend). Refer [this example](https://github.com/triton-inference-server/python_backend#usage) for more information. In summary, we deploy the model/pipeline using the Python Backend.
38
41
39
-
***Approach 2:** Break apart the pipeline, use a different backends for pre/post processing and deploying the core model on a framework backend. The advantage in this case is that running the core network on a dedicated framework backend provides higher performance. Additionally, many framework specific optimizations can be leveraged. See [Part 4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for more information. This is achieved with Triton's Ensembles. An explanation for the same can be found in [Part 5](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) of the Conceptual Guide. Refer to the documentation for more [information](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models).
42
+
***Approach 2:** Break apart the pipeline, use a different backends for pre/post processing and deploying the core model on a framework backend. The advantage in this case is that running the core network on a dedicated framework backend provides higher performance. Additionally, many framework specific optimizations can be leveraged. See [Part 4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for more information. This is achieved with Triton's Ensembles. An explanation for the same can be found in [Part 5](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) of the Conceptual Guide. Refer to the documentation for more [information](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models). In summary, we build an ensemble with a preprocessing step and the exported model.
40
43
41
44

42
45
@@ -46,7 +49,7 @@ For the purposes of this explanation, the `ViT` model([Link to HuggingFace](http
46
49
47
50

48
51
49
-
### Deploying on the Python Backend
52
+
### Deploying on the Python Backend (Approach 1)
50
53
51
54
Making use of Triton's python backend requires users to define up to three functions of the `TritonPythonModel` class:
52
55
*`initialize()`: This function runs when Triton loads the model. It is recommended to use this function to initialize/load any models and/or data objects. Defining this function is optional.
### Deploying using a Triton Ensemble (Approach 2)
110
113
111
114
Before the specifics around deploying the models can be discussed, the first step is to download and export the model. It is recommended to run the following inside the [PyTorch container available on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). If this is your first try at setting up a model ensemble in Triton, it is highly recommended to review [this guide](../Conceptual_Guide/Part_5-Model_Ensembles/README.md) before proceeding. The key advantages of breaking down the pipeline is improved performance and access to a multitude of acceleration options. Explore [Part-4](../Conceptual_Guide/Part_4-inference_acceleration/README.md) of the conceptual guide for details about model acceleration.
0 commit comments