In this blog, I will deploy Hugging face Model on the NVIDIA Triton inference Server.
Prerequisites:
- AWS EC2 p3.2xlarge instance
- Docker
- Conda
I am using AWS EC2 p3.2xlarge(GPU instance) for the demonstration so please create the AWS node. It requires the NVIDIA driver and containers library.
Install the NVIDIA driver:
sudo apt-get install linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt update
sudo apt-get -y install cuda-drivers
Validate the GPU driver:
$ nvidia-smi
Here is the Output:

Install the container library:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
Triton Server supports TensorRT Models, ONNX Models, TorchScript Models, TensorFlow Models, OpenVINO Models, Python Models, and DALI Models. I will use Python Models for Hugging Face deployment. Triton server has a standard directory structure for each model type. Here is Python model directory structure:
$ tree model_repository/ -I '__pycache__'
model_repository/ # ROOT FOLDER(may have many models)
└── sentiment # MODEL FOLDER NAME(same as model name)
├── 1 # MODEL VERSION
│ └── model.py # MODEL PYTHON SCRIPT
├── config.pbtxt # CONFIG FILE FOR A MODEL
└── hf-sentiment.tar.gz # CONDA ENV(all dependencies required for hugging face)
2 directories, 3 files
Create an empty directory structure as described above and let’s understand each file one by one.
- config.pbtxt It is the config for a model that describes the model name, backend, Input/Output fields and types, and model execution information like GPU or CPU, Batch size, and many more. I will take minimal configuration.
name: "sentiment"
backend: "python"
input [
{
name: "text"
data_type: TYPE_STRING
dims: [-1]
}
]
output [
{
name: "sentiment"
data_type: TYPE_STRING
dims: [-1]
}
]
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "/mnt/model_repository/sentiment/hf-sentiment.tar.gz"}
}
instance_group [
{
kind: KIND_GPU
}
]
2. model.py has standard class(TritonPythonModel) with 3 methods(needs to implement).
Note: I am using the Hugging Face sentiment model called “cardiffnlp/twitter-roberta-base-sentiment-latest“
import triton_python_backend_utils as pb_utils
import numpy as np
from transformers import pipeline
class TritonPythonModel:
def initialize(self, args):
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
self.generator = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)
def execute(self, requests):
responses = []
for request in requests:
# Decode the Byte Tensor into Text
input = pb_utils.get_input_tensor_by_name(request, "text")
input_text = input.as_numpy()[0].decode()
# Call the Model pipeline
pipeline_output = self.generator(input_text)
sentiment = pipeline_output[0]["label"]
# Encode the text to byte tensor to send back
inference_response = pb_utils.InferenceResponse(
output_tensors=[
pb_utils.Tensor("sentiment", np.array([sentiment.encode()]))]
)
responses.append(inference_response)
return responses
def finalize(self, args):
self.generator = None
3. hf-sentiment.tar.gz is a Conda pack of all the dependencies required for Hugging Face. (You can have different tar file name)
conda create -k -y -n hf-sentiment python=3.10
conda activate hf-sentiment
pip install numpy conda-pack
pip install torch==1.13.1
pip install transformers==4.21.3
optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"
# conda install -c conda-forge gcc=12.1.0
conda pack -o hf-sentiment.tar.gz
Create files and move them respected director as mentioned in the model repository tree.
The model repository structure is ready. Triton needs three ports:
8000 -> HTTPService
8001 -> GRPCInferenceService,
8002 -> Metrics Service
and volume mounts for the model repository folder. Let’s start the Triton docker container:
docker run -d --shm-size=10G -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/mnt/model_repository nvcr.io/nvidia/tritonserver:23.06-py3 tritonserver --model-repository=/mnt/model_repository --log-verbose=1
Docker logs:
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
372793f53dda nvcr.io/nvidia/tritonserver:23.06-py3 "/opt/nvidia/nvidia_…" 18 hours ago Up 15 hours 0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp nervous_shannon
$ docker logs -f nervous_shannon
I0803 14:45:38.670657 1 server.cc:630]
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path | Config |
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0803 14:45:38.670695 1 server.cc:673]
+-----------+---------+--------+
| Model | Version | Status |
+-----------+---------+--------+
| sentiment | 1 | READY |
+-----------+---------+--------+
I0803 14:45:38.729720 1 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB
I0803 14:45:38.730009 1 metrics.cc:701] Collecting CPU metrics
I0803 14:45:38.730278 1 tritonserver.cc:2385]
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.35.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0] | /mnt/model_repository |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0803 14:45:38.730913 1 grpc_server.cc:2339]
+----------------------------------------------+---------+
| GRPC KeepAlive Option | Value |
+----------------------------------------------+---------+
| keepalive_time_ms | 7200000 |
| keepalive_timeout_ms | 20000 |
| keepalive_permit_without_calls | 0 |
| http2_max_pings_without_data | 2 |
| http2_min_recv_ping_interval_without_data_ms | 300000 |
| http2_max_ping_strikes | 2 |
+----------------------------------------------+---------+
I0803 14:45:38.731608 1 grpc_server.cc:99] Ready for RPC 'Check', 0
I0803 14:45:38.731646 1 grpc_server.cc:99] Ready for RPC 'ServerLive', 0
I0803 14:45:38.731659 1 grpc_server.cc:99] Ready for RPC 'ServerReady', 0
I0803 14:45:38.731664 1 grpc_server.cc:99] Ready for RPC 'ModelReady', 0
I0803 14:45:38.731677 1 grpc_server.cc:99] Ready for RPC 'ServerMetadata', 0
I0803 14:45:38.731690 1 grpc_server.cc:99] Ready for RPC 'ModelMetadata', 0
I0803 14:45:38.731699 1 grpc_server.cc:99] Ready for RPC 'ModelConfig', 0
I0803 14:45:38.731711 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryStatus', 0
I0803 14:45:38.731724 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryRegister', 0
I0803 14:45:38.731737 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0803 14:45:38.731748 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryStatus', 0
I0803 14:45:38.731755 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryRegister', 0
I0803 14:45:38.731762 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0803 14:45:38.731774 1 grpc_server.cc:99] Ready for RPC 'RepositoryIndex', 0
I0803 14:45:38.731786 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelLoad', 0
I0803 14:45:38.731792 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelUnload', 0
I0803 14:45:38.731806 1 grpc_server.cc:99] Ready for RPC 'ModelStatistics', 0
I0803 14:45:38.731818 1 grpc_server.cc:99] Ready for RPC 'Trace', 0
I0803 14:45:38.731825 1 grpc_server.cc:99] Ready for RPC 'Logging', 0
I0803 14:45:38.731868 1 grpc_server.cc:348] Thread started for CommonHandler
I0803 14:45:38.732014 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0803 14:45:38.732064 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0803 14:45:38.732201 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0803 14:45:38.732243 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0803 14:45:38.732369 1 stream_infer_handler.cc:127] New request handler for ModelStreamInferHandler, 0
I0803 14:45:38.732403 1 infer_handler.h:1046] Thread started for ModelStreamInferHandler
I0803 14:45:38.732415 1 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0803 14:45:38.732686 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0803 14:45:38.774091 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
I0803 14:49:24.762883 1 http_server.cc:3449] HTTP request: 2 /v2/models/sentiment/infer
I0803 14:49:24.762987 1 infer_request.cc:751] [request id: <id_unknown>] prepared: [0x0x7f4454002ed0] request id: , model: sentiment, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0
The Triton server container is up and running. Let’s try an inference example using CURL:
curl --location --request POST 'http://localhost:8000/v2/models/sentiment/infer' \
--header 'Content-Type: application/json' \
--data-raw '{
"inputs":[
{
"name": "text",
"shape": [1],
"datatype": "BYTES",
"data": ["I really enjoyed this"]
}
]
}'
Inference response:
{
"model_name": "sentiment",
"model_version": "1",
"outputs": [
{
"name": "sentiment",
"datatype": "BYTES",
"shape": [
1
],
"data": [
"positive"
]
}
]
}
References:
- https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html
- https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
- https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace
- https://github.com/satendrakumar/huggingface-triton-server