Deploying Hugging Face model on NVIDIA Triton inference Server

In this blog, I will deploy Hugging face Model on the NVIDIA Triton inference Server.

Prerequisites:

  • AWS EC2 p3.2xlarge instance
  • Docker
  • Conda

I am using AWS EC2 p3.2xlarge(GPU instance) for the demonstration so please create the AWS node. It requires the NVIDIA driver and containers library.

Install the NVIDIA driver:

sudo apt-get install linux-headers-$(uname -r)

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')

wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.0-1_all.deb

sudo dpkg -i cuda-keyring_1.0-1_all.deb

sudo apt update

sudo apt-get -y install cuda-drivers

Validate the GPU driver:

$ nvidia-smi

Here is the Output:

Install the container library:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update

sudo apt-get install nvidia-container-toolkit

sudo systemctl restart docker

Triton Server supports TensorRT Models, ONNX Models, TorchScript Models, TensorFlow Models, OpenVINO Models, Python Models, and DALI Models. I will use Python Models for Hugging Face deployment. Triton server has a standard directory structure for each model type. Here is Python model directory structure:

$ tree model_repository/ -I '__pycache__'
model_repository/    #  ROOT FOLDER(may have many models)
└── sentiment          # MODEL FOLDER NAME(same as model name)
    ├── 1                        # MODEL VERSION
    │   └── model.py   # MODEL PYTHON SCRIPT 
    ├── config.pbtxt   # CONFIG FILE FOR A MODEL
    └── hf-sentiment.tar.gz # CONDA ENV(all dependencies required for hugging face)

2 directories, 3 files

Create an empty directory structure as described above and let’s understand each file one by one.

  1. config.pbtxt It is the config for a model that describes the model name, backend, Input/Output fields and types, and model execution information like GPU or CPU, Batch size, and many more. I will take minimal configuration.
name: "sentiment"
backend: "python"
input [
  {
    name: "text"
    data_type: TYPE_STRING
    dims: [-1]
  }
]
output [
  {
    name: "sentiment"
    data_type: TYPE_STRING
    dims: [-1]
  }
]

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "/mnt/model_repository/sentiment/hf-sentiment.tar.gz"}
}

instance_group [
  {
    kind: KIND_GPU
  }
]

2. model.py has standard class(TritonPythonModel) with 3 methods(needs to implement).

Note: I am using the Hugging Face sentiment model called “cardiffnlp/twitter-roberta-base-sentiment-latest

import triton_python_backend_utils as pb_utils
import numpy as np
from transformers import pipeline

class TritonPythonModel:
    def initialize(self, args):
        model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
        self.generator = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)

    def execute(self, requests):
        responses = []
        for request in requests:
            # Decode the Byte Tensor into Text
            input = pb_utils.get_input_tensor_by_name(request, "text")
            input_text = input.as_numpy()[0].decode()
            # Call the Model pipeline
            pipeline_output = self.generator(input_text)
            sentiment = pipeline_output[0]["label"]
            # Encode the text to byte tensor to send back
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[
                    pb_utils.Tensor("sentiment", np.array([sentiment.encode()]))]
            )
        responses.append(inference_response)
        return responses

    def finalize(self, args):
        self.generator = None

3. hf-sentiment.tar.gz is a Conda pack of all the dependencies required for Hugging Face. (You can have different tar file name)

conda create -k -y -n hf-sentiment python=3.10

conda activate hf-sentiment

pip install numpy conda-pack

pip install torch==1.13.1

pip install transformers==4.21.3

 optional if you get issue "nvidia triton version `GLIBCXX_3.4.30' not found"
 # conda install -c conda-forge gcc=12.1.0

conda pack -o hf-sentiment.tar.gz

Create files and move them respected director as mentioned in the model repository tree.

The model repository structure is ready. Triton needs three ports:

8000 -> HTTPService

8001 -> GRPCInferenceService,

8002 -> Metrics Service

and volume mounts for the model repository folder. Let’s start the Triton docker container:

docker run -d --shm-size=10G -p 8000:8000 -p 8001:8001 -p 8002:8002 -v $PWD/model_repository:/mnt/model_repository nvcr.io/nvidia/tritonserver:23.06-py3 tritonserver --model-repository=/mnt/model_repository --log-verbose=1

Docker logs:

$ docker ps -a
CONTAINER ID   IMAGE                                   COMMAND                  CREATED        STATUS        PORTS                                                           NAMES
372793f53dda   nvcr.io/nvidia/tritonserver:23.06-py3   "/opt/nvidia/nvidia_…"   18 hours ago   Up 15 hours   0.0.0.0:8000-8002->8000-8002/tcp, :::8000-8002->8000-8002/tcp   nervous_shannon

$ docker logs -f nervous_shannon

I0803 14:45:38.670657 1 server.cc:630] 
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend | Path                                                    | Config                                                                                                                                                        |
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {}                                                                                                                                                            |
| python  | /opt/tritonserver/backends/python/libtriton_python.so   | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+---------+---------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0803 14:45:38.670695 1 server.cc:673] 
+-----------+---------+--------+
| Model     | Version | Status |
+-----------+---------+--------+
| sentiment | 1       | READY  |
+-----------+---------+--------+

I0803 14:45:38.729720 1 metrics.cc:808] Collecting metrics for GPU 0: Tesla V100-SXM2-16GB
I0803 14:45:38.730009 1 metrics.cc:701] Collecting CPU metrics
I0803 14:45:38.730278 1 tritonserver.cc:2385] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.35.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /mnt/model_repository                                                                                                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0803 14:45:38.730913 1 grpc_server.cc:2339] 
+----------------------------------------------+---------+
| GRPC KeepAlive Option                        | Value   |
+----------------------------------------------+---------+
| keepalive_time_ms                            | 7200000 |
| keepalive_timeout_ms                         | 20000   |
| keepalive_permit_without_calls               | 0       |
| http2_max_pings_without_data                 | 2       |
| http2_min_recv_ping_interval_without_data_ms | 300000  |
| http2_max_ping_strikes                       | 2       |
+----------------------------------------------+---------+

I0803 14:45:38.731608 1 grpc_server.cc:99] Ready for RPC 'Check', 0
I0803 14:45:38.731646 1 grpc_server.cc:99] Ready for RPC 'ServerLive', 0
I0803 14:45:38.731659 1 grpc_server.cc:99] Ready for RPC 'ServerReady', 0
I0803 14:45:38.731664 1 grpc_server.cc:99] Ready for RPC 'ModelReady', 0
I0803 14:45:38.731677 1 grpc_server.cc:99] Ready for RPC 'ServerMetadata', 0
I0803 14:45:38.731690 1 grpc_server.cc:99] Ready for RPC 'ModelMetadata', 0
I0803 14:45:38.731699 1 grpc_server.cc:99] Ready for RPC 'ModelConfig', 0
I0803 14:45:38.731711 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryStatus', 0
I0803 14:45:38.731724 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryRegister', 0
I0803 14:45:38.731737 1 grpc_server.cc:99] Ready for RPC 'SystemSharedMemoryUnregister', 0
I0803 14:45:38.731748 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryStatus', 0
I0803 14:45:38.731755 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryRegister', 0
I0803 14:45:38.731762 1 grpc_server.cc:99] Ready for RPC 'CudaSharedMemoryUnregister', 0
I0803 14:45:38.731774 1 grpc_server.cc:99] Ready for RPC 'RepositoryIndex', 0
I0803 14:45:38.731786 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelLoad', 0
I0803 14:45:38.731792 1 grpc_server.cc:99] Ready for RPC 'RepositoryModelUnload', 0
I0803 14:45:38.731806 1 grpc_server.cc:99] Ready for RPC 'ModelStatistics', 0
I0803 14:45:38.731818 1 grpc_server.cc:99] Ready for RPC 'Trace', 0
I0803 14:45:38.731825 1 grpc_server.cc:99] Ready for RPC 'Logging', 0
I0803 14:45:38.731868 1 grpc_server.cc:348] Thread started for CommonHandler
I0803 14:45:38.732014 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0803 14:45:38.732064 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0803 14:45:38.732201 1 infer_handler.cc:693] New request handler for ModelInferHandler, 0
I0803 14:45:38.732243 1 infer_handler.h:1046] Thread started for ModelInferHandler
I0803 14:45:38.732369 1 stream_infer_handler.cc:127] New request handler for ModelStreamInferHandler, 0
I0803 14:45:38.732403 1 infer_handler.h:1046] Thread started for ModelStreamInferHandler
I0803 14:45:38.732415 1 grpc_server.cc:2445] Started GRPCInferenceService at 0.0.0.0:8001
I0803 14:45:38.732686 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0803 14:45:38.774091 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002
I0803 14:49:24.762883 1 http_server.cc:3449] HTTP request: 2 /v2/models/sentiment/infer
I0803 14:49:24.762987 1 infer_request.cc:751] [request id: <id_unknown>] prepared: [0x0x7f4454002ed0] request id: , model: sentiment, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0

The Triton server container is up and running. Let’s try an inference example using CURL:

curl --location --request POST 'http://localhost:8000/v2/models/sentiment/infer' \
 --header 'Content-Type: application/json' \
 --data-raw '{
    "inputs":[
    {    
     "name": "text",
     "shape": [1],
     "datatype": "BYTES",
     "data":  ["I really enjoyed this"]
    }
   ]
 }'  

Inference response:

{
  "model_name": "sentiment",
  "model_version": "1",
  "outputs": [
    {
      "name": "sentiment",
      "datatype": "BYTES",
      "shape": [
        1
      ],
      "data": [
        "positive"
      ]
    }
  ]
}

References:

  1. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/getting_started/quickstart.html
  2. https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
  3. https://github.com/triton-inference-server/tutorials/tree/main/HuggingFace
  4. https://github.com/satendrakumar/huggingface-triton-server