Skip to content

Latest commit

 

History

History
149 lines (107 loc) · 5.15 KB

File metadata and controls

149 lines (107 loc) · 5.15 KB

Integration with NVIDIA Triton

This plugin also supports inference via the powerful and flexible NVIDIA Triton server.

This allows a client to request that images stored in a remote Cassandra server be inferenced on a different remote, GPU-powered server.

Cassandra DALI operators

The plugin provides two operators to be used with Triton:

fn.crs4.cassandra_interactive

This operator expects a batch of UUIDs as input, represented as pairs of uint64, and produces as output a batch containing the raw images which are stored as BLOBs in the database, possibly paired with the corresponding labels.

fn.crs4.cassandra_decoupled

The decoupled version of the operator splits the input UUIDs (which, in this case, can form a very long list) into mini-batches and proceeds to request the images from the database using prefetching to increase the throughput and hide the network latencies.

Testing the examples

The directory models contains the following subdirectories, with examples of pipelines using both cassandra_interactive and cassandra_decoupled:

dali_cassandra_interactive

This model retrieves the raw data from the database, decodes it into images, performs normalization and cropping, and returns the images as a tensor. It utilizes the fn.crs4.cassandra_interactive class.

dali_cassandra_interactive_stress

This model retrieves the raw data from the database and returns the first byte of every BLOB. It utilizes the fn.crs4.cassandra_interactive class.

dali_cassandra_decoupled

This model retrieves the raw data from the database, decodes it into images, performs normalization and cropping, and returns the images as a tensor. It utilizes the fn.crs4.cassandra_decoupled class.

dali_cassandra_decoupled_stress

This model retrieves the raw data from the database and returns the first byte of every BLOB. It utilizes the fn.crs4.cassandra_decoupled class.

classification_resnet

This model utilizes a pre-trained ResNet50 for ImageNet classification to perform inference. To download the network, simply run the runme.py file.

cass_to_inference

This ensemble model connects dali_cassandra_interactive and classification_resnet to load and preprocess images from the database and perform inference on them.

cass_to_inference_decoupled

This ensemble model connects dali_cassandra_decoupled and classification_resnet to load and preprocess images from the database and perform inference on them.

Building and running the docker container

The most convenient method to test the cassandra-dali-plugin with Triton is by utilizing the provided Dockerfile.triton (derived from NVIDIA Triton Inference Server NGC), which contains our plugin, NVIDIA Triton, NVIDIA DALI, Cassandra C++ and Python drivers, as well as a Cassandra server. To build and run the container, use the following commands:

# Build and run cassandra-dali-triton docker container
$ docker build -t cassandra-dali-triton -f Dockerfile.triton .
$ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
  --gpus all --cap-add=sys_admin --name cass-dali cassandra-dali-triton

Starting and filling the DB

Once the Docker container is set up, it is possible to start the database and populate it with images from the imagenette dataset using the provided script:

./start-and-fill-db.sh  # might take a few minutes

Starting Triton server

After the database is populated, we can start the Triton server with

./start-triton.sh
# i.e., tritonserver --model-repository ./models --backend-config dali,plugin_libs=/opt/conda/lib/python3.8/site-packages/libcrs4cassandra.so

Now you can leave this shell open, and it will display the logs of the Triton server.

Testing the inference

To run the clients, start a new shell in the container with following command:

docker exec -ti cass-dali fish

Now, within the container, run the following commands to test the inference:

python3 client-http-stress.py
python3 client-grpc-stress.py
python3 client-grpc-stream-stress.py
python3 client-http-ensemble.py
python3 client-grpc-ensemble.py
python3 client-grpc-stream-ensemble.py

You can also benchmark the inference performance using NVIDIA's perf_analyzer. For example:

perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000
perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000 -i grpc
perf_analyzer -m dali_cassandra_decoupled_stress --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000
perf_analyzer -m cass_to_inference --input-data uuids.json -b 256 --concurrency-range 16 -i grpc
perf_analyzer -m cass_to_inference_decoupled --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000