This plugin also supports inference via the powerful and flexible NVIDIA Triton server.
This allows a client to request that images stored in a remote Cassandra server be inferenced on a different remote, GPU-powered server.
The plugin provides two operators to be used with Triton:
This operator expects a batch of UUIDs as input, represented as pairs of uint64, and produces as output a batch containing the raw images which are stored as BLOBs in the database, possibly paired with the corresponding labels.
The decoupled version of the operator splits the input UUIDs (which, in this case, can form a very long list) into mini-batches and proceeds to request the images from the database using prefetching to increase the throughput and hide the network latencies.
The directory models contains the following subdirectories,
with examples of pipelines using both cassandra_interactive
and
cassandra_decoupled
:
This model retrieves the raw data from the database, decodes it into
images, performs normalization and cropping, and returns the images as
a tensor. It utilizes the fn.crs4.cassandra_interactive
class.
This model retrieves the raw data from the database and returns the
first byte of every BLOB. It utilizes the
fn.crs4.cassandra_interactive
class.
This model retrieves the raw data from the database, decodes it into
images, performs normalization and cropping, and returns the images as
a tensor. It utilizes the fn.crs4.cassandra_decoupled
class.
This model retrieves the raw data from the database and returns the
first byte of every BLOB. It utilizes the
fn.crs4.cassandra_decoupled
class.
This model utilizes a pre-trained ResNet50 for ImageNet classification to perform inference. To download the network, simply run the runme.py file.
This ensemble model connects dali_cassandra_interactive
and
classification_resnet
to load and preprocess images from the
database and perform inference on them.
This ensemble model connects dali_cassandra_decoupled
and
classification_resnet
to load and preprocess images from the
database and perform inference on them.
The most convenient method to test the cassandra-dali-plugin with Triton is by utilizing the provided Dockerfile.triton (derived from NVIDIA Triton Inference Server NGC), which contains our plugin, NVIDIA Triton, NVIDIA DALI, Cassandra C++ and Python drivers, as well as a Cassandra server. To build and run the container, use the following commands:
# Build and run cassandra-dali-triton docker container
$ docker build -t cassandra-dali-triton -f Dockerfile.triton .
$ docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
--gpus all --cap-add=sys_admin --name cass-dali cassandra-dali-triton
Once the Docker container is set up, it is possible to start the database and populate it with images from the imagenette dataset using the provided script:
./start-and-fill-db.sh # might take a few minutes
After the database is populated, we can start the Triton server with
./start-triton.sh
# i.e., tritonserver --model-repository ./models --backend-config dali,plugin_libs=/opt/conda/lib/python3.8/site-packages/libcrs4cassandra.so
Now you can leave this shell open, and it will display the logs of the Triton server.
To run the clients, start a new shell in the container with following command:
docker exec -ti cass-dali fish
Now, within the container, run the following commands to test the inference:
python3 client-http-stress.py
python3 client-grpc-stress.py
python3 client-grpc-stream-stress.py
python3 client-http-ensemble.py
python3 client-grpc-ensemble.py
python3 client-grpc-stream-ensemble.py
You can also benchmark the inference performance using NVIDIA's perf_analyzer. For example:
perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000
perf_analyzer -m dali_cassandra_interactive_stress --input-data uuids.json -b 256 --concurrency-range 16 -p 10000 -i grpc
perf_analyzer -m dali_cassandra_decoupled_stress --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000
perf_analyzer -m cass_to_inference --input-data uuids.json -b 256 --concurrency-range 16 -i grpc
perf_analyzer -m cass_to_inference_decoupled --input-data uuids_2048.json --shape UUID:2048,2 --concurrency-range 4 -i grpc --streaming -p 10000