TorchServe was designed a multi model inferencing framework. A production grade inferencing framework needs both APIs to request inferences but also APIs to manage models all the while keeping track of logs. TorchServe manages several worker processes that are dynamically assigned to different models with the behavior of those workers determined by a handler file and a model store where weights are loaded from.
- Frontend: The request/response handling component of TorchServe. This portion of the serving component handles both request/response coming from clients and manages the lifecycles of the models.
- Model Workers: These workers are responsible for running the actual inference on the models.
- Model: Models could be a
script_module
(JIT saved models) oreager_mode_models
. These models can provide custom pre- and post-processing of data along with any other model artifacts such as state_dicts. Models can be loaded from cloud storage or from local hosts. - Plugins: These are custom endpoints or authz/authn or batching algorithms that can be dropped into TorchServe at startup time.
- Model Store: This is a directory in which all the loadable models exist.
.github
: CI for docs and nightly buildsbenchmarks
: tools to benchmark torchserve on reference modelsbinaries
: utilities to create binaries for pypi, conda and dockerdocker
: user and dev dockerfiles to use torchservedocs
: documentation for pytorch.org/serveexamples
: reference examplesexperimental
: projects with no support or backwards compatibility guaranteesfrontend
: Core Java engine for TorchServekubernetes
: how to deploy TorchServe in a K8 clustermodel-archiver
: model package CLIplugins
: extend core TorchServerequirements
: requirements.txtserving_sdk
: SDK to support TorchServe in sagemakertest
: teststs_scripts
: useful utility files that don't fit in any other folderworkflow-archiver
: workflow package CLI
Frontend means the Java part of the code (potentially C++)
And backend is the Python code (most Pytorch specific stuff)
https://github.com/pytorch/serve/blob/master/ts/arg_parser.py#L64
- Arg parser controls config/not workflow and can also setup a model service worker with a custom socket
https://github.com/pytorch/serve/blob/master/ts/context.py
- Context object of incoming request - keeps model relevant worker information
https://github.com/pytorch/serve/blob/master/ts/model_server.py
- model server open up pid and start torchserve by using the arg parser
- If stopping it they use psutil.Process(pid).terminate()
- loads config.properties
https://github.com/pytorch/serve/blob/master/ts/model_loader.py
- Model loader
- Uses manifest file to find handler and envelope and starts the service
- Loads either default handler or custom handler
- Request envelopes which make it easier to interact with other systems like Seldon, KFserving, Google cloud AI platform
../gradlew startServer
- Takes care of closing workers
- Just an enum of worker states
- Get number of running workers
- Number of workers which is just a concurrent hashmap, backendgroup, ports etc are all here
- Add worker threads by submitting them to a threadpool Executor Service (create a pool of threads and assign tasks or worker threads to it)
- Batch aggregator
- Puts requests and responses in a list
- Keeps track of workers, batch size, timeout, version and mar name
- Encoding the model state in a JSON and then pulling properties from it
- Keeps track of jobs which are either inference or management requests
- Many metrics are just added using psutil package in Python
- Model registration calling https://github.com/pytorch/serve/blob/8903ca1fb059eab3c1e8eccdee1376d4ff52fb67/frontend/server/src/main/java/org/pytorch/serve/util/ApiUtils.java
- Install model dependencies
- create model archive
- All configs managed here
- Get GPU usage
- Worker thread has model, aggregator, listener, eventloop, port etc and then a run function which connects it to a request