OpenVINO™ Model Server

Model Server hosts models and makes them accessible to software components over standard network protocols: a client sends a request to the model server, which performs model inference and sends a response back to the client. Model Server offers many advantages for efficient model deployment:

Remote inference enables using lightweight clients with only the necessary functions to perform API calls to edge or cloud deployments.
Applications are independent of the model framework, hardware device, and infrastructure.
Client applications in any programming language that supports REST or gRPC calls can be used to run inference remotely on the model server.
Clients require fewer updates since client libraries change very rarely.
Model topology and weights are not exposed directly to client applications, making it easier to control access to the model.
Ideal architecture for microservices-based applications and deployments in cloud environments – including Kubernetes and OpenShift clusters.
Efficient resource utilization with horizontal and vertical inference scaling.

OpenVINO™ Model Server (OVMS) is a high-performance system for serving models. Implemented in C++ for scalability and optimized for deployment on Intel architectures, the model server uses the same architecture and API as TensorFlow Serving and KServe while applying OpenVINO for inference execution. Inference service is provided via gRPC or REST API, making deploying new algorithms and AI experiments easy.

The models used by the server need to be stored locally or hosted remotely by object storage services. For more details, refer to Preparing Model Repository documentation. Model server works inside Docker containers, on Bare Metal, and in Kubernetes environment. Start using OpenVINO Model Server with a fast-forward serving example from the Quickstart guide or explore Model Server features.

Read release notes to find out what’s new.

Key features:

support for multiple frameworks, such as Caffe, TensorFlow, MXNet, PaddlePaddle and ONNX
online deployment of new model versions
configuration updates in runtime
support for AI accelerators
works with Bare Metal Hosts as well as Docker containers
model reshaping in runtime
directed Acyclic Graph Scheduler - connecting multiple models to deploy complex processing solutions and reducing data transfer overhead
custom nodes in DAG pipelines - allowing model inference and data transformations to be implemented with a custom node C/C++ dynamic library
serving stateful models - models that operate on sequences of data and maintain their state between inference requests
binary format of the input data - data can be sent in JPEG or PNG formats to reduce traffic and offload the client applications
model caching - cache the models on first load and re-use models from cache on subsequent loads
metrics - metrics compatible with Prometheus standard

Note: OVMS has been tested on RedHat, and Ubuntu. The latest publicly released docker images are based on Ubuntu and UBI. They are stored in:

Run OpenVINO Model Server

A demonstration on how to use OpenVINO Model Server can be found in our quick-start guide. For more information on using Model Server in various scenarios you can check the following guides: