Decoding ML Deployment Best Practices

Jul 06, 2024

Build a maintainable and affordable platform for your Models from scratch? Here’s 5 tips to start off on the right foot.

1) Use Containers & Container Orchestration:

Use Docker & Kubernetes for scaling and managing the infrastructure for running multiple models.
This provides efficiencies as it helps in creating a central place for managing and organizing the hardware, and allows for the same hardware to be used for multiple models.

2) Use Job Queueing:

Job queueing is a system used to ensure that data processing is done in an orderly manner no matter how many GPUs are available at any given time.
It is important to note that GPUs can only run one model at a time, so models must be shut down once they are finished for the next one to run.

3) Use Event-based Architecture:

An event-based architecture allows for asynchronous execution of events and activities to reduce the amount of idle time for GPUs, making them available for inference jobs as soon as new data becomes available.
With this architecture, you can define multiple GPU nodes within a container orchestrator such as Kubernetes, and select the smallest piece of hardware that your model can run on to right-size the hardware for the job.

4) Batch Process with a Single Model:

A single model is hosted in the cloud on a GPU or CPU for processing a batch of data.
Batch data can be submitted to the model service for processing, and the GPU can then be manually shut down to save costs.
To maximize efficiency, the model should be packed with data to ensure the highest GPU utilization.

5) Process real-time data with a Single Model:

Real-time or streaming data processing requires a sufficient amount of GPU hardware to be running 24/7 in order to ensure quick results.
To save costs, a solution should be designed that can scale up and down according to peaks and troughs of activity.
It may not be as cost-effective as batch processing, but it can still result in cost savings.

The MLOps Newsletter