From experimentation to production: how MLOps helps to deliver high-quality machine learning applications

Published on: September 29, 2022

Deploying machine learning models into a production environment can be a long and winding road. A lot can go wrong between the experimentation phase and the production floor, and a successful deployment depends on a complex interaction between data, machine learning model and code. The MLOps methodology is a good way to streamline this interaction. MLOps borrows heavily from DevOps, but there are important differences.

According to a VentureBeat article of 2019, 87% of data science projects never make it to production. Why is it so hard to transform machine learning experimentation to applications deployed in production floors? There may be several reasons.

Experiment

One of the inherent challenges of machine learning is that a lot of experimentation is involved. Features, parameters, and models are constantly tweaked in the process. Problems occur when data scientists and machine learning engineers forget to track versions, record parameters or keep details of the environments in which they run their models. Then it tends to become very messy, and the result of the data experiments will be hard to reproduce.

Collaborate

Another challenge is collaboration. Machine learning projects are typically executed by a mixed group of talents, including data engineers, data scientists, machine learning engineers. For these projects to generate any usable results, it’s essential that everyone is able to collaborate with one another. Unfortunately, working in siloes is often the rule. Machine learning experiments are often done on computational notebooks that reside in the cloud or on a data scientist’s personal workstation. More often than not, the evolution of the code is not (sufficiently) exposed to the team, which makes it challenging to reproduce experiments in other computing environments.

Why not use DevOps?

In addition to experimentation and a high need for collaboration, machine learning projects are characterized by fast change. Machine learning applications evolve rapidly. State-of-the-art models that are used as a starting point are often refined and extended in the process to better suit their purpose, making the original model rapidly inefficient compared to the new model. This makes deployment fairly complex.

To be able to cope with fast change and to avoid rework and inefficiency, machine learning projects need a stable, but agile process for bringing applications into production, which allows teams to adapt to changing conditions fast. But, wait a minute, isn’t that what DevOps is for?

Indeed, DevOps is a methodology in traditional software development that ensures a stable deployment process and a clear lineage of the code. DevOps has helped companies to break down the silos of development and operational teams, and has stimulated collaboration between both teams to come to better results. DevOps works with a never-ending, iterative loop through which engineering teams can continuously improve. This way, teams go through different stages, from planning to monitoring, and then go back to the beginning to start a new loop.

The agile principles of DevOps could work for machine learning, were it not for the fact that machine learning is about more than just code. In machine learning there is a constant interaction between the code, the data and the model. That’s why a traditional software workflow cannot work. But isn’t there a way to apply the DevOps principles to machine learning? Yes, there is. It’s called MLOps.

What is MLOps?

MLOps is a methodology, based on DevOps, which improves the collaboration between data scientists and operations professionals. Applying this methodology helps teams to deploy machine learning models in large-scale production environments much faster and with much better results. Let’s look at some of the building blocks of the MLOps methodology.

Version control

Code versioning is an essential part of DevOps. In MLOps, there is also a need to version the data and the models that are produced. In addition, all these versioning processes need to be correlated. For every model, you need the corresponding datasets and the corresponding versions of the code.

Pipelines

A pipeline connects the development environment to the production environment. This enables us to transfer the model we used for inference into production. Computational notebooks need to be converted into executable code, so that it can be versioned and integrated into the deployment pipeline.

Cloud environment

By training the model in the cloud, you have a centralized, or at least a shared, environment, which facilitates collaboration among data scientists. A centralized environment and automated workflows also offer uniformity and reproducibility, which are prerequisites for the successful delivery of a machine learning project. At Kapernikov, we typically rely on a DevOps platform for storing the code in the cloud. The data can be stored in some cloud storage and be downloaded/mounted at the beginning of the training. For the model, we prefer to host the trained model in a dedicated model registry.

The MLOps life cycle

The MLOps lifecycle offers a complete workflow to connect experimentation with the production environment. As part of this pipeline, short cycles of development can bring new features into production.

Work packages for data engineers, data scientist and machine learning engineers are clearly defined:

Data engineers focus on data storage and data preparation.
Data scientists use that data to take care of model development. The result is a trained model.
Machine learning engineers deploy the trained model in production and monitor it.

What are the benefits of MLOps?

Easy to reproduce

Now that your versioned data and models are saved in the cloud, along with your code, it’s much easier to reproduce models. Whenever data scientists need to rework a model, they can reproduce it safely on their machine or in the cloud.

Stimulating collaboration

Model training can be automated and scheduled in the cloud, so that the best performing model is always available for production. The cloud also makes collaboration much easier between members of a team working on the same project. No more lost models, no more lost code.

Faster deployment

Applying MLOps principles leads to faster and more robust deployment of applications whenever there is a code or data change. Granted, training models in the cloud may be resource-intensive and time-consuming. However, it’s still perfectly possible with ML pipelines to run training locally. For example, you could train on a local machine in the early stage of your project to decrease the latency and to speed up development. Then, following your ML pipeline, you could regularly share your code, data and models in the cloud when you reach key milestones.

What does an MLOps task look like?

MLOps is an ensemble of processes, supported by a wide range of tools and platforms. Some tools like DVC and MLFlow, will manage part of the process. To do MLOps, you will need a selection of tools, also called a framework. Some MLOps platforms will offer most of these processes as one solution.

To perform the training in a remote environment, you will need to clearly define tasks that will lead to the building of the model. This includes preprocessing (data transformation, data augmentation), the training process itself, and the validation process. These tasks, depending on the MLOps framework/platform, are defined at a different level of abstraction:

code: tasks are defined using a piece of code or by using specific classes
notebook: cells that are considered as tasks
file: file scripts that define a task
docker container: the execution of a file script inside a docker image

With codes and notebooks, the frameworks/platforms are language dependent. Also, dependencies need to be provided and supported by the solution, which will be more difficult for recent libraries. Even if they are supported, there are risks of failures every time you update these dependencies.

On the contrary, docker containers as tasks offer a solution where you keep control over the dependencies. That’s a dedicated artifact that offers a clean isolation and that you can functionally test. Moreover, docker containers are interoperable meaning that you are not stick to one framework/platform.

Improving your MLOps workflow

Versioning of code, data and models, a pipeline that connects the development environment to production and training in the cloud: these best practices form a solid foundation for MLOps. Once you have these components in place, you can start extending your workflow with additional or improved components.

ML pipeline

The ML pipeline includes all tasks required to build a model:

Input data preprocessing
Model training and testing
Reporting on the quality of the models (metrics, plots)
Registering and deployment of the model
CI/CD pipeline of the code

The ML pipeline can be triggered by code changes, or by adding new data, or it can be scheduled. Complex non-linear workflows can be designed.

Model serving (inference)

The model can be put into production by using automated tools (TFServing, TorchServe, MLServer), but it can also be deployed through a homemade API. This offers end users more flexibility and customization possibilities. In both cases, you get an API that offers an interface between the model and the outside world.

Reporting & monitoring

When a model is in production, you can collect statistics to get insights. You can also set up alerts that warn you of possible problems. By closely monitoring your model, you can detect model drift in time. This means that a model will become less accurate, because it is receiving different data than the data it was trained on. An often-cited tool for performing drift detection is Seldon.

Experiment tracking

You can track and share experiments using tools like DVC and MLFlow. Tracking includes the execution of the ML pipeline along with all the produced artifacts (logs, metrics).

Ready to start with MLOps?

MLOps helps teams to deploy machine learning applications into a production environment much faster and with higher quality results. By connecting the experimentation environment to the production environment using a well-defined deployment process, it is much easier for teams to adapt to changes. And by carefully versioning all data, models and code in the cloud, there is no risk of losing valuable work.

MLOps offers a controlled development process, which may require a learning curve for your team, but at Kapernikov, we couldn’t be more convinced of the benefits of this methodology. Need to get your team up to speed with MLOps? Let us know, maybe we can help.

It seems like you're really digging this article.

Subscribe to our newsletter and stay up to date.

Author

Ludovic Santos

You might expect that studies in Chemistry may primarily lead to working in a lab, but Ludovic had already moved far beyond that as he was doing his PhD in Quantum Chemistry a ...

CopernNet : Point Cloud Segmentation using ActiveSampling Transformers

In the dynamic field of railway maintenance, accurate data is critical. From ensuring the health of ...