CopernNet : Point Cloud Segmentation using ActiveSampling Transformers
In the dynamic field of railway maintenance, accurate data is critical. From ensuring the health of ...
Published on: September 29, 2022
Deploying machine learning models into a production environment can be a long and winding road. A lot can go wrong between the experimentation phase and the production floor, and a successful deployment depends on a complex interaction between data, machine learning model and code. The MLOps methodology is a good way to streamline this interaction. MLOps borrows heavily from DevOps, but there are important differences.
According to a VentureBeat article of 2019, 87% of data science projects never make it to production. Why is it so hard to transform machine learning experimentation to applications deployed in production floors? There may be several reasons.
One of the inherent challenges of machine learning is that a lot of experimentation is involved. Features, parameters, and models are constantly tweaked in the process. Problems occur when data scientists and machine learning engineers forget to track versions, record parameters or keep details of the environments in which they run their models. Then it tends to become very messy, and the result of the data experiments will be hard to reproduce.
Another challenge is collaboration. Machine learning projects are typically executed by a mixed group of talents, including data engineers, data scientists, machine learning engineers. For these projects to generate any usable results, it’s essential that everyone is able to collaborate with one another. Unfortunately, working in siloes is often the rule. Machine learning experiments are often done on computational notebooks that reside in the cloud or on a data scientist’s personal workstation. More often than not, the evolution of the code is not (sufficiently) exposed to the team, which makes it challenging to reproduce experiments in other computing environments.
In addition to experimentation and a high need for collaboration, machine learning projects are characterized by fast change. Machine learning applications evolve rapidly. State-of-the-art models that are used as a starting point are often refined and extended in the process to better suit their purpose, making the original model rapidly inefficient compared to the new model. This makes deployment fairly complex.
To be able to cope with fast change and to avoid rework and inefficiency, machine learning projects need a stable, but agile process for bringing applications into production, which allows teams to adapt to changing conditions fast. But, wait a minute, isn’t that what DevOps is for?
Indeed, DevOps is a methodology in traditional software development that ensures a stable deployment process and a clear lineage of the code. DevOps has helped companies to break down the silos of development and operational teams, and has stimulated collaboration between both teams to come to better results. DevOps works with a never-ending, iterative loop through which engineering teams can continuously improve. This way, teams go through different stages, from planning to monitoring, and then go back to the beginning to start a new loop.
The agile principles of DevOps could work for machine learning, were it not for the fact that machine learning is about more than just code. In machine learning there is a constant interaction between the code, the data and the model. That’s why a traditional software workflow cannot work. But isn’t there a way to apply the DevOps principles to machine learning? Yes, there is. It’s called MLOps.
MLOps is a methodology, based on DevOps, which improves the collaboration between data scientists and operations professionals. Applying this methodology helps teams to deploy machine learning models in large-scale production environments much faster and with much better results. Let’s look at some of the building blocks of the MLOps methodology.
Code versioning is an essential part of DevOps. In MLOps, there is also a need to version the data and the models that are produced. In addition, all these versioning processes need to be correlated. For every model, you need the corresponding datasets and the corresponding versions of the code.
A pipeline connects the development environment to the production environment. This enables us to transfer the model we used for inference into production. Computational notebooks need to be converted into executable code, so that it can be versioned and integrated into the deployment pipeline.
By training the model in the cloud, you have a centralized, or at least a shared, environment, which facilitates collaboration among data scientists. A centralized environment and automated workflows also offer uniformity and reproducibility, which are prerequisites for the successful delivery of a machine learning project. At Kapernikov, we typically rely on a DevOps platform for storing the code in the cloud. The data can be stored in some cloud storage and be downloaded/mounted at the beginning of the training. For the model, we prefer to host the trained model in a dedicated model registry.
The MLOps lifecycle offers a complete workflow to connect experimentation with the production environment. As part of this pipeline, short cycles of development can bring new features into production.
Work packages for data engineers, data scientist and machine learning engineers are clearly defined:
Now that your versioned data and models are saved in the cloud, along with your code, it’s much easier to reproduce models. Whenever data scientists need to rework a model, they can reproduce it safely on their machine or in the cloud.
Model training can be automated and scheduled in the cloud, so that the best performing model is always available for production. The cloud also makes collaboration much easier between members of a team working on the same project. No more lost models, no more lost code.
Applying MLOps principles leads to faster and more robust deployment of applications whenever there is a code or data change. Granted, training models in the cloud may be resource-intensive and time-consuming. However, it’s still perfectly possible with ML pipelines to run training locally. For example, you could train on a local machine in the early stage of your project to decrease the latency and to speed up development. Then, following your ML pipeline, you could regularly share your code, data and models in the cloud when you reach key milestones.
MLOps is an ensemble of processes, supported by a wide range of tools and platforms. Some tools like DVC and MLFlow, will manage part of the process. To do MLOps, you will need a selection of tools, also called a framework. Some MLOps platforms will offer most of these processes as one solution.
To perform the training in a remote environment, you will need to clearly define tasks that will lead to the building of the model. This includes preprocessing (data transformation, data augmentation), the training process itself, and the validation process. These tasks, depending on the MLOps framework/platform, are defined at a different level of abstraction:
With codes and notebooks, the frameworks/platforms are language dependent. Also, dependencies need to be provided and supported by the solution, which will be more difficult for recent libraries. Even if they are supported, there are risks of failures every time you update these dependencies.
On the contrary, docker containers as tasks offer a solution where you keep control over the dependencies. That’s a dedicated artifact that offers a clean isolation and that you can functionally test. Moreover, docker containers are interoperable meaning that you are not stick to one framework/platform.
Versioning of code, data and models, a pipeline that connects the development environment to production and training in the cloud: these best practices form a solid foundation for MLOps. Once you have these components in place, you can start extending your workflow with additional or improved components.
The ML pipeline includes all tasks required to build a model:
The ML pipeline can be triggered by code changes, or by adding new data, or it can be scheduled. Complex non-linear workflows can be designed.
The model can be put into production by using automated tools (TFServing, TorchServe, MLServer), but it can also be deployed through a homemade API. This offers end users more flexibility and customization possibilities. In both cases, you get an API that offers an interface between the model and the outside world.
When a model is in production, you can collect statistics to get insights. You can also set up alerts that warn you of possible problems. By closely monitoring your model, you can detect model drift in time. This means that a model will become less accurate, because it is receiving different data than the data it was trained on. An often-cited tool for performing drift detection is Seldon.
You can track and share experiments using tools like DVC and MLFlow. Tracking includes the execution of the ML pipeline along with all the produced artifacts (logs, metrics).
MLOps helps teams to deploy machine learning applications into a production environment much faster and with higher quality results. By connecting the experimentation environment to the production environment using a well-defined deployment process, it is much easier for teams to adapt to changes. And by carefully versioning all data, models and code in the cloud, there is no risk of losing valuable work.
MLOps offers a controlled development process, which may require a learning curve for your team, but at Kapernikov, we couldn’t be more convinced of the benefits of this methodology. Need to get your team up to speed with MLOps? Let us know, maybe we can help.
Subscribe to our newsletter and stay up to date.