How we keep track of our data experiments

Published on: January 26, 2022

Experimentation is an essential part of training machine learning models. Feeding a model with new data can lead to new insights and new performance levels. But as data experiments continue, it can become hard to keep track of which model was trained on which dataset. That’s why at Kapernikov, we like to use Data Version Control (DVC), a dedicated software tool for keeping track of data experiments.

DVC is already well integrated into our company workflows. One of the projects where the tool came in handy was the Telraam project, where we developed an algorithm to estimate city traffic on a cost-effective edge computing chip. As our algorithm already generated good detection results during the day, we also wanted to perform an experiment with nighttime traffic data. DVC allowed us to easily compare the performance of the daytime and nighttime datasets and check the added value of the newly added data.

In our machine learning projects, DVC helps us to:

Keep track of our experiments using different datasets
Collaborate and communicate our different data experiments across the team
Maintain a continuous development schedule by adding new data and features, all the while clearly tracking our progress
Reverse-engineer our experiments, so we understood why data was acting in a certain way
Work faster and avoid costly mistakes
Enable different team members to work on the project independently

Why not use Git?

Now, you may think: ‘this is great, but don’t we already have better known version control systems like Git to do the same?’ Indeed, for many of us, Git is probably a more familiar version control system. The tool helps developers keep track of different versions of their code and collaborate with other developers. Not a luxury, because if you are working with different team members and you are handling different parts of a software project over a certain period of time, it’s easy to lose track of who did what and where that particular bug came from.

With a tool like Git, you can go back and forth between different versions of your code without being afraid of losing the code you changed. A project can be organized around a central repository and each developer or subteam working on a particular feature can push changes into that repository through a specific branch. And when a mistake is made, you can easily go back and solve the problem without disrupting the project too much.

However, Git is not the tool you need for version control of data projects, for at least two reasons:

Git is not made for huge datasets. Data projects, like the development of machine learning models, often work with large amounts of images, videos or texts. This is not practical for Git, because pushing and pulling massive amounts of data can quickly become a bottleneck.
Due to the large amounts of data, Git can make it harder to review or compare changes in different versions of data.

Git is not made for huge datasets, DVC is.

Version control for data projects

Machine learning projects are a whole different animal compared to pure software projects. Here, we are not only dealing with code, but also with data and machine learning models. The success of a machine learning project is a complex interplay of these three things.

So, in a machine learning project, you will have different versions of your code, but also different datasets you are experimenting with to train your machine learning model. How then can you keep track of all these different versions and experiments? And how can you make sure you do not lose any previous versions of datasets after you have updated your data?

Training and finetuning machine learning models is often a long process of iteration and experimentation. And once released, there is always new data coming in that is used for updates. Models are often trained with different datasets, resulting in different final versions that need to be compared with each other. At a certain point, it may become difficult to unravel which model was trained with which dataset. You may also want to reproduce a data experiment, to verify the experiment, to redo the experiment on a data subset, or just to communicate and clarify your experiment to your team members.

DVC to the rescue

At Kapernikov, we have been working successfully with Data Version Control (DVC), a management software tool for machine learning projects. With DVC, you can keep track of different versions of your data, reproduce experiments, monitor experiment metrics and much more. DVC consistently maintains a combination of input data, configuration, and the code that was initially used to run an experiment, which makes it easy to reproduce experiments and keep track of everything you have tried in previous experiments. And the additional benefit is that DVC syntax looks a lot like Git, which means that Git users will easily find their way in DVC.

With DVC, you can keep track of different versions of your data, reproduce experiments, monitor experiment metrics and much more.

The constant interplay between code, data and machine learning model can make machine learning projects complex. But by using the right data version control tools, we can stay more organized and be more effective in our development efforts.

Interested to know how you can keep your machine learning projects organized? Get in touch with one of our data experts.

It seems like you're really digging this article.

Subscribe to our newsletter and stay up to date.

Author

Moustafa Ayoub

Solving real problems Moustafa joined the Kapernikov ranks in June 2021. At that time, he already had two years of machine learning experience under his belt, and before jo ...