Protected: Automated nuts quality control with AI
There is no excerpt because this is a protected post.
Published on: October 8, 2019
Best practices for quality pipelines
At Kapernikov, we frequently work with master data. In this context, “working” means cleansing, analyzing, migrating and integrating. In order to do this, we frequently need to set up transformation and ML pipelines. In other words: ETL.
In the past years, we have been working with a multitude of technologies to achieve this, from proprietary software to the Python data stack (check out our luigi tutorial). More recently, we decided to look into Apache Spark.
In this article, we would like to share our experience with what we consider to be the basic principles of writing quality ETL code. In a later article, we will try to see how technology can help us and how we can apply these best practices in an Apache Spark pipeline.
Data pipelines often grow towards considerable complexity. Intermediary results are stored as tables by some jobs and then re-read by other jobs in a DAG. The different pieces of the pipeline become tightly coupled and for several reasons, it becomes difficult to isolate parts for testing, reuse, etc.
We think our high–level building blocks should be composable (high level: we are not talking about “left join” or “filter”, but about things like “get all installations that were in service after a certain date”).
When the same “data” is needed for two or more applications, we could just store it in a table that is used by both applications. But we prefer to actually share the logic to produce this data rather than the data itself.
It is not easy to verify whether a pipeline works correctly. Errors might be subtle and might only come up in certain circumstances. It requires experience to specifically test for this, in addition to a set of best practices. However, this is no excuse not to cover the basics: in our experience, the vast majority of problems are easy to prevent. Especially when under time pressure, we have seen people resorting to putting a poorly tested, sometimes even trivially broken pipeline in production.
We think “the basics” boil down to:
In addition, we need to verify that the whole pipeline produces meaningful results on a representative dataset. This is harder to do and to automate.
A master data ETL job easily takes up to a couple of hours, so having an avoidable error somewhere halfway, can cost hours of development cycle time. If you have to retry a 4-hour job 3 times, a full working day has passed. You might be able to do other things in the meantime, but your deadline for this job will approach fast.
The ability to track the history (using git) of our transformation logic in a detailed way is very important. We need to be able to track multiple changes that are done at the same time (some changes are simple and urgent, some are more complex, which requires more testing). Without this level of detail, it will be hard to deliver updates quickly in a controlled way.
In the past, we have been working with both graphical ETL tools and code-based tools. We much prefer the latter. However, when stuck with a graphical tool, not all is lost: often it is possible to generate a (usually readable) file from the graphical pipeline that can be stored in an SCM-like git. Useless for branching and merging, but it gives a reliable way to track evolution.
Remember that time when you had to put a bug fix in production urgently when you already half-reworked that same code for a substantial functional adaptation? That’s just one scenario in which you want version management to cover your ass.
In this article, we introduced a few best practices for making quality data processing pipelines. These principles are not really tied to a single technology and at Kapernikov, we use various technologies for building data pipelines. We are constantly trying to advance our state of the art and we get inspired by other resources. Of course, we are very curious about your feedback.
We will follow up on this topic with an article about which technology can help us to apply these principles, and we will conclude with a post in which we will try to apply these principles to a pipeline written in Apache Spark.
Subscribe to our newsletter and stay up to date.