Writing a high quality data pipeline for master data with apache spark – Part 1

Published on: October 8, 2019

Best practices for quality pipelines

At Kapernikov, we frequently work with master data. In this context, “working” means cleansing, analyzing, migrating and integrating. In order to do this, we frequently need to set up transformation and ML pipelines. In other words: ETL.

In the past years, we have been working with a multitude of technologies to achieve this, from proprietary software to the Python data stack (check out our luigi tutorial). More recently, we decided to look into Apache Spark.

In this article, we would like to share our experience with what we consider to be the basic principles of writing quality ETL code. In a later article, we will try to see how technology can help us and how we can apply these best practices in an Apache Spark pipeline.

1. Building blocks should be composable

Data pipelines often grow towards considerable complexity. Intermediary results are stored as tables by some jobs and then re-read by other jobs in a DAG. The different pieces of the pipeline become tightly coupled and for several reasons, it becomes difficult to isolate parts for testing, reuse, etc.

The definition of a job already specifies which other jobs it depends on
The different jobs are coupled via the name of the datasets they generate and write (e.g. read from table “assets” and write to table “report”).
Storing and loading of data is mixed with transformation logic.

This comes with several caveats:

Reusing parts of a pipeline becomes very difficult.
It makes the whole pipeline very difficult to test. Maintaining a test system means maintaining a full database scheme, which tends to lag and become non-representative of the production system. Testing a job means making sure all input tables are present and validating the output afterwards.
The requirements of a job are not expressed and maybe not even documented (the job does need a table named “assets”, but this table must have a certain number of columns and there are some data quality constraints).
It is difficult to parameterize a pipeline. Imagine the following scenarios:
I usually accept that the data I use for testing is 2 days old, but this time, I really need recent data.
I want to run a complete pipeline twice, but the second time, I want to swap out some data sources.

We think our high–level building blocks should be composable (high level: we are not talking about “left join” or “filter”, but about things like “get all installations that were in service after a certain date”).

2. Reuse of logic instead of data

When the same “data” is needed for two or more applications, we could just store it in a table that is used by both applications. But we prefer to actually share the logic to produce this data rather than the data itself.

Why?

The requirements for one application might evolve differently from the requirements for the other one. This is much easier to manage in code.
This way, the code expresses the complete transformation, which has great advantages for data lineage / auditability.
Again, it becomes much easier to parameterize: we exceptionally need to do something else (see previous chapter).
Does this mean that we need to regenerate the same dataset over and over again? No, but our “saved intermediary results” will be nothing less / nothing more than a cache to speed up computation and their existence should be an implementation detail.

3. Verifiability

It is not easy to verify whether a pipeline works correctly. Errors might be subtle and might only come up in certain circumstances. It requires experience to specifically test for this, in addition to a set of best practices. However, this is no excuse not to cover the basics: in our experience, the vast majority of problems are easy to prevent. Especially when under time pressure, we have seen people resorting to putting a poorly tested, sometimes even trivially broken pipeline in production.

We think “the basics” boil down to:

Verifying that all jobs are executable.
Verifying that external input datasets are compliant (duplicate keys…).
Verifying that a pipeline does not violate the requirements (or “contract”) of any of the jobs.
These basics are actually easy to automate and automating them pays off. Since we are dealing with master data here, the dataset sizes are not huge and it is often feasible to validate all inputs, every run.

In addition, we need to verify that the whole pipeline produces meaningful results on a representative dataset. This is harder to do and to automate.

A master data ETL job easily takes up to a couple of hours, so having an avoidable error somewhere halfway, can cost hours of development cycle time. If you have to retry a 4-hour job 3 times, a full working day has passed. You might be able to do other things in the meantime, but your deadline for this job will approach fast.

4. Version management / SCM

The ability to track the history (using git) of our transformation logic in a detailed way is very important. We need to be able to track multiple changes that are done at the same time (some changes are simple and urgent, some are more complex, which requires more testing). Without this level of detail, it will be hard to deliver updates quickly in a controlled way.

In the past, we have been working with both graphical ETL tools and code-based tools. We much prefer the latter. However, when stuck with a graphical tool, not all is lost: often it is possible to generate a (usually readable) file from the graphical pipeline that can be stored in an SCM-like git. Useless for branching and merging, but it gives a reliable way to track evolution.

Remember that time when you had to put a bug fix in production urgently when you already half-reworked that same code for a substantial functional adaptation? That’s just one scenario in which you want version management to cover your ass.

Wrapping up

In this article, we introduced a few best practices for making quality data processing pipelines. These principles are not really tied to a single technology and at Kapernikov, we use various technologies for building data pipelines. We are constantly trying to advance our state of the art and we get inspired by other resources. Of course, we are very curious about your feedback.

We will follow up on this topic with an article about which technology can help us to apply these principles, and we will conclude with a post in which we will try to apply these principles to a pipeline written in Apache Spark.

Read part 2

Read part 3

It seems like you're really digging this article.

Subscribe to our newsletter and stay up to date.

Author

Frank Dekervel

Frank is one of the founders of Kapernikov. Together with his business partner, Rein Lemmens, Frank started a web services agency in 2004. But with the addition of a third partner, ...