Data Science icon

Data science

Escape from the land of confusion: quality data for the energy market

When you’re dealing with the energy market, you’re dealing with big data.
In Belgium, customers can easily switch between energy suppliers—and they often do. The companies then have to speak to clearing house Atrias for updated info, drawing from a pure jungle of data.
Enter Kapernikov, to bring order into chaos.

Escape from the land of confusion: quality data for the energy market animatie
Escape from the land of confusion: quality data for the energy market animatie

The Belgian energy market was liberalised in 2007. Since then, gas and electricity suppliers have sprung up like mushrooms and they fiercely compete for every customer. People switch back and forth easily, especially now that sustainability has become an additional motivator. Meanwhile, distribution grid operators also see the climate goals ahead and they are working hard to build a greener and smarter power grid. To tackle the data challenges, they jointly founded the clearing house Atrias.

There are many clients to cater for,
and they all need tailored information.

Atrias runs a Central Market System (CMS) to simplify data exchange between all the parties: the grid operators as well as the energy transport companies and the actual suppliers. That’s a lot of clients to cater to, and they all need their own tailored info. On top of the CMS, Atrias manages an Operational Data Store (ODS), which turns out daily aggregates. For the much-needed data quality control of this ODS, Atrias called in Kapernikov.

Challenge

To fully understand the challenge, try not to be dazzled by all the factors involved.

  • The complexity of the CMS content. Consumers send in readings of their energy metres once every year, which are then verified by grid operators every two years. Customers can switch suppliers at any moment, meaning their data needs to be corrected. This is why it is so hard to have exact figures at all times on who consumed or produced how much, at what time. In fact, it requires looking at data going back as far as five years.
  • We’re dealing with multiple time dimensions. It’s a bit like ‘Back to the Future’: every error or change of contract requires a retroactive update. This is what we call versioning: every version of the information needs to be kept in the system. It requires very careful abstract conceptualisation if you don’t want to get lost in data (or time).

Working with multiple time dimensions
is a bit like ‘Back to the Future’.

  • Over the course of five years, the data stored in the back-end as well as the reporting system easily reaches up to 40-50 TB. It is rare to come across a project with such gigantic volumes of data. And we haven’t seen the end of it. Instead of one metre reading per year, the new smart meters will provide information every 15 minutes of every hour of every day…
  • The energy consumption of the entire market passes through the central application, but that doesn’t mean every user should be able to see all of it. Private customer data needs to be protected, and information cannot fall into competitors’ hands. Proper data segregation is crucial.
  • Technically speaking, a clearing house application can be tough to work with, especially for business users who just need to turn out reports. The data needs to be self-explanatory, easy to locate and available in their own terms. Modelling the data from the huge back end to such a tight fit—that calls for quite a tailor.
  • Most of the data in the CMS is not created by Atrias but by the market parties. Considering that data quality is a challenge for most companies, get ready for inconsistencies when you pile everything up. The big challenge is knowing when your data is bad. Sometimes it is easier to see it at the end of the chain, in your reporting application, than in your source system. But it is in the source system that the bug needs to be fixed…

Role

Kapernikov built a framework that guarantees the quality of the ODS by comparing its actual output with the required one. If that’s a match, the application can be trusted to deliver quality data. Usually, such a comparison would involve lots of manual checks, which is an awful lot of work and only yields superficial results. Remember, we are dealing with billions of records. Automatisation is key. But how? Sit tight, because this is where it gets technical.

Our developers first screened the data from both the CMS and ODS databases, according to Atrias’ logical data model (LDM). The LDM describes how the business is run: the required data (for example, every contract has to be linked to a person or company), the relations between data, etc. The logical model applied to both databases, but there was a big difference between the technical implementation of the two. Which data went into which table in which format? That was just one of many questions.

It’s a match!

To solve the riddle, Kapernikov brought its own new data store into play. Basically, it converts data from both databases to a canonical data model, so they speak the same language. Rather than writing endless amounts of individual SQL queries implementing the same logic, it defines functionalities once and sets up dynamic code generation from there. Easy does it. Any correction only has to be done in one place, rather than in hundreds or thousands. Besides, this approach runs a much lower risk of flaws in the code.

Developers took a shortcut by organising the work in DAGs (Directed Acyclic Graphs – tasks that run in a specific order, building on previous tasks). They applied modern data pipeline practices and used Luigi as the dataflow orchestration tool. To make the flow smoother, the data warehouse itself takes care of all data transformations, expressed in SQL queries. Long story short: it was all about parallelisation of tasks, huge data throughput and logical transparency.

Results

Our approach allowed us to verify billions of records and spot even the tiniest error or inconsistency. From there, all that was required was analysing the root causes and weeding them out with Atrias’ team. The quality of the data in the ODS improved, and even long-lived glitches in the CMS could now be fixed. Both Atrias and the grid operators can rest assured about their reporting system. In the end, it all looks simple.

Applied Solutions & Technologies

Python

Dynamically generated SQL

Luigi

SQL Server