Payment Data Enrichment: From Manual Cleaning to Neural Networks

What is data enrichment in banking? Cleaning data from payment transactions was initially a painstaking manual task. Seven years ago, we were learning to understand data, googling and searching the web for data. We started with a few thousand enriched transactions a month, but gradually we algorithmised the manual work and built our data engine. Today, we process thousands of times more transactions with a huge degree of automation.

What is payment data for anyway?

Payment transaction data is ugly, unstructured data that has virtually no use in its raw form. There is no uniform standard for how to record it, and on a statement you may see data like:

"KAUFLAND THANKS YOU" or "ALBERT 0598".

Our aim is to transform it into something better. Which means understanding which specific business the transaction happened at and to enrich the information with the exact location, logo and type of purchase. Essentiall

On the left is an example of a payment card data from which we make cleaned and enriched data (right).

In the beginning, it was manual work

When we started in 2013, we had to manually clean the data. We had hired mothers on maternity leave to help us with this. Sometimes it was resembled a detective`s work. It's not enough just to identify the keyword "Tesco". You need to understand whether it's a specific Tesco store, a purchase at Tescoma or even a withdrawal from a Tesco ATM. Each entry needs to be properly identified.

Of course, this was not a scalable approach. As the business grew, we started to algorithmise and automate the work. What we used to do intuitively, we rewrote into individual technical steps.

But along with the automation of data cleansing, the error rate started to increase. We needed to introduce statistical algorithms and neural networks that could detect abnormalities and algorithm errors. It was necessary to find out why the error occurred and how to prevent it. This is long-term, systematic and consistent work.

For example, it may happen that a retail chain closes a branch in Liberec and moves the payment terminal to a store in Pilsen. But nobody updates the terminal and transactions are still recorded as Shop XY, Liberec. We need to have methods to recognize this. Statistical algorithms analyze the sequence of purchases and alert us to abnormalities.

Cleaning the data is a lot of small steps

Terminal data is often incomplete and sometimes downright misleading. A typical case are McDonald's that operate as franchises. It is not the brand that is shown on the transaction record, but the legal entity that operates the franchise. We then have to laboriously track down who is really behind the business and correctly identify them.

We still need a human in the process, but most algorithms are already making do on their own and don't get to manual assessment.

From thousands to hundreds of millions

For the first few years, the number of automatically identified transactions grew only gradually. In 2018, we were only able to enrich about 100,000 transactions per month.

But as volumes increased, we had to upgrade our data engine. You can see the staircase on the chart, where we jumped up to 30% thanks to new algorithms. Today, we process over half a billion transactions per month, which is about three times the volume of data in the entire Czech Republic.

There's still room for growth

The infrastructure of payment terminals is constantly changing. The lifetime of a terminal is 3 to 4 years. This means that 25% of terminals are replaced every year and we have to correctly identify and reassign them.

We are improving our engine and inventing new ways to make data cleaning more accurate and even more automated. We've only just started adding really advanced algorithms and we see incredible potential.

At the same time, we're opening up foreign markets where we have to start practically from scratch. We're discovering new and new problems there, but ultimately they're going to push us further.

Our huge advantage is that we are not only enriching the data, but also working with the outputs. Within the company, we have products that allow customers to run marketing campaigns that are precisely targeted simply thanks to information about consumer payment behavior.

This allows us to constantly see how valuable the data is and what can be learned from it. The connection is unique.