Data Quality Revisited

Nobody ever will oppose to striving for good content. However, when it comes to data it’s a call too often overseen. Data scientists may derive new angles to existing panoramas, but are they aware if their efforts are built on mud or solid ground?

There’s reason to doubt it. The overwhelming attention for ‘Big Data’ has put fine art in the shade. Fine art? No, I don’t refer to Golden Age painters or fraudulent use of ingredients. It’s about the careful effort that goes in bringing forward an accurate set of data, before making any use of it. This is to say, make sure that the data to rely on for your posterior analysis or reporting are representing what you hold them for.

Basically spoken there are two causes for poor data quality. The oldest is about missing fields, bad data entry or ‘strange’ values, i.e. not meeting the required format. For example, is “Bad-Aibling” different from “Bad Aibling”? In the digital era this kind of non-quality should rapidly fade out. Aut
omated rule-based applications and systematic monitoring can guarantee insight on accuracy and completeness of data from a single source. The real challenge nowadays is in merging data from multiple resources, each with proprietary characteristics, into one unified set. The once magic term ‘unique match key’ – typical example an ID number – progressively suffers validity and privacy regulation is not the only explanation to it. Instead, ‘match key’ evolves to become a multidimensional space. A person once baptized “Wilhelm” today could be known on Facebook as “willy” and while he was born in Bad Aibling, he actually may be a student in Heidelberg sharing a room at the address of Hotel Krokodil.

How to deal with the duplicates when combining different sets of data? That depends on your objectives. If your aim is connecting to students, to stick to the example, data on landlines are virtually useless. And when it comes to insurance few people will expect an offer via Facebook, just remember Kroodle. At the end of the day it goes back to the very essential that there is no business strategy without an integrated data vision. Only by determining what data are needed for success, accurate datasets can be composed and later on monetized. That’s the basic condition for creating value as the next example from a very data driven industry may illustrate.

We started a new credit card operation; launch date as sacred as Weihnachten in Germany. However, the system that should support it could not meet the challenge in full. A mere beta version facilitated issuing and delivery of new plastic as well as fast growth in a competitive market. To that extent the business goal was met. But what also should be key in the process of origination, record all application data for posterior analysis on e.g. fraud and credit risk was not entirely in place. What to do? Being strong believers in the importance of sound data, the decision taken was a drastic one. All 50K original forms from the early days were reprocessed and up to three times contrasted until a stable set of data resulted. The cost involved hundreds of K€. Did that pay out?

Well, the answer is yes. When comparing the NPV of the scorecards built on the ‘new’ data with some finger exercises on the original dataset the difference turned out to be over € 15M to zero. Rien ne va plus ….


By: Herman Huizinga, Principal Consultant Business Intelligence