'Garbage in, garbage out’ has become a winged expression for the concept that flawed input data will lead to flawed output data. Practical examples abound. If a dataset contains temperature readings in both Celsius and Fahrenheit without proper conversion, any analysis based on that data will be flawed. If the input data for a gift recommender system contains errors in the age attribute of customers, it might accidentally suggest kids toys to grown ups.
At a time when more and more companies, organizations and governments base decisions on data analytics, it is highly important to ensure good and clean data sets. That is what Sebastian Schelter and his colleagues are working on. Schelter is assistant professor at of the Informatics Institute of the University of Amsterdam, working in the INtelligent Data Engineering Lab (INDElab). Academic work he published in 2018, when he was working at Amazon, presently powers some of Amazon’s data quality services. At UvA he is expanding that work.
What are the biggest problems with data sets?
‘Missing data is one big problem. Think of an Excel sheet where you have to fill in values in each cell, but some cells are empty. May be data got lost, may be data just wasn’t collected. That’s a very common problem. The second big problem is that some data are wrong. Let’s say you have data about the age of people and there appears to be somebody who is a thousand years old.
A third major problem with data sets is data integration errors, which arise from combining different data sets. Very often this leads to duplicates. Think of two companies that merge. They will have address data bases and may be the same address is spelled in slightly different ways: one database uses ‘street’ and the other one uses ‘st.’. Or the spelling might be different.
Finally, the fourth major problem is called ‘referential integrity’. If you have datasets that reference each other, you need to make sure that the referencing is done correctly. If a company has a dataset with billing data and a bank has a dataset with bank account numbers of their customers, you want a bank account number in the billing dataset to reference an existing bank account at that bank, otherwise it would reference something that does not exist. Often there are problems with references between two data sets.’
Data scientists spend a lot of their time cleaning up flawed data sets. The numbers vary, but surveys have shown that it’s up to eighty percent of their time. That’s a big waste of time and talent.Sebastian Schelter
How does your research tackle these problems?
‘Data scientists spend a lot of their time cleaning up flawed data sets. The numbers vary, but surveys have shown that it’s up to eighty percent of their time. That’s a big waste of time and talent. To counter this, we have developed open source software, called Deequ. Instead of data scientists having to write a program that validates the data quality, they can just write down how their data should look like. For example, they can prescribe things like: ‘there shouldn’t be missing data in the column with social security numbers’ or ‘the values in the age-column shouldn’t be negative’. Then Deequ runs over the data in an efficient way and tells whether the test is passed or not. Often Deequ also shows the particular data records that violated the test.’
How is Deequ used in practice?
‘The original scientific paper was written when I was working at Amazon. Since then, the open source implementation of this work has become pretty popular for all kinds of applications in all kinds of domains. There is a Python-version which has more than four million downloads per month. After I left Amazon, the company built two cloud services based on Deequ, one of them called AWS Glue Data Quality. Amazon’s cloud is the most used cloud in the world, so many companies that use it, have access to our way of data cleaning.’
What is the current research you are doing to clean up data sets?
‘At the moment we are developing a way to measure data quality of streaming data in our ICAI-lab ‘AI for Retail’, cooperating with bol.com. Deequ was developed for data at rest, but many use cases have a continuous stream of data. The data might be too big to store, there might be privacy reasons for not storing them, or it might simply be too expensive to store the data. So, we built StreamDQ, which can run quality checks on streaming data. A big challenge is that you can’t spend much time on processing the data, otherwise everything will be slowed down too much. So, you can only do certain tests and sometimes you have to use approximations. We have a working prototype, and we are now evaluating it.’