Data cleaning is an important problem in the integration of large amounts of data in order to make data usable for an application or an analysis. There are several cleaning procedures, each of which can fix a specific category of errors or data quality problems. Usually one has to use several methods one after the other to achieve the desired data quality. Determining the selection and sequence of these procedures is a lengthy and laborious manual process.
The aim of the present project is to propose new data cleaning procedures by considering previous data cleaning procedures that were successfully carried out on similarly structured and dirty data.
The challenges here are
- the classification of comparable data quality problems,
- the development of a similarity measure, "Dirtiness Similarity", on the basis of which data records can be made comparable with regard to the potential cleaning effort,
- the automatic prediction and assessment of the performance of a cleaning algorithm on a new data set. For this purpose, algorithms are implicitly classified according to their connection to cleaned data and their profiles and
- the solution of a multivariate optimization problem, taking into account the quality of the results and the efficiency, a combination of cleaning algorithms is to be proposed for a new data set.
Our approach is based on existing techniques of To test "data profilings" and "effort estimation" with regard to the sensible creation of dataset profiles and to find out which dataset profiles can be used to describe and compare the data quality of a dataset.