INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Ajumobi Udechukwu; Christie Ezeife; Ken Barker

Authors

Ajumobi Udechukwu Dept. of Computer Science, University of Calgary
Christie Ezeife School of Computer Science, University of Windsor
Ken Barker Dept. of Computer Science, University of Calgary

Keywords:

Data cleaning, De-duplication, data quality, field-matching, record linkage

Abstract

Many organizations collect large amounts of data to support their business anddecision-making processes. The data originate from a variety of sources that may haveinherent data-quality problems. These problems become more pronounced whenheterogeneous data sources are integrated (for example, in data warehouses). A majorproblem that arises from integrating different databases is the existence of duplicates. Thechallenge of de-duplication is identifying “equivalent” records within the database. Mostpublished research in de-duplication propose techniques that rely heavily on domainknowledge. A few others propose solutions that are partially domain-independent. Thispaper identifies two levels of domain-independence in de-duplication namely: domainindependenceat the attribute level, and domain-independence at the record level. Thepaper then proposes a positional algorithm that achieves domain-independent deduplicationat the attribute level, and a technique for field weighting by data profiling,which, when used with the positional algorithm, achieves domain-independence at therecord level. Experiments show that the proposed techniques achieve more accurate deduplicationthan the existing algorithms.