INDEPENDENT DE-DUPLICATION IN DATA CLEANING

Authors

  • Ajumobi Udechukwu Dept. of Computer Science, University of Calgary
  • Christie Ezeife School of Computer Science, University of Windsor
  • Ken Barker Dept. of Computer Science, University of Calgary

Keywords:

Data cleaning, De-duplication, data quality, field-matching, record linkage

Abstract

Many organizations collect large amounts of data to support their business anddecision-making processes. The data originate from a variety of sources that may haveinherent data-quality problems. These problems become more pronounced whenheterogeneous data sources are integrated (for example, in data warehouses). A majorproblem that arises from integrating different databases is the existence of duplicates. Thechallenge of de-duplication is identifying “equivalent” records within the database. Mostpublished research in de-duplication propose techniques that rely heavily on domainknowledge. A few others propose solutions that are partially domain-independent. Thispaper identifies two levels of domain-independence in de-duplication namely: domainindependenceat the attribute level, and domain-independence at the record level. Thepaper then proposes a positional algorithm that achieves domain-independent deduplicationat the attribute level, and a technique for field weighting by data profiling,which, when used with the positional algorithm, achieves domain-independence at therecord level. Experiments show that the proposed techniques achieve more accurate deduplicationthan the existing algorithms.

Downloads

Published

2012-03-15

How to Cite

[1]
A. Udechukwu, C. Ezeife, and K. Barker, “INDEPENDENT DE-DUPLICATION IN DATA CLEANING”, J. inf. organ. sci. (Online), vol. 29, no. 2, Mar. 2012.

Section

Articles