Subjects
-Tags
Computer Science and Engineering
Abstract
Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system's behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools.
DOI
10.21608/erjeng.2015.126816
Keywords
Duplicate Record Detection, Data Cleaning, Indexing, Data Integration, Entity Matching, Soundex, Dictionary Building, Similarity Metrics
Authors
Affiliation
Computer and Control Engineering Department,
Faculty of Engineering, Tanta University
Email
-City
-Orcid
-Affiliation
Computer and control engineering dept, faculty of engineering, Tanta university, Egypt
Email
amany_sarhan@f-eng.tanta.edu.eg
City
-Orcid
-Affiliation
Computer and Control Engineering Department, Faculty of Engineering, Tanta University
Email
-City
-Orcid
-Link
https://erjeng.journals.ekb.eg/article_126816.html
Detail API
https://erjeng.journals.ekb.eg/service?article_code=126816
Publication Title
Journal of Engineering Research
Publication Link
https://erjeng.journals.ekb.eg/
MainTitle
WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT