Beta
126816

WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT

Article

Last updated: 04 Jan 2025

Subjects

-

Tags

Computer Science and Engineering

Abstract

Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system's behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools.

DOI

10.21608/erjeng.2015.126816

Keywords

Duplicate Record Detection, Data Cleaning, Indexing, Data Integration, Entity Matching, Soundex, Dictionary Building, Similarity Metrics

Authors

First Name

Azza

Last Name

Higazy

MiddleName

Abd Al-Elah

Affiliation

Computer and Control Engineering Department, Faculty of Engineering, Tanta University

Email

-

City

-

Orcid

-

First Name

Amany

Last Name

Sarhan

MiddleName

M.

Affiliation

Computer and control engineering dept, faculty of engineering, Tanta university, Egypt

Email

amany_sarhan@f-eng.tanta.edu.eg

City

-

Orcid

-

First Name

Tarek

Last Name

El-Tobely

MiddleName

E.

Affiliation

Computer and Control Engineering Department, Faculty of Engineering, Tanta University

Email

-

City

-

Orcid

-

Volume

1

Article Issue

2015

Related Issue

18753

Issue Date

2015-12-01

Receive Date

2020-12-01

Publish Date

2015-12-01

Page Start

165

Page End

176

Print ISSN

2356-9441

Online ISSN

2735-4873

Link

https://erjeng.journals.ekb.eg/article_126816.html

Detail API

https://erjeng.journals.ekb.eg/service?article_code=126816

Order

14

Type

Research articles

Type Code

1,605

Publication Type

Journal

Publication Title

Journal of Engineering Research

Publication Link

https://erjeng.journals.ekb.eg/

MainTitle

WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT

Details

Type

Article

Created At

23 Jan 2023