Beta
238146

UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING

Article

Last updated: 03 Jan 2025

Subjects

-

Tags

-

Abstract

Bioinformaticians face an issue in analyzing and studying protein similarity as the number of proteins grows. Protein sequence analysis helps in the prediction of protein functions. It is critical for the analysis process to be able to appropriately categorize proteins based on their sequences. The extraction of features from protein sequences is done using a variety of methods. The goal of this study is to investigate the different variations of data on the classification performance of a deep learning model employing 3D data. First, few research questions were formulated regarding the impact of the following criteria: dataset size, IMF importance, feature size, and preprocessing on the proposed deep learning classification process. Second, comprehensive experiments were conducted to answer the research questions. Six feature extraction methods were utilized to create 3D features with two sizes (7x7x7 and 9x9x9), which were then fed into a convolutional neural network. Three datasets different in their sorts, sizes, and balance state were used. Accuracy, precision, recall and F1-score are the standard assessment metrics used. Experimental results draw significant conclusions. First, the 7x7x7 feature matrix has a positive correlation between its dimensions, which improved the results. Second, using the sum of the first three IMF components had better impact than using the first IMF component. Third, the classification process did not benefit from the normalization of features for small datasets unlike the large dataset. Finally, the dataset size had a significant impact on training the CNN model, with a training accuracy reaching 84.03%.

DOI

10.21608/ijicis.2022.123177.1168

Keywords

Deep learning, proteins, EMD, IMF, Feature Matrix

Authors

First Name

Farida

Last Name

Mostafa

MiddleName

-

Affiliation

Information Systems Department, Bioinformatics Program, Faculty of Computer and Information Sciences, Ain Shams University.

Email

farida.alaaeldin@cis.asu.edu.eg

City

Nasr City

Orcid

0000-0002-9982-1030

First Name

Yasmine

Last Name

Afify

MiddleName

-

Affiliation

Information Systems, Faculty of Computer and Information Sciences, Ain Shams University

Email

yasmine.afify@cis.asu.edu.eg

City

-

Orcid

0000-000106400-8472

First Name

Rasha

Last Name

Ismail

MiddleName

-

Affiliation

Vice Dean for Postgraduate Studies & Research, Faculty of Computer and Information Sciences, Ain Shams University

Email

rashaismail@cis.asu.edu.eg

City

-

Orcid

0000-0003-3581-8112

First Name

Nagwa

Last Name

Badr

MiddleName

-

Affiliation

Department of Information Systems, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, 11566, Egypt

Email

nagwabadr@cis.asu.edu.eg

City

-

Orcid

0000-0002-5382-1385

Volume

22

Article Issue

2

Related Issue

34382

Issue Date

2022-05-01

Receive Date

2022-02-21

Publish Date

2022-05-01

Page Start

112

Page End

125

Print ISSN

1687-109X

Online ISSN

2535-1710

Link

https://ijicis.journals.ekb.eg/article_238146.html

Detail API

https://ijicis.journals.ekb.eg/service?article_code=238146

Order

12

Type

Original Article

Type Code

494

Publication Type

Journal

Publication Title

International Journal of Intelligent Computing and Information Sciences

Publication Link

https://ijicis.journals.ekb.eg/

MainTitle

UNCOVERING THE EFFECTS OF DATA VARIATION ON PROTEIN SEQUENCE CLASSIFICATION USING DEEP LEARNING

Details

Type

Article

Created At

22 Jan 2023