Beta
81880

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Article

Last updated: 24 Dec 2024

Subjects

-

Tags

-

Abstract

This paper extends an earlier work on designing a speech recognition system based on Hidden Markov Model (HMM) classification technique of using visual modality in addition to audio modality[1]. Improved off traditional HMM-based Automatic Speech Recognition (ASR) accuracy is achieved by implementing a technique using either RNN-based or CNN-based approach. This research is intending to deliver two contributions: The first contribution is the methodology of choosing the visual features by comparing different visual features extraction methods like Discrete Cosine Transform (DCT), blocked DCT, and Histograms of Oriented Gradients with Local Binary Patterns (HOG+LBP), and applying different dimension reduction techniques like Principal Component Analysis (PCA), auto-encoder, Linear Discriminant Analysis (LDA), t-distributed Stochastic Neighbor Embedding (t-SNE) to find the most effective features vector size. Then the obtained visual features are early integrated with the audio features obtained by using Mel Frequency Cepstral Coefficients (MFCCs) and feed the combined audio-visual feature vector to the classification process. The second contribution of this research is the methodology of developing the classification process using deep learning by comparing different Deep Neural Network (DNN) architectures like Bidirectional Long-Short Term Memory (BiLSTM) and Convolution Neural Network (CNN) with the traditional HMM. The proposed model is evaluated on two multi-speakers AV-ASR datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent in GRID dataset.

DOI

10.21608/ejle.2020.22022.1002

Keywords

AV-ASR, DCT, MFCC, HMM, DNN

Authors

First Name

Eslam

Last Name

ElMaghraby

MiddleName

E

Affiliation

egypt- el fayoum el mashtal st

Email

eem00@fayoum.edu.eg

City

fayoum

Orcid

0000-0003-3914-2818

First Name

Amr

Last Name

Gody

MiddleName

M

Affiliation

Faculty of Engineering, Fayoum University

Email

amg00@fayoum.edu.eg

City

Fayoum

Orcid

0000-0003-2079-9860

First Name

Mohamed

Last Name

Farouk

MiddleName

Hashem

Affiliation

Engineering Math.; Physics Dept., Faculty of Engineering, Cairo University

Email

mhesham@eng.cu.edu.eg

City

Cairo, Egypt

Orcid

0000-0001-8101-6423

Volume

7

Article Issue

1

Related Issue

14114

Issue Date

2020-04-01

Receive Date

2020-02-20

Publish Date

2020-04-01

Page Start

27

Page End

42

Print ISSN

2356-8208

Online ISSN

2356-8216

Link

https://ejle.journals.ekb.eg/article_81880.html

Detail API

https://ejle.journals.ekb.eg/service?article_code=81880

Order

3

Type

Original Article

Type Code

1,039

Publication Type

Journal

Publication Title

The Egyptian Journal of Language Engineering

Publication Link

https://ejle.journals.ekb.eg/

MainTitle

Noise-Robust Speech Recognition System based on Multimodal Audio-Visual Approach Using Different Deep Learning Classification Techniques

Details

Type

Article

Created At

22 Jan 2023