Beta
59164

Speech Recognition Using Historian Multimodal Approach

Article

Last updated: 24 Dec 2024

Subjects

-

Tags

-

Abstract

This paper proposes an Audio-Visual Speech Recognition (AVSR) model using both audio and visual speech information
to improve recognition accuracy in a clean and noisy environment. Mel frequency cepstral coefficient (MFCC) and Discrete
Cosine Transform (DCT) are used to extract the effective features from audio and visual speech signal respectively. The
Classification process is performed on the combined feature vector by using one of main Deep Neural Network (DNN)
architecture, Bidirectional Long-Short Term Memory (BiLSTM), in contrast to the traditional Hidden Markov Models (HMMs).
The effectiveness of the proposed model is demonstrated on a multi-speakers AVSR benchmark dataset named GRID. The
experimental results show that the early integration between audio and visual features achieved an obvious enhancement in the
recognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM. The obtained
results when using integrated audio-visual features achieved highest recognition accuracy of 99.07%, this result demonstrates an
enhancement of up to 9.28% over audio-only recognition for clean data. While for noisy data, the highest recognition accuracy for
integrated audio-visual features is 98.47% with enhancement up to 12.05% over audio-only. The main reason for BiLSTM
effectiveness is it takes into account the sequential characteristics of the speech signal. The obtained results show the performance
enhancement compared to previously obtained highest audio visual recognition accuracies on GRID, and prove the robustness of
our AVSR model (BiLSTM-AVSR).

DOI

10.21608/ejle.2019.59164

Keywords

DCT, MFCC, HMM, BiLSTM, and GRID

Authors

First Name

Eslam

Last Name

Elmaghraby

MiddleName

Eid

Affiliation

Communication and Electronics Engineering Department from faculty of engineering, Fayoum University

Email

eem00@fayoum.edu.eg

City

Fayom, Egypt

Orcid

0000-0003-3914-2818

First Name

Amr

Last Name

Gody

MiddleName

Refaat

Affiliation

Faculty of Engineering, Fayoum University

Email

amg00@fayoum.edu.eg

City

Fayoum

Orcid

0000-0003-2079-9860

First Name

Mohamed

Last Name

Farouk

MiddleName

Hashem

Affiliation

Engineering Math. & Physics Dept., Faculty of Engineering, Cairo University

Email

mhesham@eng.cu.edu.eg

City

Cairo, Egypt

Orcid

0000-0001-8101-6423

Volume

6

Article Issue

2

Related Issue

8993

Issue Date

2019-09-01

Receive Date

2019-06-07

Publish Date

2019-09-01

Page Start

44

Page End

58

Print ISSN

2356-8208

Online ISSN

2356-8216

Link

https://ejle.journals.ekb.eg/article_59164.html

Detail API

https://ejle.journals.ekb.eg/service?article_code=59164

Order

4

Type

Original Article

Type Code

1,039

Publication Type

Journal

Publication Title

The Egyptian Journal of Language Engineering

Publication Link

https://ejle.journals.ekb.eg/

MainTitle

Speech Recognition Using Historian Multimodal Approach

Details

Type

Article

Created At

22 Jan 2023