Beta
59430

Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal

Article

Last updated: 24 Dec 2024

Subjects

-

Tags

-

Abstract

Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal,
disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that prevent its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. This paper aims to build a connected-words audio visual speech recognition system (AV-ASR) for English language that uses both acoustic and visual speech information to improve the recognition performance. Initially, Mel frequency cepstral coefficients (MFCCs) have been used to extract the audio features from the speech-files. For the visual counterpart, the Discrete Cosine Transform (DCT) Coefficients have been used to extract the visual feature from the speaker's mouth region and Principle Component Analysis (PCA) have been used for dimensionality reduction purpose. These features are then concatenated with traditional audio ones, and the resulting features are used for training hidden Markov models (HMMs) parameters using word level acoustic models. The system has been developed using hidden Markov model toolkit (HTK) that uses hidden Markov models (HMMs) for recognition. The potential of the suggested approach is demonstrated by a preliminary experiment on the GRID sentence database one of the largest databases
available for audio-visual recognition system, which contains continuous English voice commands for a small vocabulary task. The experimental results show that the proposed Audio Video Speech Recognizer (AV-ASR) system exhibits higher recognition rate in comparison to an audio-only recognizer as well as it indicates robust performance. An increase of success rate by 4% for the grammar based word recognition system overall speakers is achieved for speaker independent test.

DOI

10.21608/ejle.2017.59430

Keywords

AV-ASR, HMM, HTK, MFCC, DCT, PCA, MATLAB, GRID

Authors

First Name

Eslam

Last Name

Elmaghraby

MiddleName

Eid

Affiliation

Communication and Electronics Engineering Department from faculty of engineering, Fayoum University

Email

eem00@fayoum.edu.eg

City

Fayom, Egypt

Orcid

0000-0003-3914-2818

First Name

Amr

Last Name

Gody

MiddleName

-

Affiliation

Faculty of Engineering, Fayoum University

Email

amg00@fayoum.edu.eg

City

Fayoum

Orcid

0000-0003-2079-9860

First Name

Mohamed

Last Name

Farouk

MiddleName

Hashem

Affiliation

Engineering Math. & Physics Dept., Faculty of Engineering, Cairo University

Email

mhesham@eng.cu.edu.eg

City

Cairo, Egypt

Orcid

0000-0001-8101-6423

Volume

4

Article Issue

2

Related Issue

9019

Issue Date

2017-09-01

Receive Date

2017-05-09

Publish Date

2017-09-01

Page Start

27

Page End

40

Print ISSN

2356-8208

Online ISSN

2356-8216

Link

https://ejle.journals.ekb.eg/article_59430.html

Detail API

https://ejle.journals.ekb.eg/service?article_code=59430

Order

3

Type

Original Article

Type Code

1,039

Publication Type

Journal

Publication Title

The Egyptian Journal of Language Engineering

Publication Link

https://ejle.journals.ekb.eg/

MainTitle

Enhancement Quality and Accuracy of Speech Recognition System Using Multimodal Audio-Visual Speech signal

Details

Type

Article

Created At

22 Jan 2023