Beta
184716

Bidirectional Temporal Context Fusion with Bi-Modal Semantic Features using a gating mechanism for Dense Video Captioning

Article

Last updated: 22 Jan 2023

Subjects

-

Tags

-

Abstract

Dense video captioning involves detecting interesting events and generating textual descriptions for each event in an untrimmed video. Many machine intelligent applications such as video summarization, search and retrieval, automatic video subtitling for supporting blind disabled people, benefit from automated dense captions generator. Most recent works attempted to make use of an encoder-decoder neural network framework which employs a 3D-CNN as an encoder for representing a detected event frames, and an RNN as a decoder for caption generation. They follow an attention based mechanism to learn where to focus in the encoded video frames during caption generation. Although the attention-based approaches have achieved excellent results, they directly link visual features to textual captions and ignore the rich intermediate/high-level video concepts such as people, objects, scenes, and actions. In this paper, we firstly propose to obtain a better event representation that discriminates between events nearly ending at the same time by applying an attention based fusion. Where hidden states from a bi-directional LSTM sequence video encoder, which encodes past and future surrounding context information of a detected event are fused along with its visual (R3D) features. Secondly, we propose to explicitly extract bi-modal semantic concepts (nouns and verbs) from a detected event segment and equilibrate the contributions from the proposed event representation and the semantic concepts dynamically using a gating mechanism while captioning. Experimental results demonstrates that our proposed attention based fusion is better in representing an event for captioning. Also involving semantic concepts improves captioning performance.

DOI

10.21608/ijicis.2021.60216.1055

Keywords

Video events proposal detection, Video to natural language, Attention-Based sentence decoder, Bidirectional LSTM, Deep learning

Authors

First Name

Noorhan

Last Name

Khaled

MiddleName

-

Affiliation

Computer Science Department, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, Egypt

Email

noorhankhaled1994@gmail.com

City

-

Orcid

0000-0003-3372-7495

First Name

M

Last Name

Aref

MiddleName

-

Affiliation

Department Computer Science, Faculty of Computer and Information Sciences,Ain Shams University, Cairo, Egypt.

Email

mostafa.aref@cis.asu.edu.eg

City

-

Orcid

0000-0002-1278-0070

First Name

mohammed

Last Name

marey

MiddleName

-

Affiliation

Scientific Computing department, Faculty of Computer and Information Science, Ain Shams University, Cairo, Egypt

Email

mohammed.marey@cis.asu.edu.eg

City

-

Orcid

-

Volume

21

Article Issue

2

Related Issue

25765

Issue Date

2021-07-01

Receive Date

2021-01-28

Publish Date

2021-07-18

Page Start

1

Page End

22

Print ISSN

1687-109X

Online ISSN

2535-1710

Link

https://ijicis.journals.ekb.eg/article_184716.html

Detail API

https://ijicis.journals.ekb.eg/service?article_code=184716

Order

1

Type

Original Article

Type Code

494

Publication Type

Journal

Publication Title

International Journal of Intelligent Computing and Information Sciences

Publication Link

https://ijicis.journals.ekb.eg/

MainTitle

-

Details

Type

Article

Created At

22 Jan 2023