Beta
386922

Transformer-Based Backbones for Scene Graph Generation A Comparative Analysis

Article

Last updated: 03 Jan 2025

Subjects

-

Tags

-

Abstract

The Scene Graph is a modern structured representation of an image scene that explicitly describes the scene as a set of objects, attributes, and links between the objects (relationships). With the great advancements in the computer vision field, researchers dedicated their efforts towards more complex reasoning and a high level of understanding of visual scenes. Tasks like Visual Question Answering, image generation, and cross-modal retrieval are examples of Complex vision tasks that require a high level of visual scene understanding. Scene Graph is an effective data structure that highlights complex visual relationships presented in a scene. In this work, we provide a comparative analysis of Scene Graph Generation (SGG) backbone models. The contributed work aims to compare the Convolution Neural Networks (CNN) backbones and the vision transformer-based backbones using the RelTR model. The conducted analysis proved that both SwiftFormer L3 and MiT-B2 transformer backbones increased the model performance over the ResNet50 CNN backbone by 2.1 % and 2.5% Recall@50 respectively when experimented on the same Visual Genome 50 test split. The Visual Genome 50 is a tailored version of The Visual Genome dataset. It contains only the 50 most common relationships and the most frequent 150 object classes.

DOI

10.21608/ijicis.2024.301597.1342

Keywords

Scene Graph, Scene Graph Generation, Transformer-Based Backbone, Visual Relationship Detection, Low Resolution

Authors

First Name

Mohammad

Last Name

Essam

MiddleName

-

Affiliation

Faculty of computer and information sciences ain shams university

Email

mohamed97@cis.asu.edu.eg

City

Cairo

Orcid

-

First Name

Dina

Last Name

Khattab

MiddleName

-

Affiliation

Scientific Computing Department, Faculty of Computer & Information Sciences, Ain Shams University, Cairo, Egypt

Email

dina.khattab@cis.asu.edu.eg

City

Cairo

Orcid

-

First Name

howida

Last Name

shedeed

MiddleName

-

Affiliation

FCIS - Ain Shams Univ.

Email

dr_howida@cis.asu.edu.eg

City

-

Orcid

-

First Name

Mohamed

Last Name

Tolba

MiddleName

-

Affiliation

Department of Scientific Computing, Faculty of Computer and Information Sciences, Ain Shams University, Cairo, 11566, Egypt

Email

fahmytolba@cis.asu.edu.eg

City

-

Orcid

0000-0003-3104-6418

Volume

24

Article Issue

3

Related Issue

50851

Issue Date

2024-09-01

Receive Date

2024-07-04

Publish Date

2024-09-30

Page Start

1

Page End

10

Print ISSN

1687-109X

Online ISSN

2535-1710

Link

https://ijicis.journals.ekb.eg/article_386922.html

Detail API

https://ijicis.journals.ekb.eg/service?article_code=386922

Order

386,922

Type

Original Article

Type Code

494

Publication Type

Journal

Publication Title

International Journal of Intelligent Computing and Information Sciences

Publication Link

https://ijicis.journals.ekb.eg/

MainTitle

Transformer-Based Backbones for Scene Graph Generation A Comparative Analysis

Details

Type

Article

Created At

23 Dec 2024