Beta
62728

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Article

Last updated: 25 Dec 2024

Subjects

-

Tags

-

Abstract

Hadoop is an open-source framework written by java and used for big
data processing. It consists of two main components: Hadoop
Distributed File System (HDFS) and MapReduce. HDFS is used to
store data while MapReduce is used to distribute and process an
application tasks in a distributed processing form. Recently, several
researchers employ Hadoop for processing big data. The results
indicate that Hadoop performs well with Large Files (files larger than
Data Node block size). Nevertheless, Hadoop performance decreases
with small files that are less than its block size. This is because, small
files consume the memory of both the DataNode and the NameNode,
and increases the execution time of the applications (i.e. decreases
MapReduce performance). In this paper, the problem of the small files
in Hadoop is defined and the existing approaches to solve this problem
are classified and discussed. In addition, some open points that must
be considered when thinking of a better approach to improve the
Hadoop performance when processing the small files.

DOI

10.21608/mjeer.2019.62728

Keywords

Hadoop, Big Data, Small Files, Hive, HBase, Amazon EMR, S3DistCp

Authors

First Name

Tharwat

Last Name

EL-SAYED

MiddleName

-

Affiliation

Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt

Email

-

City

-

Orcid

-

First Name

Mohamed

Last Name

Badawy

MiddleName

-

Affiliation

Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt

Email

-

City

-

Orcid

-

First Name

Ayman

Last Name

El-Sayed

MiddleName

-

Affiliation

Computer Science & Eng. Dept., Faculty of Electronic Eng., Menoufia University, Menouf 32952, Egypt.

Email

-

City

-

Orcid

0000-0002-4437-259X

Volume

28

Article Issue

1

Related Issue

9506

Issue Date

2019-01-01

Receive Date

2018-01-03

Publish Date

2019-01-01

Page Start

109

Page End

120

Print ISSN

1687-1189

Online ISSN

2682-3535

Link

https://mjeer.journals.ekb.eg/article_62728.html

Detail API

https://mjeer.journals.ekb.eg/service?article_code=62728

Order

7

Type

Original Article

Type Code

1,088

Publication Type

Journal

Publication Title

Menoufia Journal of Electronic Engineering Research

Publication Link

https://mjeer.journals.ekb.eg/

MainTitle

Impact of Small Files on Hadoop Performance: Literature Survey and Open Points

Details

Type

Article

Created At

22 Jan 2023