A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance pe...

Full description

Saved in:
Bibliographic Details
Published in2013 International Conference on Computer Communication and Informatics pp. 1 - 8
Main Authors Chandrasekar, S., Dakshinamurthy, R., Seshakumar, P. G., Prabavathy, B., Babu, C.
Format Conference Proceeding Journal Article
LanguageEnglish
Published IEEE 01.01.2013
Subjects
Online AccessGet full text
ISBN1467329061
9781467329064
DOI10.1109/ICCCI.2013.6466147

Cover

More Information
Summary:Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance penalty with increased number of small files. Storing and managing a large number of small files imposes a heavy burden on the NameNode. The number of files that can be stored into HDFS is constrained by the size of NameNode's main memory. Further, HDFS does not take the correlation among files into account, and it does not provide any prefetching mechanism to improve the I/O performance. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose a solution based on the works of Dong et al., namely Extended Hadoop Distributed File System (EHDFS). In this approach, a set of correlated files is combined, as identified by the client, into a single large file to reduce the file count. An indexing mechanism has been built to access the individual files from the corresponding combined file. Further, index prefetching is also provided to improve I/O performance and minimize the load on NameNode. The experimental results indicate that EHDFS is able to reduce the metadata footprint on NameNode's main memory by 16% and also improve the efficiency of storing and accessing large number of small files.
Bibliography:ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Conference-1
ObjectType-Feature-3
content type line 23
SourceType-Conference Papers & Proceedings-2
ISBN:1467329061
9781467329064
DOI:10.1109/ICCCI.2013.6466147