A novel indexing scheme for efficient handling of small files in Hadoop Distributed File System

Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance pe...

Full description

Saved in:

Bibliographic Details
Published in	2013 International Conference on Computer Communication and Informatics pp. 1 - 8
Main Authors	Chandrasekar, S., Dakshinamurthy, R., Seshakumar, P. G., Prabavathy, B., Babu, C.
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.01.2013
Subjects	Client server systems Computer architecture Computers Conferences Correlation Counting extended hdfs file correlation File systems hadoop distributed file system Indexes Indexing Informatics Merging Metadata Prefetching small file Storage Stores
Online Access	Get full text
ISBN	1467329061 9781467329064
DOI	10.1109/ICCCI.2013.6466147

Cover

More Information
Summary:	Hadoop Distributed File System (HDFS) is designed for reliable storage and management of very large files. All the files in HDFS are managed by a single server, the NameNode. NameNode stores metadata, in its main memory, for each file stored into HDFS. As a consequence, HDFS suffers a performance penalty with increased number of small files. Storing and managing a large number of small files imposes a heavy burden on the NameNode. The number of files that can be stored into HDFS is constrained by the size of NameNode's main memory. Further, HDFS does not take the correlation among files into account, and it does not provide any prefetching mechanism to improve the I/O performance. In order to improve the efficiency of storing and accessing the small files on HDFS, we propose a solution based on the works of Dong et al., namely Extended Hadoop Distributed File System (EHDFS). In this approach, a set of correlated files is combined, as identified by the client, into a single large file to reduce the file count. An indexing mechanism has been built to access the individual files from the corresponding combined file. Further, index prefetching is also provided to improve I/O performance and minimize the load on NameNode. The experimental results indicate that EHDFS is able to reduce the metadata footprint on NameNode's main memory by 16% and also improve the efficiency of storing and accessing large number of small files.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISBN:	1467329061 9781467329064
DOI:	10.1109/ICCCI.2013.6466147