solution for the future: small file management by optimizing Hadoop

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Hadoop Distributed File System (HDFS) is designed to reliably store very large files across machines in a large cluster. It is one of the most used distributed file systems and offer a high availability and scalability on low-cost hardware. All Hadoopframework have HDFS as their storage component. Coupled with map reduce, which is the processing component, HDFS and Map Reduce (a processing component) have become the standard platforms for any management of big data in these days. HDFS however, in terms of design has the ability to handle huge numbers of large files,  but when it comes to its deployments to handle large amounts of small files it might not be very effective. This paper puts forward a new strategy of managing small files. The approach will consists of two principal phases. The first phase will deal with the consolidating of aaclients input files, storing it continuously in a particular allocated block, that is a SequenceFile format, and so on into the next blocks. In this way we avoid the use of multiple block allocations for different streams, this reduces calls for available blocks and also reduces the metadata memory on the NameNode. Note the reason for this is that groups of small files packaged in a SequenceFile on the same block require one entry instead of one of each small file. The second phase will involve analyzing the attributes of stored small files so they can be distributed them in a way that the most called files will be referenced by an additional index as a MapFile format to reduce the read throughput during random access.


  • Keywords

    Cloud Hadoop, HDFS, Small Files, SequenceFile, MapFile.

  • References

      [1] Official Hadoop website,

      [2] J. Dörre, S. Apel, and C. Lengauer, “Modeling and optimizing MapReduce programs,” Concurrency and Computation: Practice and Experience, vol. 27, no. 7, pp.1734-1766, 2015.

      [3] D. T. Nukarapu, B. Tang, L. Wang, and S. Lu, “Data replication in data intensive scientific applications with performance guarantee,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 8, pp. 1299–1306, 2011.

      [4] Y. Gao and S. Zheng, "A Metadata Access Strategy of Learning Resources Based on HDFS," in proceeding International Conference on Image Analysis and Signal Processing (IASP), pp. 620—622, 2011.

      [5] T. White. Hadoop: The Definitive Guide, 4th Edition O’Reilly, 2015.

      [6] B. White, T. Yeh, J. Lin, and L. Davis, “Web-scale computer vision using mapreduce for multimedia data mining,” in Proceedings of the Tenth International Workshop on Multimedia Data Mining. ACM, 2010, p. 9.

      [7] K. Wiley, A. Connolly, J. Gardner, S. Krughoff, M. Balazinska, B. Howe, Y. Kwon, and Y. Bu, “Astronomy in the cloud: using mapreduce for image co-addition,” Astronomy, vol. 123, no. 901, pp. 366–380, 2011.

      [8] W. Fang, V. Sheng, X. Wen, and W. Pan, “Meteorological data analysis using mapreduce,” The Scientific World Journal, vol. 2014, 2014.

      [9] F. Wang and M. Liao, “A map-reduce based fast speaker recognition,” in Information, Communications and Signal Processing (ICICS) 2013 9th International Conference on. IEEE, 2013, pp. 1–5.

      [10] K. P. Ajay, K. C. Gouda, H. R. Nagesh, “A Study for Handelling of High-Performance Climate Data using Hadoop, Proceedings of the International Conference, pp: 197-202, April 2015.

      [11] D. Q. Duffy, J. L. Schnase, J. H. Thompson, S. M. Freeman, and T. L. Clune, “Preliminary evaluation of mapreduce for high-performance climate data analysis,” 2012.

      [12] C. Shen, W. Lu, J. Wu, and B. Wei, “A digital library architecture supporting massive small files and efficient replica maintenance,” in Proceedings of the 10th Annual Joint Conference on Digital Libraries (JCDL '10), pp. 391–392, June 2010

      [13] A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. SenSarma, R. Murthy, and H. Liu, “Data warehousing and analytics infrastructure at facebook,” in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010, pp. 1013–1020.

      [14] J. K. Bonfield and R. Staden, “ZTR: A new format for DNA sequence trace data”, Bioinformatics, vol. 18, no. 1, (2002), pp. 3–10.

      [15] J. Xie, S. Yin, et al. “Improving MapReduce performance through data placement in heterogeneous Hadoop clusters”, In 2010 IEEE International Symposium on Parallel & Distributed

      [16] G. Mackey; S. Sehrish; J. Wang. Improving metadata management for small files in HDFS. IEEE International Conference on Cluster Computing and Workshops (CLUSTR). 2009. pp.1-4.

      [17] C. Vorapongkitipun; N. Nupairoj. Improving performance of small-file accessing in Hadoop. IEEE International Conference on Computer Science and Software Engineering (JCSSE). 2014. pp.200-205.

      [18] Patel A, Mehta M A. A novel approach for efficient handling of small files in HDFS, 2015 IEEE International Advance Computing Conference (IACC), pp. 1258-1262.

      [19] Y. Zhang; D. Liu. Improving the Efficiency of Storing for Small Files in HDFS. International Conference on Computer Science & Service System (CSSS). 2012. pp.2239-2242

      [20] D. Dev; R. Patgiri. HAR+: Archive and metadata distribution! Why not both?. IEEE International Conference on Computer Communication and Informatics (ICCCI). 2015. pp.1-6.

      P. Gohil; B. Panchal; J. S. Dhobi. A novel approach to improve the performance of Hadoop in handling of small files. International Conference on Electrical, Computer and Communication Technologies (ICECCT). 2015. pp.1-5.




Article ID: 10773
DOI: 10.14419/ijet.v7i2.6.10773

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.