Massive Volume of Unstructured Data and Storage Space Optimization- a Review


  • Ranjeet V. Powar
  • B Arunkumar





Deduplication, storage optimization, data compression, unstructured data.


Nowadays the volume of digital data generated and used by enterprises is increasing at an enormous rate. The survey says that more than 80% of data that were generated in the last two years are unstructured in nature. Hence storage space requirement for storing this big volume of unstructured data is very high.  It has gained attention to large-scale storage systems. Deduplication is a space efficient method mainly used to solve storage space optimization problem. This paper focuses on the effect of massive volume of unstructured data and review various storage optimization techniques and survey of various storage types. In addition, it elaborates specific challenges with regard to storage optimization using deduplication and technology that handles a huge amount of unstructured data.




[1] He Q, Li Z & Zhang X, “Data deduplication techniquesâ€, IEEE International Conference on Future Information Technology and Management Engineering (FITME), (2010), pp.430-433.

[2] Chen CP & Zhang CY, “Data-intensive applications, challenges, techniques and technologies: A survey on Big Dataâ€, Information Sciences, Vol.275, (2014), pp.314-347.

[3] Kulkarni P, Douglis F, LaVoie JD & Tracey JM, “Redundancy Elimination within Large Collections of Filesâ€, USENIX Annual Technical Conference, General Track, (2004), pp.59-72.

[4] Michael K & Miller KW, “Big data: New opportunities and new challenges [guest editors' introduction]â€, Computer, Vol.46, No.6, (2013), pp.22-24.

[5] Shoro AG & Tariq RS, “Big data analysis: Apache spark perspectiveâ€, Global Journal of Computer Science and Technology, (2015).

[6] Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C & Byers AH, “Big data: The next frontier for innovation, competition, and productivityâ€, McKinsey Global Institute, (2011).

[7] Turner V, Gantz JF, Reinsel D & Minton S, “The digital universe of opportunities: Rich data and the increasing value of the internet of thingsâ€, IDC Analyze the Future, (2014).

[8] Nguyen TH, Shirai K & Velcin J, “Sentiment analysis on social media for stock movement predictionâ€, Expert Systems with Applications, Vol.42, No.24, (2015), pp.9603-9611.

[9] Salomon D, Data compression: the complete reference, Springer Science & Business Media, (2004).

[10] Reghbati HK, “Special feature an overview of data compression techniquesâ€, Computer, Vol.14, No.4, (1981), pp.71-75.

[11] Boldi P & Sebastiano V, “The web graph framework I: compression techniquesâ€, Proceedings of the 13th international conference on World Wide Web, (2004).

[12] Sethi G, “Data Compression Techniquesâ€, International Journal of Computer Science and Information Technologies, Vol.5, No.4, (2014), pp.5584-6.

[13] Chen M, Shiwen M & Yunhao L, “Big data: A surveyâ€, Mobile networks and applications, Vol.19, No.2, (2014), pp.171-209.

[14] Statistics, YouTube, YouTube Inc., (2016).

[15] Hess B & Virginia T, “Educating Consumers through Social Mediaâ€, Consumer Interests Annual, Vol. 60, (2014).

[16] Brain, Statistic, “Facebook statisticsâ€, Retrieved March, Vol.17, (2014).

[17] Brain, Statistic, “Twitter statisticsâ€, Statistic Brain (2014),

[18] White, Tom. Hadoop: The definitive guide, O'Reilly Media, Inc, (2012).

[19] Bigelow SJ & Hawkins J, Data deduplication (Intelligent compression or single-instance storage), (2008).

[20] Matze JEG, “System and method for data deduplicationâ€, U.S. Patent No. 8,205,065, (2012).

[21] Venish A & Siva Sankar K, “Study of chunking algorithm in data deduplicationâ€, Proceedings of the International Conference on Soft Computing Systems, (2016).

[22] Shin Y, Dongyoung K & Junbeom H, “A survey of secure data deduplication schemes for cloud storage systemsâ€, ACM Computing Surveys (CSUR), Vol.49, No.4, (2017).

[23] Sharma S & Mangat, V, “Technology and trends to handle big data: Surveyâ€, IEEE Fifth International Conference on Advanced Computing & Communication Technologies (ACCT), (2015), pp.266-271.

[24] Bhadani AK & Dhanya J, “Big Data: Challenges, Opportunities, and Realitiesâ€, Effective Big Data Management and Opportunities for Implementation, IGI Global, Pennsylvania, USA, (2016), pp.1-24.

[25] O’Malley, Owen, “Terabyte sort on apache Hadoopâ€, Yahoo, (2008), pp.1-3.

[26] Zakir J, Tom S, and Kristi B, “Big Data Analyticsâ€, Issues in Information Systems, Vol.16, No.2, (2015).

[27] Cohen J & Subatra A, “Towards a more secure apache hadoop hdfs infrastructureâ€, International Conference on Network and System Security. Springer, Berlin, Heidelberg, (2013).

[28] Frank S, “Scalable block data storage using content addressingâ€, U.S. Patent No. 9,104,326, (2015).

[29] Yaqoob I, “Big data: From beginning to futureâ€, International Journal of Information Management, Vol.36, No.6, (2016), pp.1231-1247.

[30] Smith C, “By the numbers: 160+ interesting Instagram statisticsâ€, Retrieved February, Vol.11, (2016).

[31] Meyer DT and William JB, “A study of practical deduplicationâ€, ACM Transactions on Storage (TOS), Vol.7, No.4, (2012), pp.1-14.

[32] Nam YJ, Dongchul P & David HCD, “Assuring demanded read performance of data deduplication storage with backup datasetsâ€, IEEE 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2012.

[33] Wallace G, “Characteristics of backup workloads in production systemsâ€, FAST. Vol.12, (2012).

[34] G, Abikhanova, A Ahmetbekova, E Bayat, A Donbaeva, G Burkitbay (2018). International motifs and plots in the Kazakh epics in China (on the materials of the Kazakh epics in China), Opción, Año 33, No. 85. 20-43.

[35] A Mukanbetkaliyev, S Amandykova, Y Zhambayev, Z Duskaziyeva, A Alimbetova (2018). The aspects of legal regulation on staffing of procuratorial authorities of the Russian Federation and the Republic of Kazakhstan Opción, Año 33. 187-216.

View Full Article: