A Survey on Duplicate Data Filtering Methods in Big Data

  • Authors

    • S P Godlin Jasil
    • V Ulagamuthalvi
    2018-08-04
    https://doi.org/10.14419/ijet.v7i3.1.16805
  • Bloom Filter, Duplicate, Data Filtering, False Positive, Multi-layer Bloom Filter.
  • Abstract

    Big Data analytics is the process of collecting heterogeneous huge sets of data for analyzing .The data are fetched from different sources and can be in heterogeneous form. Data arriving in the big data system will be in giga-bytes for every second. Since, the data are in huge volume, there is a possibility of redundant data that affect the network performance. This article presents the review of different filtering methods and algorithms that are used for duplicate elimination such as Bloom filter, Stable Bloom Filter, multi-layer bloom filter, Counting Bloom Filter with some disadvantages such as false positive and false negative. The aim of this paper is to propose an algorithm for eliminating the duplicate Data in a large data set by using big data analytics.

     

     

  • References

    1. [1] B. H. Bloom. Space/time trade-of in hash coding with allowable errors. In CACM, 1970.

      [2] E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In Proc. of PODS, 2003.

      [3] T. Palpanas, M. Vlachos, E. J. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In Proc. of ICDE, 2004.

      [4] Fan Deng , Davood Raï¬ei . Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters In Proceedings of the 2006 ACM SIGMOD international conference on Management of data Pages 25-36 cross Ref Google Scholar

      [5] Cen Zhiwang , Xu Jungang , Sun Jian (2010) A multi-layer bloom filter for duplicated URL detection In Advanced Computer Theory and Engineering (ICACTE), 2010 3rd International Conference .

      [6] Muhammad Habib ur Rehma, Chee Sun Liew, Assad Abbas, Prem Prakash Jayaraman, Teh Ying Wah, Samee U. Khan(2016) Big Data Reduction Methods: A Survey cross Ref Google Scholar.

      [7] https://www.ibm.com

      [8] Chun-Hee Lee a, Chin-Wan Chung, An approximate duplicate elimination in RFID data streams. In Data & Knowledge Engineering 2011, pages 1070-1087.

      [9] Fan, L., P. Cao, J. Almeida, & Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3), pp. 281-293.

      [10] S.P. Godlin Jasil , M. Deepa(2014). Efficient Utilization of Cloud Infrastructure by Handling Heterogeneous Workloads. International Journal of Applied Engineering Research. Volume 9, Number 23 (2014) pp. 13655-13666.

      [11] Czerwinski, S., Zhao, B.Y., Hodes, T., Joseph, A.D., & Katz, R. (1999). An architecture for a secure service discovery service. Proceedings of MobiCom99, pp. 24-35.

      [12] A Salman. Bloom’s Filters: Their Types And Analysis. Dogu Universities Dergisi, 6 (2) 2005, 268-278.

      [13] G. Ramprabu, S. Nagarajan, “Design and Analysis of Novel Modified Cross Layer Controller for WMSNâ€, Indian Journal of Science and Technology, Vol 8(5), March 2015, pp.438-444.

  • Downloads

  • How to Cite

    P Godlin Jasil, S., & Ulagamuthalvi, V. (2018). A Survey on Duplicate Data Filtering Methods in Big Data. International Journal of Engineering & Technology, 7(3.1), 90-92. https://doi.org/10.14419/ijet.v7i3.1.16805

    Received date: 2018-08-04

    Accepted date: 2018-08-04

    Published date: 2018-08-04