Successive Duplicate Detection in Scalable Datasets in Cloud Database

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Replica identification is the path toward perceiving various depictions of same true matters. Today, duplication location methodologies needed process ever greater datasets in ever shorter time: keeping this idea on dataset ends up being dynamically troublesome. To present, dynamic duplicate and distinguishing by proof figuring happened to be using Progressive Sorted Neighbourhood Method and Progressive Blocking augmentations more profitability occurred in order to find duplicates. If the execution time is restricted then the grow of general technique considers the time accessible and generates reports that considerably produces results faster than ordinary systems. Broad examinations exhibit that our dynamic counts can twofold the capability after some season of standard copy recognition and basically improve related work.



  • Keywords

    Duplication Detection, Dataset, Progressive Blocking, Progressive Sorted Neighbourhood Method, Data cleaning

  • References

      [1] T.Senthil Murugan, Jagannath E Nalavade, “C-mixture and multi-constraints based genetic algorithm for collaborative data publishing,” Elsevier - Journal of King Saud University – Computer and Information Sciences (Article in press), 2016

      [2] Kavitha R, Rajkumar N, Kannan E.”Framework for Primary Health Centers (PHC)” using cloud in the International Journal named Discovery, using Cloud. Discovery, , 30(116), 17-21 2015.

      [3] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Trans. Knowl. Data Eng., vol. 24, no. 9, pp. 1537–1555, Sep. 2012.

      [4] R. Kavitha, E. Kannan and S. Kotteswaran “Implementation of Cloud based Electronic Health Record (EHR) for Indian Healthcare Needs” in the International journal of Science and Technology., Vol 9(3), DOI: 10.17485/ijst/2016/v9i3/86391, January 2016

      [5] M. A. Hernandez and S. J. Stolfo, “Real-world data is dirty: Data cleansing and the merge/purge problem,” Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.

      [6] T.Senthil Murugan, Jagannath E Nalavade, “THRFuzzy: Tangential holoentropy-enabled rough fuzzy classifier to classification of evolving data streams”, Springer - Journal of Central South University,Vol.24(8), 2017

      [7] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller, “Framework for evaluating clustering algorithms in duplicate detection,” Proc. Very Large Databases Endowment, vol. 2, pp. 1282– 1293, 2009.

      [8] Kavitha R, E Kannan “A Novel Triangular Boundary based classification approach to detect outliers and predict the class labels using the kernel Methods” by Journal of Information Science and Engineering.

      [9] T.Senthil Murugan, Jagannath E Nalavade, “HRNeuro-fuzzy: Adapting neuro-fuzzy classifier for recurring concept drift of evolving data streams using rough set theory and holoentropy”, Journal of King Saud University – Computer and Information Sciences, (Article in press), 2016

      [10] U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,” in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.




Article ID: 11167
DOI: 10.14419/ijet.v7i2.4.11167

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.