Duplicate detection and elimination in  xml data for a data warehouse

Ghaith O. Mahdi 1; Murtadha M. Hamad

doi:10.14419/ijet.v7i4.20419

Article Summary Abstract References Full Article How to cite

Authors
- Ghaith O. Mahdi 1 university of anbar
- Murtadha M. Hamad university of anbar
2019-05-27

https://doi.org/10.14419/ijet.v7i4.20419
Blocking Technique, Levenshtein Distance, Smith Waterman Similarity, ANN (Back Propagation)
Abstract

Due to the significant increase in the volume of data in recent decades, the problem of duplicate data has emerged because of the multiplicity of resources where the data is collected in different formats. The presence of duplicates comes as a result of the existence of different formulas of data. Thus, it is necessary to clean the duplicate data to access a pure data set. The main concern of this study is to clean data which Known by its complex hierarchal structure in data warehouse. This can be achieved by detecting duplicate in large data bases in order to increase the efficiency of data mining. In the current study the proposed system of duplicate elements passes through three-stages. The first stage (Pre-processing stage) includes two parts: the first part is the elimination of the exact match which, in turn, works to eliminate many of the identical elements completely. This procedure saves a lot of time and effort by preventing the entrance of many elements to the processing stage which are usually known by its complexity. In the second part blocking technique is used based on Levenshtein distance to minimize the number of comparisons and to maximize the accuracy of blocking elements than the traditional ones. These processes are performed to improve the dataset. The second stage (Processing stage) is taken to compute the similarity ratios between each pair of elements within each block by using smith waterman similarity algorithm. The third stage is the classification stage of the elements in which an element is identified whether it is duplicate or non-duplicate. The Artificial Neural Network technique (Back-Propagation) is used to meet this purpose. The threshold 0.65 has been determined which is relied on the results obtained. The Artificial Neural Network (Back-Propagation) is used to classify the elements in to duplicate and non-duplicate. The efficiency of the proposed system is represented by the accuracy obtained which is closer to 100% through reducing the number of "false negatives" and "false positive" relative to the "true positive".
Â
Â
References
1. [1] M. R. Pawar, â€œEfficient Duplicate Detection and Elimination in Hierarchical Multimedia Data,â€ vol. 122, no. 12, pp. 15â€“21, 2015. https://doi.org/10.5120/21751-5018.
  [2] A. A. Abraham and S. D. Kanmani, â€œA Novel Approach for the Effective Detection of Duplicates in XML Data,â€ Int. J. Comput. Eng. Res., vol. 4, pp. 82â€“87, 2014.
  [3] M. M. Hamad and S. S. Sami, â€œUsing Q-Gram and Fuzzy Logic Algorithms for Eliminating Data Warehouse Duplications,â€ 2016.
  [4] S. Gaikwad and N. Bogiri, â€œLevenshtein distance algorithm for efficient and effective XML duplicate detection,â€ IEEE Int. Conf. Comput. Commun. Control. IC4 2015, 2016. https://doi.org/10.1109/IC4.2015.7375698.
  [5] R. Ananthakrishna, S. Chaudhuri, and V. Ganti, â€œEliminating Fuzzy Duplicates in Data Warehouses,â€ VLDB â€™02 Proc. 28th Int. Conf. Very Large Databases, pp. 586â€“597, 2002. https://doi.org/10.1016/B978-155860869-6/50058-5.
  [6] M. Weis and F. Naumann, â€œDogmatiX Tracks down Duplicates in XML,â€ Proc. 2005 ACM SIGMOD Int. Conf. Manag. data,ACM, pp. 431â€“442, 2005. https://doi.org/10.1145/1066157.1066207.
  [7] L. LeitÃ£o, P. Calado, and M. Weis, â€œStructure-based inference of xml similarity for fuzzy duplicate detection,â€ 16th ACM Conf. Inf. Knowl. Manag., pp. 293â€“302, 2007. https://doi.org/10.1145/1321440.1321483.
  [8] A. R. Petkar and V. B. Patil, â€œDuplicate Detection in Hierarchical Data Using XPath,â€ IOSR J. Comput. Eng. Ver. I, vol. 17, no. 6, pp. 2278â€“661, 2015.
  [9] A. N. Mehta, â€œSimilarity Detection for XML Data,â€ Int. J. Adv. Reserch Sci. Eng., vol. 5, no. 1, pp. 152â€“157, 2016.
  [10] P. B. K. P. M. Bhavana Dhake1, Dr.S.S.Lomte2, Prof.Y.R.Nagargoje3,Prof.R.A.Auti4,â€œDuplicatDetection in Hierarchical Data Using Improved Network Pruning Algorithm,â€ Compusoft, vol.4, no. 6, pp. 7838â€“7850, 2015.
  [11] A. Thesis, â€œPerformance Evaluation of Blocking Methods for Duplicate Record Detection,â€ 2010.
  [12] U. Draisbach and F. Naumann, â€œA generalization of blocking and windowing algorithms for duplicate detection,â€ Proc. - 2011 Int. Conf. Data Knowl. Eng. ICDKE 2011, pp. 18â€“24, 2011. https://doi.org/10.1109/ICDKE.2011.6053920.
  [13] J.ARUNA, â€œIdentification of Duplication Records For Query Results from Real Time Databases,â€ B.S.Abdur Rahman University, 2012.
  [14] R. Haldar and D. Mukhopadhyay, â€œLevenshtein Distance Technique in Dictionary Lookup Methods: An Improved Approach,â€ arXiv:1101.1232, no. Ld, pp. 286â€“293, 2011.
  [15] G. Recchia and M. Louwerse, â€œA Comparison of String Similarity Measures for Toponym Matching,â€ no. c, 2013.
  [16] C. Ling, K. Benkrid, and T. Hamada, â€œA parameterisable and scalable smith-Waterman algorithm implementation on CUDA-compatible GPUs,â€ 2009 IEEE 7th Symp. Appl. Specif. Process. SASP 2009, pp. 94â€“100, 2009. https://doi.org/10.1109/SASP.2009.5226343.
  [17] L. Hasan, Z. Al-Ars, and Z. Nawaz, â€œA Novel Approach for Accelerating the Smith-Waterman Algorithm using Recursive Variable Expansion,â€ Proc. 19th Annu. â€¦, 2008. https://doi.org/10.1109/IDT.2008.4802483.
  [18] M. Bilenko, M. View, and R. J. Mooney, â€œAdaptive Blocking : Learning to Scale Up Record Linkage,â€ Proc. Sixth IEEE Int. Conf. Data Min., no. December, pp. 87â€“96, 2006. https://doi.org/10.1109/ICDM.2006.13.
Downloads
How to Cite
O. Mahdi 1, G., & M. Hamad, M. (2019). Duplicate detection and elimination in xml data for a data warehouse. International Journal of Engineering & Technology, 7(4), 6175-6180. https://doi.org/10.14419/ijet.v7i4.20419
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: 2018-09-28

Accepted date: 2018-12-16

Published date: 2019-05-27

Duplicate detection and elimination in xml data for a data warehouse

Authors

Abstract

References

Downloads

How to Cite

Published