An Enhancement of Progressive Duplicate Detection with Performance Evaluation


  • Ravikanth. M
  • D Vasumathi



Duplicate detection, entity resolution, pay-as-you-go, progressiveness, data cleaning.


Copy recognition is the way toward grouping various portrayals of same certifiable substances. By and by, these techniques made fundamental to course ever higher datasets in continually squatter period and managing the distinction of a dataset befits logically hazardous. Dynamic copy discovery calculations altogether strengthen the productivity of finding copies if the execution time is lacking. Abusing the extension of the general procedure inside the time accessible by detailing brings about much earlier than past systems. Here, Widespread tests show that dynamic calculations can twofold the effectiveness after some time of customary copy identification and inauspiciously advance upon associated work.



[1] S. E. Whang, D. Marmaros, and H. Garcia-Molina, “Pay-asyou- go entity resolution,†IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1111–1124, May 2012.

[2] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,†IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007.

[3] F. Naumann and M. Herschel, An Introduction to Duplicate Detection. San Rafael, CA, USA: Morgan & Claypool, 2010.

[4] H. B. Newcombe and J. M. Kennedy, “Record linkage: Making maximum use of the discriminating power of identifying information,†Commun. ACM, vol. 5, no. 11, pp.563–566, 1962.

[5] M. A. Hern_andez and S. J. Stolfo, “Real-world data is dirty:Data cleansing and the merge/purge problem,†Data Mining Knowl. Discovery, vol. 2, no. 1, pp. 9–37, 1998.

[6] X. Dong, A. Halevy, and J. Madhavan, “Reference reconciliation in complex information spaces,†in Proc. Int. Conf. Manage. Data, 2005, pp. 85–96.

[7] O. Hassanzadeh, F. Chiang, H. C. Lee, and R. J. Miller,“Framework for evaluating clustering algorithms in duplicate detection,†Proc. Very Large Databases Endowment, vol. 2, pp. 1282–1293, 2009.

[8] O. Hassanzadeh and R. J. Miller, “Creating probabilistic databases from duplicated data,†VLDB J., vol. 18, no. 5, pp. 1141–1166, 2009.

[9] U. Draisbach, F. Naumann, S. Szott, and O. Wonneberg, “Adaptive windows for duplicate detection,†in Proc. IEEE 28th Int. Conf. Data Eng., 2012, pp. 1073–1083.

[10] S. Yan, D. Lee, M.-Y. Kan, and L. C. Giles, “Adaptive arranged neighborhood methods for efficient record linkage,†in Proc. 7th ACM/IEEE Joint Int. Conf. Digit. Libraries, 2007, pp. 185–194.

[11] J. Madhavan, S. R. Jeffery, S. Cohen, X. Dong, D. Ko, C. Yu, and A. Halevy, “Web-scale data integration: You can only afford to pay as you go,†in Proc. Conf. Innovative Data Syst.Res., 2007.

[12] S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-yougo user feedback for dataspace systems,†in Proc. Int. Conf. Manage. Data, 2008, pp. 847–860.

[13] C. Xiao, W. Wang, X. Lin, and H. Shang, “Top-k set similarity joins,†in Proc. IEEE Int. Conf. Data Eng., 2009, pp. 916–927.

[14] P. Indyk, “A small approximately min-wise independent family of hash functions,†in Proc. 10th Annu. ACM-SIAM Symp. Discrete Algorithms, 1999, pp. 454–456.

[15] U. Draisbach and F. Naumann, “A generalization of blocking and windowing algorithms for duplicate detection,†in Proc. Int. Conf. Data Knowl. Eng., 2011, pp. 18–24.

View Full Article: