Removing Duplicate URLs based on URL Normalization and Query Parameter

  • Authors

    • Kavita Goel
    • Jay Shankar Prasad
    • Saba Hilal
    2018-07-20
    https://doi.org/10.14419/ijet.v7i3.12.16107
  • URL Normalization, Query Parameter, Categorization, Duplicate URLs, Execution time.
  • Searching is the important requirement of the web user and results is based on crawler. Users rely on search engines to get desired information in various forms text, images, sound, Video. Search engine gives information on the basis of indexed database and this database is created by the URLs through crawler. Some URLs directly or indirectly leads to same page. Crawling and indexing similar contents URLs implies wastage of resources. Crawler gives such results because of bad crawling algorithm, poor quality Ranking algorithm or low level user experience. The challenge is to remove duplicate results, near duplicate document detection and elimination to improve the performance of any search engine. This paper proposes a Web Crawler which performs crawling in particular category to remove irrelevant URL and implements URL normalization for removing duplicate URLs within particular category. Results are analyzed on the basis of total URL Fetched, Duplicate URLs, and Query execution time.

     

     

  • References

    1. [1] M. Shoaib and A. K. Maurya. “URL ordering based performance evaluation of Web crawlerâ€, International Conference on Advances in Engineering & Technology Research, Unnao, pp. 1-7. 2014 doi: 10.1109/ICAETR.2014.7012962.

      [2] Jiang, J. Pei. And H.Li. “Mining search and browse logs for web search: A Surveyâ€, Journal of ACM Transactions on Intelligent Systems and Technology, vol.4, no. 4, 2013.

      [3] L. Getoor, “Link Mining: A New Data Mining Challenge†.SIGKDD explorations, pp. 1-6, 2003,

      [4] K. S. Kim, K. Y. Kim and K. H. Lee, et al. “Design and implementation of web crawler based on dynamic web collection cycleâ€, The International Conference on Information Network 2012, Bali, 2012, pp. 562-566, 2012 doi: 10.1109/ICOIN.2012.6164440.

      [5] Agarwal, S H. Koppula and KP. Leela, et al. “URL normalization for de-duplication of web pagesâ€, in Proceedings of the 18th ACM conference on Information and knowledge management, Hong Kong, China,2009, pp.1987-1990.

      [6] L.k. Soon and S.H. Lee. “Enhancing URL Normalization Using Metadata of Web Pagesâ€, 2008 International Conference on Computer and Electrical Engineering, Phuket, pp. 331-335. 2008 doi: https://doi.org/10.1109/ICCEE.2008.112.

      [7] Z .B Yossef , I. Keidar and U. Schonfeld,U “Do Not Crawl in the DUST: Different URLs with Similar Textâ€, in proceedings of the 16th international conference on World Wide Web (WWW '07), ACM, New York, NY, USA ,2007, pp.111-120 doi: http://dx.doi.org/10.1145/1462148.1462151.

      [8] Shestakov. “Current challenges in web crawlingâ€, in, Lecture notes on computer science, F. Daniel, P. Dolog, Li Q Ed. Springer-Verlag: Berlin, Heidelberg, 2013, pp. 518-521.

      [9] T. Lei, R. Cai and J.M.Yang, et al. “A pattern tree-based approach to learning URL normalization rulesâ€, in Proceedings of the 19th international conference on World Wide Web (WWW’10), New York, NY, USA, 2010, pp. 611-620. doi: http://dx.doi.org/10.1145/1772690.1772753.

      [10] A.Dasgupta, R. Kumar and A. Sasturkar. “De-duping URLs via rewrite rulesâ€, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’08), ACM, New York, NY, USA, 2008, pp. 186-194. doi: https://doi.org/10.1145/1401890.1401917.

      [11] H. Lu, D. Zhan and L. Zhou, et al. “An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluationâ€, Mathematical Problems in Engineering, vol. 2016, 2016 doi:10.1155/2016/6406901.

      [12] H. W. Hao, C. X. Mu and X. C. Yin,et al. “An improved topic relevance algorithm for focused crawlingâ€, IEEE International Conference on Systems, Man, and Cybernetics, Anchorage, AK, 2011, pp. 850-855. doi: 10.1109/ICSMC.2011.6083759.

      [13] B. Yohanes, P. Handoko, H K. Wardana. “Focused Crawler Optimization Using Genetic Algorithmâ€. Telkomnika, 2011, doi: 10.12928/telkomnika.v9i3.730.

      [14] S. Batsakis, G.M. E. Petrakis and E. Milios “Improving the performance of focused web crawlersâ€, Data & Knowledge Engineering, 2009, Vol. 68, no.10, pp 1001-1003.ISSN 0169-023X, https://doi.org/10.1016/j.datak.2009.04.002.

      [15] S. Khalil and M. Fakir “R Crawler: An R package for parallel web crawling and scraping†, Software, vol. 6, 2017, pp. 98-106, ISSN 2352-7110, https://doi.org/10.1016/j.softx.2017.04.004.

      [16] X. Zhang and M. Xian. “Optimization of Distributed Crawler under Hadoopâ€. MATEC Web of Conferences, 2015. doi: 10.1051/matecconf/20152202029.

      [17] Cao F, Jiang D ,Singh J P. “Scheduling Web Crawl for Better Performance and Qualityâ€,2003.[Online].Available: ftp://ftp.cs.princeton.edu/reports/2003/682.pdf. Accessed Jan 29, 2018.

      [18] K. Rodrigues, M. Cristo and E. S. de Moura et al. "Removing DUST Using Multiple Alignment of Sequences," in IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 8, pp. 2261-2274, Aug 2015.doi: 10.1109/TKDE.2015.2407354.

      [19] Purnamasari, L.Y. Banowosari, R.D. Kusumawati, et al. 2017. “Semantic Similarity for Search Engine Enhancementâ€. Journal of Engineering and Applied Sciences , Vol 12. 2017, pp. 1979-1982.

  • Downloads

  • How to Cite

    Goel, K., Shankar Prasad, J., & Hilal, S. (2018). Removing Duplicate URLs based on URL Normalization and Query Parameter. International Journal of Engineering & Technology, 7(3.12), 361-365. https://doi.org/10.14419/ijet.v7i3.12.16107