Menu Recognition Using Rendered Webpage Information and Machine Learning

  • Authors

    • Chan Choi
    • Minwoo Park
    • Geunseong Jung
    • Jaehyuk Cha
    https://doi.org/10.14419/ijet.v8i1.4.25456
  • web menu, machine learning, feature selection, logistic regression
  • Among the many components of a webpage, the web menu provides organization as well as information on the primary content of the website. By recognizing the menu existing within a webpage, it is possible to identify the overall structure of the website from the menu's information for crawling and indexing. Therefore, tasks can be performed efficiently without accessing the pages unnecessarily. In addition, a variety of applications could be developed through web menu detection. This paper suggests a method to categorize the web menu within a webpage using machine learning. Rendered attribute values of the web document are extracted with processed data to offer more choices in selecting the machine learning attributes. This paper also proposes and demonstrates a Chrome extension-based webpage document collector that effectively collects the final rendered form of a webpage in a browser and its internal data that are required when performing machine learning for web menu categorization. Lastly, a demo platform that can detect webpage menus in real time is designed based on the learned result.

     

     

  • References

    1. [1] Han J, Cheng H, Xin D, Yan X, Frequent pattern mining: Current status and future directions, Data Min Knowl Discov, Vol.15, No.1, (2007), pp.55–86.

      [2] Cai D, Yu S, Wen JR, Ma WY, VIPS: a visionbased page segmentation algorithm, Beijing Miciosoft Res Asia, (2003), pp.1–29, available online: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.638&rep=rep1&type=pdf.

      [3] Insa D, Silva J, Page-Level Webpage Menu Detection, (2016), .

      [4] Bar-Yossef Z, Rajagopalan S, Template Detection via Data Mining and Its Applications, Vol.9, (2002), pp.580–591, available online: http://doi.acm.org/10.1145/511446.511522%5Cnhttp://dl.acm.org/ft_gateway.cfm?id=511522&type=pdf.

      [5] Lowe D, Gaedke M, Web Engineering: 5th International Conference, ICWE 2005, Sydney, Australia, July 27-29, 2005, Proceedings, (2011), .

      [6] Markov Z, Larose DT, Data mining the Web: uncovering patterns in Web content, structure, and usage, (2007), .

      [7] Robie J, Ag S, Document Object Model ( DOM ) Level 3 Core Specification, W3C Recomm, No.April, (2004), pp.1–216.

      [8] Ochoa Serna J, Design and implementation of a Scraping system for Sport News, (2017), .

      [9] Blanvillain O, Kasioumis N, Banos V, BlogForever Crawler, Proc 4th Int Conf Web Intell Min Semant - WIMS ’14, (2014), pp.1–8, available online: http://dl.acm.org/citation.cfm?doid=2611040.2611067.

      [10] Wang J, Guo Y, Scrapy-based crawling and user-behavior characteristics analysis on Taobao, Proc 2012 Int Conf Cyber-Enabled Distrib Comput Knowl Discov CyberC 2012, (2012), pp.44–52.

      [11] Liu L, Zhang X, Yan G, Chen S, Chrome extensions: Threat analysis and countermeasures, … Netw Distrib Syst …, (2012), , available online: https://www.cs.gmu.edu/~sqchen/publications/NDSS-2012.pdf.

      [12] Insa D, Silva J, Tamarit S, Using the words/leafs ratio in the DOM tree for content extraction, J Log Algebr Program, Vol.82, No.8, (2013), pp.311–25, available online: http://dx.doi.org/10.1016/j.jlap.2013.01.002.

      [13] Gorelick M, Ozsvald I, High Performance Python: Practical Performant Programming for Humans, (2014), .

      [14] Ruderman J, Same origin policy for JavaScript, Online Https//Developer Mozilla Org/En/Same Orig Policy JavaScript, (2009), .

      [15] Maimon O, Rokach L, Data Mining and Knowledge Discovery, Kodo Keiryogaku (The Japanese J Behav, (1999), .

      [16] An A, Cercone N, Huang X, A case study for learning from imbalanced data sets, Conf Can Soc Comput Stud Intell, (2001), pp.1–15.

      [17] Japkowicz N, Learning from Imbalanced Data Sets: A Comparison of Various Strategies, AAAI Work Learn from Imbalanced Data Sets, Vol.68, (2000), pp.10–5.

      [18] Mani I, Zhang I, kNN approach to unbalanced data distributions: a case study involving information extraction, Proc Work Learn from Imbalanced Datasets, Vol.126, (2003), .

      [19] Tomek I, Two modifications of CNN, IEEE Trans Syst Man Cybern, Vol.6, (1976), pp.769–72.

      [20] Chawla N V., Bowyer KW, Hall LO, Kegelmeyer WP, SMOTE: Synthetic minority over-sampling technique, J Artif Intell Res, Vol.16, (2002), pp.321–57.

      [21] Han H, Wang W-Y, Mao B-H, Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, Int Conf Intell Comput, (2005), pp.878–87.

      [22] Cox DR, The regression analysis of binary sequences, J R Stat Soc Ser B, (1958), pp.215–42.

      [23] Guyon I, Elisseeff A, An Introduction to Variable and Feature Selection, J Mach Learn Res, Vol.3, No.3, (2003), pp.1157–82.

      [24] Zhai Y, Liu B, Web data extraction based on partial tree alignment, … 14th Int Conf World Wide Web, (2005), pp.76–85, available online: http://dl.acm.org/citation.cfm?id=1060761.

      [25] Hearst M a, Palo X, Segmentation Expository Text, Proc 32nd Annu Meet Assoc Comput Linguist, No.Hearst, (1994), pp.9–16.

      [26] Chakrabarti D, Kumar R, Punera K, A graph-theoretic approach to webpage segmentation, Proceeding 17th Int Conf World Wide Web - WWW ’08, (2008), pp.377, available online: http://portal.acm.org/citation.cfm?doid=1367497.1367549.

      [27] Ferrara E, De Meo P, Fiumara G, Baumgartner R, Web data extraction, applications and techniques: A survey, Knowledge-Based Syst, Vol.70, (2014), pp.301–23, available online: http://dx.doi.org/10.1016/j.knosys.2014.07.007.

      [28] Kushmerick N, Wrapper induction: efficiency and expressiveness, Artif Intell, Vol.118, No.1–2, (2000), pp.15–68.

      [29] Debnath S, Mitra P, Pal N, Giles CL, Automatic Identification of Informative Sections of Web-pages, Vol.17, No.i, (2005), pp.1–35.

      [30] Chen Y, Ma W-Y, Zang H-J, Detecting web page structure for adaptive viewing on small form factor devices, Proc 12th Int Conf World Wide Web, No.49, (2003), pp.225--233, available online: http://research.microsoft.com/pubs/68995/p297-chen.pdf.

      [31] Baluja S, Browsing on Small Screens : Recasting Web-Page Segmentation into an Efficient Machine Learning Framework, Entropy, No.1, (2006), pp.33–42, available online: http://doi.acm.org/10.1145/1135777.1135788.

      [32] Manabe T, Extracting Logical Hierarchical Structure of HTML Documents Based on Headings, Proc 41st Int Conf Very Large Data Bases, Vol.8, No.12, (2015), pp.1606–17.

      [33] Kao HY, Ho JM, Chen MS, WISDOM: Web Intrapage Informative Structure Mining based on Document Object Model, IEEE Trans Knowl Data Eng, Vol.17, No.5, (2005), pp.614–27.

      [34] Vieira K, da Silva AS, Pinto N, de Moura ES, Cavalcanti JMB, Freire J, A Fast and Robust Method for Web Page Template Detection and Removal, Proc 15th ACM Int Conf Inf Knowl Manag, (2006), pp.258–67, available online: http://doi.acm.org/10.1145/1183614.1183654.

      [35] Kohlschütter C, Nejdl W, A densitometric approach to web page segmentation, Proc 17th ACM Conf Inf Knowl Manag, (2008), pp.1173–82.

      [36] Etzioni O, Banko M, Soderland S, Weld DS, Open information extraction from the web, Commun ACM, Vol.51, No.12, (2008), pp.68–74.

      [37] Etzioni O, Cafarella M, Downey D, Popescu A-M, Shaked T, Soderland S, et al., Unsupervised named-entity extraction from the web: An experimental study, Artif Intell, Vol.165, No.1, (2005), pp.91–134.

      [38] Cass TA, Formless forms and paper web using a reference-based mark extraction technique, (1997), .

      [39] C. Chan, M. Park, G. Jung, J. Cha “Detecting Menu in Rendered Web Document Using Machine Learning Techniques†International Journal of Artificial Intelligence and Applications for Smart Devices, vol. 5, no. 1, pp.1-6, 2018.

  • Downloads

  • How to Cite

    Choi, C., Park, M., Jung, G., & Cha, J. (2019). Menu Recognition Using Rendered Webpage Information and Machine Learning. International Journal of Engineering & Technology, 8(1.4), 458-470. https://doi.org/10.14419/ijet.v8i1.4.25456