Building Standard Offline Anti-phishing Dataset for Benchmarking

  • Authors

    • Kang Leng Chiew
    • Ee Hung Chang
    • Choon Lin Tan
    • Johari Abdullah
    • Kelvin Sheng
    • Chek Yong
    2018-12-09
    https://doi.org/10.14419/ijet.v7i4.31.23333
  • Anti-phishing, Dataset for benchmarking, Features, Legitimate and phishing webpages
  • Abstract

    Anti-phishing research is one of the active research fields in information security. Due to the lack of a publicly accessible standard test dataset, most of the researchers are using their own dataset for the experiment. This makes the benchmarking across different anti-phishing techniques become challenging and inefficient. In this paper, we propose and construct a large-scale standard offline dataset that is downloadable, universal and comprehensive. In designing the dataset creation approach, major anti-phishing techniques from the literature have been thoroughly considered to identify their unique requirements. The findings of this requirement study have concluded several influencing factors that will enhance the dataset quality, which includes: the type of raw elements, source of the sample, sample size, website category, category distribution, language of the website and the support for feature extraction. These influencing factors are the core to the proposed dataset construction approach, which produced a collection of 30,000 samples of phishing and legitimate webpages with a distribution of 50 percent of each type. Thus, this dataset is useful and compatible for a wide range of anti-phishing researches in conducting the benchmarking as well as beneficial for a research to conduct a rapid proof of concept experiment. With the rapid development of anti-phishing research to counter the fast evolution of phishing attacks, the need of such dataset cannot be overemphasised. The complete dataset is available for download at http://www.fcsit.unimas.my/research/legit-phish-set.

     

     

  • References

    1. D. Goel and A. K. Jain, “Mobile phishing attacks and defence mechanisms: State of art and open research challenges,†Computers & Security, vol. 73, pp. 519–544, 2018.

      [2] K. L. Chiew, E. H. Chang, S. S. Nah, and W. K. Tiong, “Utilisation of Website Logo for Phishing Detection,†Computer & Security, vol. 54, pp. 16–26, 2015.

      [3] G. Ramesh, J. Gupta, and P. Gamya, “Identification of phishing webpages and its target domains by analyzing the feign relationship,†Journal of Information Security and Applications, vol. 35, pp. 75–84, 2017.

      [4] A. Aleroud and L. Zhou, “Phishing environments, techniques, and countermeasures: A survey,†Computers & Security, vol. 68, pp. 160–196, 2017.

      [5] E. H. Chang, K. L. Chiew, S. S. Nah, and W. K. Tiong, “Phishing Detection via Identification of Website Identity,†2013 International Conference on IT Convergence and Security, pp. 1–4, 2013.

      [6] J. Zhang, Y. Ou, D. Li, and Y. Xin, “A Prior-based Transfer Learning Method for the Phishing Detection,†Journal of Networks, vol. 7, no. 8, pp. 1201–1207, 2012.

      [7] I. Kang, P. Kim, S. Lee, H. Jung, and B. You, “Construction of a large-scale test set for author disambiguation,†Information Processing & Management, vol. 47, no. 3, pp. 452–465, 2011.

      [8] K. Yuan, Z. Tian, J. Zou, Y. Bai, and Q. You, “Brain CT image database building for computer-aided diagnosis using content-based image retrieval,†Information Processing & Management, vol. 47, no. 2, pp. 176–185, 2011.

      [9] P. Bailey, N. Craswell, and D. Hawking, “Engineering a multi-purpose test collection for web retrieval experiments,†Information Processing & Management, vol. 39, no. 6, pp. 853–871, 2003.

      [10] J. Zhang, S. Luo, Z. Gong, and Y. Xin, “Protection Against Phishing Attacks: A Survey,†International Journal of Advancements in Computing Technology, vol. 3, no. 9, pp. 155–164, 2011.

      [11] F. Schneider, N. Provos, R. Moll, M. Chew, and B. Rakowski, “Phishing protection: Design documentation,†[Online]. Available: https://wiki.mozilla.org/Phishing Protection: Design Documentation, accessed: 23 February 2017.

      [12] R. Abrams, O. Barrera, and J. Pathak, “Browser security comparative analysis - phishing protection,†[Online]. Available: https://www.nsslabs.com/index.cfm/_api/render/file/?method=inline&fileID=A02950BF-5056-9046-93D93A5D61314F1D, accessed: 23 February 2017.

      [13] Y. Cao, W. Han, and Y. Le, “Anti-phishing Based on Automated Individual White-list,†Proceedings of the 4th Workshop on Digital Identity Management, pp. 51–60, 2008.

      [14] P. Prakash, M. Kumar, R. R. Kompella, and M. Gupta, “PhishNet: Predictive Blacklisting to Detect Phishing Attacks,†29th IEEE International Conference on Computer Communications, pp. 346–350, 2010.

      [15] Y. Zhang, J. I. Hong, and L. F. Cranor, “Cantina: A Content-based Approach to Detecting Phishing WebSites,†Proceedings of the 16th International Conference on World Wide Web, pp. 639–648, 2007.

      [16] G. Xiang and J. I. Hong, “A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval,†Proceedings of the 18th International Conference on World Wide Web, pp. 571–580, 2009.

      [17] Q. Cui, G.-V. Jourdan, G. V. Bochmann, R. Couturier, and I.-V. Onut, “Tracking phishing attacks over time,†in Proceedings of the 26th International Conference on World Wide Web, pp. 667–676, 2017.

      [18] L. Wenyin, N. Fang, X. Quan, B. Qiu, and G. Liu, “Discovering Phishing Target Based on Semantic Link Network,†Future Generation Computer Systems, pp. 381–388, 2010.

      [19] J. Zhang, C. Wu, H. Guan, Q. Wang, L. Zhang, Y. Ou, Y. Xin, and L. Chen, “An Content-analysis Based Large Scale Anti-Phishing Gateway,†2010 IEEE 12th International Conference on Communication Technology, pp. 979–982, 2010.

      [20] C. Soman, H. Pathak, V. Shah, A. Padhye, and A. Inamdar, “An Intelligent System for Phish Detection, using Dynamic Analysis and Template Matching,†International Journal of Computer, Electrical, Automation, Control and Information Engineering, vol. 2, no. 6, pp. 1927–1933, 2008.

      [21] R. M. Mohammad, F. A. Thabtah, and L. McCluskey, “Intelligent Rule-based Phishing Websites Classification,†IET Information Security, vol. 8, no. 3, pp. 153–160, 2014.

      [22] N. Abdelhamid, “Multi-label rules for phishing classification,†Applied Computing and Informatics, vol. 11, no. 1, pp.29–46, 2015.

      [23] A. K. Jain and B. B. Gupta, “A novel approach to protect against phishing attacks at client side using auto-updated white-list,†EURASIP Journal on Information Security, vol. 2016, no. 9, pp. 1–11, May 2016.

      [24] R. S. Rao and A. R. Pais, “Detecting phishing websites using automation of human behavior,†in Proceedings of the 3rd ACM Workshop on Cyber-Physical System Security, pp. 33–42, 2017.

      [25] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Beyond blacklists: learning to detect malicious websites from suspicious URLs,†15th ACM International Conference on Knowledge Discovery and Data Mining, pp. 1245–1254, 2009.

      [26] S. Garera, N. Provos, M. Chew, and A. D. Rubin, “A Framework for Detection and Measurement of Phishing Attacks,†Proceedings of the 2007 ACM Workshop on Recurring Malcode, pp. 1–8, 2007.

      [27] Z. Hu, R. Chiong, I. Pranata, W. Susilo, and Y. Bao, “Identifying malicious web domains using machine learning techniques with online credibility and performance data,†in IEEE Congress on Evolutionary Computation, pp. 5186–5194, 2016.

      [28] M. Maurer and L. Höfer, “Sophisticated Phishers Make More Spelling Mistakes: Using URL Similarity against Phishing,†4th International Symposium in Cyberspace Safety and Security, pp. 414–426, 2012.

      [29] B. E. Sananse and T. K. Sarode, “Phishing URL Detection: A Machine Learning and Web Mining-based Approach,†International Journal of Computer Applications, vol. 123, no. 13, pp. 46–50, 2015.

      [30] L. A. T. Nguyen, B. L. To, H. K. Nguyen, and M. H. Nguyen, “Detecting phishing web sites: A heuristic URL-based approach,†2013 International Conference on Advanced Technologies for Communications, pp. 597–602, 2013.

      [31] R. Verma and K. Dyer, “On the Character of Phishing URLs: Accurate and Robust Statistical Learning Classifiers,†Proceedings of the 5th ACM Conference on Data and Application Security and Privacy, pp. 111–122, 2015.

      [32] W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, “Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs,†IEEE International Conference on Communications, pp. 1990–1994, 2013.

      [33] A. Y. Daeef, R. B. Ahmad, Y. Yacob, and N. Y. Phing, “Wide scope and fast websites phishing detection using URLs lexical features,†in 3rd International Conference on Electronic Design, pp. 410–415, 2016.

      [34] J. Solanki and R. G. Vaishnav, “Website phishing detection using heuristic based approach,†International Research Journal of Engineering and Technology, vol. 3, no. 5, pp. 2044–2048, May 2016.

      [35] A. C. Bahnsen, E. C. Bohorquez, S. Villegas, J. Vargas, and F. A. González, “Classifying phishing URLs using recurrent neural networks,†2017 APWG Symposium on Electronic Crime Research (eCrime), pp. 1–8, 2017.

      [36] Z. Yan, S. Liu, T. Wang, B. Sun, H. Jiang, and H. Yang, “A genetic algorithm based model for chinese phishing e-commerce websites detection,†in HCI in Business, Government, and Organizations: eCommerce and Innovation, pp. 270–279, 2016.

      [37] S. Marchal, G. Armano, T. Gröndahl, K. Saari, N. Singh, and N. Asokan, “Off-the-hook: An Efficient and Usable Client-Side Phishing Prevention Application,†IEEE Transactions on Computers, vol. 66, no. 10, pp. 1717–1733, 2017.

      [38] R. Jabri and B. Ibrahim, “Phishing websites detection using data mining classification model,†Transactions on Machine Learning and Artificial Intelligence, vol. 3, no. 4, pp. 42–51, Sep. 2015.

      [39] R. Gupta and P. K. Shukla, “Performance analysis of anti-phishing tools and study of classification data mining algorithms for a novel anti-phishing system,†International Journal of Computer Network and Information Security, vol. 12, pp. 70–77, Nov. 2015.

      [40] P. Singh, Y. P. Maravi, and S. Sharma, “Phishing websites detection through supervised learning networks,†in International Conference on Computing and Communications Technologies, pp. 61–65, 2015.

      [41] W. Liu, X. Deng, G. Huang, and A. Y. Fu, “An Antiphishing Strategy Based on Visual Similarity Assessment,†IEEE Internet Computing, vol. 10, no. 2, pp. 58–65, 2006.

      [42] A. Y. Fu, L. Wenyin, and X. Deng, “Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover’s Distance (EMD),†IEEE Transactions on Dependable and Secure Computing, vol. 3, no. 4, pp. 301–311, 2006.

      [43] R. S. Rao and S. T. Ali, “A computer vision technique to detect phishing attacks,†in Fifth International Conference on Communication Systems and Network Technologies, pp. 596–601, 2015.

      [44] M. Dunlop, S. Groat, and D. Shelly, “GoldPhish: Using Images for Content-Based Phishing Analysis,†Proceedings of the 2010 Fifth International Conference on Internet Monitoring and Protection, pp. 123–128, 2010.

      [45] J. C. S. Fatt, K. L. Chiew, and S. N. Sze, “Phishdentity: Leverage Website Favicon to Offset Polymorphic Phishing Website,†Ninth International Conference on Availability, Reliability and Security, pp. 114–119, 2014.

      [46] J. Mao, P. Li, K. Li, T. Wei, and Z. Liang, “BaitAlarm: Detecting Phishing Sites Using Similarity in Fundamental Visual Features,†5th International Conference on Intelligent Networking and Collaborative Systems, pp. 790–795, 2013.

      [47] W. Zhang, H. Lu, B. Xu, and H. Yang, “Web Phishing Detection Based on Page Spatial Layout Similarity,†Informatica (Slovenia), vol. 37, no. 3, pp. 231–244, 2013.

      [48] M. Hara, A. Yamada, and Y. Miyake, “Visual similarity-based phishing detection without victim site information,†2009 IEEE Symposium on Computational Intelligence in Cyber Security, pp. 30–36, 2009.

      [49] “Google.com,†[Online]. Available: https://www.google.com/, accessed: 4 April 2017.

      [50] “Yahoo.com,†[Online]. Available: https://www.yahoo.com/, accessed: 4 April 2017.

      [51] Alexa Internet Inc., “Keyword Research, Competitive Analysis, & Website Ranking,†[Online]. Available: http://www.alexa.com/, accessed: 4 April 2017.

      [52] “Dmoz.com,†[Online]. Available: https://www.dmoz.org/, accessed: 4 April 2017.

      [53] W. Feller, Introduction to Probability Theory and Its Applications, 3rd ed. Wiley, 1968, vol. 1.

      [54] H. Chen, A. Abbasi, B. Thuraisingham, C. Yang, P. Hu, and R. Shenandoah, “Internet phishing websites,†[Online]. Available: http://www.azsecure-data.org/phishing-websites.html, 2017, accessed: 7 June 2017.

      [55] APWG, “About the APWG,†[Online]. Available: http://www.antiphishing.org/about-APWG, accessed: 23 February 2017.

      [56] APWG, “Phishing Activity Trends Reports 4th Quarter 2012,†[Online]. Available: http://docs.apwg.org/reports/apwg_trends_report_Q4_2012.pdf, accessed: 23 February 2017.

      [57] APWG, “Phishing Activity Trends Reports 4th Quarter 2013,†[Online]. Available: http://docs.apwg.org/reports/apwg_trends_report_q4_2013.pdf, accessed: 23 February 2017.

      [58] APWG, “Phishing Activity Trends Reports 4th Quarter 2014,†[Online]. Available: http://docs.apwg.org/reports/apwg_trends_report_q4_2014.pdf, accessed: 23 February 2017.

      [59] One World Nations Online, “Most widely spoken languages in the world,†[Online]. Available: http://www.nationsonline.org/oneworld/most_spoken_languages.htm, accessed: 4 April 2017.

      [60] Internet World Stats, “Internet world users by language,†[Online]. Available: http://www.internetworldstats.com/stats7.htm, accessed: 4 April 2017.

      [61] “Best of the Web,†[Online]. Available: http://botw.org, accessed: 4 April 2017.

      [62] “GNU Wget,†[Online]. Available: https://www.gnu.org/software/wget/, accessed: 4 April 2017.

      [63] “WebShot,†[Online]. Available: http://www.websitescreenshots.com/, accessed: 4 April 2017.

      [64] Microsoft Corporation, “Whois - Windows Sysinternals,†[Online]. Available: https://docs.microsoft.com/en-us/sysinternals/downloads/whois/, accessed:10 March 2018.

      [65] R. M. A. Mohammad, L. McCluskey, and F. Thabtah, “Phishing websites data set,†[Online]. Available: https://archive.ics.uci.edu/ml/datasets/phishing+websites, accessed: 7 June 2017.

      [66] N. Abdelhamid, “Website phishing data set,†[Online]. Available: https://archive.ics.uci.edu/ml/datasets/Website+Phishing, 2016, accessed: 7 June 2017.

  • Downloads

  • How to Cite

    Leng Chiew, K., Hung Chang, E., Lin Tan, C., Abdullah, J., Sheng, K., & Yong, C. (2018). Building Standard Offline Anti-phishing Dataset for Benchmarking. International Journal of Engineering & Technology, 7(4.31), 7-14. https://doi.org/10.14419/ijet.v7i4.31.23333

    Received date: 2018-12-07

    Accepted date: 2018-12-07

    Published date: 2018-12-09