Automatic Classification of Arabic News and Column Articles using Machine Learning and Deep Learning Approaches

  • Authors

    • Hanen Himdi University of Jeddah
    • Bayan Alotaibi
    • Dania Alsahafy
    • Gharam Alghamdi
    • Layan Alghamdi
    2023-12-22
    https://doi.org/10.14419/49cm2a56
  • Abstract

    News is a report of recent events that is distributed from its point of origin to receivers by journalists who work for commercial organizations. Columns, on the other hand, are writings presented in special sections on news platforms that provide content on a variety of subjects and are often written by individuals or small groups of authors. Usually, columns might present information in a more subjective manner compared to the objective constraints in news reports. However, a major concern is that many columns lack the news’s editorial oversight process presented in news production, which can have negative effects, such as including incorrect information. In some cases, the presence of column articles on news platforms is mistaken for a news article, causing credibility concerns. To tackle this problem, a wide range of extant studies conducted in the recent past offer suitable techniques for column article classification. However, there are no studies in Arabic for this purpose. In this study, we introduce the first Arabic column article dataset that includes more than 12k articles. Then, we compiled several classification models using machine learning and deep learning approaches. It was found that deep learning models, CNN-LSTM, trained by BERT achieved the highest accuracy, reaching 96.6%. Finally, we propose a web platform that can be used freely to classify news and column articles based on their textual content solely

  • References

    1. AGGARWAL, C. C., AND ZHAI, C. A survey of text classification algorithms. Mining text data (2012), 163–222.
    2. AL-BARHAMTOSHY, H. M., HIMDI, H. T., AND ALYAHYA, M. Arabic pilgrim services dataset: Creating and analysis. In 2023 1st International
    3. Conference on Advanced Innovations in Smart Cities (ICAISC) (2023), pp. 1–8.
    4. AL-EIDAN, R. M. B., AL-KHALIFA, H. S., AND AL-SALMAN, A. S. Towards the measurement of arabic weblogs credibility automatically. In
    5. Proceedings of the 11th international conference on information integration and web-based applications & services (2009), pp. 618–622.
    6. AL ZAATARI, A., EL BALLOULI, R., ELBASSOUNI, S., EL-HAJJ, W., HAJJ, H., SHABAN, K., HABASH, N., AND YAHYA, E. Arabic corpora for
    7. credibility analysis. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (2016), pp. 4396–4401.
    8. ALHARBI, A. R., HIJJI, M., AND ALJAEDI, A. Enhancing topic clustering for arabic security news based on k-means and topic modelling. IET
    9. Networks 10, 6 (2021), 278–294.
    10. ANTOUN, W., BALY, F., AND HAJJ, H. Arabert: Transformer-based model for arabic language understanding. In LREC 2020 Workshop Language Resources and Evaluation Conference 11–16 May 2020 (2020), p. 9.
    11. ANTOUN, W., BALY, F., AND HAJJ, H. AraBERT: Transformer-based Model for Arabic Language Understanding. arXiv e-prints (2020),
    12. arXiv:2003.00104.
    13. ARUNA DEVI, K. Two dimensional feature extraction and blog classification using artificial neural network. International Journal of Applied
    14. Engineering Research 13, 9 (2018), 6536–6544.
    15. BLOCK, A. Why newspapers should not have columnists. https://stanforddaily.com/2014/11/09/
    16. why-newspapers-should-not-have-columnists/, November 9 2014. Accessed on December 22, 2023.
    17. BROOKE, J. Sus: a “quick and dirty’usability. Usability evaluation in industry 189, 3 (1996), 189–194.
    18. DAL MOLIN, G. P., SANTOS, H. D., MANSSOUR, I. H., VIEIRA, R., AND MUSSE, S. R. Cross-media sentiment analysis in brazilian blogs. In
    19. Advances in Visual Computing: 14th International Symposium on Visual Computing, ISVC 2019, Lake Tahoe, NV, USA, October 7–9, 2019, Proceedings, Part II 14 (2019), Springer, pp. 492–503.
    20. DALAL, M. K., AND ZAVERI, M. A. Automatic classification of unstructured blog text. Journal of Intelligent Learning Systems and Applications 5, 02 (2013), 108–114.
    21. DU, Y., YI, Y., LI, X., CHEN, X., FAN, Y., AND SU, F. Extracting and tracking hot topics of micro-blogs based on improved latent dirichlet allocation. Engineering Applications of Artificial Intelligence 87 (2020), 103279.
    22. EL-HAJJ, W., BRAHIM, G. B., AND ZAATARI, A. Assessing in real-time the credibility of arabic blog posts using traditional and deep learning models. Social Network Analysis and Mining 11, 1 (2021), 72.
    23. FULLER, J. News values: Ideas for an information age, vol. 10. University of Chicago Press, 1996.
    24. GANS, H. J. Deciding what’s news: A study of cbs evening news, nbc nightly news. Newsweek, and Time. New York: Pantheon 42 (1979), 48.
    25. GENUER, R., POGGI, J.-M., TULEAU-MALOT, C., AND VILLA-VIALANEIX, N. Random forests for big data. Big Data Research 9 (2017), 28–46.
    26. GUO, B., ZHANG, C., LIU, J., AND MA, X. Improving text classification with weighted word embeddings via a multi-channel textcnn model.
    27. Neurocomputing 363 (2019), 366–374.
    28. HASSANI, H., BENEKI, C., UNGER, S., MAZINANI, M. T., AND YEGANEGI, M. R. Text mining in big data analytics. Big Data and Cognitive
    29. Computing 4, 1 (2020), 1.
    30. HELWE, C., ELBASSUONI, S., AL ZAATARI, A., AND EL-HAJJ, W. Assessing arabic weblog credibility via deep co-learning. In Proceedings of the
    31. Fourth Arabic Natural Language Processing Workshop (2019), pp. 130–136.
    32. IKEDA, D., TAKAMURA, H., AND OKUMURA, M. Semi-supervised learning for blog classification. In AAAI (2008), pp. 1156–1161.
    33. ISLAM, T., PRINCE, A. I., KHAN, M. M. Z., JABIULLAH, M. I., AND HABIB, M. T. An in-depth exploration of bangla blog post classification.
    34. Bulletin of Electrical Engineering and Informatics 10, 2 (2021), 742–749.
    35. JANNATI, R., MAHENDRA, R., WARDHANA, C. W., AND ADRIANI, M. Stance classification towards political figures on blog writing. In 2018
    36. International Conference on Asian Language Processing (IALP) (2018), IEEE, pp. 96–101.
    37. KESELJ, V. Automated authorship attribution using cng distance on blog posts in the serbian language. In 2023 22nd International Symposium INFOTEH-JAHORINA (INFOTEH) (2023), IEEE, pp. 1–8.
    38. LERNER, K. M. Journalists believe news and opinion are separate, but readers can’t tell the difference. https://theconversation.com/
    39. journalists-believe-news-and-opinion-are-separate-but-readers-cant-tell-the-difference-140901, June 22 2020. Accessed on December 22, 2023.
    40. MANJOO, F. I was wrong about facebook. https://www.nytimes.com/2022/07/21/opinion/farhad-manjoo-facebook.html, July 21 2022.
    41. Accessed on December 22, 2023.
    42. MICROSOFT. Is it an article, a column, or an editorial (and why does it matter)? https://www.microsoft.com/en-us/
    43. microsoft-365-life-hacks/writing/article-column-or-editorial, March 20 2023. Accessed on December 22, 2023.
    44. MUKHERJEE, A., AND LIU, B. Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing (2010), pp. 207–217.
    45. PRAMANIK, M., PRADHAN, R., NANDY, P., QAISAR, S. M., AND BHOI, A. K. Assessment of acoustic features and machine learning for parkinson’s detection. Journal of healthcare engineering 2021 (2021).
    46. PSA RESEARCH CENT. Top 100 u.s. magazines by circulation. http://www1.psaresearch.com/images/TOPMAGAZINES.pdf, November 9
    47. Accessed on December 22, 2023.
    48. SCHERLEN, A. Part i: Columns and blogs: Making sense of merging worlds. The Serials Librarian 54, 1-2 (2008), 79–92.
    49. SHERIF, S. M., ALAMOODI, A., ALBAHRI, O., GARFAN, S., ALBAHRI, A., DEVECI, M., BAKER, M. R., AND KOU, G. Lexicon annotation in
    50. sentiment analysis for dialectal arabic: Systematic review of current trends and future directions. Information Processing & Management 60, 5 (2023), 103449.
    51. SHIRSAT, V. S., JAGDALE, R. S., AND DESHMUKH, S. N. Sentence level sentiment identification and calculation from news articles using machine
    52. learning techniques. In Computing, Communication and Signal Processing: Proceedings of ICCASP 2018 (2019), Springer, pp. 371–376.
    53. SINGH, V. K., PIRYANI, R., UDDIN, A., AND WAILA, P. Sentiment analysis of movie reviews: A new feature-based heuristic for aspect-level
    54. sentiment classification. In 2013 International mutli-conference on automation, computing, communication, control and compressed sensing (imac4s) (2013), IEEE, pp. 712–717.
    55. SULLIVAN, M. An uneasy mix of news and opinion. https://www.nytimes.com/2015/01/11/public-editor/
    56. an-uneasy-mix-of-news-and-opinion.html, January 10 2015. Accessed on December 22, 2023.
    57. SUN, A., SURYANTO, M. A., AND LIU, Y. Blog classification using tags: An empirical study. In Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers: 10th International Conference on Asian Digital Libraries, ICADL 2007, Hanoi, Vietnam, December 10-13, 2007. Proceedings 10 (2007), Springer, pp. 307–316.
    58. VIJAYAN, V. K., BINDU, K., AND PARAMESWARAN, L. A comprehensive study of text classification algorithms. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2017), IEEE, pp. 1109–1113.
    59. WEBSTER, N. Merriam-webster online dictionary. https://www.merriam-webster.com/, 1828. Accessed on December 22, 2023.
    60. ZAMZAMI, N., HIMDI, H., AND SABBEH, S. F. Arabic news classification based on the country of origin using machine learning and deep learning techniques. Applied Sciences 13, 12 (2023), 7074
  • Downloads

  • How to Cite

    Himdi, H., Alotaibi, B., Alsahafy, D., Alghamdi, G., & Alghamdi, L. (2023). Automatic Classification of Arabic News and Column Articles using Machine Learning and Deep Learning Approaches. International Journal of Engineering & Technology, 12(2), 126-137. https://doi.org/10.14419/49cm2a56