Part of Speech Tagging for Arabic Long Sentence


  • Ahmed H. Aliwy
  • Duaa A. Al Raza







Part Of Speech (POS) tagging of Arabic words is a difficult and non-travail task it was studied in details for the last twenty years and its performance affects many applications and tasks in area of natural language processing (NLP). The sentence in Arabic language is very long compared with English sentence. This affect tagging process for any approach deals with complete sentence at once as in Hidden Markov Model HMM tagger. In this paper, new approach is suggested for using HMM and n-grams taggers for tagging Arabic words in a long sentence. The suggested approach is very simple and easy to implement. It is implemented on data set of 1000 documents of 526321 tokens annotated manually (containing punctuations). The results shows that the suggested approach has higher accuracy than HMM and n-gram taggers. The F-measures were 0.888, 0.925 and 0.957 for n-grams, HMM and the suggested approach respectively.


[1] Jurafsky D & Martin J, “Speech and Language Processing: An introduction to natural language processingâ€, computational linguistics, and speech recognition, (2008).

[2] Nitin I & Fred J, Handbook of Natural Language Processing, Second Edition, Chapman & Hall/CRC Machine Learning & Pattern Recognition, USA, (2010).

[3] Aliwy AH, “Arabic morphosyntactic raw text part of speech tagging systemâ€, Ph.D dissertation, University of Warsaw, warsaw, Poland, (2010).

[4] Darwish K, Abdelali A & Mubarak H, “Using Stem-Templates to Improve Arabic POS and Gender/Number Taggingâ€, LREC, (2014), pp.2926-2931.â€

[5] Diab M, Hacioglu K & Jurafsky D, “Automatic tagging of Arabic text: From raw text to base phrase chunksâ€, Proceedings of HLT-NAACL:Short papers, (2004), pp.149-152.â€

[6] Attia M & Rashwan M, “A large-scale Arabic POS tagger based on a compact Arabic POS tags set, and application on the statistical inference of syntactic diacritics of Arabic text wordsâ€, Proceedings of the Arabic Language Technologies and Resources Int’l Conference, (2004).

[7] Albared M, Omar N, Ab Aziz MJ & Nazri MZA, “Automatic part of speech tagging for Arabic: an experiment using Bigram hidden Markov modelâ€, International Conference on Rough Sets and Knowledge Technology, (2010), 361-370.

[8] Mansour S, Sima'an K & Winter Y, “Smoothing a lexicon-based POS tagger for Arabic and Hebrewâ€, Proceedings of the Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources, (2007), pp.97-103.â€

[9] Surendar, A., & Nelakuditi, U. R. (2017). Editorial -New developments in electronics, cloud and IoT. Electronic Government, 13(4).

[10] Albared M, Omar N & Ab Aziz MJ, “Developing a competitive HMM arabic POS tagger using small training corporaâ€, Asian Conference on Intelligent Information and Database Systems, (2011), pp.288-296.â€

[11] Aliwy AH, “Combining POS taggers in master-slaves technique for highly inflected languages as Arabicâ€, International Conference on Cognitive Computing and Information Processing, (2015), pp. 1-5.

[12] Abbas M, Smaili K & Berkani D, “Evaluation of Topic Identification Methods on Arabic Corporaâ€, Journal of Digital Informa0on Management, Vol.9, No.5, (2011), pp.185-192.

[13] Toutanova K, Klein D, Manning CD & Singer Y, “Feature-Rich Part-Of-Speech Tagging With a Cyclic Dependency Networkâ€, Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, (2003), pp.173–180.

[14] Z Iskakova, M Sarsembayev, Z Kakenova (2018). Can Central Asia be integrated as asean? Opción, Año 33. 152-169.

[15] G Cely Galindo (2017) Del Prometeo griego al de la era-biós de la tecnociencia. Reflexiones bioéticas Opción, Año 33, No. 82 (2017):114-133

View Full Article: