A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia

  • Authors

    • Sitanath Biswas
    • Sujata Dash
    2018-12-03
    https://doi.org/10.14419/ijet.v7i4.38.24311
  • Named Entity Recognition, NER, Wikipedia, Machine Translation, Information Extraction, Information Retrieval.
  • Named Entity Recognition (NER) is considered as very influential undertaking in natural language processing appropriate to Question Answering system, Machine Translation (MT), Information extraction (IE), Information Retrieval (IR) etc. Basically NER is to identify and classify different types of proper nouns present inside given file like location name, person name, number, organization name, time etc.  Although huge amount of progress is made for different Indian languages, NER is still a big problem for Odiya Language. Odiya is also a resource constrained language and till today, this is very tough to find out a large and accurate corpus for training and test. Therefore in this paper, we have utilized Wikipedia to develop a huge Odiya corpus of annotated name entities which is quite efficient to be training dataset further. After evaluation, we have got a very promising result with a F-score of 78.89.

     

     

  • References

    1. [1] Sitanath Biswas. 2017. Hybrid Multilingual Named Entity Recognition for Indian Languages. International Journal of Control Theory and Application, 10 (18), 57-62.

      [2] Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic acquisition of named entity tagged corpus from World Wide Web. In The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics, pages 165–168.

      [3] Agichtein, Eugene and Gravano, Luis. Snowball: 2000, Extracting relations from large plain-text collections. In In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85–94,. Asif Ekbal. Classifier Ensemble Selection Using Genetic Algorithm for Named Entity Recognition. Research on Language and Computation.Spriger.

      [4] Chris.D.Paice, 1990, “An Evaluation method for Stemming algorithm,†in Proceedings of the 17th Annual International ACM SIGR Conference on Research and Development in Information Retrieval, pp. 42–50.

      [5] L. S.Larkey, M. E.Connel, and N. A. Jaleel, 2003, “Hindi CLIR in Thirty days,†ACM Transaction on Asian language Information Processing, vol. 2(2), pp. 130–142.

      [6] Wisam Dakka and Silviu Cucerzan. 2008. Augmenting Wikipedia with named entity tags. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, pages 545–552, Hyderabad,India.

      [7] Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91–134.

      [8] Jun’ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 698–707.

      [9] Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32:485–525.

      [10] Andrei Mikheev, Marc Moens, and Claire Grover. 1999. Named entity recognition without gazetteers. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pages 1–8, Bergen, Norway.

      [11] Borthwick Andrew. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University.

      [12] Fellbaum, C., 1998, editor. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.

      [13] Kumarn. and Bhattacharyya Pushpak. 2006. Named Entity Recognition in Hindi using MEMM. In Technical Report, IIT Bombay, India.

      [14] Nguyen, Dat P. T., Matsuo, Yutaka, and Ishizuka, Mitsuru. 2007, Relation extraction from wikipedia using subtree mining. In AAAI’07: Proceedings of the 22nd national conference on Artificial intelligence, pages 1414–1420. AAAI Press.

      [15] Vapnik VN 1998, Statistical learning theory. Wiley, New York.

      [16] Ada Brunstein. 2002. Annotation guidelines for answer types. LDC2005T33, Linguistic Data Consortium, Philadelphia.

      [17] Razvan Bunescu and Marius Pas¸ca. 2006. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 9–16.

      [18] Ruiz-casado, Maria, Alfonseca, Enrique, and Castells, Pablo. 2005, Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In Proceedings of the Atlantic Web Intelligence Conference, AWIC-2005. Volume 3528 of Lecture Notes in Computer Science, pages 380–386. Springer Verlag.

      [19] Massimiliano Ciaramita and Yasemin Altun. 2005. Named-entity recognition in novel domains with external lexical knowledge. In Proceedings of the NIPS Workshop on Advances in Structured Learning for Text and Speech Processing.

      [20] Daniel Gildea. 2001. Corpus variation and parser performance. In 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pittsburgh, PA.

      [21] Suchanek, Fabian M., Kasneci, Gjergji, and Weikum, Gerhard. Yago: 2008, A large ontology from wikipedia and wordnet. Web Semant., 6(3):203–217.

      [22] Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485–525.

      [23] Chinchor, Nancy. 1995. MUC-6 Named Entity Task Definition (Version 2.1). In MUC-6. Maryland.

      [24] Chinchor, Nancy. 1998. MUC-7 Named Entity Task Definition (Version 3.5). In MUC-7. Fairfax.

      [25] Chikashi Nobata, Nigel Collier, and Jun’ichi Tsuji. 2000. Comparison between tagged corpora for the named entity task. In Proceedings of the Workshop on Comparing Corpora, pages 20–27.

      [26] Moldovan, D., S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu, A. No-vischi, A. Badulescu, and O. Bolohan. 2002. LCC Tools for Question Answering. In Text REtrieval Conference (TREC) 2002.

      [27] Babych, Bogdan and A. Hartley. 2003. Improving Machine Translation Quality with Automatic Named Entity Recognition. In Proceedings of EAMT/EACL 2003 Workshop on MT and other Language Technology Tools, pages 1–8.

      [28] Humphreys, K., R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cun-nigham, and Y.Wilks. 1998. Univ. Of Sheffield: Description of the LaSIE-II System as Used for MUC-7. In MUC-7. Fairfax, Virginia.

      [29] Aone, Chinatsu, L. Halverson, T. Hampton, and M. Ramos-Santacruz. 1998. SRA: Description of the IE2 system used for MUC-7. In MUC-7. Fairfax, Virginia.

      [30] Mikheev, A., C. Grover, and M. Moens. 1998. Description of the LTG system used for MUC-7. In MUC-7. Fairfax, Virginia.

      [31] Mikheev, A., C. Grover, and M. Moens. 1999. Named Entity Recognition without Gazeteers. In Proceedings of EACL, pages 1–8. Bergen, Norway.

      [32] Joel Nothman, James R Curran, and Tara Murphy. 2008. Transforming Wikipedia into named entity training data. In Proceedings of the Australasian Language Technology Association Workshop 2008, pages 124–132, Hobart, Australia, December.

      [33] Miller, S., M. Crystal, H. Fox, L. Ramshaw, R. Schawartz, R. Stone, R. Weischedel, and the Annotation Group. 1998. BBN: Description of the SIFT System as Used for MUC-7. In MUC-7. Fairfax, Virginia.

      [34] Bikel, Daniel M., Richard L. Schwartz, and Ralph M. Weischedel. 1999. An Algorithm that Learns What’s in a Name. Machine Learning 34(1-3):211– 231.

      [35] Borthwick, A. 1999. Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University.

      [36] Borthwick, Andrew, J. Sterling, E. Agichtein, and R. Grishman. 1998. NYU:Description of the MENE Named Entity System as Used in MUC-7. In MUC-7. Fairfax.

      [37] Bennet, Scott W., C. Aone, and C. Lovell. 1997. Learning to Tag Multilingual Texts Through Observation. In Proceedings of Empirical Methods of Natural Language Processing, pages 109–116. Providence, Rhode Island.

      [38] Joel Nothman, Tara Murphy, and James R. Curran. 2009. Analysing Wikipedia and gold-standard corpora for NER training. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 612–620, Athens, Greece, March.

      [39] Joel Nothman. 2008. Learning Named Entity Recognition from Wikipedia. Honours Thesis. School of IT, University of Sydney.

      [40] PediaPress. 2007. mwlib Media Wiki parsing library. http://code.pediapress.com.

      [41] Alexander E. Richman and Patrick Schone. 2008. Mining wiki resources for multilingual named entity recognition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1– 9, Columbus, Ohio.

      [42] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning, pages 142–147, Edmonton, Canada.

      [43] Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, pages 1–4, Taipei, Taiwan.

      [44] Ralph Weischedel and Ada Brunstein. 2005. BBN Pronoun Coreference and Entity Type Corpus. LDC2005T33, Linguistic Data Consortium, Philadelphia.

      [45] FeiWu, Raphael Hoffmann, and Daniel S.Weld. 2008. Information extraction from Wikipedia: Moving down the long tail. In Proceedings of the 14th International Conference on Knowledge Discovery & Data Mining, Las Vegas, USA, August.

      [46] A. Ramanathan and D.Rao, “A light weight stemmer for Hindi,†in Proceedings of the 10th Conference of the European Chapter of the association for Computational Linguistic for South Asian language workshop, 2003, pp. 42–48.

      [47] L. S.Larkey, M. E.Connel, and N. A. Jaleel, “Hindi CLIR in Thirty days,†ACM Transaction on Asian language Information Processing, vol. 2(2), pp. 130–142, 2003.

      [48] S. Dasgupta and V. Ng, “Unsupervised morphological parsing of Bengali,†Language Resources and Evaluation, pp. 311–330, 2006.

  • Downloads

  • How to Cite

    Biswas, S., & Dash, S. (2018). A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia. International Journal of Engineering & Technology, 7(4.38), 11-16. https://doi.org/10.14419/ijet.v7i4.38.24311