A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia

Sitanath Biswas; Sujata Dash

doi:10.14419/ijet.v7i4.38.24311

Article Summary Abstract References Full Article How to cite

Authors
- Sitanath Biswas
- Sujata Dash
How to Cite

Biswas, S., & Dash, S. (2018). A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia. International Journal of Engineering and Technology, 7(4.38), 11-16. https://doi.org/10.14419/ijet.v7i4.38.24311

ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: December 18, 2018

Accepted date: December 18, 2018

Published date: December 3, 2018
https://doi.org/10.14419/ijet.v7i4.38.24311
Named Entity Recognition, NER, Wikipedia, Machine Translation, Information Extraction, Information Retrieval.
Abstract

Named Entity Recognition (NER) is considered as very influential undertaking in natural language processing appropriate to Question Answering system, Machine Translation (MT), Information extraction (IE), Information Retrieval (IR) etc. Basically NER is to identify and classify different types of proper nouns present inside given file like location name, person name, number, organization name, time etc.Â Although huge amount of progress is made for different Indian languages, NER is still a big problem for Odiya Language. Odiya is also a resource constrained language and till today, this is very tough to find out a large and accurate corpus for training and test. Therefore in this paper, we have utilized Wikipedia to develop a huge Odiya corpus of annotated name entities which is quite efficient to be training dataset further. After evaluation, we have got a very promising result with a F-score of 78.89.
Â
Â
References
1. [1] Sitanath Biswas. 2017. Hybrid Multilingual Named Entity Recognition for Indian Languages. International Journal of Control Theory and Application, 10 (18), 57-62.
  [2] Joohui An, Seungwoo Lee, and Gary Geunbae Lee. 2003. Automatic acquisition of named entity tagged corpus from World Wide Web. In The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics, pages 165â€“168.
  [3] Agichtein, Eugene and Gravano, Luis. Snowball: 2000, Extracting relations from large plain-text collections. In In Proceedings of the 5th ACM International Conference on Digital Libraries, pages 85â€“94,. Asif Ekbal. Classifier Ensemble Selection Using Genetic Algorithm for Named Entity Recognition. Research on Language and Computation.Spriger.
  [4] Chris.D.Paice, 1990, â€œAn Evaluation method for Stemming algorithm,â€ in Proceedings of the 17th Annual International ACM SIGR Conference on Research and Development in Information Retrieval, pp. 42â€“50.
  [5] L. S.Larkey, M. E.Connel, and N. A. Jaleel, 2003, â€œHindi CLIR in Thirty days,â€ ACM Transaction on Asian language Information Processing, vol. 2(2), pp. 130â€“142.
  [6] Wisam Dakka and Silviu Cucerzan. 2008. Augmenting Wikipedia with named entity tags. In Proceedings of the 3rd International Joint Conference on Natural Language Processing, pages 545â€“552, Hyderabad,India.
  [7] Oren Etzioni, Michael Cafarella, Doug Downey, AnaMaria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. 2005. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91â€“134.
  [8] Junâ€™ichi Kazama and Kentaro Torisawa. 2007. Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 698â€“707.
  [9] Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32:485â€“525.
  [10] Andrei Mikheev, Marc Moens, and Claire Grover. 1999. Named entity recognition without gazetteers. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, pages 1â€“8, Bergen, Norway.
  [11] Borthwick Andrew. 1999. A Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, Computer Science Department, New York University.
  [12] Fellbaum, C., 1998, editor. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
  [13] Kumarn. and Bhattacharyya Pushpak. 2006. Named Entity Recognition in Hindi using MEMM. In Technical Report, IIT Bombay, India.
  [14] Nguyen, Dat P. T., Matsuo, Yutaka, and Ishizuka, Mitsuru. 2007, Relation extraction from wikipedia using subtree mining. In AAAIâ€™07: Proceedings of the 22nd national conference on Artificial intelligence, pages 1414â€“1420. AAAI Press.
  [15] Vapnik VN 1998, Statistical learning theory. Wiley, New York.
  [16] Ada Brunstein. 2002. Annotation guidelines for answer types. LDC2005T33, Linguistic Data Consortium, Philadelphia.
  [17] Razvan Bunescu and Marius PasÂ¸ca. 2006. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 9â€“16.
  [18] Ruiz-casado, Maria, Alfonseca, Enrique, and Castells, Pablo. 2005, Automatic assignment of wikipedia encyclopedic entries to wordnet synsets. In Proceedings of the Atlantic Web Intelligence Conference, AWIC-2005. Volume 3528 of Lecture Notes in Computer Science, pages 380â€“386. Springer Verlag.
  [19] Massimiliano Ciaramita and Yasemin Altun. 2005. Named-entity recognition in novel domains with external lexical knowledge. In Proceedings of the NIPS Workshop on Advances in Structured Learning for Text and Speech Processing.
  [20] Daniel Gildea. 2001. Corpus variation and parser performance. In 2001 Conference on Empirical Methods in Natural Language Processing (EMNLP), Pittsburgh, PA.
  [21] Suchanek, Fabian M., Kasneci, Gjergji, and Weikum, Gerhard. Yago: 2008, A large ontology from wikipedia and wordnet. Web Semant., 6(3):203â€“217.
  [22] Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485â€“525.
  [23] Chinchor, Nancy. 1995. MUC-6 Named Entity Task Definition (Version 2.1). In MUC-6. Maryland.
  [24] Chinchor, Nancy. 1998. MUC-7 Named Entity Task Definition (Version 3.5). In MUC-7. Fairfax.
  [25] Chikashi Nobata, Nigel Collier, and Junâ€™ichi Tsuji. 2000. Comparison between tagged corpora for the named entity task. In Proceedings of the Workshop on Comparing Corpora, pages 20â€“27.
  [26] Moldovan, D., S. Harabagiu, R. Girju, P. Morarescu, F. Lacatusu, A. No-vischi, A. Badulescu, and O. Bolohan. 2002. LCC Tools for Question Answering. In Text REtrieval Conference (TREC) 2002.
  [27] Babych, Bogdan and A. Hartley. 2003. Improving Machine Translation Quality with Automatic Named Entity Recognition. In Proceedings of EAMT/EACL 2003 Workshop on MT and other Language Technology Tools, pages 1â€“8.
  [28] Humphreys, K., R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cun-nigham, and Y.Wilks. 1998. Univ. Of Sheffield: Description of the LaSIE-II System as Used for MUC-7. In MUC-7. Fairfax, Virginia.
  [29] Aone, Chinatsu, L. Halverson, T. Hampton, and M. Ramos-Santacruz. 1998. SRA: Description of the IE2 system used for MUC-7. In MUC-7. Fairfax, Virginia.
  [30] Mikheev, A., C. Grover, and M. Moens. 1998. Description of the LTG system used for MUC-7. In MUC-7. Fairfax, Virginia.
  [31] Mikheev, A., C. Grover, and M. Moens. 1999. Named Entity Recognition without Gazeteers. In Proceedings of EACL, pages 1â€“8. Bergen, Norway.
  [32] Joel Nothman, James R Curran, and Tara Murphy. 2008. Transforming Wikipedia into named entity training data. In Proceedings of the Australasian Language Technology Association Workshop 2008, pages 124â€“132, Hobart, Australia, December.
  [33] Miller, S., M. Crystal, H. Fox, L. Ramshaw, R. Schawartz, R. Stone, R. Weischedel, and the Annotation Group. 1998. BBN: Description of the SIFT System as Used for MUC-7. In MUC-7. Fairfax, Virginia.
  [34] Bikel, Daniel M., Richard L. Schwartz, and Ralph M. Weischedel. 1999. An Algorithm that Learns Whatâ€™s in a Name. Machine Learning 34(1-3):211â€“ 231.
  [35] Borthwick, A. 1999. Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University.
  [36] Borthwick, Andrew, J. Sterling, E. Agichtein, and R. Grishman. 1998. NYU:Description of the MENE Named Entity System as Used in MUC-7. In MUC-7. Fairfax.
  [37] Bennet, Scott W., C. Aone, and C. Lovell. 1997. Learning to Tag Multilingual Texts Through Observation. In Proceedings of Empirical Methods of Natural Language Processing, pages 109â€“116. Providence, Rhode Island.
  [38] Joel Nothman, Tara Murphy, and James R. Curran. 2009. Analysing Wikipedia and gold-standard corpora for NER training. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 612â€“620, Athens, Greece, March.
  [39] Joel Nothman. 2008. Learning Named Entity Recognition from Wikipedia. Honours Thesis. School of IT, University of Sydney.
  [40] PediaPress. 2007. mwlib Media Wiki parsing library. http://code.pediapress.com.
  [41] Alexander E. Richman and Patrick Schone. 2008. Mining wiki resources for multilingual named entity recognition. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1â€“ 9, Columbus, Ohio.
  [42] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the 7th Conference on Natural Language Learning, pages 142â€“147, Edmonton, Canada.
  [43] Erik F. Tjong Kim Sang. 2002. Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, pages 1â€“4, Taipei, Taiwan.
  [44] Ralph Weischedel and Ada Brunstein. 2005. BBN Pronoun Coreference and Entity Type Corpus. LDC2005T33, Linguistic Data Consortium, Philadelphia.
  [45] FeiWu, Raphael Hoffmann, and Daniel S.Weld. 2008. Information extraction from Wikipedia: Moving down the long tail. In Proceedings of the 14th International Conference on Knowledge Discovery & Data Mining, Las Vegas, USA, August.
  [46] A. Ramanathan and D.Rao, â€œA light weight stemmer for Hindi,â€ in Proceedings of the 10th Conference of the European Chapter of the association for Computational Linguistic for South Asian language workshop, 2003, pp. 42â€“48.
  [47] L. S.Larkey, M. E.Connel, and N. A. Jaleel, â€œHindi CLIR in Thirty days,â€ ACM Transaction on Asian language Information Processing, vol. 2(2), pp. 130â€“142, 2003.
  [48] S. Dasgupta and V. Ng, â€œUnsupervised morphological parsing of Bengali,â€ Language Resources and Evaluation, pp. 311â€“330, 2006.
Downloads
How to Cite
Biswas, S., & Dash, S. (2018). A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia. International Journal of Engineering and Technology, 7(4.38), 11-16. https://doi.org/10.14419/ijet.v7i4.38.24311
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: December 18, 2018

Accepted date: December 18, 2018

Published date: December 3, 2018

A Hybrid Bootstrapping Approach for developing Odiya Named Entity Corpora from Wikipedia

Authors

How to Cite

Abstract

References

Downloads

How to Cite

Published