Enriching Tweets for Topic Modeling via Linking to the Wikipedia

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Twitter is currently the essential broadcasting services in the world of multimedia and the scope of spread information, news, and events. Due to the nature of tweets, from the short post, limited contextual information, sporadic, noisy and vague, the topics of learning remain a significant challenge. In this paper, the proposed approach overcomes those challenges through complete the contextual information of each tweet by attaching descriptions from the Wikipedia. Tweets will be enriched and integrated with its contextual information of the Wikipedia, in order to convert short posts into long texts and get rid of the deficiencies that produce during the training. This procedure implemented by following the steps; firstly, Twitter Name Entity Recognition (TNER) was proposed to classify the tweets by the nature of it and choosing specified entities for the next step. Secondly, establish connecting through the Wikipedia API, linking the entities of each tweet and Wikipedia appending it to its tweets creating a new dataset, feeding the preprocessing step. Finally, topic modeling, Latent Dirichlet Allocation (LDA), was applied.

     Moreover, a comparison based on the effect on modeling representation, and the nature of the topics for both datasets was performed. As well as the evaluation criteria were performed i.e. perplexity and coherency of both models.

    The twitter Dataset was collected via API, from several Twitter accounts for Fox News, Reuters, and CNBC. It indicates that the system affects the representation of topics for the topic modeling. The representation was better for enriched tweets, and the tokens of each topic more descriptive and meaningful, this was indicated by the high coherency of the second model that improve and affect the representation of topics.



  • Keywords

    Twitter, Topic modeling, Wikipedia, TNER, NER, and LDA.

  • References

      [1] K. Weller, A. Bruns, J. Burgess, M. Mahrt, and C. Puschmann, Twitter and Society. 2014.

      [2] L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva, “Twitter part-of-speech tagging for all: Overcoming sparse and noisy data,” Proc. Recent Adv. Nat. Lang. Process., no. September, pp. 198–206, 2013.

      [3] N. Gupta, S. Singh, and D. Roth, “Entity Linking via Joint Encoding of Types, Descriptions, and Context,” Emnlp, pp. 2671–2680, 2017.

      [4] B. V. Barde and A. M. Bainwad, “An Overview of Topic Modeling Methods and Tools,” pp. 745–750, 2017.

      [5] T. Shi, K. Kang, J. Choo, and C. K. Reddy, “Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations,” Proc. 2018 World Wide Web Conf. World Wide Web - WWW ’18, pp. 1105–1114, 2018.

      [6] X. Cheng, X. Yan, Y. Lan, and J. Guo, “BTM: Topic modeling over short texts,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 12, pp. 2928–2941, 2014.

      [7] P. Wang, H. Zhang, B. Xu, C. Liu, H. Hao, and et al. Wang, Peng, “Short Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph,” Nat. Lang. Process. Chinese Comput., vol. 496, pp. 79–90, 2014.

      [8] S. Hingmire, “WikiLDA : Towards More Effective Knowledge Acquisition in Topic Models using Wikipedia.”

      [9] W. Lukasiewicz, A. Services, and A. Paschke, “On the Move to Meaningful Internet Systems: OTM 2016 Workshops,” vol. 10034, no. October 2016, 2017.

      [10] L. P. Prieto, M. J. Rodríguez-Triana, M. Kusmin, and M. Laanpere, “Smart school multimodal dataset and challenges,” CEUR Workshop Proc., vol. 1828, pp. 53–59, 2017.

      [11] C. Lopez et al., “CAp 2017 challenge: Twitter Named Entity Recognition,” 2017.

      [12] R. Nugroho, D. Molla-Aliod, J. Yang, Y. Zhong, C. Paris, and S. Nepal, “Incorporating tweet relationships into topic derivation,” Commun. Comput. Inf. Sci., vol. 593, pp. 177–190, 2016.

      [13] D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, and J. B. Edu, “Latent Dirichlet Allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003.

      [14] T. Yano and M. Kang, “Taking advantage of Wikipedia in Natural Language Processing.”

      [15] A. Yıldırım, S. Üsküdarli, and A. Özgür, “Identifying topics in microblogs using wikipedia,” PLoS One, vol. 11, no. 3, pp. 1–20, 2016.

      [16] K. Nakayama, T. Hara, and S. Nishio, “Wikipedia link structure and text mining for semantic relation extraction towards a huge scale global web ontology,” CEUR Workshop Proc., vol. 334, pp. 59–73, 2008.

      G. Lansley and P. A. Longley, “The geography of Twitter topics in London,” Comput. Environ. Urban Syst., vol. 58, pp. 85–96, 2016.




Article ID: 27969
DOI: 10.14419/ijet.v7i4.19.27969

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.