Enriching Tweets for Topic Modeling via Linking to the Wikipedia

Ghaidaa A. Al-Sultany; Hiba J. Aleqabie

doi:10.14419/ijet.v7i4.19.27969

Authors and Affiliations

Ghaidaa A. Al-Sultany
Hiba J. Aleqabie

About this article

DOI:

https://doi.org/10.14419/ijet.v7i4.19.27969

Received:

26-02-2019

Accepted:

26-02-2019

Published:

27-11-2018

Views:

509

Downloads:

185

Download PDF

Keywords:

Twitter, Topic modeling, Wikipedia, TNER, NER, and LDA.

Abstract

Twitter is currently the essential broadcasting services in the world of multimedia and the scope of spread information, news, and events. Due to the nature of tweets, from the short post, limited contextual information, sporadic, noisy and vague, the topics of learning remain a significant challenge. In this paper, the proposed approach overcomes those challenges through complete the contextual information of each tweet by attaching descriptions from the Wikipedia. Tweets will be enriched and integrated with its contextual information of the Wikipedia, in order to convert short posts into long texts and get rid of the deficiencies that produce during the training. This procedure implemented by following the steps; firstly, Twitter Name Entity Recognition (TNER) was proposed to classify the tweets by the nature of it and choosing specified entities for the next step. Secondly, establish connecting through the Wikipedia API, linking the entities of each tweet and Wikipedia appending it to its tweets creating a new dataset, feeding the preprocessing step. Finally, topic modeling, Latent Dirichlet Allocation (LDA), was applied.

Â Moreover, a comparison based on the effect on modeling representation, and the nature of the topics for both datasets was performed. As well as the evaluation criteria were performed i.e. perplexity and coherency of both models.

The twitter Dataset was collected via API, from several Twitter accounts for Fox News, Reuters, and CNBC. It indicates that the system affects the representation of topics for the topic modeling. The representation was better for enriched tweets, and the tokens of each topic more descriptive and meaningful, this was indicated by the high coherency of the second model that improve and affect the representation of topics.

Â

References

[1] K. Weller, A. Bruns, J. Burgess, M. Mahrt, and C. Puschmann, Twitter and Society. 2014.

[2] L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva, â€œTwitter part-of-speech tagging for all: Overcoming sparse and noisy data,â€ Proc. Recent Adv. Nat. Lang. Process., no. September, pp. 198â€“206, 2013.

[3] N. Gupta, S. Singh, and D. Roth, â€œEntity Linking via Joint Encoding of Types, Descriptions, and Context,â€ Emnlp, pp. 2671â€“2680, 2017.

[4] B. V. Barde and A. M. Bainwad, â€œAn Overview of Topic Modeling Methods and Tools,â€ pp. 745â€“750, 2017.

[5] T. Shi, K. Kang, J. Choo, and C. K. Reddy, â€œShort-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations,â€ Proc. 2018 World Wide Web Conf. World Wide Web - WWW â€™18, pp. 1105â€“1114, 2018.

[6] X. Cheng, X. Yan, Y. Lan, and J. Guo, â€œBTM: Topic modeling over short texts,â€ IEEE Trans. Knowl. Data Eng., vol. 26, no. 12, pp. 2928â€“2941, 2014.

[7] P. Wang, H. Zhang, B. Xu, C. Liu, H. Hao, and et al. Wang, Peng, â€œShort Text Feature Enrichment Using Link Analysis on Topic-Keyword Graph,â€ Nat. Lang. Process. Chinese Comput., vol. 496, pp. 79â€“90, 2014.

[8] S. Hingmire, â€œWikiLDA : Towards More Effective Knowledge Acquisition in Topic Models using Wikipedia.â€

[9] W. Lukasiewicz, A. Services, and A. Paschke, â€œOn the Move to Meaningful Internet Systems: OTM 2016 Workshops,â€ vol. 10034, no. October 2016, 2017.

[10] L. P. Prieto, M. J. RodrÃguez-Triana, M. Kusmin, and M. Laanpere, â€œSmart school multimodal dataset and challenges,â€ CEUR Workshop Proc., vol. 1828, pp. 53â€“59, 2017.

[11] C. Lopez et al., â€œCAp 2017 challenge: Twitter Named Entity Recognition,â€ 2017.

[12] R. Nugroho, D. Molla-Aliod, J. Yang, Y. Zhong, C. Paris, and S. Nepal, â€œIncorporating tweet relationships into topic derivation,â€ Commun. Comput. Inf. Sci., vol. 593, pp. 177â€“190, 2016.

[13] D. M. Blei, B. B. Edu, A. Y. Ng, A. S. Edu, M. I. Jordan, and J. B. Edu, â€œLatent Dirichlet Allocation,â€ J. Mach. Learn. Res., vol. 3, pp. 993â€“1022, 2003.

[14] T. Yano and M. Kang, â€œTaking advantage of Wikipedia in Natural Language Processing.â€

[15] A. YÄ±ldÄ±rÄ±m, S. ÃœskÃ¼darli, and A. Ã–zgÃ¼r, â€œIdentifying topics in microblogs using wikipedia,â€ PLoS One, vol. 11, no. 3, pp. 1â€“20, 2016.

[16] K. Nakayama, T. Hara, and S. Nishio, â€œWikipedia link structure and text mining for semantic relation extraction towards a huge scale global web ontology,â€ CEUR Workshop Proc., vol. 334, pp. 59â€“73, 2008.

G. Lansley and P. A. Longley, â€œThe geography of Twitter topics in London,â€ Comput. Environ. Urban Syst., vol. 58, pp. 85â€“96, 2016.

How to Cite

A. Al-Sultany, G., & J. Aleqabie, H. (2018). Enriching Tweets for Topic Modeling via Linking to the Wikipedia. International Journal of Engineering and Technology, 7(4.19), 609-614. https://doi.org/10.14419/ijet.v7i4.19.27969

Download Citation