Graph-based Representation for Sentence Similarity Measure : A Comparative Analysis

  • Abstract
  • Keywords
  • References
  • PDF
  • Abstract

    Textual data are a rich source of knowledge; hence, sentence comparison has become one of the important tasks in text mining related works. Most previous work in text comparison are performed at document level, research suggest that comparing sentence level text is a non-trivial problem.  One of the reason is two sentences can convey the same meaning with totally dissimilar words.  This paper presents the results of a comparative analysis on three representation schemes i.e. term frequency inverse document frequency, Latent Semantic Analysis and Graph based representation using three similarity measures i.e. Cosine, Dice coefficient and Jaccard similarity to compare the similarity of sentences.  Results reveal that the graph based representation and the Jaccard similarity measure outperforms the others in terms of precision, recall and F-measures.


  • Keywords

    Graph Based Representation, Latent Semantic Analysis, Text Representation, Text Similarity Measure, TF-IDF.

  • References

      [1] A. J. Mohammed, Y. Yusof, & H. Husni. Integrated Bisect K-Means and Firefly Algorithm for Hierarchical Text Clustering. J. Eng. Applied Sci, 100(3), (2016) 522-527.

      [2] S. A., Waheeb & H. Husni. Multi-Document Arabic Summarization Using Text Clustering to Reduce Redundancy. International Journal of Advances in Science and Technology (IJAST), 2(1), (2014) 194-199.

      [3] S. S. Kamaruddin,.A. A. Bakar, A. R. Hamdan, F.M. Nor, M.Z. Z. Nazri, Z. A. Othman, & G. S. Hussein. A text mining system for deviation detection in financial documents. Intelligent Data Analysis, 19(s1), (2015) S19-S44.

      [4] Allan, J., Wade C. and Bolivar A., Retrieval and Novelty Detection at the Sentence Level. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, (2003) 314-321.

      [5] Jacquenet, F. and Largeron, C., Using the structure of documents to improve the discovery of unexpected information. Proceedings of the 2006 ACM symposium on Applied computing table of contents, (2006) 1036-1042.

      [6] Abouzakhar, N., Allison, B. and Guthrie, L., Unsupervised Learning-based Anomalous Arabic Text Detection. Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), (2008) 291-196.

      [7] F. Jacquenet, and C. Largeron. Discovering unexpected documents in corpora. Knowledge-Based Systems 22: (2009) 421-429.

      [8] Fernández, R. T. and Losada, D. E., Novelty Detection Using Local Context Analysis. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’07, (2007) 813-814.

      [9] Beecks, C. Uysal, M. S. and. Seidl, T., A comparative study of similarity measures for content-based multimedia retrieval. In Proc. IEEE International Conference on Multimedia & Expo, (2010) 1552-1557.

      [10] Lee, W. N., Shah, N., Sundlass, K., Musen, M., Comparison of Ontology-based Semantic-Similarity Measures. AMIA Annu Symp Proceedings, (2008) 384-388.

      [11] S. Cha, Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions, International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1(4), (2007) 300-307

      [12] Strehl, A., Ghosh, J., and Mooney, R., Impact of similarity measures on web-page clustering. In AAAI-2000: Workshop on Artificial Intelligence for Web Search, July (2000).

      [13] Huang, A., Similarity Measures for Text Document Clustering. In Proceedings of the New Zealand Computer Science Research Student Conference (NZCSRSC'08), Christchurch, New Zealand (2007).

      [14] J. Szymanski, Comparative Analysis of Text Representation Methods using Classification. Cybernetics and System 45(2). (2014).

      [15] Dolan, W., Quirk, C., and Brockett, C., “Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources”, Proceeding of the 20th International Conference on Computational Linguistics, (2004).

      [16] Lin L., Hu X., Hu B., Wang J., “Measuring sen-tence similarity from different aspects”, The Eighth International Conference on Machine Learning and Cybernetics. Bao-ding, Hebei, China, (2009) 2244-2249.




Article ID: 11149
DOI: 10.14419/ijet.v7i2.14.11149

Copyright © 2012-2015 Science Publishing Corporation Inc. All rights reserved.