An novel cluster based feature selection and document classification model on high dimension trec data

Lalitha Kumari; Ch. Satyanarayana

doi:10.14419/ijet.v7i1.1.10146

Article Summary Abstract References Full Article How to cite

Authors
- Lalitha Kumari
- Ch. Satyanarayana
2017-12-21

https://doi.org/10.14419/ijet.v7i1.1.10146
TREC Datasets, Information Retrieval, Document Clustering And Classification.
Abstract

TREC text documents are complex to analyze the features its relevant similar documents using the traditional document similarity measures. As the size of the TREC repository is increasing, finding relevant clustered documents from a large collection of unstructured documents is a challenging task. Traditional document similarity and classification models are implemented on homogeneous TREC data to find essential features for document entities that are similar to the TREC documents. Also, most of the traditional models are applicable to limited text document sets for text analysis. The main issues in the traditional text mining models in TREC repository include :1) Each document is represented in vector form with many sparsity values 2) Failed to find theÂ document semantic similarity between the intra and inter clusters 3) High mean squared error rate. In this paper, novel feature selection based clustered and classification model is proposed on large number of different TREC repositories. Traditional latent Semantic Indexing and document clustering models are failed to find the topic relevance on large number of TREC clinical text document sets due to computational memory and time. Proposed document feature selection and clustered based classification model is applied on TREC clinical benchmark datasets. From the experimental results, it is proved that the proposed model is efficient than the existing models in terms of computational memory, accuracy and error rate are concerned.
References
1. [1] M. Rojcek, â€œSystem for Fuzzy Document Clusterng and Fast Fuzzy Classificationâ€, â€œ15th IEEE International Symposium on Computational Intelligence and Informatics â€, pp.39-42, 2014.
  [2] A. AÃ¯telhadj, M. Boughanem, M. Mezghiche and F. Souam, â€œUsing structural similarity for clustering XML documentsâ€, pp.109-139, 2011.
  [3] S. W. Chan and M. W Chong, â€œUnsupervised clustering for nontextual web document classificationâ€, â€œDecision Support Systemsâ€, pp.377-396, 2004.
  [4] D. Curtis, V. Kubushyn, E. A. Yfantis and M. Rogers, â€œA Hierarchical Feature Decomposition Clustering Algorithm for Unsupervised Classification of Document Image Typesâ€, â€œSixth International Conference on Machine Learning and Applicationsâ€, pp.423-428, 2007.
  [5] W. Dai, G. Xue, Qi. Yang and Y. Yu, â€œCo-clustering based Classification for Out-of-domain Documentsâ€, â€œProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACMâ€, pp.210-219, 2007.
  [6] I. Diaz-Valenzuela, V. Loia, M. J. Martin-Bautista, S. Senatore and M. A. Vila, â€œAutomatic constraints generation for semisupervised clustering: experiences with documents classificationâ€, â€œSoft Computing 20, no. 6 â€œ, pp. 2329-2339, 2016.
  [7] C. Hachenberg and T. Gottron, â€œLocality Sensitive Hashing for Scalable Structural Classification and Clustering of Web Documentsâ€, â€œProceedings of the 22nd ACM international conference on Information & Knowledge Management. ACMâ€, pp.359-363, 2013.
  [8] S. Jiang, J. Lewris, M. Voltmer and H. Wang, â€œIntegrating Rich Document Representations for Text Classificationâ€, â€œIEEE Systems and Information Engineering Design Conference (SIEDS '16)â€, pp.303-308, 2016.
  [9] W. Ke, â€œLeast Information Document Representation for Automated Text Classificationâ€, â€œroceedings of the American Society for Information Science and Technology 49.1â€, pp.1-10, 2012.
  [10] B. Lin and T. Chen, â€œGenre Classification for Musical Documents Based on Extracted Melodic Patterns and Clusteringâ€, â€œConference on Technologies and Applications of Artificial Intelligenceâ€, pp. 39-43, 2012.
  [11] L. N. Nam and H. B. Quoc, â€œA Combined Approach for Filter Feature Selection in Document Classificationâ€, â€œIEEE 27th International Conference on Tools with Artificial Intelligence â€œ, pp.317-324, 2015.
  [12] S. Shruti and L. Shalini, â€œSentence Clustering in Text Document Using Fuzzy Clustering Algorithmâ€, â€œInternational Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT)â€, pp.1473-1476, 2014.
  [13] http://www.trec-cds.org/2017.html
Downloads
How to Cite
Kumari, L., & Satyanarayana, C. (2017). An novel cluster based feature selection and document classification model on high dimension trec data. International Journal of Engineering & Technology, 7(1.1), 466-471. https://doi.org/10.14419/ijet.v7i1.1.10146
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: 2018-03-14

Accepted date: 2018-03-14

Published date: 2017-12-21

An novel cluster based feature selection and document classification model on high dimension trec data

Authors

Abstract

References

Downloads

How to Cite

Published