Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data

  • Authors

    • Farzana Kabir Ahmad
    • Yuhanis Yusof
    • Nooraini Yusoff
    2018-04-06
    https://doi.org/10.14419/ijet.v7i2.15.11216
  • Bioinformatics, Feature Selection, High Dimensional Data, Support Vector Machine.
  • DNA microarray technology is a current innovative tool that has offers a new perspective to look sight into cellular systems and measure a large scale of gene expressions at once. Regardless the novel invention of DNA microarray, most of its results relies on the computational intelligence power, which is used to interpret the large number of data. At present, interpreting large scale of gene expression data remain a thought-provoking issue due to their innate nature of “high dimensional low sample sizeâ€. Microarray data mainly involved thousands of genes, n in a very small size sample, p. In addition, this data are often overwhelmed, over fitting and confused by the complexity of data analysis. Due to the nature of this microarray data, it is also common that a large number of genes may not be informative for classification purposes. For such a reason, many studies have used feature selection methods to select significant genes that present the maximum discriminative power between cancerous and normal tissues. In this study, we aim to investigate and compare the effectiveness of these four popular filter gene selection methods namely Signal-to-Noise ratio (SNR), Fisher Criterion (FC), Information Gain (IG) and t-Test in selecting informative genes that can distinguish cancer and normal tissues. Two common classifiers, Support Vector Machine (SVM) and Decision Tree (C4.5) are used to train the selected genes. These gene selection methods are tested on three large scales of gene expression datasets, namely breast cancer dataset, colon dataset, and lung dataset. This study has discovered that IG and SNR are more suitable to be used with SVM while IG fit for C4.5. In a colon dataset, SVM has achieved a specificity of 86% with SNR while and 80% for IG. In contract, C4.5 has obtained a specificity of 78% for IG on the identical dataset. These results indicate that SVM performed slightly better with IG pre-processed data compare to C4.5 on the same dataset.

  • References

    1. [1] K. Kourou, & D. I. Fotiadis, (2015). Computational modelling in cancer: Methods and applications. Biomedical Data Journal, 1(1), 15–25.

      [2] R. Hu, (2011). Medical data mining based on decision tree algorithm. Computer and Information Science, 4(5), 14–19.

      [3] K.-H. Chen, K.-J. Wang, M.-L. Tsai, K.-M. Wang, A.M Adrian, W..-C. Cheng, … K.-S. Chang, (2014). Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm. BMC Bioinformatics, 15(1), 49.

      [4] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, … A. Nowé, (2012). A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(4), 1106–19. doi:10.1109/TCBB.2012.33

      [5] F.K. Ahmad, S. Deris, N.H. Othman, and N.M. Norwawi, (2009). A review of feature selection techniques via gene expression profiles. IEEE International Symposium on Information Technology (ITSim), 2008, Kuala Lumpur, Malaysia.

      [6] O.F. Huey, N. Mustapha & N. Sulaiman, (2011). Integrative gene selection for classification of microarray data. Computer and Information Science, 4(2), 55–63.

      [7] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R.B. Altman, (2001). Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6), 520-525.

      [8] L. Li, and H. Li, (2004). Dimension reduction methods for microarrays with application to censored survival data. Briefings in Bioinformatics, 20(18), 3406-3412.

      [9] O. Gevaert, F.D. Smet, D. Timmerman, Y. Moreau, and B.D Moor, (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics, 22(14), e184–e190.

      [10] Y. Saeys, I. Inza, and P. Larranaga, (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507-251.

  • Downloads

  • How to Cite

    Kabir Ahmad, F., Yusof, Y., & Yusoff, N. (2018). Filter-Based Gene Selection Method for Tissues Classification on Large Scale Gene Expression Data. International Journal of Engineering & Technology, 7(2.15), 68-71. https://doi.org/10.14419/ijet.v7i2.15.11216