A survey of machine learning techniques for  genomic diseases and data sets

Manu Phogat; Dr. Dharmender Kumar

doi:10.14419/ijet.v7i4.11016

Article Summary Abstract References Full Article How to cite

Authors
- Manu Phogat Guru Jambheshwar University of Science and Technology, Hisar, India, 125001
- Dr. Dharmender Kumar Guru Jambheshwar University of Science and Technology, Hisar, India, 125001
2019-04-07

https://doi.org/10.14419/ijet.v7i4.11016
Machine Learning, ANN, KNN, RF, SVM, Genomic, Mutation.
Abstract

From the very early age of Medical Science, medical practitioners have been concerned about visualizing and analyzing complex biological data which was not so easy. Today is the era of GWAS (genome-wide association studies), so the quest for understanding the genotype of various complex diseases is rapidly increasing day by day. Recently, high throughput molecular data have provided ample information about the whole genome, and have popularized the computational tools in genomics. Due to the humongous size and high dimensionality of genomic data, it is not possible to analyze it with conventional techniques, so machine learning tends to develop efficient computational techniques that will raise with experience, for analysis the vast complex data sets. This article give an outline of different machine learning techniques for examination of the genomics data of diseases and epigenetic, proteomic data.
Â
Â
Â
Â
References
1. [1] International human genome sequencing consortium: Finishing the euchromatic sequence of the human genome. Nature 2004; 431(7011):931- 45. https://doi.org/10.1038/nature03001.
  [2] J. D. Watson and F. H. C. Crick (1953), â€˜â€˜Molecular structure of nucleic acids: A structure for deoxyribose nucleic acid,â€™â€™ Nature, vol. 171, no. 4356, pp. 737â€“738. https://doi.org/10.1038/171737a0.
  [3] E. S. Lander et al. (2001), â€˜â€˜Initial sequencing and analysis of the human genome,â€™â€™ Nature, vol. 409, no. 6822, pp. 860â€“921. https://doi.org/10.1038/35057062.
  [4] J. Harrow et al. (2012), â€˜â€˜GENCODE: The reference human genome annotation for the ENCODE project,â€™â€™ Genome Res., vol. 22, no. 9, pp. 1760â€“1774. https://doi.org/10.1101/gr.135350.111.
  [5] Kevin Jarrett, Mary Williams, Spencer Horn, David Radford, and J. Michael Wyss (2016), â€œSickle cell anemia: tracking down a mutationâ€: an interactive learning laboratory that communicates basic principles of genetics and cellular biologyâ€ Advances in Physiology education, vol.40, pp. 110-115. https://doi.org/10.1152/advan.00143.2015.
  [6] Gravitz L, Pincock S. (2014), â€œSickle-cell diseaseâ€ Nature, Vol. 515, Issue.7526. https://doi.org/10.1038/515S1a.
  [7] D. Hanahan and R. A. Weinberg (2011), â€˜â€˜Hallmarks of cancer: The next generation,â€™â€™ Cell, vol. 144, no. 5, pp. 646â€“674. https://doi.org/10.1016/j.cell.2011.02.013.
  [8] M. A. Rubin (2015), â€˜â€˜Make precision medicine work for cancer care,â€™â€™ Nature, vol. 520, no.547, pp. 290â€“291. https://doi.org/10.1038/520290a.
  [9] F. H. Crick, L. Barnett, S. Brenner, and R. J. Watts-Tobin (1961), â€˜â€˜General nature of the genetic code for proteins,â€™â€™ Nature, vol. 192, pp. 1227â€“1232. https://doi.org/10.1038/1921227a0.
  [10] L. A. Hindorff et al. (2009), â€˜â€˜Potential etiologic and functional implications of genome-wide association loci for human diseases and traits,â€™â€™ Proc. Nat. Acad. Sci. USA, vol. 106, no. 23, pp. 9362â€“9367. https://doi.org/10.1073/pnas.0903103106.
  [11] Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2011), â€œMolecular diagnosis of congenital adrenal hyperplasia in Iran:Focusing on CYP21A2 geneâ€ , Iranian Journal of Pediatrics, vol.21, no.2, pp.139-50.
  [12] Rabbani B, Mahdieh N, Haghi Ashtiani MT, et al. (2012), â€œIn silico structural, functional and pathogenicity evaluation of a novel mutation:An overview of HSD3B2 gene mutationsâ€, Gene, vol.503, no.2, pp.215-219. https://doi.org/10.1016/j.gene.2012.04.080.
  [13] Ghanem N, Girodon E, Vidaud M, et al. (1992), â€œA comprehensive scanning method for rapid detection of beta-globin gene mutations and polymorphismsâ€, Human Mutation, vol.1, no.3, pp.229-239. https://doi.org/10.1002/humu.1380010310.
  [14] Mahdieh N, Rabbani B, Wiley S, et al. (2010), â€œGenetic causes of nonsyndromic hearing loss in Iran in comparison with other populationsâ€, Journal of Human Genetics, vol.55, pp. 639-48. https://doi.org/10.1038/jhg.2010.96.
  [15] Garcia-Garcia AB, Real JT, Puig O, et al. (2001), â€œMolecular genetics of familial hypercholesterolemia in spain:Ten novel LDLR mutations and population analysisâ€, Human Mutation, vol.18, no.5, pp.458-469. https://doi.org/10.1002/humu.1218.
  [16] Mahdieh N, Bagherian H, Shirkavand A, et al. (2010), â€œ High level of intrafamilial phenotypic variability of non- syndromic hearing loss in a Lur family due to DELE120 mutation in GJB2 geneâ€, International Journal of Pediatric Otorhinolaryngology, vol.74, no.9, pp.1089-91. https://doi.org/10.1016/j.ijporl.2010.06.005.
  [17] Schrijver I, Liu W, Odom R, et al. (2002), â€œPremature termination mutations in FBN1: Distinct effects on differential allelic expression and on protein and clinical phenotypesâ€, American Journal of Human Genetics, vol.71, no.2, pp. 223-37. https://doi.org/10.1086/341581.
  [18] Madan K, Seabright M, Lindenbaum RH, et al. (1984), â€œParacentric inversions in manâ€, Journal of Medical Genetics, vol.21, no.6, pp. 407-412. https://doi.org/10.1136/jmg.21.6.407.
  [19] Xi Chen, Hemant Ishwaran (2012), â€œRandom forests for genomic data analysisâ€, Genomics, vol. 99, pp. 323â€“329. https://doi.org/10.1016/j.ygeno.2012.04.003.
  [20] X. Chen, L.Wang, H. Ishwaran (2010), â€œAn integrative pathway-based clinical-genomicmodel for cancer survival predictionâ€, Statistics & Probability Letters. Vol.80 no.17â€“18, pp. 1313â€“1319. https://doi.org/10.1016/j.spl.2010.04.011.
  [21] J.S. Wu, H.D. Liu, X.Y. Duan, Y. Ding, H.T. Wu, Y.F. Bai, X. Sun (2009), â€œPrediction of DNAbinding residues in proteins from amino acid sequences using a random forest model with a hybrid featureâ€, Bioinformatics, vol.25, no.1, pp.30â€“35. https://doi.org/10.1093/bioinformatics/btn583.
  [22] Z.P. Liu, L.Y. Wu, Y. Wang, X.S. Zhang, L. Chen (2010), â€œPrediction of proteinâ€“RNA binding sites by a random forest method with combined featuresâ€, Bioinformatics, vol. 26, no.13, pp.1616â€“1622. https://doi.org/10.1093/bioinformatics/btq253.
  [23] M. Sikic, S. Tomic, K. Vlahovicek (2009), â€œPrediction of proteinâ€“protein interaction sites in sequences and 3D structures by random forestsâ€, PLOS Computational Biology, vol.5, no.1, e1000278. https://doi.org/10.1371/journal.pcbi.1000278.
  [24] G. Riddick, H. Song, S. Ahn, J. Walling, D. Borges-Rivera, W. Zhang, H.A. Fine (2011), â€œPredicting in vitro drug sensitivity using random forestsâ€, Bioinformatics, vol. 27, no. 2, pp.220â€“224. https://doi.org/10.1093/bioinformatics/btq628.
  [25] Li-ChungChuang, and Po-Hsiu Kuo (2017), â€œBuilding a genetic risk model for bipolar disorder from genomewide association data with random forest algorithmâ€, Scientific Reports, Nature, vol.7, no. 39943, pp. 1-10.
  [26] T. Shi, D. Seligson, A.S. Belldegrun, A. Palotie, S. Horvath (2005), â€œTumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinomaâ€, Mod. Pathol, vol. 18, no.4, pp.547â€“557. https://doi.org/10.1038/modpathol.3800322.
  [27] Vapnik V (1963), â€œPattern recognition using generalized portrait methodâ€, Automation Remote Control, vol. 24, pp.774-780.
  [28] Shujun Huang et al. (2018), â€œApplications of Support Vector Machine (SVM) Learning in Cancer Genomicsâ€, Cancer Genomics & Proteomics, vol.15, pp. 41-51.
  [29] Y. Shen, Z. Liu, and J. Ott (2012), "Support Vector Machines with L 1 penalty for detecting geneâ€“gene interactions," International journal of data mining and bioinformatics, vol. 6, pp. 463-470. https://doi.org/10.1504/IJDMB.2012.049300.
  [30] Waddell M, Page D, Zhan F (2005), Predicting cancer susceptibility from single-nucleotide polymorphism data: A case study in multiple myeloma. Proceedings of the 5th ACM SIGKDD Workshop on Data Mining in Bioinformatics. Chicago, IL. https://doi.org/10.1145/1134030.1134035.
  [31] Moler E, Chow M and Mian I (2000), â€œAnalysis of molecular profile data using generative and discriminative methodsâ€, Physiological Genomics, vol. 4, no.2, pp. 109-126. https://doi.org/10.1152/physiolgenomics.2000.4.2.109.
  [32] Chen L, Xuan J, Riggins RB, Clarke R and Wang Y (2011), â€œIdentifying cancer biomarkers by network-constrained support vector machines,â€ BMC Systems Biology, vol. 5, no.1, pp. 161. https://doi.org/10.1186/1752-0509-5-161.
  [33] Capriotti E and Altman RB (2011), â€œA new disease-specific machine learning approach for the prediction of cancer-causing missense variants,â€ Genomics, vol. 98, no.4, pp. 310-317. https://doi.org/10.1016/j.ygeno.2011.06.010.
  [34] Bari MG, Ung CY, Zhang C, Zhu S and Li H (2017), â€œMachine Learning-assisted network inference approach to identify a new class of genes that coordinate the functionality of cancer networks,â€ Scientific Reports, vol.7, pp. 6993. https://doi.org/10.1038/s41598-017-07481-5.
  [35] Taghipour M1, Vand AA, Rezaei Aand Karim GR (2015), â€œApplication of Artificial Neural Network for Modeling and Prediction of MTT Assay on Human Lung Epithelial Cancer Cell Lines,â€ Journal of Biosensors & Bioelectronics, vol.6, no.2.
  [36] Khan J, Wei JS, Ringner M, et al. (2001), â€œClassification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,â€ Nature Medicine, vol.7, pp.673â€“679. https://doi.org/10.1038/89044.
  [37] Catalogna M, Cohen E, Fishman S, Halpern Z, Nevo U, Ben-Jacob E 92012), â€œArtificial neural networks-based controller for glucose monitoring during clamp test,â€ Public Library of Science One, vol.7, no. e44587.
  [38] Narayanan A, Keedwell EC, Gamalielsson J, et al. (2004), â€œSinglelayer artificial neural networks for gene expression analysis,â€ Neurocomputing, vol.61, pp.217â€“40. https://doi.org/10.1016/j.neucom.2003.10.017.
  [39] Karabulut E, IbrikÃ§i T. (2012), â€œEffective diagnosis of coronary artery disease using the rotation forest ensemble method,â€ Journal of Medical Systems, vol.36, pp.3011â€“3018. https://doi.org/10.1007/s10916-011-9778-y.
  [40] Samuel, O.W., Asogbon, G.M., Sangaiah, A.K., Fang, P., Li, G. (2017), â€œAn integrated decision support system based on ANN and Fuzzy_AHP for heart failure risk prediction,â€ Expert Systems with Applications, vol.68, pp.163â€“172. https://doi.org/10.1016/j.eswa.2016.10.020.
  [41] Shouman, M., Turner, T., Stocker, R. (2012), â€œApplying k-nearest neighbour in diagnosing heart disease patients,â€ Int. J. Inf. Educ. Technol, vol.2, no.3, pp. 220. https://doi.org/10.7763/IJIET.2012.V2.114.
  [42] V. Anuja Kumari, R.Chitra (2013), â€œClassification Of Diabetes Disease Using Support Vector Machine,â€ International Journal of Engineering Research and Applications, vol.3, no. 2, pp.1797-1801.
  [43] Rau, H.-H., Hsu, C.-Y., Lin, Y.-A., Atique, S., Fuad, A., Wei, L.-M., Hsu, M.-H (2016), , â€œDevelopment of a web-based liver cancer prediction model for type II diabetes patients by using an artificial neural network,â€ Computer Methods and Programs in Biomedicine, vol.125, pp. 58â€“65. https://doi.org/10.1016/j.cmpb.2015.11.009.
  [44] Kaya, Y., Uyar, M. (2013), â€œA hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease,â€ Applied Soft Computing, vol.13, no.8, pp.3429â€“3438. https://doi.org/10.1016/j.asoc.2013.03.008.
  [45] Joshi J., Doshi R., Patel J. (2014), â€œDiagnosis and prognosis breast cancer using classification rules,â€ International Journal of Engineering Research and General Science, vol.2, no.6, pp. 315â€“323.
  [46] Jilani, T.A., Yasin, H., Yasin, M.M. (2011), â€œPCA-ANN for classification of Hepatitis-C patients,â€ International Journal of Computer Applications, vol.14, no.7, pp. 1â€“6 (0975â€“8887).
  [47] Gardezi, S.J.S., Faye, I., Bornot, J.M.S., Kamel, N., Hussain, M. (2017), â€œMammogram classification using dynamic time warping,â€ Multimedia Tools and Applications, pp.1â€“22.
  [48] Abdelaal M.M.A., Farouq M.W., Sena H.A., Salem A.-B., M., â€œUsing data mining for assessing diagnosis of breast cancer,â€ International Multiconference on Computer Science and Information Technology; 2010 March 17â€“19; Hong Kong, China. p. 11â€“17.
  [49] Kumar, M., Rath, N.K., Rath, S.K. (2016), â€œAnalysis of microarray leukemia data using an efficient MapReduce-based K-nearest-neighbor classifier,â€ The Journal of Biomedical Informatics, vol.60, pp.395â€“409. https://doi.org/10.1016/j.jbi.2016.03.002.
  [50] Gasiorek JJ, Blank V. (2015), â€œRegulation and function of the NFE2 transcription factor in hematopoietic and non-hematopoietic cells,â€ Cell Mol Life Sci CMLS, vol.72, pp.2323â€“35. https://doi.org/10.1007/s00018-015-1866-6.
  [51] Mohamed, H., Mabrouk, M.S., Sharawy, A. (2014), â€œComputer aided detection system for micro calcifications in digital mammograms,â€ Computer Methods and Programs in Biomedicine, vol.116, no.3, pp. 226â€“235. https://doi.org/10.1016/j.cmpb.2014.04.010.
  [52] Huang C.-L., Liao H.-C., Chen M.-C. (2008), â€œPrediction model building and feature selection with support vector machines in breast cancer diagnosis,â€ Expert Systems with Applications, vol.34, pp.578â€“587. https://doi.org/10.1016/j.eswa.2006.09.041.
  [53] Xin Wang, Peijie Lin1 and Joshua W. K. Ho1 (2018), â€œDiscovery of cell-type specific DNA motif grammar in cis-regulatory elements using random Forest,â€ BMC Genomics, vol 19, no.1, pp.929. https://doi.org/10.1186/s12864-017-4340-z.
  [54] Thakur, A., Mishra, V., Jain, S.K. (2011), â€œFeed forward artificial neural network: tool for early detection of ovarian cancer,â€ Scientia Pharmaceutica, vol.79, no.3, pp.493â€“506. https://doi.org/10.3797/scipharm.1105-11.
  [55] Babeu J-P, Boudreau F. (2014), â€œHepatocyte nuclear factor 4-alpha involvement in liver and intestinal inflammatory networks,â€ World J Gastroenterol WJG, vol.20, pp.22â€“30. https://doi.org/10.3748/wjg.v20.i1.22.
  [56] Mahmoud, A.M., Maher, B.A., El-Horbaty, E.-S.M., Salem, A.B.M. (2013), â€œAnalysis of machine learning techniques for gene selection and classification of microarray data,â€ Proceedings of the 6th International Conference on Information Technology.
  [57] T. G. Consortium, â€˜â€˜the genotype-tissue expression (GTEx) project. (2013)â€™â€™ Nature Genetics, vol. 45, no. 6, pp. 580â€“585. https://doi.org/10.1038/ng.2653.
  [58] R. H. Shoemaker (2006), â€˜â€˜The NCI60 human tumour cell line anticancer drug screen,â€™â€™ Nature Rev. Cancer, vol. 6, no. 10, pp. 813â€“823. https://doi.org/10.1038/nrc1951.
  [59] M. Kellis et al. (2014), â€˜â€˜Defining functional DNA elements in the human genome,â€™â€™ Proc. Nat. Acad. Sci. USA, vol. 111, no. 17, pp. 6131â€“6138. https://doi.org/10.1073/pnas.1318948111.
  [60] T. J. Hudson et al. (2010), â€˜â€˜International network of cancer genome projects,â€™â€™ Nature, vol. 464, no. 7291, pp. 993â€“998. https://doi.org/10.1038/nature08987.
  [61] K. Chang et al. (2013), â€˜â€˜the cancer genome atlas pan-cancer analysis project,â€™â€™ Nature Genetics, vol. 45, no. 10, pp. 1113â€“1120. https://doi.org/10.1038/ng.2764.
  [62] J. Li et al., â€˜â€˜TCPA: A resource for cancer functional proteomics data,â€™â€™ Nature Methods, vol. 10, no. 11, pp. 1046â€“1047. https://doi.org/10.1038/nmeth.2650.
  [63] G. Project et al. (2013), â€˜â€˜an integrated map of genetic variation from 1,092 human genomes,â€™â€™ Nature, vol. 491, no. 7422, pp. 556â€“665, 2012.
  [64] B. E. Bernstein et al. (2010), â€˜â€˜The NIH roadmap epigenomics mapping consortium,â€™â€™ Nature Biotechnol., vol. 28, no. 10, pp. 1045â€“1048. https://doi.org/10.1038/nbt1010-1045.
  [65] R. E. Consortium et al. (2015), â€˜â€˜Integrative analysis of 111 reference human epigenomes,â€™â€™ Nature, vol. 518, no. 7539, pp. 317â€“330. https://doi.org/10.1038/nature14248.
  [66] A. R. Wood et al. (2014), â€˜â€˜Defining the role of common variation in the genomic and biological architecture of adult human height,â€™â€™ Nature Genetics, vol. 46, no. 11, pp. 1173â€“1186. https://doi.org/10.1038/ng.3097.
  [67] A. E. Locke et al. (2015), â€˜â€˜Genetic studies of body mass index yield new insights for obesity biology,â€™â€™ Nature, vol. 518, no. 7538, pp. 197â€“206. https://doi.org/10.1038/nature14177.
Downloads
How to Cite
Phogat, M., & Dharmender Kumar, D. (2019). A survey of machine learning techniques for genomic diseases and data sets. International Journal of Engineering & Technology, 7(4), 5533-5538. https://doi.org/10.14419/ijet.v7i4.11016
ACM

ACS

APA

ABNT

Chicago

Harvard

IEEE

MLA

Turabian

Vancouver

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX
Received date: 2018-04-03

Accepted date: 2018-07-02

Published date: 2019-04-07

A survey of machine learning techniques for genomic diseases and data sets

Authors

Abstract

References

Downloads

How to Cite

Published