Clustering and multiple imputation of missing data

  • Authors

    • Elsiddig Koko Sudan University of Science & Technology, Faculty of science, Department of Statistics
    • Amin Ibrahim Adam Mohamed
    2015-12-10
    https://doi.org/10.14419/ijbas.v5i1.5470
  • Cluster Analysis, Missing Data, Multiple Imputation, Two-Step Cluster Analysis.
  • Abstract

    The present work specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. Furthermore, goal of analysis is to reduce biases arising from the fact that non-respondents may be different from those who participate and to bring sample data up to the dimensions of the target population totals.

  • References

    1. [1] Ngondi, J., Matthews, F., Reacher, M., Onsarigo, A., Matende, I., Baba, S., & Emerson, P. (2007). Prevalence of risk factors and severity of active trachoma in southern Sudan: an ordinal analysis. American Journal of Tropical Medicine and Hygiene, 77(1), 126.

      [2] Jain A. K. and Dubes R. C. (1988). Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA.

      [3] M. S. Aldenderfer, R. K. Blashfield, Cluster analysis, Sage Publications, London, England.

      [4] Anderberg M. R. (1973). Cluster analysis for applications, Academic Press, Inc., London, and ASR: An integrated study. In Proc. of Eurospeech ’99, 2407–2410.

      [5] Karkka T., Inen and Ayramo, S., (2004).Robust clustering methods for incomplete and erroneous data, in Proceedings of the Fifth Conference on Data Mining,, pp. 101–112.

      [6] R. J. Little, D. B. Rubin, Statistical analysis with missing data, John Wiley & Sons, (1987).

      [7] Jain, A. K., Duin, R. P.W. and Mao, J. (2000) Statistical pattern recognition: A review, IEEE Trans. Pattern Anal. Mach. Intell., 22, pp. 4–37. http://dx.doi.org/10.1109/34.824819.

      [8] A. Jain, M. Murty, P. Flynn, Data clustering: a review, ACM Computing Surveys, 31 (1999) 264–323. http://dx.doi.org/10.1145/331499.331504.

      [9] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction, Springer-Verlag. http://dx.doi.org/10.1007/978-0-387-21606-5.

      [10] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17 520–525. http://dx.doi.org/10.1093/bioinformatics/17.6.520.

      [11] Ghahramani, Z. and Jordan, M. I. (1994). Learning from incomplete data. Tech. Rep., Massachusetts Inst. of Technology Artiï¬cial Intelligence Lab.

      [12] Vizinho, A., Green, P., Cooke, M. and Josifovski, L. (1999). Missing data theory, spectral subtraction and signal-to-noise estimation for robust.

      [13] Tresp, V., Neunier, R. and Ahmad, S. (1995). Efficient methods for dealing with missing data in supervised learning. In Advances in Neural Info Proc. Sys. 7.

      [14] Wagstaff, K., Cardie, C., Rogers, S. and Schroedl, S. (2001). Constrained k-means clustering with background knowledge. In Proc. of the 18th Intl. Conf. on Machine Learning, 577–584.

      [15] J. Han, M. Kamber, Data mining: concepts and techniques, Morgan Kaufmann Publishers, Inc., (2001).

      [16] Hand D., Mannila, H. and Smyth P.,Principles of Data Mining, MIT Press,(2001).

      [17] P. Tan, M. Steinbach, V. Kumar, Introduction to data mining, Addison-Wesley, Networks, 16 (2005) 645–678.

      [18] Horvitz, D. G., and D.J. Thompson, (1952). “A generalization of sampling without replacement from a finite universe.†The Journal of the American Statistical Association 47:663-685.

      [19] D. B Rubin, "Inference and Missing Data,†Biometrika, 63(1987)581–590. Multiple Imputations for Nonresponsive in Surveys, New York: Wiley. 8(1987) 3–15.Association, 91 (1976) 473–489.

      [20] Deville, J.C. and C.E. Sarndal, (1992). “Calibration Estimating in Survey Sampling.†Journal of the American Statistical Association 87:376-382. http://dx.doi.org/10.1080/01621459.1992.10475217.

      [21] Folsom, R. E. and A.C. Singh, (2000). “The General Exponential Model for Sampling Weight Calibration for Extreme Values, Non-response, and Post-stratification.†in Proceedings of the Survey Research Methods Section, American Statistical Association. Indianapolis, Indiana.

      [22] Cochran, W. G., (1977). Sampling Techniques, Third Edition. New York: John Wiley & Sons.

      [23] Skinner, C.J., D. Holt and T.M.F. Smith. Editors, (1989). Analysis of Complex Surveys. Wiley, New York.

      [24] J. G. Ibrahim, M.-H. Chen, S. R. Lipsitz, “Missing- Data Methods for Generalized Linear Models: Comparative Review,†Journal of the American Statistical Association, 100(2005) 332–346. http://dx.doi.org/10.1198/016214504000001844.

      [25] J. Carpenter, “Annotated Bibliography on Missing Dataâ€, Available online at http://www.lshtm.ac.uk/ msu/ missingdata/biblio.html [accessed July 30, 2006].

      [26] Horton, N. J., and Lipsitz, S. R. (2001), “Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables,†The American Statistician, 55, 244–254. http://dx.doi.org/10.1198/000313001317098266.

      [27] I. Jansen, C. Bounces, G. Molenberghs, “Analyzing Incomplete Discrete Longitudinal Clinical Trial Data,†Statistical Science, 21(2006) 52–69. http://dx.doi.org/10.1214/088342305000000322.

      [28] Robins, J. M., Rotnitzky, A., and Zhao, L. P. , “Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data,†Journal of the American Statistical Association, 90, (1995)106–121. http://dx.doi.org/10.1080/01621459.1995.10476493.

      [29] Laird, N. M., “Missing Data in Longitudinal Studies,†Statistics in Medicine, 7, (1988) 305–315. http://dx.doi.org/10.1002/sim.4780070131.

      [30] T. E. Raghunathan , J. M. Lepkowski, P. Solenberger, “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,†Survey Methodology, 27(2001) 85–95.

      [31] Von Hippel, P., “Biases in SPSS 12.0 Missing Value Analysis,†The American Statistician, 58, (2004), 160–164. http://dx.doi.org/10.1198/0003130043204.

      [32] Van Buuren, S. (2006), Multiple Imputation Online [accessed August 19, 2015]. Available online at http://www.multiple-imputation.com. (In press), “Creating Multiple Imputations in Discrete and Continuous Data by Fully Conditional Specification,†Statistical Methods in Medical Research.

      [33] S. van Buuren, H. C. Boshuizen, D. L. Knook, “Multiple Imputation of Missing, (1999).

      [34] P. D. Allison, “Multiple Imputation for Missing Data: A Cautionary Tale,†Sociological Methods and Research, 28(2000) 301–309. http://dx.doi.org/10.1177/0049124100028003003.

      [35] Raghunathan TE. What do we do with missing data? Some options for analysis of incomplete data.Annual Review of Public Health. 2004; 25:99–117. http://dx.doi.org/10.1146/annurev.publhealth.25.102802.124410.

      [36] Meng XL. Missing data: dial M for??? Journal of the American Statistical Association.2000; 95(452):1325–1330. http://dx.doi.org/10.1080/01621459.2000.10474341.

  • Downloads

  • How to Cite

    Koko, E., & Mohamed, A. I. A. (2015). Clustering and multiple imputation of missing data. International Journal of Basic and Applied Sciences, 5(1), 15-29. https://doi.org/10.14419/ijbas.v5i1.5470

    Received date: 2015-10-23

    Accepted date: 2015-12-05

    Published date: 2015-12-10