Missing data treatment method on cluster analysis

  • Authors

    • Elsiddig Elsadig Mohamed Koko Sudan University of Science & TechnologyFaculty of science - Department of Statistics
    • Amin Ibrahim Adam Mohamed
    2015-10-18
    https://doi.org/10.14419/ijasp.v3i2.5318
  • Cluster Analysis, Missing Data, Multiple Imputation Method, Sudan Household Health Survey (SHHS).
  • Abstract

    The missing data in household health survey was challenged for the researcher because of incomplete analysis. The statistical tool cluster analysis methodology implemented in the collected data of Sudan's household health survey in 2006.

    Current research specifically focuses on the data analysis as the objective is to deal with the missing values in cluster analysis. Two-Step Cluster Analysis is applied in which each participant is classified into one of the identified pattern and the optimal number of classes is determined using SPSS Statistics/IBM. However, the risk of over-fitting of the data must be considered because cluster analysis is a multivariable statistical technique. Any observation with missing data is excluded in the Cluster Analysis because like multi-variable statistical techniques. Therefore, before performing the cluster analysis, missing values will be imputed using multiple imputations (SPSS Statistics/IBM). The clustering results will be displayed in tables. The descriptive statistics and cluster frequencies will be produced for the final cluster model, while the information criterion table will display results for a range of cluster solutions.

  • References

    1. [1] R. H. Henderson, T. Sundaresan, Cluster sampling to assess immunization coverage: a review of experience with a simplified sampling method, Bulletin of the World Health Organization 60(2) (1982) 253-260.

      [2] R. J. Little, D. B. Rubin, Statistical analysis with missing data, John Wiley & Sons, (1987).

      [3] A. Williams, Science or marketing at Who? A Commentary on 'World Health 2000', Health Economics, 10(2) (2000)93-100. http://dx.doi.org/10.1002/hec.594.

      [4] A. M. Aalto, U. Häkkinenm, E.Ollila, Measuring the responsiveness of health care system in the World Health Report 2000. In Eds The World Health Report 2000: What does it tell us about health systems? Analyses by Finnish Experts. Helsinki, Finland: National Research and Development Centre for Welfare and Health (STAKES). [http://www.stakes.fi/english/publicati/Publications.htm]. (2000)

      [5] R. Little, D. Rubin, Statistical Analysis With Missing Data (2nd ed.), New York: Wiley, (2002). http://dx.doi.org/10.1002/9781119013563.

      [6] R. Blendon, M. Kim, and J. M. Benson, The public versus the World Health Organization on health system performance. Health Affairs, 20(3) (2001)10-20. http://dx.doi.org/10.1377/hlthaff.20.3.10.

      [7] V. Navarro, World Health Report 2000: Response to Murray and Frenk. Lancet, 357(9269) (2001)1701-1702. http://dx.doi.org/10.1016/S0140-6736(00)04827-3.

      [8] P. D. Allison, Missing Data, SAGE University Papers (2002).

      [9] J. L. Schafer Analysis of Incomplete Multivariate Data, New York: Chapman & Hall, (1997).

      [10] J. G. Ibrahim, “Incomplete Data in Generalized Linear Models,†Journal of the American Statistical Association, 85(1990) 765–769. http://dx.doi.org/10.1080/01621459.1990.10474938.

      [11] R. J. A. Little, “Regression with Missing X’s: A Review,†Journal of the American Statistical Association, 87(1992)1227–1237. http://dx.doi.org/10.2307/2290664.

      [12] S. Greenland, W. D. Finkle, “A Critical Look at Methods for Handling Missing Covariates in Epidemiologic Regression Analyses,†American Journal of Epidemiology, 142 (1995) 1255–1264.

      [13] M. Jones, “Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression,†Journal of the American Statistical Association, 91 (1996) 222–230. http://dx.doi.org/10.1080/01621459.1996.10476680.

      [14] I. Jansen, C. Bounces, G. Molenberghs, “Analyzing Incomplete Discrete Longitudinal Clinical Trial Data,†Statistical Science, 21(2006) 52–69. http://dx.doi.org/10.1214/088342305000000322.

      [15] R. J. Cook, L. Zeng, G. Y. Yi, “Marginal Analysis of Incomplete Longitudinal Binary Data: A Cautionary Note on LOCF Imputation,†Biometrics, 60 (2004) 820–828. http://dx.doi.org/10.1111/j.0006-341X.2004.00234.x.

      [16] J. Carpenter, M. Kenward, S. Evans, “Last Observation Carry-Forward and Last Observation Analysis,†Statistics in Medicine, 23 (2004) 3241–3244. http://dx.doi.org/10.1002/sim.1891.

      [17] D. B Rubin, "Inference and Missing Data,†Biometrika, 63(1987)581–590. Multiple Imputations for Nonresponsive in Surveys, New York: Wiley. 8(1987) 3–15.Association, 91 (1976) 473–489.

      [18] D. B. Rubin, “Multiple Imputation after 18+Years,†Journal of the American Statistical (1996).

      [19] J. Barnard, X. L. Meng, “Applications of Multiple Imputation in Medical Studies: From AIDS to NHANES,†Statistical Methods in Medical Research, 8(1999) 17–36. http://dx.doi.org/10.1191/096228099666230705.

      [20] P. D. Allison, “Imputation of Categorical Variables with PROC MIâ€, Available online at http:// www2.sas.com/ proceedings/sugi30/ 113-30.pdf [accessed July 30, 2006]. Multiple Imputation,†The American Statistician, 57 (2005) 229–232.

      [21] P. D. Allison, “Multiple Imputation for Missing Data: A Cautionary Tale,†Sociological Methods and Research, 28(2000) 301–309. http://dx.doi.org/10.1177/0049124100028003003.

      [22] J. L. Schafer, Analysis of Incomplete Multivariate Data, New York: Chapman & Hall (1997). http://dx.doi.org/10.1201/9781439821862.

      [23] Y. Bishop, S. Fienberg, P. Holland, Discrete Multivariate Analyses (1975).

      [24] I. Olkin, R. F. Tate, “Multivariate Correlation Models With Mixed Discrete and Continuous Variables,†The Annals of Mathematical Statistics, Theory and Practice, Cambridge, MA: MIT Press 32 (1961) 448–465. http://dx.doi.org/10.1214/aoms/1177705052.

      [25] J. Carpenter, “Annotated Bibliography on Missing Dataâ€, Available online at http://www.lshtm.ac.uk/ msu/ missingdata/biblio.html [accessed July 30, 2006].

      [26] F. Xie, M. C. Paik, "Generalized Estimating Equation Model for Binary Outcomes With Missing Covariates,†Biometrics, 90 Statistical Software Reviews 53 (1997) 1458–1466.

      [27] S. van Buuren, H. C. Boshuizen, D. L. Knook, “Multiple Imputation of Missing, (1999).

      [28] T. E. Raghunathan , J. M. Lepkowski, P. Solenberger, “A Multivariate Technique for Multiply Imputing Missing Values Using a Sequence of Regression Models,†Survey Methodology, 27(2001) 85–95.

      [29] J. G. Ibrahim, M.-H. Chen, S. R. Lipsitz, “Missing- Data Methods for Generalized Linear Models: Comparative Review,†Journal of the American Statistical Association, 100(2005) 332–346. http://dx.doi.org/10.1198/016214504000001844.

      [30] M. S. Aldenderfer, R. K. Blashfield, Cluster analysis, Sage Publications, London, England,

      [31] R. M. Cormack, A review of classification, Journal of the Royal Statistical Society, Series A (General), 134 (1984) 321–367. http://dx.doi.org/10.2307/2344237.

      [32] P. Tan, M. Steinbach, V. Kumar, Introduction to data mining, Addison-Wesley, Networks, 16 (2005) 645–678.

      [33] R. Duda , P. Hart , Pattern Classification and Scene analysis, John Wiley & Sons, Inc, NY, (1973).

      [34] J. Han, M. Kamber, Data mining: concepts and techniques, Morgan Kaufmann Publishers, Inc., (2001).

      [35] A. Jain, M. Murty, P. Flynn, Data clustering: a review, ACM Computing Surveys, 31 (1999) 264–323. http://dx.doi.org/10.1145/331499.331504.

      [36] C. Fraley, A. Raftery, Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97(2002) 611–631. http://dx.doi.org/10.1198/016214502760047131.

      [37] A. K. Jain, R. Dubes, Algorithms for clustering data, Prentice-Hall, Inc., Upper Saddle River, NJ, USA: (1988).

      [38] K. D. Bailey, Cluster analysis, Sociological Methodology, 6(1975) 59–128. http://dx.doi.org/10.2307/270894.

      [39] D. Jiang, C. Tang, A. Zhang, Cluster analysis for gene expression data: A survey, IEEE Transactions on Knowledge and Data Engineering, 16 (2004) 1370–1386. http://dx.doi.org/10.1109/TKDE.2004.68.

      [40] G. Milligan, M. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50(1985) 159–179. http://dx.doi.org/10.1007/BF02294245.

      [41] Kish, Leslie, Survey Sampling, New York: John Wiley & Sons, Inc, (1965).

      [42] A. Rose, R. F. Grais & H. Ritter. A comparison of cluster and systematic sampling methods for measuring crude mortality. Bulletin of the World Health Organization, 84(2006) 290-296. http://dx.doi.org/10.2471/BLT.05.029181.

  • Downloads

  • Received date: 2015-09-14

    Accepted date: 2015-10-11

    Published date: 2015-10-18