Fast online detection of outliers using least-trimmed squares regression with non-dominated sorting based initial subsets

  • Authors

    • Mehmet Hakan Satman Istanbul University
    2015-04-04
    https://doi.org/10.14419/ijasp.v3i1.4439
  • Data streams, Initial subset selection, Non-dominated sorting, Outlier detection, Robust regression.
  • In this paper, a new algorithm is devised for calculating the Least Trimmed of Squares (LTS) estimator. The algorithm consists of two steps. In the first step, the non-dominated sorting algorithm is applied on the design matrix of regression data for selecting a clean subset of observations. In the second step, C-steps are iterated to adjust the LTS estimators. The algorithm is fast and precise for small sample sizes, however, the sorting algorithm is computationally inefficient in large datasets. A fast update mechanism can be used in online data with a linear increase in computation time. Some properties of the sorting algorithm are also investigated under some transformations. Results of applying the algorithm on some well-known datasets and Monte Carlo simulations show that the proposed algorithm is suitable to use in many cases when the computation time is the major objective and a moderate level of precision is enough.

  • References

    1. [1] J. Agullo. New algorithms for computing the least trimmed squares regression estimator. Computational Statistics & Data Analysis, 36(4):425-439, 2001.

      [2] A. Atkinson. Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89(428):1329-1339, 1994.

      [3] A. C. Atkinson and T.-C. Cheng. Computing least trimmed squares regression with the forward search. Statistics and Computing, 9(4):251-263, 1999.

      [4] V. Barnett and T. Lewis. Outliers in statistical data. John Wiley & Sons, 286:293, 1978.

      [5] C. Beguin and B. Hulliger. The bacon-eem algorithm for multivariate outlier detection in incomplete survey data. Survey Methodology, 34(1):91, 2008.

      [6] N. Billor, S. Chatterjee, and A. S. Hadi. A re-weighted least squares method for robust regression estimation. American journal of mathematical and management sciences, 26(3-4):229-252, 2006.

      [7] N. Billor, A. S. Hadi, and P. F. Velleman. Bacon: blocked adaptive computationally efficient outlier nominators. Computational Statistics & Data Analysis, 34(3):279-298, 2000.

      [8] N. Billor and G. Kiral. A comparison of multiple outlier detection methods for regression data. Communications in Statistics Simulation and Computation, 37(3):521-545, 2008.

      [9] B. Brown. Statistical uses of the spatial median. Journal of the Royal Statistical Society. Series B (Methodological), pages 25-30, 1983.

      [10] K. A. Brownlee and K. A. Brownlee. Statistical theory and methodology in science and engineering, volume 150. Wiley New York, 1965.

      [11] L. Davies. The asymptotics of rousseeuw's minimum volume ellipsoid estimator. The Annals of Statistics, pages 1828-1843, 1992.

      [12] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan. A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: Nsga-ii. Lecture notes in computer science, 1917:849-858, 2000.

      [13] D. Eddelbuettel and R. Francois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):1-18, 2011.

      [14] A. Giloni and M. Padberg. Least trimmed squares regression, least median squares regression, and mathematical programming. Mathematical and Computer Modelling, 35(9):1043-1060, 2002.

      [15] A. S. Hadi and J. S. Simonoff. Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88(424):1264-1272, 1993.

      [16] J. Hardin and D. M. Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625-638, 2004.

      [17] D. M. Hawkins, D. Bradu, and G. V. Kass. Location of several outliers in multiple-regression data using elemental sets. Technometrics, 26(3):197-208, 1984.

      [18] R. A. Horn. The hadamard product. In Proc. Symp. Appl. Math, volume 40, pages 87-169, 1990.

      [19] A. M. Leroy and P. J. Rousseeuw. Robust regression and outlier detection. 1987.

      [20] R. A. Maronna and V. J. Yohai. Robust estimation of multivariate location and scatter. Encyclopedia of Statistical Sciences, 1998.

      [21] R. Nunkesser and O. Morell. An evolutionary algorithm for robust regression. Computational Statistics & Data Analysis, 54(12):3242-3248, 2010.

      [22] D. Pena and V. J. Yohai. The detection of influential subsets in linear regression by using an influence matrix. Journal of the Royal Statistical Society. Series B (Methodological), pages 145-156, 1995.

      [23] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.

      [24] P. J. Rousseeuw and K. Van Driessen. An algorithm for positive-breakdown regression based on concentration steps. In Data Analysis, pages 335-346. Springer, 2000.

      [25] P. J. Rousseeuw and K. Van Driessen. Computing lts regression for large data sets. Data mining and knowledge discovery, 12(1):29-45, 2006.

      [26] P. J. Rousseeuw and B. C. Van Zomeren. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411):633-639, 1990.

      [27] M. H. Satman. A genetic algorithm based modification on the lts algorithm for large data sets. Communications in Statistics-Simulation and Computation, 41(5):644-652, 2012.

      [28] M. H. Satman. A new algorithm for detecting outliers in linear regression. International Journal of Statistics and Probability, 2(3):p101, 2013.

      [29] D. M. Sebert, D. C. Montgomery, and D. A. Rollier. A clustering algorithm for identifying multiple outliers in linear regression. Computational statistics & data analysis, 27(4):461-484, 1998.

      [30] N. Srinivas and K. Deb. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary computation, 2(3):221-248, 1994.

      [31] A. J. Stromberg. Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal on Scientific Computing, 14(6):1289-1299, 1993.

      [32] K. Tamura and S. Miura. Necessary and sufficient conditions for local and global nondominated solutions in decision problems with multi-objectives. Journal of Optimization Theory and Applications, 28(4):501-523,1979.

      [33] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, fourth edition, 2002. ISBN 0-387-95457-0.

  • Downloads