Similarity search optimization using recently-biased symbolic representation

  • Authors

    • Tamer Hassan FTSM-UKM-Malaysia
    • Zalinda Othman
    • Abdul Razak Hamdan
    2014-05-09
    https://doi.org/10.14419/ijbas.v3i2.2358
  • Abstract

    Dimension reduction is one of the important requirements for a successful representation to improve the efficiency of extracting the attracting trend patterns on the time series. Furthermore, an efficient and accurate similarity searching on a huge time series data set is a crucial problem in data mining preprocessing. Symbolic representations have proven to be a very effective way to reduce the dimensionality of time series without loss of knowledge. However, symbolic representations suffer from another challenges promoted by the possibility of losing some principal patterns due to the impractical utilization of dealing with the whole data with the same weight. The methodology utilized in this paper is proposed to overcome symbolic representation pattern mismatch. Moreover, the data dimensionality is reduced by keeping more detail on recent-pattern data and less detail on older ones using modified sliding window controlled by the corresponding classification error rate. Experimental results were made on the UCR standard dataset comparing with the state of the art techniques. The proposed techniques showed promising results. Furthermore, practical experiments were made on the Egyptian stock market indices EGX 30, EGX 70 and EGX 100. The discovered patterns showed the accuracy and effectiveness of the proposed approach.

     

    Keywords: Time Series Representation, Dimensionality Reduction, Data Mining, Pattern Discovery, Optimization.

  • References

    1. Phithakkitnukoon, Santi and Ratti, Carlo (2010). A recent-pattern biased dimension-reduction framework for time series data. Journal of Advances in Information Technology, 1(4), pp. 168–180.
    2. J. Aßfalg, H.-P. Kriegel, P. Kr¨oger, P. Kunath, A. Pryakhin, and M. Renz. Similarity search on time series based on threshold queries. In EDBT, 2006.
    3. Y. Chen, M. A. Nascimento, B. C. Ooi, and A. K. H. Tung. SpADe: On Shape-based Pattern Detection in Streaming Time Series. In ICDE, 2007.
    4. E. Keogh, K. Chakrabarti, M. Pazzani and S. Mehrotra,“Dimensionality reduction for fast similarity search in large time series databases”, Journal of Knowledge and Information Systems, Vol. 3, No. 3, 2000, pp. 263-286.
    5. Han, J.: Data mining techniques. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of data, Montreal, Quebec, Canada, p. 545 (1996).
    6. Y. Cai and R. T. Ng. Indexing spatio-temporal trajectories with chebyshev polynomials. In SIGMOD Conference, 2004.
    7. D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD Workshop, 1994.
    8. J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms”, 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discover (DMKD 2003), June 13, 2003, pp. 2-11.
    9. L. Chen, M. T. Ozsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In SIGMOD Conference, 2005.
    10. L. Chen and R. T. Ng. On the marriage of lp-norms and edit distance. In VLDB, 2004.
    11. Q. Chen, L. Chen, X. Lian, Y. Liu, and J. X. Yu.Indexable PLA for Efficient Similarity Search. In VLDB, 2007.
    12. K. pong Chan and A. W.-C. Fu. Efficient Time Series Matching by Wavelets. In ICDE, 1999.
    13. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos.Fast Subsequence Matching in Time-Series Databases.In SIGMOD Conference, 1994.
    14. E. Frentzos, K. Gratsias, and Y. Theodoridis. Index-based most similar trajectory search. In ICDE, 2007.
    15. Jessica, L., Eamonn, K.: A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, California, pp. 2–11 (2003).
    16. John, F.R., Kathleen, H., Myra, S.: An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research. In: Roddick, J., Hornsby, K.S. (eds.) TSDM 2000.LNCS (LNAI), vol. 2007, pp. 147–163. Springer, Heidelberg (2001).
    17. Jiawei Han and Micheline Kamber. Data Mining:Concepts and Techniques.2nd edition. Morgan Kaufmann Publishers, CA, 2006.
    18. Y.-L. Wu, D. Agrawal, and A. E. Abbadi. A Comparison of DFT and DWT based Similarity Search in Time-Series Databases. In CIKM, 2000.
    19. K. Kawagoe and T. Ueda. A Similarity Search Method of Time Series Data with Combination of Fourier and Wavelet Transforms. In TIME, 2002.
    20. Y. Chen, G. Dong, j. Han, B. W. Wah, and J. Wang, ”Multi-Dimensional Regression Analysis of Time Series Data Streams,” Procs. 2002 Int’l Conf. Very Large Data Bases (VLDB’02), 2002.
    21. P. L. Love and M. Simaan. Automatic recognition of primitive changes in manufacturing process signals. Pattern Recognition, 21(4):333-342, 1988.
    22. E. J. Keogh. A Decade of Progress in Indexing and Mining Large Time Series Databases. In VLDB, 2006.
    23. E. J. Keogh, K. Chakrabarti, S. Mehrotra, and M. J. Pazzani. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. In SIGMOD Conference, 2001.
    24. E. J. Keogh, K. Chakrabarti, M. J. Pazzani, and S. Mehrotra. Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases.Knowl. Inf. Syst., 3(3), 2001.
    25. E. J. Keogh and S. Kasetty. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration. Data Min. Knowl. Discov, 7(4), 2003.
    26. E. J. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowl. Inf. Syst., 7(3), 2005.
    27. C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu, ”Mining Frequent Patterns in Data Streams at Multiple Time Granularities,”Data Mining: Next Generation Challenges and Future Directions, H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha, eds., AAAI/ MIT Press, 2003.
    28. C. Aggarwal, J. Han, J. Wang, and P. Yu, ”A Framework for Clustering Evolving Data Streams,” Procs. 29th Very Large Data Bases Conference (VLDB’03), pp. 81-92, Sept 2003.
    29. A. Bulut, and A. K. Singh,”SWAT: Hierarchical Stream Summarization in Large Networks,” Procs. 19th Int’l Conf. Data Eng. (ICDE’03), 2003.
    30. J. Lin, E. J. Keogh, L. Wei, and S. Lonardi. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., 15(2), 2007.
    31. M. D. Morse and J. M. Patel. An efficient and accurate method for evaluating time series similarity. In SIGMOD Conference, 2007.
    32. M. Vlachos, D. Gunopulos, and G. Kollios.Discovering similar multidimensional trajectories. In ICDE, 2002.
    33. Charest, M and S. Delisle: Ontology-guided intelligent data mining assistance: Combining declarative and procedural knowledge. Artificial Intelligence and Soft Computing 2006: p 9-14.
    34. I. Popivanov and R. J. Miller. Similarity Search over Time-Series Data Using Wavelets. In ICDE, 2002.
    35. Y. Zhao, and S. Zhang, ”Generalized Dimension-Reduction Framework for Recent-Biased Time Series Analysis,” IEEE Trans. on Knowledge and Data Eng., 18(2)231–244, 2006.
    36. M.M.M. Fuad, "Genetic Algorithms-Based Symbolic Aggregate Approximation",in Proc. DaWaK, 2012, pp.105-116.
    37. Muruga, D. Radha Devi, Maheswari, V. & Thambidur, P. 2010 "Similarity Search In Recent Biased Time Series Databases Using Vari-DWT and Polar Wavelets", pp.398-404.
    38. J. Lin, E. J. Keogh, L. Wei, and S. Lonardi, “Experiencing sax: a novel symbolic representation of time series,” Data Min. Knowl. Discov., vol. 15, no. 2, pp. 107–144, 2007.
    39. N. D. Pham, Q. L. Le, and T. K. Dang, “Two novel adaptive symbolic representations for similarity search in time series databases,” in APWeb,2010, pp. 181–187.
    40. T. Rakthanmanon and E. Keogh, “Fast shapelets: A scalable algorithm for discovering time series shapelets,” in SDM, 2013.
    41. E. Keogh, J. Lin, and A. Fu, “Hot sax: Efficiently finding the most unusual time series subsequence,” in Data Mining, Fifth IEEE International Conference on. IEEE, 2005, pp. 8–pp.
    42. B. Lkhagva, Y. Suzuki, and K. Kawagoe, “New time series data representation esax for financial applications,” in ICDE Workshops, 2006, p. 115.
    43. Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: a survey and empirical demonstration. DMKD, 7, 4, (2003).
    44. Ding, H., Trajcevski, G., Scheuermann, P.,Wang, X., Keogh, E.: Querying and mining of time series data: experimental comparison of representations and distance measures. In Proc. VLDB, 1542–1552 (2008).
    45. Xi, X., Keogh, E., Shelton, C., Wei, L., Ratanamahatana, C.: Fast time series classification using numerosity reduction. In Proc. ICML (2006).
    46. Sakoe, H. Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. on Acoustics, Speech and Signal Processing, 1, 43–49 (1978).
    47. Agrawal, R., Faloutsos, C., Swami, A.: Efficient Similarity Search In Sequence Databases. In Proc. FODO, 69–84 (1993).
    48. Lin, J., Khade, R., Li, Y.: Rotation-invariant similarity in time series using bag-of-patterns representation. J. Intell. Inf. Syst. 39, 2, 287–315 (2012).
    49. L. Ye and E. J. Keogh, “Time series shapelets: a novel technique that allows accurate, interpretable and fast classification,” Data Min. Knowl. Discov, vol. 22, no. 1-2, pp. 149–182, 2011.
    50. Nguyen & Duong T. Anh. 2007. "Combining SAX and Piecewise Linear Approximation to Improve Similarity Search on Financial Time Series" Information Technology Convergence, International Symposium on In Information Technology Convergence, pp.58-62.
  • Downloads

  • How to Cite

    Hassan, T., Othman, Z., & Hamdan, A. R. (2014). Similarity search optimization using recently-biased symbolic representation. International Journal of Basic and Applied Sciences, 3(2), 101-109. https://doi.org/10.14419/ijbas.v3i2.2358

    Received date: 2014-04-04

    Accepted date: 2014-05-02

    Published date: 2014-05-09