Bag-of-words from image to speech: a multi-classifier emotions recognition system

  • Authors

    • Mai Ezz-Eldin Electronics and Communications Department, Minia University, El-Minia
    • Hesham F. A. Hamed
    • Ashraf A. M. Khalaf
    2020-08-30
    https://doi.org/10.14419/ijet.v9i3.30958
  • Bag-of-words (BoW), Mel frequency cepstral coefficients (MFCC), RAVDESS database, support vector machine (SVM), K-nearest neighbors (KNN), and extreme gradient boosting (XGBoost).
  • Recently, recognizing the emotional content of speech signals has received considerable research attention. Consequently, systems have been developed to recognize the emotional content of a spoken utterance. Achieving high accuracy in speech emotion recognition remains a challenging problem due to issues related to feature extraction, type, and size. Central to this study is increasing emotion recognition accuracy by porting the bag-of-word (BoW) technique from image to speech for feature processing and clustering. The BoW technique is applied to features extracted from Mel frequency cepstral coefficients (MFCC) which enhances feature quality. The study considers deployment of different classification approaches to examine the performance of the embedded BoW approach. The deployed classifiers include support vector machine (SVM), K-nearest neighbor (KNN), naive Bays (NB), random forest (RF), and extreme gradient boosting (XGBoost). In this study, experiments used the standard RAVDESS audio dataset with eight emotions: angry, calm, happy, surprised, sad, disgusted, fearful and neutral. The maximum accuracy obtained in the angry class using SVM was 85%, while overall accuracy was 80.1 %. The empirical works have proved that using BoW achieves better results in terms of accuracy and processing time compared to other available methods.

  • References

    1. [1] M. Schr¨oder, Emotional speech synthesis: A review, in: EUROSPEECH Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, Vol. 7, September 3-7, 2001, pp. 561–564.

      [2] D. O’Shaughnessy, Speech communication: Human and machine addison wesley 20 (November 30, 1999) 548 pages.

      [3] M. Joshi, S. Srivastava, Human computer interaction using speech recognition technology, in: National Conference on Mathematical Analysis and Computation (NCMAC) 2015, At JAIPUR, 2015.

      [4] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature detectors, arXiv preprint arXiv:1207.0580 (3 Jul 2012).

      [5] A. Graves, A.-r. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, Canada, IEEE, 26 - 31 Mar 2013, pp. 6645–6649.

      [6] D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, M. Wilkes, Acoustical properties of speech as indicators of depression and suicidal risk, IEEE transactions on Biomedical Engineering 47 (7 Jul 2000) 829–837. doi:10.1109/10.846676.

      [7] M. El Ayadi, M. S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44 (3) (March 2011) 572–587. doi:10.1016/j.patcog.2010.09.020.

      [8] B. Schuller, G. Rigoll, M. Lang, Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture, Vol. 1, IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Que., Canada, 17-21 May 2004, pp. I–577. doi:10.1109/ICASSP.2004.1326051.

      [9] L. Devillers, L. Vidrascu, L. Lamel, Challenges in real-life emotion annotation and machine learning based detection, Neural Networks 18 (4) (May 2005) 407–422. doi:org/10.1016/j.neunet.2005.03.007.

      [10] B. Zhang, E. M. Provost, G. Essi, Cross-corpus acoustic emotion recognition from singing and speaking: A multi-task learning approach, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, IEEE, 20-25 March 2016, pp. 5805–5809. doi:10.1109/ICASSP.2016.7472790.

      [11] B. Zhang, G. Essl, E. M. Provost, Recognizing emotion from singing and speaking using shared models, in: 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, IEEE, 21-24 Sept. 2015, pp. 139–145. doi:10.1109/ACII.2015.7344563.

      [12] T. Danisman, A. Alpkocak, Emotion classification of audio signals using ensemble of support vector machines, in: International Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, Berlin, Heidelberg, Springer, 2008, pp. 205–216. doi:10.1007/978-3-540-69369-7_23.

      [13] S. Emerich, E. Lupu, A. Apatean, Emotions recognition by speech and facial expressions analysis, in: 2009 17th European Signal Processing Conference, Glasgow, Scotland, IEEE, 24-28 August 2009, pp. 1617–1621.

      [14] M. Schmitt, F. Ringeval, B. W. Schuller, At the border of acoustics and linguistics: Bag-of-audio-words for the recognition of emotions in speech, in: Interspeech, San Francisco, USA, 8–12 September 2016, pp. 495–499. doi:10.21437/Interspeech.2016-1124.

      [15] A. Bombatkar, G. Bhoyar, K. Morjani, S. Gautam, V. Gupta, Emotion recognition using speech processing using k-nearest neighbor algorithm, International Journal of Engineering Research and Applications (IJERA), Cranfield, Bedfordshire, United Kingdom (12-13 April 2014) 2248 9622.

      [16] M. Khan, T. Goskula, M. Nasiruddin, R. Quazi, Comparison between k-nn and svm method for speech emotion recognition, International Journal on Computer Science and Engineering 3 (2) (Feb 2011) 607–611.

      [17] C. M. Lee, S. S. Narayanan, Toward detecting emotions in spoken dialogs, IEEE transactions on speech and audio processing 13 (2) (22 February 2005) 293–303. doi:10.1109/TSA.2004.838534.

      [18] B. Chen, Q. Yin, P. Guo, A study of deep belief network based chinese speech emotion recognition, in: 2014 Tenth International Conference on Computational Intelligence and Security, Kunming, China, IEEE, 15-16 Nov 2014, pp. 180–184. doi:10.1109/CIS.2014.148.

      [19] M. E. S´anchez-Guti´errez, E. M. Albornoz, F. Martinez-Licona, H. L. Rufiner, J. Goddard, Deep learning for emotional speech recognition, in: Mexican conference on pattern recognition, Springer, Cham, 2014, pp. 311–320. doi:10.1007/978-3-319-07491-7_32.

      [20] J. M. Montero, J. M. Gutierrez-Arriola, S. Palazuelos, E. Enriquez, S. Aguilera, J. M. Pardo, Emotional speech synthesis: From speech database to tts, in: Fifth International Conference on Spoken Language Processing, Sydney, Australia, 30th November - 4th December 1998.

      [21] D. Bitouk, R. Verma, A. Nenkova, Class-level spectral features for emotion recognition, Speech communication 52 (7-8) (July–August 2010) 613–625. doi:10.1016/j.specom.2010.02.010.

      [22] M. Sigmund, Spectral analysis of speech under stress, IJCSNS International Journal of Computer Science and Network Security 7 (2007) 170–172.

      [23] L. Liberti, C. Lavor, N. Maculan, A. Mucherino, Euclidean distance geometry and applications, SIAM Review 56 (1) (2014) 3–69. doi:10.1137/120875909.

      [24] G.James, An introduction to statistical learning: with applications in R, Springer Verlag New York, 2013. doi:10.1007/978-1-4614-7138-7.

      [25] J. H. Friedman, Greedy function approximation: a gradient boosting machine, Annals of statistics 29 (2001) 1189–1232. doi:10.1214/aos/1013203451.

      [26] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, California, USA, ACM, 13 - 17 August 2016, pp. 785–794. doi:10.1145/2939672.2939785.

      [27] S. Livingstone, F. Russo, The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english, Vol. 13, 2018, p. e0196391. doi:10.1371/journal.pone.0196391.

      [28] M. ., S. Kwon, A cnn-assisted enhanced audio signal processing for speech emotion recognition, Sensors 20 (2019) 183. doi:10.3390/s20010183.

      [29] Y. Zeng, H. Mao, D. Peng, Z. Yi, Spectrogram based multi-task audio classification, Multimedia Tools and Applications 78 (3) (2019) 3705–3722. doi:10.1007/s11042-017-5539-3.

      [30] M. A. Jalal, E. Loweimi, R. K. Moore, T. Hain, Learning temporal clusters using capsule routing for speech emotion recognition, in: Proc. Interspeech 2019, Graz, Austria, 15–19 September, 2019, pp. 1701–1705. doi:10.21437/Interspeech.2019-3068.

      [31] P. Shegokar, P. Sircar, Continuous wavelet transform based speech emotion recognition, 2016. doi:10.1109/ICSPCS.2016.7843306.

  • Downloads

  • How to Cite

    Ezz-Eldin, M., F. A. Hamed, H., & A. M. Khalaf, A. (2020). Bag-of-words from image to speech: a multi-classifier emotions recognition system. International Journal of Engineering & Technology, 9(3), 770-778. https://doi.org/10.14419/ijet.v9i3.30958