Speech Recognition Using Convolutional Neural Networks


  • D. Nagajyothi
  • P. Siddaiah






Neural Networks (NN), Convolutional Neural Networks(CNN).


Automatic speech recognition (ASR) is the process of converting the vocal speech signals into text using transcripts. In the present era of computer revolution, the ASR plays a major role in enhancing the user experience, in a natural way, while communicating with the machines. It rules out the use of traditional devices like keyboard and mouse, and the user can perform an endless array of applications like controlling of devices and interaction with customer care. In this paper, an ASR based Airport enquiry system is presented. The system has been developed natively for telugu language. The database is created based on the most frequently asked questions in an airport enquiry. Because of its high performance, Convolutional Neural Network (CNN) has been used for training and testing of the database. The salient feature of weight connectivity, local connectivity and polling result is a through training of the system, thus resulting in a superior testing performance. Experiments performed on wideband speech signals results in significant improvement in the performance of the system in comparison to the traditional techniques.



[1] H. Jiang, “Discriminative training for automatic speech recognition: A survey,†Comput. Speech, Lang., vol. 24, no. 4, pp. 589–608, 2010.

[2] X. He, L. Deng, and W. Chou, “Discriminative learning in sequential pattern recognition—A unifying review for optimization-oriented speech recognition,†IEEE Signal Process. Mag., vol. 25, no. 5, pp. 14–36, Sep. 2008.

[3] L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,†IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060–1089, May 2013.

[4] G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, “Phone recognitionwith the mean-covariance restricted Boltzmann machine,†Adv.Neural Inf. Process. Syst., no. 23, 2010.

[5] A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, andM. Picheny, “Deep belief networks using discriminative features forphone recognition,†in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP), May 2011, pp. 5060–5063.

[6] D. Yu, L. Deng, and G. Dahl, “Roles of pre-training and fine-tuningin context-dependent DBN-HMMs for real-world speech recognition,â€in Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn.,2010.

[7] G. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary continuousspeech recognition with context-dependent DBN-HMMs,†inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp.4688–4691.

[8] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-dependent deep neural networks for conversational speech transcription,â€in Proc. IEEE Workshop Autom. Speech Recognition Understand.(ASRU), 2011, pp. 24–29.

[9] N. Morgan, “Deep and wide: Multiple layers in automatic speechrecognition,†IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no.1, pp. 7–13, Jan. 2012.

[10] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phonerecognition,†in Proc. NIPS Workshop Deep Learn. Speech Recognition Related Applicat., 2009.

[11] A. Mohamed, D. Yu, and L. Deng, “Investigation of full-sequencetraining of deep belief networks for speech recognition,†in Proc.Interspeech, 2010, pp. 2846–2849.

[12] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning forbuilding deep architectures,†in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Process., 2012, pp. 2133–2136.

[13] G. Dahl,D.Yu, L.Deng, and A. Acero, “Context-dependent pre-traineddeep neural networks for large-vocabulary speech recognition,†IEEETrans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, Jan.2012.

[14] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,†in Proc. Interspeech, 2011,pp. 437–440.

[15] T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak,and A. Mohamed, “Making deep belief networks effective for largevocabulary continuous speech recognition,†in IEEE Workshop Autom.Speech Recogn. Understand. (ASRU), 2011, pp. 30–35.

[16] J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang, “Investigation of deepneural networks (DNN) for large vocabulary continuous speech recognition:Why DNN surpasses GMMs in acoustic modeling,†in Proc.ISCSLP, 2012.

View Full Article: