Speech Recognition Using Convolutional Neural Networks
Keywords:Neural Networks (NN), Convolutional Neural Networks(CNN).
Automatic speech recognition (ASR) is the process of converting the vocal speech signals into text using transcripts. In the present era of computer revolution, the ASR plays a major role in enhancing the user experience, in a natural way, while communicating with the machines. It rules out the use of traditional devices like keyboard and mouse, and the user can perform an endless array of applications like controlling of devices and interaction with customer care. In this paper, an ASR based Airport enquiry system is presented. The system has been developed natively for telugu language. The database is created based on the most frequently asked questions in an airport enquiry. Because of its high performance, Convolutional Neural Network (CNN) has been used for training and testing of the database. The salient feature of weight connectivity, local connectivity and polling result is a through training of the system, thus resulting in a superior testing performance. Experiments performed on wideband speech signals results in significant improvement in the performance of the system in comparison to the traditional techniques.
 H. Jiang, â€œDiscriminative training for automatic speech recognition: A survey,â€ Comput. Speech, Lang., vol. 24, no. 4, pp. 589â€“608, 2010.
 X. He, L. Deng, and W. Chou, â€œDiscriminative learning in sequential pattern recognitionâ€”A unifying review for optimization-oriented speech recognition,â€ IEEE Signal Process. Mag., vol. 25, no. 5, pp. 14â€“36, Sep. 2008.
 L. Deng and X. Li, â€œMachine learning paradigms for speech recognition: An overview,â€ IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 5, pp. 1060â€“1089, May 2013.
 G. E. Dahl, M. Ranzato, A. Mohamed, and G. E. Hinton, â€œPhone recognitionwith the mean-covariance restricted Boltzmann machine,â€ Adv.Neural Inf. Process. Syst., no. 23, 2010.
 A. Mohamed, T. Sainath, G. Dahl, B. Ramabhadran, G. Hinton, andM. Picheny, â€œDeep belief networks using discriminative features forphone recognition,â€ in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP), May 2011, pp. 5060â€“5063.
 D. Yu, L. Deng, and G. Dahl, â€œRoles of pre-training and fine-tuningin context-dependent DBN-HMMs for real-world speech recognition,â€in Proc. NIPS Workshop Deep Learn. Unsupervised Feature Learn.,2010.
 G. Dahl, D. Yu, L. Deng, and A. Acero, â€œLarge vocabulary continuousspeech recognition with context-dependent DBN-HMMs,â€ inProc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp.4688â€“4691.
 F. Seide, G. Li, X. Chen, and D. Yu, â€œFeature engineering in context-dependent deep neural networks for conversational speech transcription,â€in Proc. IEEE Workshop Autom. Speech Recognition Understand.(ASRU), 2011, pp. 24â€“29.
 N. Morgan, â€œDeep and wide: Multiple layers in automatic speechrecognition,â€ IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no.1, pp. 7â€“13, Jan. 2012.
 A. Mohamed, G. Dahl, and G. Hinton, â€œDeep belief networks for phonerecognition,â€ in Proc. NIPS Workshop Deep Learn. Speech Recognition Related Applicat., 2009.
 A. Mohamed, D. Yu, and L. Deng, â€œInvestigation of full-sequencetraining of deep belief networks for speech recognition,â€ in Proc.Interspeech, 2010, pp. 2846â€“2849.
 L. Deng, D. Yu, and J. Platt, â€œScalable stacking and learning forbuilding deep architectures,â€ in Proc. IEEE Int. Conf. Acoustics,Speech, Signal Process., 2012, pp. 2133â€“2136.
 G. Dahl,D.Yu, L.Deng, and A. Acero, â€œContext-dependent pre-traineddeep neural networks for large-vocabulary speech recognition,â€ IEEETrans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30â€“42, Jan.2012.
 F. Seide, G. Li, and D. Yu, â€œConversational speech transcription using context-dependent deep neural networks,â€ in Proc. Interspeech, 2011,pp. 437â€“440.
 T. N. Sainath, B. Kingsbury, B. Ramabhadran, P. Fousek, P. Novak,and A. Mohamed, â€œMaking deep belief networks effective for largevocabulary continuous speech recognition,â€ in IEEE Workshop Autom.Speech Recogn. Understand. (ASRU), 2011, pp. 30â€“35.
 J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang, â€œInvestigation of deepneural networks (DNN) for large vocabulary continuous speech recognition:Why DNN surpasses GMMs in acoustic modeling,â€ in Proc.ISCSLP, 2012.