Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı USE OF REGRESSION IN NOISY SPEECH RECOGNITION Yrd. Doç. Dr. Semih ERGİN Eskisehir Osmangazi Üniversitesi sergin@ogu.edu.tr Yrd. Doç. Dr. Rifat Aykut ARAPOĞLU Eskisehir Osmangazi Üniversitesi arapoglu@ogu.edu.tr Abstract In this study, we investigated the contribution of the multiple regression to robust noisy speech recognition in improving the recognition rates. When the noisy speech recognition process is carried out; first of all, an Affine Transformation is performed in order to map the feature vectors of noisy speech into those of clean speech. After transforming, the recognition step is achieved using the Common Vector Approach (CVA). We used several multiple linear as well as nonlinear regression models to improve the recognition rates by adding non-linear terms into the model during the affine transformation stage. In the experimental study, the recognition rates of the noisy speech signals with 0 dB, 5 dB, 10dB, and 20 dB Signal-to-Noise Ratio (SNR) values have been obtained. Noisy speech which has 20, 10, 5, and 0 dB SNR is obtained using MATLAB by adding white Gaussian noise on the clean speech taken from the Texas Instruments (TI) Digit Database. Improvements are observed when non-linear terms are introduced into the model. Keywords: Regression analysis, Affine transformation, Robust speech recognition. JEL Classification: C39 GÜRÜLTÜLÜ SES TANIMADA REGRESYON KULLANIMI Özet Bu çalışmada, gürültülü ses tanıma oranlarının iyileştirilmesinde çoklu regresyon analizinin katkısı araştırılmıştır. Gürültülü ortamda ses tanıma sürecinde, ilk önce, gürültülü sesin öznitelik vektörünü temiz sesin öznitelik vektörüne haritalayan bir ilgin (affine) dönüşümden faydalanılmaktadır. Bu dönüşümden sonra, tanıma aşaması Ortak Vektör Yaklaşımı (OVY) ile yürütülmektedir. Tanıma oranlarını iyileştirmek için, ilgin (affine) dönüşüm sırasında birçok doğrusal ve doğrusal olmayan regresyon modeli kurulmuş ve hem doğrusal hem de doğrusal olmayan terimler bu modele eklenmiştir. Deneysel çalışmalarda, 0 dB, 5 dB, 10dB ve 20 dB Sinyal/Gürültü Oranı (SGO) değerlerindeki gürültülü ses sinyalleri için tanıma oranları elde edilmiştir. 20, 10, 5 ve 0 dB SGO gürültülü sesler, Texas Instruments (TI) Rakam veritabanından alınan temiz seslerin üzerine Beyaz Gauss gürültünün MATLAB ortamında eklenmesi ile elde edilmiştir. Doğrusal olmayan terimlerin modele eklenmesi sonucu tanıma oranlarında iyileşmeler gözlemlenmiştir. 495 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı Anahtar Kelimeler: Regresyon analizi, İlgin (Affine) dönüşüm, Gürbüz ses tanıma JEL Kodu: C39 1. Introduction In speech recognition the original speech signals may be corrupted by outside environmental factors such as ambient noises leading to misunderstandings on the part of the listener. Therefore, scientists have been working on improving the performance level of speech recognition systems. Most scientists in this area have focused on different techniques that not only clean the corrupted or noisy speech signals, but also transform the parameters of noisy speech into those of clean speech. Robustness is a crucial issue in speech recognition for real-world applications. Most research in speech recognition has focused on obtaining high levels of robustness (Basbug, 2003; Chien, 2003; Karnjanadecha, 2001; Lee, 2003; Mammone, 1996). Two different approaches for robust speech recognition have been developed to prevent the problems resulting especially from channel effects and noisy environment. In the first approach, methods such as spectral subtraction, wiener filtering and subspace are performed before feature vector extraction stage is made (Lee, Hyun, Choi, Go, & Lee, 2003). These methods provide the suppression of noise effects. The second approach has been performed by transforming the feature vectors of noisy speech into those of clean speech. In this study, a modified Affine Transformation which is a method of mapping the parameters of noisy speech into those of clean speech is examined. The TI Digit Database is used for both training and test sets. This database includes the pronunciation of ten digits (0-9) and an additional ‘Oh’ for zero. The Affine Transform matrix for each digit is evaluated by using the feature vectors of the same digit in the training set. The “root-melcep” parameters are calculated as the elements of feature vectors. The Affine Transformation method can also be viewed as a linear regression model. This study aims to improve the performance of the traditional Affine Transformation method by adding non linear terms into the regression model. The recognition rates for both the traditional and modified Affine Transformations are also given. 2. Affine Transformation In this study, the noisy speech signal ( ϕ [n] ) is obtained by adding white Gaussian noise ( τ [n] ) on the clean speech signal ( α [n] ) as indicated by Eq. (1). ϕ [n] = α [n] + τ [n]. (1) 496 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı In order to express a relationship between noisy speech signals and clean speech signals; firstly, two different matrices are constructed from the feature vectors of noisy and clean speeches. clean speech matrix = C = noisy speech matrix = T = c11 c12 ⋮ c1p c21 ⋯ c N1 c22 ⋮ c N2 , ⋮ ⋯ ⋮ c2p ⋯ c Np (2) t11 t12 ⋮ t1p t 21 ⋯ t N1 t 22 ⋮ t N2 . ⋮ ⋯ ⋮ t 2p ⋯ t Np (3) In Eqns. (2) and (3), p is the number of features in each feature vector extracted from the speech samples of each speaker and N is the number of speakers in the training set of the TI-Digit Database. The relationship between noisy and clean speech matrices is defined as (Mammone et al., 1996): C = A × T + B. (4) Here, B is the displacement matrix between clean and noisy speech and A is the Affine Transform matrix. The Affine Transform matrix, A, can be expanded as: a11 A= ⋮ a p1 ⋯ a1p ⋱ ⋮ . ⋯ a pp (5) Some algebraic operations are needed to linearize the relation between noisy and clean speech. After those operations are applied, Eqns. (6) and (7) can be obtained. C = A new × Tnew , (6) where a11 Anew = ⋮ a p1 ⋯ a1p b1i ⋱ ⋮ ⋮ , ⋯ a pp bpi t11 t12 Tnew = ⋮ t1p 1 t 21 t 22 ⋮ t 2p 1 ⋯ t N1 ⋮ t N2 ⋯ ⋮ . ⋯ t Np ⋯ 1 (7) In Eq. (7), b1i, ..., bpi are the elements of the ith column of the displacement matrix B and the remaining part of Anew is the Affine Transform matrix A. 497 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı After multiplying both sides of Eq. (6) with the transpose of Tnew, Eq. (8) is obtained: C × (Tnew )T = Anew × (Tnew × (Tnew )T ). (8) Using Eq. (8), Anew can be calculated as A new = C × (Tnew )T × (Tnew × (Tnew )T )-1. (9) As it is mentioned in Eq. (7), the last column of Anew is the displacement vector b, and the remaining part of Anew is the Affine Transform matrix A. 3. Multiple Regression Approach In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. A large body of techniques for carrying out regression analysis has been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data (David, 2005). In this study, the abovementioned Eq. (9) is very similar to the formula for estimates of the linear regression coefficients or least squares estimation. Eq. (9) differs from least squares estimation technique in two aspects. The first one is that a single value is estimated using the ordinary regression analysis whereas a vector can be estimated using Eq. (9). The second aspect is that independent variables are placed in the columns of the data matrix for the traditional regression analysis but they are written on the rows of the data matrix in this paper. However the second aspect does not cause an important problem since the same results can be found when data matrix is transposed in Eq. (9). In this paper, the simple idea of adding some quadratic terms into the data matrix was proposed and then the Affine Transformation matrices (Anew) were recalculated. Therefore, the effect of adding quadratic terms on the recognition performance (in other words, finding more correct Affine Transformation matrices) was investigated. It is also noted that the number of added quadratic terms is limited between 1-10 range considering empirical observations. 498 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı 4. Experimental Study In this study, the Texas Instruments (TI) Digit Database was used for both training and test sets. Because the digit “0” is saved in two different ways (“zero” and “oh”), the database includes 11 different digit classes. Each digit is pronounced twice by 223 different speakers. Therefore, there are a total of 446 repetitions for each digit in the database. In the analysis of TI Digit database, first of all, the silence regions at the beginning and at the end of each digit were removed using energy and zero-crossing thresholds. Then, each speech is framed. While implementing this process, the lengths of the frames are adjusted; so that each speech signal is divided into ten frames. After framing, twenty-five Root Mel Frequency Cepstrum (root-melcep) parameters were calculated for each frame. Then, those parameters were stacked in order to construct a feature vector with the dimension of 250 for each repetition of each digit. By this way, the feature vectors for both the clean and noisy speech of each digit were obtained. Before the feature vectors of all noisy speech were transformed into the feature vectors of clean speech using Affine Transformation, a band-pass filter was applied onto all noisy speech signals. Then, the feature vectors of the filtered noisy speech signals were transformed by Affine Transformation matrices which were found by Eq. (9). The database was divided into one training and one test set. Besides the training set was composed of 426 speeches of each digit, the test set was comprised from the remaining 20 speeches for each digit. In order to test all speech in the database, the ‘‘leave-twenty-out” procedure was followed. Therefore, the classification scheme was repeated twenty-two times for each digit. The recognition rate was determined using the Common Vector Approach (CVA) (Gulmezoglu, 1999; Gulmezoglu, 2001). The feature vectors obtained from the noisy speech signals with 20, 10, 5, and 0 dB Signal-to-Noise (SNR) values were used in the recognition process. A zero dB SNR value corresponds that original and noisy speech have the same energy level. In the second and crucial part of the experimental study, ten different values were added at the end of each feature vector. These values were the exponent of respective ten values at the beginning of each feature vector. Several exponents were empirically selected and they were (-1/3), (-1/6), (-1/7), (-1/8), and (-1/20). Thus, the dimension of each feature vector was 260. Then, the abovementioned transformation and classification schemes were repeated for the noisy speech signals with 20, 10, 5, and 0 dB SNR values. The average recognition rates attained with/without addition cases are given in Table 1. 499 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı Table 1. The average recognition rates (%) of all digits. SNR Values of Noisy Speech Exponent used in quadratic terms 0 dB 5 dB 10 dB 20 dB No term added 67.7 80.5 90.5 96.8 (-1/3) 70.5 87.1 95.7 98.5 (-1/6) 70.7 87.3 95.7 98.5 (-1/7) 70.6 87.4 95.7 98.5 (-1/8) 70.6 87.4 95.8 98.5 (-1/20) 70.6 87.3 95.7 98.5 5. Conclusion In this study, the effect of the multiple regression on robust noisy speech recognition was investigated to improve the recognition rates. In order to implement this motivation, ten different quadratic values were added at the end of each feature vector. Then the matrices of Affine Transformation which is a method to map the feature vectors of noisy speech into those of clean speech was re-calculated for each digit. The classification scheme is preferred as the Common Vector Approach (CVA). As the recognition rates are examined, higher recognition performance (between 3% and 5.2%) was obtained when compared to the traditional Affine Transformation methodology in which no term is added to feature vectors. This clearly indicates that more accurate Affine Transformation matrices can be evaluated by adding quadratic values to the feature vectors of both clean and noisy speech. In the future work, more discriminative terms in the feature space can be determined and irrelevant terms can be eliminated. Then the proposed regression analysis will be applied to this reduced feature space. 500 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı REFERENCES Basbug, F., Swaminathan, K. & Nandkumar, S. (2003). Noise reduction and echo cancellation front-end for speech codecs. IEEE Transactions on Speech and Audio Processing, 11(1), 113. Chien, J.-T. (2003). Linear regression based Bayesian predictive classification for speech recognition. IEEE Transactions on Speech and Audio Processing, 11(1), 70-79. David, A. F. (2005). Statistical Models: Theory and Practice. Cambridge: Cambridge University Press. Gulmezoglu, M. B., Dzhafarov, V., Keskin, M. & Barkana, A. (1999). A novel approach to isolated word recognition. IEEE Transactions on Speech and Audio Processing, 7(6), 620628. Gulmezoglu, M. B., Dzhafarov, V. & Barkana, A. (2001). The Common Vector Approach and its relation to Principal Component Analysis. IEEE Transactions on Speech and Audio Processing, 9(6), 655-662. Karnjanadecha, M. & Zahorian, S. A. (2001). Signal modeling for high-performance robust isolated word recognition. IEEE Transactions on Speech and Audio Processing, 9(6), 647654. Lee, C., Hyun, D., Choi, E., Go, J. & Lee, C. (2003). Optimizing feature extraction for speech recognition. IEEE Transactions on Speech and Audio Processing, 11(1), 80-87. Mammone, J. R., Zhang, X. & Ramachandran, R. P. (1996, September). Robust speaker recognition – A feature based approach. IEEE Signal Processing Magazine, 58-71. 501 Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı Bu sayfa bilerek boş bırakılmıştır This page [is] intentionally left blank 502

Download
# USE OF REGRESSION IN NOISY SPEECH RECOGNITION Abstract