Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
USE OF REGRESSION IN NOISY SPEECH RECOGNITION
Yrd. Doç. Dr. Semih ERGİN
Eskisehir Osmangazi Üniversitesi
[email protected]
Yrd. Doç. Dr. Rifat Aykut ARAPOĞLU
Eskisehir Osmangazi Üniversitesi
[email protected]
Abstract
In this study, we investigated the contribution of the multiple regression to robust noisy speech
recognition in improving the recognition rates. When the noisy speech recognition process is
carried out; first of all, an Affine Transformation is performed in order to map the feature vectors
of noisy speech into those of clean speech. After transforming, the recognition step is achieved
using the Common Vector Approach (CVA). We used several multiple linear as well as nonlinear regression models to improve the recognition rates by adding non-linear terms into the
model during the affine transformation stage. In the experimental study, the recognition rates of
the noisy speech signals with 0 dB, 5 dB, 10dB, and 20 dB Signal-to-Noise Ratio (SNR) values
have been obtained. Noisy speech which has 20, 10, 5, and 0 dB SNR is obtained using
MATLAB by adding white Gaussian noise on the clean speech taken from the Texas Instruments
(TI) Digit Database. Improvements are observed when non-linear terms are introduced into the
model.
Keywords: Regression analysis, Affine transformation, Robust speech recognition.
JEL Classification: C39
GÜRÜLTÜLÜ SES TANIMADA REGRESYON KULLANIMI
Özet
Bu çalışmada, gürültülü ses tanıma oranlarının iyileştirilmesinde çoklu regresyon analizinin
katkısı araştırılmıştır. Gürültülü ortamda ses tanıma sürecinde, ilk önce, gürültülü sesin öznitelik
vektörünü temiz sesin öznitelik vektörüne haritalayan bir ilgin (affine) dönüşümden
faydalanılmaktadır. Bu dönüşümden sonra, tanıma aşaması Ortak Vektör Yaklaşımı (OVY) ile
yürütülmektedir. Tanıma oranlarını iyileştirmek için, ilgin (affine) dönüşüm sırasında birçok
doğrusal ve doğrusal olmayan regresyon modeli kurulmuş ve hem doğrusal hem de doğrusal
olmayan terimler bu modele eklenmiştir. Deneysel çalışmalarda, 0 dB, 5 dB, 10dB ve 20 dB
Sinyal/Gürültü Oranı (SGO) değerlerindeki gürültülü ses sinyalleri için tanıma oranları elde
edilmiştir. 20, 10, 5 ve 0 dB SGO gürültülü sesler, Texas Instruments (TI) Rakam veritabanından
alınan temiz seslerin üzerine Beyaz Gauss gürültünün MATLAB ortamında eklenmesi ile elde
edilmiştir. Doğrusal olmayan terimlerin modele eklenmesi sonucu tanıma oranlarında iyileşmeler
gözlemlenmiştir.
495
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
Anahtar Kelimeler: Regresyon analizi, İlgin (Affine) dönüşüm, Gürbüz ses tanıma
JEL Kodu: C39
1. Introduction
In speech recognition the original speech signals may be corrupted by outside environmental
factors such as ambient noises leading to misunderstandings on the part of the listener.
Therefore, scientists have been working on improving the performance level of speech
recognition systems. Most scientists in this area have focused on different techniques that not
only clean the corrupted or noisy speech signals, but also transform the parameters of noisy
speech into those of clean speech.
Robustness is a crucial issue in speech recognition for real-world applications. Most research in
speech recognition has focused on obtaining high levels of robustness (Basbug, 2003; Chien,
2003; Karnjanadecha, 2001; Lee, 2003; Mammone, 1996). Two different approaches for robust
speech recognition have been developed to prevent the problems resulting especially from
channel effects and noisy environment. In the first approach, methods such as spectral
subtraction, wiener filtering and subspace are performed before feature vector extraction stage is
made (Lee, Hyun, Choi, Go, & Lee, 2003). These methods provide the suppression of noise
effects. The second approach has been performed by transforming the feature vectors of noisy
speech into those of clean speech.
In this study, a modified Affine Transformation which is a method of mapping the parameters of
noisy speech into those of clean speech is examined. The TI Digit Database is used for both
training and test sets. This database includes the pronunciation of ten digits (0-9) and an
additional ‘Oh’ for zero. The Affine Transform matrix for each digit is evaluated by using the
feature vectors of the same digit in the training set. The “root-melcep” parameters are calculated
as the elements of feature vectors.
The Affine Transformation method can also be viewed as a linear regression model. This study
aims to improve the performance of the traditional Affine Transformation method by adding non
linear terms into the regression model. The recognition rates for both the traditional and modified
Affine Transformations are also given.
2. Affine Transformation
In this study, the noisy speech signal ( ϕ [n] ) is obtained by adding white Gaussian noise ( τ [n] ) on
the clean speech signal ( α [n] ) as indicated by Eq. (1).
ϕ [n] = α [n] + τ [n].
(1)
496
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
In order to express a relationship between noisy speech signals and clean speech signals; firstly,
two different matrices are constructed from the feature vectors of noisy and clean speeches.
clean speech matrix = C =
noisy speech matrix = T =
 c11

c12

 ⋮
c1p

c21 ⋯ c N1 

c22 ⋮ c N2 
,
⋮ ⋯ ⋮ 
c2p ⋯ c Np 
(2)
 t11

 t12

 ⋮
 t1p

t 21 ⋯ t N1 

t 22 ⋮ t N2 
.
⋮ ⋯ ⋮ 
t 2p ⋯ t Np 
(3)


In Eqns. (2) and (3), p is the number of features in each feature vector extracted from the speech
samples of each speaker and N is the number of speakers in the training set of the TI-Digit
Database. The relationship between noisy and clean speech matrices is defined as (Mammone et
al., 1996):
C = A × T + B.
(4)
Here, B is the displacement matrix between clean and noisy speech and A is the Affine
Transform matrix. The Affine Transform matrix, A, can be expanded as:
 a11

A=  ⋮

a p1

⋯ a1p 

⋱ ⋮ .

⋯ a pp 
(5)

Some algebraic operations are needed to linearize the relation between noisy and clean speech.
After those operations are applied, Eqns. (6) and (7) can be obtained.
C = A new × Tnew ,
(6)
where
 a11

Anew =  ⋮

a p1

⋯ a1p b1i 

⋱ ⋮
⋮ ,

⋯ a pp bpi 

 t11

 t12
Tnew =  ⋮
 t1p

 1
t 21
t 22
⋮
t 2p
1
⋯ t N1 

⋮ t N2 
⋯ ⋮  .
⋯ t Np 

⋯ 1 
(7)
In Eq. (7), b1i, ..., bpi are the elements of the ith column of the displacement matrix B and the
remaining part of Anew is the Affine Transform matrix A.
497
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
After multiplying both sides of Eq. (6) with the transpose of Tnew, Eq. (8) is obtained:
C × (Tnew )T = Anew × (Tnew × (Tnew )T ).
(8)
Using Eq. (8), Anew can be calculated as
A new = C × (Tnew )T × (Tnew × (Tnew )T )-1.
(9)
As it is mentioned in Eq. (7), the last column of Anew is the displacement vector b, and the
remaining part of Anew is the Affine Transform matrix A.
3. Multiple Regression Approach
In statistics, regression analysis is a statistical technique for estimating the relationships among
variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the
dependent variable changes when any one of the independent variables is varied, while the other
independent variables are held fixed. A large body of techniques for carrying out regression
analysis has been developed. Familiar methods such as linear regression and ordinary least
squares regression are parametric, in that the regression function is defined in terms of a finite
number of unknown parameters that are estimated from the data (David, 2005).
In this study, the abovementioned Eq. (9) is very similar to the formula for estimates of the linear
regression coefficients or least squares estimation. Eq. (9) differs from least squares estimation
technique in two aspects. The first one is that a single value is estimated using the ordinary
regression analysis whereas a vector can be estimated using Eq. (9). The second aspect is that
independent variables are placed in the columns of the data matrix for the traditional regression
analysis but they are written on the rows of the data matrix in this paper. However the second
aspect does not cause an important problem since the same results can be found when data
matrix is transposed in Eq. (9). In this paper, the simple idea of adding some quadratic terms into
the data matrix was proposed and then the Affine Transformation matrices (Anew) were recalculated. Therefore, the effect of adding quadratic terms on the recognition performance (in
other words, finding more correct Affine Transformation matrices) was investigated. It is also
noted that the number of added quadratic terms is limited between 1-10 range considering
empirical observations.
498
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
4. Experimental Study
In this study, the Texas Instruments (TI) Digit Database was used for both training and test sets.
Because the digit “0” is saved in two different ways (“zero” and “oh”), the database includes 11
different digit classes. Each digit is pronounced twice by 223 different speakers. Therefore, there
are a total of 446 repetitions for each digit in the database. In the analysis of TI Digit database,
first of all, the silence regions at the beginning and at the end of each digit were removed using
energy and zero-crossing thresholds. Then, each speech is framed. While implementing this
process, the lengths of the frames are adjusted; so that each speech signal is divided into ten
frames. After framing, twenty-five Root Mel Frequency Cepstrum (root-melcep) parameters
were calculated for each frame. Then, those parameters were stacked in order to construct a
feature vector with the dimension of 250 for each repetition of each digit. By this way, the
feature vectors for both the clean and noisy speech of each digit were obtained. Before the
feature vectors of all noisy speech were transformed into the feature vectors of clean speech
using Affine Transformation, a band-pass filter was applied onto all noisy speech signals. Then,
the feature vectors of the filtered noisy speech signals were transformed by Affine
Transformation matrices which were found by Eq. (9).
The database was divided into one training and one test set. Besides the training set was
composed of 426 speeches of each digit, the test set was comprised from the remaining 20
speeches for each digit. In order to test all speech in the database, the ‘‘leave-twenty-out”
procedure was followed. Therefore, the classification scheme was repeated twenty-two times for
each digit. The recognition rate was determined using the Common Vector Approach (CVA)
(Gulmezoglu, 1999; Gulmezoglu, 2001). The feature vectors obtained from the noisy speech
signals with 20, 10, 5, and 0 dB Signal-to-Noise (SNR) values were used in the recognition
process. A zero dB SNR value corresponds that original and noisy speech have the same energy
level.
In the second and crucial part of the experimental study, ten different values were added at the
end of each feature vector. These values were the exponent of respective ten values at the
beginning of each feature vector. Several exponents were empirically selected and they were
(-1/3), (-1/6), (-1/7), (-1/8), and (-1/20). Thus, the dimension of each feature vector was 260.
Then, the abovementioned transformation and classification schemes were repeated for the noisy
speech signals with 20, 10, 5, and 0 dB SNR values. The average recognition rates attained
with/without addition cases are given in Table 1.
499
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
Table 1. The average recognition rates (%) of all digits.
SNR Values of Noisy Speech
Exponent used in
quadratic terms
0 dB
5 dB
10 dB
20 dB
No term added
67.7
80.5
90.5
96.8
(-1/3)
70.5
87.1
95.7
98.5
(-1/6)
70.7
87.3
95.7
98.5
(-1/7)
70.6
87.4
95.7
98.5
(-1/8)
70.6
87.4
95.8
98.5
(-1/20)
70.6
87.3
95.7
98.5
5. Conclusion
In this study, the effect of the multiple regression on robust noisy speech recognition was
investigated to improve the recognition rates. In order to implement this motivation, ten different
quadratic values were added at the end of each feature vector. Then the matrices of Affine
Transformation which is a method to map the feature vectors of noisy speech into those of clean
speech was re-calculated for each digit. The classification scheme is preferred as the Common
Vector Approach (CVA).
As the recognition rates are examined, higher recognition
performance (between 3% and 5.2%) was obtained when compared to the traditional Affine
Transformation methodology in which no term is added to feature vectors. This clearly indicates
that more accurate Affine Transformation matrices can be evaluated by adding quadratic values
to the feature vectors of both clean and noisy speech. In the future work, more discriminative
terms in the feature space can be determined and irrelevant terms can be eliminated. Then the
proposed regression analysis will be applied to this reduced feature space.
500
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
REFERENCES
Basbug, F., Swaminathan, K. & Nandkumar, S. (2003). Noise reduction and echo cancellation
front-end for speech codecs. IEEE Transactions on Speech and Audio Processing, 11(1), 113.
Chien, J.-T. (2003). Linear regression based Bayesian predictive classification for speech
recognition. IEEE Transactions on Speech and Audio Processing, 11(1), 70-79.
David, A. F. (2005). Statistical Models: Theory and Practice. Cambridge: Cambridge University
Press.
Gulmezoglu, M. B., Dzhafarov, V., Keskin, M. & Barkana, A. (1999). A novel approach to
isolated word recognition. IEEE Transactions on Speech and Audio Processing, 7(6), 620628.
Gulmezoglu, M. B., Dzhafarov, V. & Barkana, A. (2001). The Common Vector Approach and
its relation to Principal Component Analysis. IEEE Transactions on Speech and Audio
Processing, 9(6), 655-662.
Karnjanadecha, M. & Zahorian, S. A. (2001). Signal modeling for high-performance robust
isolated word recognition. IEEE Transactions on Speech and Audio Processing, 9(6), 647654.
Lee, C., Hyun, D., Choi, E., Go, J. & Lee, C. (2003). Optimizing feature extraction for speech
recognition. IEEE Transactions on Speech and Audio Processing, 11(1), 80-87.
Mammone, J. R., Zhang, X. & Ramachandran, R. P. (1996, September). Robust speaker
recognition – A feature based approach. IEEE Signal Processing Magazine, 58-71.
501
Dumlupınar Üniversitesi Sosyal Bilimler Dergisi EYİ 2013 Özel Sayısı
Bu sayfa bilerek boş bırakılmıştır
This page [is] intentionally left blank
502
Download

USE OF REGRESSION IN NOISY SPEECH RECOGNITION Abstract