Subscribe Now Subscribe Today
Research Article
 

Modified-VQ Features for Speech Emotion Recognition



Hemanta Kumar Palo and Mihir Narayan Mohanty
 
Facebook Twitter Digg Reddit Linkedin StumbleUpon E-mail
ABSTRACT

Objective: Features of the signal has major role for recognition, classification and detection task. Less number of features for effective recognition is the challenge that motivates the authors to proceed in this respect. In this study, a modified Vector Quantized (VQ) feature for emotional speech recognition has been proposed. Methodology: The proposed feature is based on statistical VQ and differential VQ statistics of frame-level prosodic features derived at utterance level. Further, the combination of frame-level baseline features, VQ based frame-level prosodic features and modified VQ prosodic features at utterance level are compared and analyzed. Neural network based classifiers as multilayer perceptron (MLP) and Radial Basis Function Network (RBFN) has been tested with proposed combinations. Standard Berlin emotional (EMO-DB) database and a locally collected emotional speech database have been used for validation of the methods. Results: The modified VQ feature combinations outperformed all other feature combinations in terms of classification accuracy and Mean Square Error (MSE). Conclusion: Result exhibited highest accuracy of 91.08% with RBFN and 89.93% with MLP classifiers respectively with modified VQ based feature combination for EMO-DB database. As against it the recognition was 90.38 and 88.05% with VQ based prosodic feature combination and 85.79 and 84.04% with frame level prosodic feature combination, respectively.

Services
Related Articles in ASCI
Similar Articles in this Journal
Search in Google Scholar
View Citation
Report Citation

 
  How to cite this article:

Hemanta Kumar Palo and Mihir Narayan Mohanty, 2016. Modified-VQ Features for Speech Emotion Recognition. Journal of Applied Sciences, 16: 406-418.

DOI: 10.3923/jas.2016.406.418

URL: https://scialert.net/abstract/?doi=jas.2016.406.418
 
Received: May 30, 2016; Accepted: July 15, 2016; Published: August 15, 2016



INTRODUCTION

Human expressions from faces, physical movements, voices and cultural artifacts can be used for emotion perception and recognition1. Among these, recognition using diversified spoken language is often complex hence least analyzed. Many application areas such as human resource management, criminal investigation, bio-medical engineering, banking and finance, gaming and computer tutoring mechanism etc., rely on speech emotions more often to remain viable. It requires an effective recognition model to enhance their applicability. However, absence of distinct and reliable features, genuine database, a well-defined theoretical platform that relates human being to emotions and efficient classifier makes these emotional models more complex. Hence forth, developing a reliable recognition system remains an ever-growing challenge for researchers. Arguably, the claimed accuracy is widely varied among literatures based on these factors2-15. It poses further ambiguities in terms of recognition authenticity.

Anger, boredom, happiness, fear, surprise and neutral are mostly discussed emotional categories in this area2-4,8-26. A review on related literatures suggests for an effective use of features, which will describe these speech emotions adequately15,22,26. Either features extracted at frame-level or at utterance-level are often attempted2,9,12,14,16-36. Nevertheless, the recognition of these speech emotions is below 70% with individual features tested alone6,15,23-24,28,30,36. It made speech engineers to explore effective feature reduction and combination mechanisms as an alternative for enhanced performance2-3,15,22,32. Few conventional feature reduction mechanisms as statistical techniques, Principle Component Analysis (PCA), vector quantization has been often discussed in literatures2,5,11,15-16,18,20,22,25,33-39. Use of PCA based reduced features has claimed accuracy of 40.7% (for male) and 36.4% (for female)16, below23 73 and 79.9%4 for emotional speech recognition based on acoustic features. However, it is not possible to guarantee the projected variables in a physical sense using PCA. This makes the selected variables difficult to interpret. Further, reduced features using PCA tends to be less reliable25. Statistical methods have been quite successful in reducing features for effective recognition15,20. Nevertheless, loss of temporal information in describing speech emotion makes the method less efficient2,36. Vector Quantization (VQ), has been successfully employed for speech coding, speaker recognition and emotional speech detection11,22,39. Further, reduced features using VQ tend to be superior to both linear and non-linear PCA based features5. Hence, VQ method of feature reduction along with statistical techniques are our preferred choice, hence opted in this study.

Possible combination using effective individual feature has been another major source for enhanced accuracy. Prosodic feature combination has provided 62.7% accuracy39. The authors suggested use of VQ method of reduction along with MFCC with an improved accuracy of 71.7% in their study. Reported accuracy of 70% with spectral and prosodic combination6, 83.55% with acoustic, semantic and prosodic combination37, 95.55% with VQ based spectral and time-frequency feature combination22, 88.7% with spectral combination21, 83.1 and 92% using linguistic and acoustic level fusion31 are few works that motivated the authors in this direction. However, further improvement in performance was experienced in our earlier effort involving both feature reduction and combination mechanism together22. However, to determine a possible combination among hundreds and even thousands of available features is quite cumbersome. Literature favors for possible combination of features that provide complementary information15,22. We have approached our problem based on these factors. The benefits of both reduction and combination concept are explored. Selection of the classifier is judiciously done after extensive study in this area to make the classifier conducive with our proposed feature extraction methods.

Gaussian Mixture Model (GMM) classifiers are capable of generating adequate model for emotions involving large feature sets2. However, for reduced feature sets neural network based classifiers outperform both GMM and HMM2,22,31. Hence, to maintain compatibility with our proposed reduced features, NN based classifiers have been chosen for this study.

Following are the few issues investigated in this piece of work. Firstly, VQ has been used to reduce the chosen prosodic features. Secondly, VQ is modified with its statistical and differential values to further reduce the features. However, it is observed that individual reduction mechanism has not improved the recognition much as reported in2,6,15,26,33. Hence, modified reduction mechanism based on statistics of both VQ and differential VQ are used to derive the proposed features. Use of differential VQ has enhanced feature robustness by involving temporal dynamic that is lost during statistical analysis. Thirdly, the mechanism used is validated in a cross corpus platform. For the purpose, a locally collected database has been tested along with standard Berlin EMO-DB database using our propose features. Fourthly, selection of efficient reduction and classifier design parameter for optimum efficiency has been iteratively made. Finally, possible combinations based on similar natured features both at frame-level and utterance level including our proposed set up is made.

MATERIALS AND METHODS

Proposed methods: From literature survey, it is clear that combination of features can enhance the recognition level6,21-22,37,39. However, it also suggest for possible combination using feature bearing similar characteristics15,22. The feature chosen should provide complementary information or else should provide supplementary in nature. Hence, in this study, few prosodic features as Zero Crossing Rate (ZCR), fundamental frequency (F0), autocorrelation coefficients and the STE (short time energy) bearing either complementary or supplementary information are selected for possible combinations. Researchers have applied different reduction techniques to a single database in most cases. In this case, although some trends may surface naturally but a consensus among designated features are quite difficult to prove32. The reason may be attributed to the reduced features, which may vary among different databases.

This study includes the proposed modified-VQ feature extraction techniques and the classification scheme used for efficient recognition of selected speech emotions with effective design parameters. The proposed scheme involves following steps:

Proposed feature extraction technique: In this study, 3 feature extraction techniques are used. Further, the suitable combinations are tested for better accuracy. Finally, the comparisons among all these techniques are shown in this study. These are:

Baseline frame-level feature combination
Reduced VQ based frame-level feature combination
Modified VQ utterance-level feature combination using both statistical VQ and differential VQ feature statistics

Baseline features: The frame-level features of basic prosodic features are considered and those are Zero Crossing Rate (ZCR), fundamental frequency (F0), autocorrelation coefficients and the Short Time Energy (STE). These are extracted by considering the frame-level signal.

Energy of the emotional speech signal indicates the arousal level and presence of higher frequency component in the signal. It is informative for recognition of speech emotions. To approximate the non-stationary speech signal x(n) into a quasi-periodic signal at short interval, STE has been computed here. The STE is found from:

(1)

where, m = 1, 2,..., M is total number of features in a frame.

In order to account for the periodicity information of emotions in speech signal, auto-correlation coefficients (ACF) has been quite useful. Further, these features will provide complementary energy information for effective feature combination. These coefficients can be estimated with a time lag τ as:

(2)

The F0 has been mostly effective in describing human speech emotions2,26,33. Auto-correlation method has been used to find F0 in this study since it is a robust, simple and more reliable method33. The ACF will attain its maximum value at x(m) = x(m+τ). Peaks are obtained at τ = IT for a signal x(m) with period T where, I is an integer. Among ACF, lower peaks are manifested with increase in with highest value being observed for S(τ) = S(0). Therefore, the F0s can be computed at τ = T by finding the location at which the peak exists. The F0 feature is extracted from whole utterance of an emotional class in this study.

The transition of the emotional speech signal around zero axis can be additional information for recognition of the signal. The information can be supplementary in nature for efficient feature combination. This information can be obtained using ZCR as given by:

(3)

In extracting these mentioned features, a frame width of 30 ms overlapped with 10 ms has been chosen here. Windowed signal from the frames are obtained using Hamming window to remove edge effect and prevent signal loss. This widow has approximately twice the bandwidth compared to rectangular window hence preferred here10. For any baseline features, the number of features in an utterance is given by:

(4)

where, m = 1, 2,..., M denotes number of features in a frame of an utterance and n = 1,2,...,N = number of frames in an utterance. For U numbers of utterances of an emotional class, a total of M×N×U features are obtained. The STE, ZCR, ACF frame-level features of an emotional class thus obtained can be represented as B = {ESTn(u), Zn(u), S(τ)n(u)}, u = 1,2,..., U.

Different combinations have been used from these prosodic baseline features. The combined feature sets are represented as B = {B1,B2,...,B11} and is given below:

The combination mechanism to be effective needs features of similar nature15,22. Hence, prosodic features providing energy information such as STE and ACF are combined. Pitch and ZCR provide supplementary information hence will increase the available information. Use of these features in our approach for combination enhanced the classifier performance.

Reduced features using VQ: The VQ has been an effective data compression technique found from literature11,18,22,37. The method outperformed both linear and non-linear PCA in removing redundant features5. This has been explored to reduce the features for effective recognition and faster process. This is formulated as follows.

Consider the ZCR features of an emotional class represented as Z. The baseline ZCR frame-level features comprises of B = {Z1, Z2,..., ZU} number of source feature vectors obtained from all the utterances U of an emotional class. In this study, k-dimensional of each feature vector is taken, i.e., Zu = {zu,1, zu,2,..., zu,k}, u = 1, 2,..., U. In VQ, these source vectors need to be mapped into another vector space consisting of code vectors comprising of finite number of regions or clusters. These code vectors form a codebook R = {r1, r2,..., rY}, where, Y is number of code vectors in the codebook. Each k-dimensional code vector is then represented by ry = {ry, 1, ry,2,..., ry,k }, y = 1, 2,..., Y. The associated encoding region of the code vector ry is represented by a qy, y = 1, 2,..., Y with the partition space indicated as C = {q1, q2,..., qY}. In this case, if the source feature vector ZU belongs to the encoded space qy, then the approximation of the feature vector if G (ZU) = ry if ZU∈qy. The deviation of the features from the centroid can be found by estimating the average distortion using a squared-error distortion measure given by:

(5)

For optimal result the distortion, need to be minimized. This requires the fulfillment of both the nearest neighbor and centroid condition. The LBG algorithm has been efficient in satisfying the above two condition for obtaining optimum codebook design17 hence, is opted in this study. The designing steps for codebook using this algorithm are as follows:

Compute the centroid: For the input feature B, let the codebook size Y = 1. Set the splitting parameter to a small value. The centroid is computed using:

(6)

This 1-vector codebook is obtained by averaging the entire training data B hence does not require any iteration. Now, the average distortion is computed using the relation:

(7)

Splitting: Forand Double the codebook size i.e., Y = 2Y
Steps for iteration: For j = 0, let D(0) = D*
  For u = 1, 2,..., U, compute the minimum of Let this minimum value is achieved for an index y*, then set
  For y = 1, 2,..., Y, the code vector is updated as:

(8)

While satisfying the above condition, it is essential to include at least one input feature vector in each coding region to prevent the denominator of Eq. 8 becoming zero

Now change j = j+1. Estimate the average distortion again using

(9)

  If then return to step (i)
  Further, set as the final code vectors
Steps 2 and 3 are repeated until we obtain the desired code vectors

Codebook size of 23, 24, 25 and 26 has been tested in this experiment. After few manipulation, a codebook size of 24 has been opted as a tradeoff between storage space, computation time and reconstruction quality. In addition, a maximum of 20 iterations for each codebook size has been chosen to suit above tradeoff. Different splitting percentage between 0.01 and 0.05 has been tested for the codebook design. A value of 0.02 splitting percentage proved to provide the reasonable required accuracy with a 0.75 rate of reduction in split size on completion of each splitting in our case. To maintain required degree of precision a threshold of 0.002 is chosen for distance measure.

The proposed VQ based frame-level features are extracted using following steps.

Consider the ZCR feature indicated as Z. The ZCR values of each frame are given by Zm = Z1, Z 2,..., Z M, m = 1, 2,....., M. Where, M is number of features in a frame. The VQ is applied to each frame to extract a single VQ coefficient of that frame indicated as VZn = VZ1, VZ2,..., VZN, n = 1,2,..., N. The set of VQ based frame-level features obtained this way can be represented as . Where, number of frames in an utterance

Therefore, the total number of VQ based ZCR features of an utterance is given by:

(10)

This way the feature is reduced from M×N to N by applying VQ. For an emotion consisting of u = 1, 2,..., U such utterances, all VQ coefficients are extracted and stored. So, for an emotional class total VQ based ZCR features extracted can be represented as {VZn (u) = VZn (1), VZn (2),..., VZn (U)}, u = 1, 2,..., U. Hence, the total number of VQ based ZCR features of an emotional class thus become N×U.

Similarly, for STE and autocorrelation coefficient corresponding VQ based features are extracted. The VQ based features for STE, ZCR, ACF of an emotional class thus obtained can be represented as V = {VESTn (u), VZn (u), VS(τ)n (u)}
However, F0 is extracted at utterance level in this study

Different combinations have been used from these VQ based features. The combined feature sets are represented as V = {V1, V2,..., V11} and is given below:

Modified VQ features: A modified VQ feature based on statistical VQ and differential VQ based reduction method is proposed. The formulation of the modified feature vector is explained below.

Statistical VQ for reduction: For further reduction of VQ features we have adopted the statistical method since the utterances used are of different duration. Further, statistical values are capable to distinguish emotions based on arousal level. Parameters as mean, standard deviation along with the minimum, maximum values and range of the signal are computed. These parameters will provide the exact information regarding the VQ features without losing major information content. From N number of VQ features in an utterance the corresponding statistical VQ based ZCR features at utterance level are SVZn (mean), SVZn (minimum), SVZn (SD), and SVZn (range). For an emotional class having U utterances all such mean values for each feature category are stored and represented for ZCR by SVZn (mean) = {SVZ1 (mean), SVZ2 (mean),..., SVZU (mean)}, u = 1, 2,..., U. Similarly, other statistical values of VQ based ZCR features at utterances level for an emotional class are stored. Similar features for STE and ACF are extracted. Correspondingly, the statistical features of VQ based STE and ACF are gathered from all utterances of an emotional class.

The statistical VQ based utterance level features for STE, ZCR and ACF of each utterance of an emotional class thus obtained can be represented as

The statistical features are computed using following generalized formulas for VQ based ZCR features is given by:

(11)

(12)

(13)

Fig. 1: Proposed feature extraction technique

This way number of VQ based features is reduced to statistical VQ features for each utterance of an emotional class. For U utterances of an emotional class, a total of statistical VQ features are obtained.

Differential VQ for reduction: However, the accuracy using statistical features suffers since the temporal information among features is completely lost during its extraction2. To compensate this problem, the features are reduced by taking the difference between two consecutive frame-level VQ features of an utterance. This way the temporal dynamics has been retained in the reduced features. The resultant features designated as differential_VQ and are extracted for ZCR features as:

(14)

The number of of an utterance for each emotional class is thus become N-1.

To include the statistical nature of the differential_VQ features, again their mean, maximum, minimum, standard deviation and range for each utterance is computed as before. This way, both temporal dynamics and statistical characteristics of the features has been involved in deriving differential VQ statistics, . Number of statistical differential VQ based features () of each utterance is thus limited to N4 = 5.

Proposed modification: To derive the modified features, the statistical VQ based features and statistical differential VQ based features for each utterance of an emotional class has been used. The modified features of each utterance can be represented as . Total number of features obtained in this way for an utterance can be given by:

(15)

For each emotional class comprising of U utterances, the total modified VQ based ZCR features at utterance level of an emotional class can be estimated as:

(16)

Similar modified features are also extracted for VQ based STE and ACF features. The modified VQ based utterance level features for STE, ZCR and ACF of each utterance of an emotional class thus obtained can be represented as (MVQ)u = {(MVQ (Z))u, (MVQ (EST))u, (MVQS(τ)u)}.

Different combinations have been used from these modified VQ features. The combined feature sets are represented as MVQ = {MVQ1, MVQ2,..., MVQ11} and is given below:

The proposed feature extraction technique has been shown in Fig. 1.

RESULTS AND DISCUSSION

Standard EMO-DB database has been tested with proposed features along with a locally collected regional Oriya speech emotional database. Due to unavailability of Oriya database, it has been collected from different sources. Emotional speech utterances of angry, fear, happy and sad states are analyzed.

Table 1: Details of utterances used from the database

Table 2: Impact of dimension reduction with different feature extraction techniques
M: No. of features/frame, N: No. of frames/utterance, U: No. of utterances/emotion

Sixty utterances of EMO-DB database and 16 utterances of Oriya database for each emotion are used for extraction of features and further classification. The local database is resampled to that of EMO-DB database to make the comparing platform similar. The details of the utterances used are shown in Table 1.

Neural network based classifiers as RBFN and MLP has been used to model the speech emotions in this study. These have self-learning ability and hence suitable for emotional speech recognition. Further these are parallel structures and suitable for speech signals that have frequencies occurring in parallel. These classifiers minimize error using weights and biases and are suitable for reduced features. Conventional classifiers like HMM/GMM can adequately model if feature dimensions are large but their performance degrades for reduce features. Since use of VQ at frame-level and modified VQ at utterance level reduced the feature dimension, NNs suit us. Further, the correlation among features are poorly defined by HMM as compared to NNs. Due to these factors, these classifiers can converge easily to optimal solution than other conventional classifiers.

For our purpose, optimum results have been achieved with a learning parameter of 0.01 with 0.1 moment rate using MLP. The number of epochs has been set to 50 after testing the classifier with 20, 30, 40, 50, 55 and 65 numbers of epochs. The number of iteration was chosen by taking into account the speed of response and computational complexity to achieve the desire result. Number of input layer neurons equals to available number of input source feature vectors has been maintained for both MLP and RBFN classifiers. The source features are collected from all four classes of emotions. The output neuron equals to 40% of source feature for each emotional category has been selected for classification. Hidden nodes of 10, 15, 20, 30 and 40 have been tested for both the classifiers. Highest performance with 20 and 40 hidden nodes for local database and EMO-DB database, respectively has been experienced in this study.

For RBFN a single hidden layer is sufficient to model the emotions22 hence, selected for our purpose. Mean square error goal of 0.0 has been used with addition of extra neurons to hidden layer until we achieved the desired MSE. Spread of 0.01, 0.5, 1.0, 1.5 and 2.1 has been tested for this network. Optimum smoothing has been obtained with a spread of 2.1 in this study hence, maintained.

The classifiers are simulated with 70/15/15%, 60/20/20% and 50/25/25% training, validation and testing ratios. A ratio of 60/20/20% founds to be most effective in this experiment. The observation using proposed feature extraction methods are described below.

In deriving the modified features, we have considered the utterance level statistics of VQ based features. Statistical features are quite successfully applied in the field of emotional speech recognition. These features are able to distinguish low arousal emotions against high arousal emotion. Since this piece of work takes into consideration both arousal emotional levels, hence statistical techniques proved suitable in our case. However, in describing the desired emotions at utterance level, the temporal dynamics between features and between frames are completely lost. Hence, we have considered differential VQ features based on the variation of feature values between frames. Thus, the rate at which the VQ based features change between frames has been taken into account. This way both statistical and dynamical characteristics of the VQ based features are included. It has resulted in enhanced accuracy. Next to it, the statistics of differential VQ features are estimated to reduce the features further in modified feature set. Reduction of features in proposed methods helped in selection of features. Hence, the cross validation are executed faster. In turn, this has enhanced the compatibility of the features to the chosen NN classifiers. The impact of dimension reduction using the different feature extraction techniques has been described in Table 2.

The RBFNN classification accuracy is found to be 85.79% (Table 3), 90.38% (Table 4) and 91.08% (Table 5) with frame-level feature combination without VQ, VQ based feature combination and modified VQ based feature combination, respectively using EMO-DB database. Thus, modified VQ based feature combinations outperformed all other feature combinations. It is followed by VQ based prosodic feature combination in terms of classification accuracy. As compared to this corresponding MLP classification accuracy is 84.04% (Table 6), 88.05% (Table 7) and 89.93% (Table 8) with frame-level feature combination without VQ with VQ based feature combination and with modified VQ based feature combination respectively. These results are experienced with E+ F0 +Z+ACF feature combination, which out performed all other possible combinations. The results are better than similar works attempted using by earlier researchers6,29,37.

Table 3: Performances of RBFN classifier with frame level prosodic feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 4: Performances of RBFN classifier with VQ based prosodic feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 5: Performances of RBFN classifier with modified VQ based feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 6: Performances of MLP classifier with frame level prosodic feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 7: Performances of MLP classifier with VQ based prosodic feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 8: Performances of MLP classifier with modified VQ based feature combination using EMO-DB database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Similar trends have been observed for locally collected Oriya database (Table 9-14). Meager difference in classification accuracy between EMO-DB and locally collected database validates the authenticity of the feature extraction techniques proposed in this study.

Among both NN based classifier the accuracy level of RBFNN has been better than MLP network using both database with our proposed feature extraction techniques. The RBFN is superior to k-NN and MLP classifiers as suggested13. Further, RBFNNs have better training algorithm than MLP since at any time only a part of non-linear activation function is active for any given input vector7,13,22. Furthermore, it is easy to interpret the data using RBFNN. The MLP uses distributed learning as compared to localized learning approach adapted by RBFNN hence is slower27.

Table 9: Performances of RBFN classifier with frame level prosodic feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 10: Performances of RBFN classifier with VQ based prosodic feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 11: Performances of RBFN classifier modified VQ based feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Combining similar information features provided improved recognition. Features carrying complementary energy information such as ACF and STE outperformed other similar combinations. Next to it, STE and ZCR feature combination proved better. Since ZCR distinguish voiced part of the signal from unvoiced parts, thus provides complementary energy information. Among three-feature category combination, STE+ZCR+autocorrelation provided highest accuracy for similar reason. This has been experience with all the proposed feature extraction technique and with both the classifiers.

Table 12: Performances of MLP classifier frame level prosodic feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 13: Performances of MLP classifier VQ based prosodic feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

Table 14: Performances of MLP classifier modified VQ based feature combination using Oriya language database
E: Short time energy, F0: Fundamental frequency, ACF: Autocorrelation coefficient, Z: Zero crossing rate

CONCLUSION

The NN based classifiers chosen in this study, better classify reduced features as our results reveal. Improvements observed when VQ based frame-level reduced features are reduced further in modified features. This observation shows that feature combination enhances classification accuracy due to increase in available information. However, to select effective features for possible combination need to be judiciously planned. This is clear from our results, when features bearing similar information are considered for effective combination. Further, the recognition is better when both reduction and feature combination mechanism are applied together. This observation clearly demonstrated that it is possible to identify emotions in speech using limited number of data if the classifier is compatible to the extracted features. A meager difference in classification accuracy between locally collected and standard EMO-DB database has supported our claim.

REFERENCES
Albornoz, E. and D. Milone, 2016. Emotion recognition in never-seen languages using a novel ensemble method with emotion profiles. IEEE Trans. Affective Comput. 10.1109/TAFFC.2015.2503757

Chandrasekar, P., S. Chapaneri and D. Jayaswal, 2014. Automatic speech emotion recognition: A survey. Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications, April 4-5, 2014, Mumbai, pp: 341-346.

Cheng, X. and Q. Duan, 2012. Speech emotion recognition using Gaussian mixture model. Proceedings of the 2nd International Conference on Computer Application and System Modeling, July 27-29, 2012, Shanxi, China, pp: 1222-1225.

El Ayadi, M., M.S. Kamel and F. Karray, 2011. Survey on speech emotion recognition: Features, classification schemes and databases. Pattern Recognit., 44: 572-587.
CrossRef  |  Direct Link  |  

Fodor, I.K., 2002. A survey of dimension reduction techniques. Technical Report, pp: 1-18. https://computation.llnl.gov/casc/sapphire/pubs/148494.pdf.

Fulmare, N.S., P. Chakrabarti and D. Yadav, 2013. Understanding and estimation of emotional expression using acoustic analysis of natural speech. Int. J. Nat. Lang. Comput., 2: 37-46.
CrossRef  |  Direct Link  |  

Han, K., D. Yu and I. Tashev, 2014. Speech emotion recognition using deep neural network and extreme learning machine. Proceedings of the 15th Annual Conference of the International Speech Communication Association, September 14-18, 2014, Singapore, pp: 223-227.

Haq, S. and P.J.B. Jackson, 2010. Multimodal Emotion Recognition. In: Machine Audition: Principles, Algorithms and Systems, Wang, W. (Ed.). IGI Global Press, USA., ISBN: 978-1615209194, pp: 398-423.

Haykins, S., 2006. Neural Networks: A Comprehensive Foundation. 2nd Edn., Pearson Education, Delhi, India.

Iliou, T. and C.N. Anagnostopoulos, 2010. Classification on speech emotion recognition-a comparative study. Int. J. Adv. Life Sci., 2: 18-28.
Direct Link  |  

Khanna, P. and M.S. Kumar, 2011. Application of Vector Quantization in Emotion Recognition from Human Speech. In: Information Intelligence, Systems, Technology and Management, Dua, S., S. Sahni and D.P. Goyal (Eds.). Springer, Berlin, Germany, ISBN: 978-3-642-19422-1, pp: 118-125.

Kim, J.C. and M.A. Clements, 2015. Multimodal affect classification at various temporal lengths. IEEE Trans. Affective Comput., 6: 371-384.
CrossRef  |  Direct Link  |  

Kolodyazhniy, V., S.D. Kreibig, J.J. Gross, W.T. Roth and F.H. Wilhelm, 2011. An affective computing approach to physiological emotion specificity: Toward subject-independent and stimulus-independent classification of film-induced emotions. Psychophysiology, 48: 908-922.
CrossRef  |  Direct Link  |  

Koolagudi, S.G. and K.S. Rao, 2012. Emotion recognition from speech: A review. Int. J. Speech Technol., 15: 99-117.
CrossRef  |  Direct Link  |  

Kuchibhotla, S., H.D. Vankayalapati, R.S. Vaddi and K.R. Anne, 2014. A comparative analysis of classifiers in emotion recognition through acoustic features. Int. J. Speech Technol., 17: 401-408.
CrossRef  |  Direct Link  |  

Lee, C.M. and S.S. Narayanan, 2005. Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process., 13: 293-303.
CrossRef  |  Direct Link  |  

Li, Y. and Y. Zhao, 1998. Recognizing emotions in speech using short-term and long-term features. Proceedings of the 5th International Conference on Spoken Language Processing, November 30-December 4, 1998, Sydney, Australia -.

Linde, Y., A. Buzo and R.M. Gray, 1980. An algorithm for vector quantizer design. IEEE Trans. Commun., 28: 84-95.
CrossRef  |  Direct Link  |  

Luengo, I., E. Navas and I. Hernaez, 2010. Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multimedia, 12: 490-501.
CrossRef  |  Direct Link  |  

Nwe, T.L., S.W. Foo and L.C. De Silva, 2003. Speech emotion recognition using hidden Markov models. Speech Commun., 41: 603-623.
CrossRef  |  Direct Link  |  

Palo, H.K., M.N. Mohanty and M. Chandra, 2015. Use of different features for emotion recognition using MLP network. Adv. Intell. Syst. Comput., 332: 7-15.
CrossRef  |  Direct Link  |  

Palo, H.K., M.N. Mohanty and M. Chandra, 2015. Design of Neural Network Model for Emotional Speech Recognition. In: Artificial Intelligence and Evolutionary Algorithms in Engineering Systems, Suresh, L.P., S.S. Dash and B.K. Panigrahi (Eds.). Springer, India, ISBN: 978-81-322-2134-0, pp: 291-300.

Palo, H.K., M.N. Mohanty and M. Chandra, 2016. Efficient feature combination techniques for emotional speech classification. Int. J. Speech Technol., 19: 135-150.
CrossRef  |  Direct Link  |  

Pao, T.L., Y.T. Chen, J.H. Yeh and W.Y. Liao, 2005. Detecting emotions in Mandarin speech. Comput. Linguist. Chin. Lang. Process., 10: 347-362.
Direct Link  |  

Quan, C., D. Wan, B. Zhang and F. Ren, 2013. Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition. Proceedings of the IEEE/SICE International Symposium on System Integration, December 15-17, 2013, Kobe, Japan, pp: 222-226.

Ramakrishnan, S., 2012. Recognition of Emotion from Speech: A Review. In: Speech Enhancement, Modeling and Recognition-Algorithms and Applications, Ramakrishnan, S. (Ed.). InTech Inc., Croatia, ISBN: 978-953-51-0291-5, pp: 121-138.

Santos, R.B., M. Rupp, S.J. Bonzi and A.M.F. Fileti, 2013. Comparison between multilayer feedforward neural networks and a radial basis function network to detect and locate leaks in pipelines transporting gas. Chem. Eng. Trans., 32: 1375-1380.
CrossRef  |  Direct Link  |  

Schuller, B., A. Batliner, D. Seppi, S. Steidl and T. Vogt et al., 2007. The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. Proceedings of the 8th Annual Conference of the International Speech Communication Association, August 27-31, 2007, Antwerp, Belgium, pp: 2253-2256.

Schuller, B., A. Batliner, S. Steidl and D. Seppi, 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Commun., 53: 1062-1087.
CrossRef  |  Direct Link  |  

Schuller, B., G. Rigoll and M. Lang, 2004. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Volume 1, May 17-21, 2004, Montreal, Canada, pp: I-577-I-580.

Schuller, B., S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. Muller and S. Narayanan, 2013. Paralinguistics in speech and language-state-of-the-art and the challenge. Comput. Speech Lang., 27: 4-39.
CrossRef  |  Direct Link  |  

Tahon, M. and L. Devillers, 2016. Towards a small set of robust acoustic features for emotion recognition: Challenges. IEEE/ACM Trans. Audio Speech Lang. Process., 24: 16-28.
CrossRef  |  Direct Link  |  

Ververidis, D. and C. Kotropoulos, 2006. Emotional speech recognition: Resources, features and methods. Speech Commun., 48: 1162-1181.
CrossRef  |  Direct Link  |  

Wang, J.C., Y.H. Chin, B.W. Chen, C.H. Lin and C.H. Wu, 2015. Speech emotion verification using emotion variance modeling and discriminant scale-frequency maps. IEEE/ACM Trans. Audio Speech Lang. Process., 23: 1552-1562.
CrossRef  |  Direct Link  |  

Wang, K., N. An, B.N. Li, Y. Zhang and L. Li, 2015. Speech emotion recognition using Fourier parameters. IEEE Trans. Affective Comput., 6: 69-75.
CrossRef  |  Direct Link  |  

Wang, K.C., 2015. Speech emotional classification using texture image information features. Int. J. Signal Process. Syst., 3: 1-7.
CrossRef  |  Direct Link  |  

Wenjing, H., L. Haifeng and G. Chunyu, 2009. A hybrid speech emotion perception method of VQ-based feature processing and ANN recognition. Proceedings of the WRI Global Congress on Intelligent Systems, Volume 2, May 19-21, 2009, Xiamen, China, pp: 145-149.

Wu, C.H. and W.B. Liang, 2011. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Trans. Affective Comput., 2: 10-21.
CrossRef  |  Direct Link  |  

Wu, S., T.H. Falk and W.Y. Chan, 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun., 53: 768-785.
CrossRef  |  Direct Link  |  

©  2020 Science Alert. All Rights Reserved