Performance Analysis of Objective Speech Quality Measures in Mel Domain

Li, Lu

ABSTRACT

To evaluate speech quality effectively and exactly, Mel-SD and Mel-CD are compared and analyzed, especially on feature extraction. The effects of the structure of Mel filter bank on both measures are investigated. The result shows that Mel-SD performance better than Mel-CD while retaining robust on variety of Mel filter bank. Mel-CD is sensitive to structure of Mel filter bank and decreases its performance when the size of bank is increased. Based on optimal size of bank, Mel-SD was tested by different compression factor to find the optimal factor in assessing speech quality. Furthermore, optimal Mel-SD and Mel-CD were tested by assessing speech quality of communication system. Experiment results show that Mel-SD has good performance and performance of Mel-CD is equivalent to PESQ.

PDF Abstract XML References Citation

INTRODUCTION

In the research, design, development and operation process of the communication system, the performance of equipment and systems are needed to be monitored in order to make adjustments, improvements and optimization. In the system involving the exchange of information in the speech, an important evaluation indicator of the system performance merits is the quality of voice transmission in the system. Faced with the new requirements of communications technology and communications services, research of flexible, reliable and accurate voice quality evaluation system become the target which domestic and foreign researchers effort (Kitawaki et al., 1984).

Mel-CD is an objective voice quality evaluation methods in Mel domain, it has certain applications in research and in practice (Chen et al., 2001; Fu et al., 2001; Huang et al., 2000; Kubichek, 1993; Wang et al., 1992). Mel-CD is that Mel Frequency Cesptral Coefficient (MFCC) is described as a feature of the speech signal and it is used as the calculated model to represent an objective distortion distance. The nonlinear perceptual characteristics of the human ear frequency is considered in MFCC but MFCC itself is treated in homomorphic deconvolution and when it is used as the voice characterization in an objective evaluation of voice quality, it is not well with the model and the perceptual characteristics of auditory physiology (French and Steinberg, 1947; Voiers, 1977).

For Mel-CD problems, a voice quality objective assessment method is proposed by Mel Frequency Spectral Coefficient (MFSC) characteristic parameters, it is Mel Spectral Distortion Measure (Mel-SD) (Chen and Jin, 2006).

Voice quality Mel domain objective evaluation has a close relationship with selection of Mel domain filter, the relationship between Mel-SD, Mel-CD and the filter is researched and on this basis, the non-linear compression function change in MFSC is how to affect the Mel-SD performance.

METHODOLOGY

Speech evaluation methods
Mel-CD and Mel-SD: A typical objective evaluation of voice quality based on input-output main consists of three parts, such as the speech signal preprocessing, calculating characteristic parameters, distortion calculation/judgment model, they are shown in Fig. 1. Core part is that characteristic parameter calculation, distortion calculation and judgment model, a different objective measure main difference is that the two parts.

Pretreatment: Since, the input-output speech quality objective assessment requirements, pre-processing is needed for speech signals, there are time alignment, structured level, pre-emphasis, framing and the other.

Characteristic parameter: There are Mel Cepstral MFCC and Mel pedigree number MFSC. MFCC and MFSC calculation process is shown in Fig. 2a and b.

FFT and short-time power spectrum: Voice signals are non-stationary signals but it is generally believed that voice signal in the 10~25 m sec is short stationary, so the short-term power spectrum is calculated in the 25 m sec speech frame.

Frequency warping: The frequency is converted to the Mel scale domain according to Eq. 1:

(1)

Mel domain filtering: The k-th frame short-term signal power spectrum obtains a power spectrum filter output by cochlear belt filter bank. It is in Eq. 2:

(2)

where, O_j,k is k-th frame, j-th output, A_j (f) is the filter transfer function of the filter bank filter j, N is the number of filters.


Fig. 1:	Voice quality objective assessment block diagram based on the input-output


Fig. 2(a-b):	(a) Mel cepstral calculations and (b) Mel spectrum calculation

Filter bank in Mel domain is composed of a given number of triangle-belt-pass filter, the center frequency and bandwidth of the filter is evenly in 0-4000 Hz range, it is corresponding to the frequency domain on the Mel scale. In each triangle filter band which the corresponding linear frequency and the corresponding weight value are determined by Eq. 3:

Image for - Performance Analysis of Objective Speech Quality Measures in Mel Domain

(3)

where, f_j is the jth filter center frequency, A_j (f) is the jth filter amplitude-frequency characteristics.

Logarithm and nonlinear compression transform: From Fig. 2a and b, it can is seen that the difference between the MFCC and MFSC lies in different logarithm and different nonlinear partial compression function φ(•). MFCC from the logarithm is homomorphic deconvolution, the compression algorithms is based on nonlinear MFSC strength, that is loudness perception transformation.

Choose a reasonable compression function in MFSC is based on two considerations:

•	To comply with the characteristics of auditory perception
•	To avoid the complicated calculation model
•	To select the cube root function (Quackenbush et al., 1988) as strength, loudness variation is approximation

Discrete Cosine Transform (DCT): The DCT purpose in MFCC is homomorphic deconvolution itself but it can also play a role with decorrelation and dimensionality reduction. The DCT purpose of MFSC is the role of derelevant and necessary to dimensionality reduction, after DCT processing, the various coefficient components of MFCC and MFSC have no correlation, it meets assumptions which each component is not relevant in the calculation of the distance distortion.

Relationship between MFCC and MFSC: Although, MFCC combines voice frequency nonlinear perception and Mel domain bandpass filtering but the essence is based on the analysis cepstral homomorphic deconvolution, MFSC are based on the speech frequency and intensity of auditory perception feature representation. But if you do not consider the difference between MFCC and MFSC principle, if the logarithm operations will be an implementation of the nonlinear compression, it can be considered that MFCC is a special case of MFSC. In order to compare the different effects between MFSC and MFCC and in objective quality evaluation, they are still regarded as the two different characteristic parameters.

Amount of distortion calculated: Distortion amount calculation of Mel-SD is exactly the same with Mel-CD distortion amount calculation, the following is only that the distortion amount of Mel-SD calculations will be described. Distance of k-frame Mel spectral distortion is defined as Eq. 4:

(4)

The MFSCx (i, k) is the i-th order MFSC coefficient of k-th frame input voice signal, MFSCy (i, k) is the i-th MFSC coefficient of the k frame distortion voice signal. N is the total number of the speech signal frames, m is the order of MFSC.

The per frame Mel spectrum distortion distance in audio file is used as the arithmetic average number, distortion distance of Mel spectrum coefficient is obtained in the distortion file, it is used as the total distortion amount of distortion voice, to see (Eq. 5):

(5)

Judgment model: The amount of distortion is calculated by least squares quadratic polynomial fitting criteria, it is correspond to the respective values of the objective voice quality MOS or MOS is called as predictive value.

Relationship between Mel-CD and Mel-SD: From the above analysis, if the MFCC is seen as an implementation exception of MFSC, the Mel-CD can be used as a special case of Mel-SD but due to the choice of the logarithmic compression does not meet the characteristics of auditory perception, Mel-CD and Mel-SD is different in the evaluation of the performance (Chen and Jin, 2006).

Performance indicators of speech quality objective assessment: There are performance merits in voice quality objective evaluation method, it is the degree of correlation between the general objective MOS value and subjective voice quality MOS values and prediction error is used as the evaluation of the performance index (ITU., 2001), the degree of correlation between the two are described by using the Pearson correlation coefficient, such as Eq. 6:

(6)

Linear degree between objective evaluation and the MOS subjective evaluation is described by correlation coefficient, the correlation coefficient is closer to +1, objective measure is used for the prediction subjective MOS value which is the more accuracy.

Prediction error is estimated with a standard deviation σ_sse, whch is defined in the Eq. 7. σ_sse is smaller, the prediction error is smaller, the performance objective evaluation measure is the better.

(7)

In Eq. 6 and 7, MOS_o (i) is the objective MOS value of i-th data, MOS_s (i) is a subjective evaluation of MOS scores, M is the number of data points.

RESULTS

Relationship between Mel-SD, Mel-CD and the number of Mel filter: By analysis of Mel-SD and Mel-CD, the difference is that different extraction of speech feature parameters. In the MFCC and MFSC, except different portions in non-linear changes, the other part is the same. Mel domain filtering is a common part of the two parameters to be calculated, so, choose of Mel filter bank will have an impact to Mel-SD, Mel-CD. Here, the impact of Mel filter bank objective measure is comprehended.

In this voice quality objective evaluation study for telephone frequency band, we chose to slightly wider bandwidth than the telephone band, it is for 0-4000 Hz. Mel filter bank structure is according to the filter number which is given in the filter bank, within the corresponding Mel domain of 0-4000 Hz and range of (0-2146) Mel scale, a center frequency of the triangular filter bank is uniform arrangement. Because different number of filters, the filter bandwidth is different, the constituting filter bank also is different. The filter bank impact on measure performance is converted to study the impact of the number of filters to measure change in the filter bank.

The impact of the filter will be reflected by the test performance of Mel-CD and Mel-SD. Mandarin speech material is used in testing laboratory, it is selected from the sound quality evaluation standard MOS SJ 20771-2000, its accompanying and MOS testing mandatory implementation of speech database with standard SJ 20852-2002, including 72 voice files, each file includes roughly 10 sec of three test statement (according to the statistical characteristics of the Chinese language, the voice balance is built) and it consists of the original pronunciation recorded voice library from three men and three women. Test experiment formes a different communication system, different interference styles and voice distortion condition of various signal ratio, there are totally eight different speech data sets, there were labeled condition 1-8 in the experiment.

The number of filters are 7-25 different filter bank, Mel-CD and Mel-SD, respectively have 8 objective evaluation test and the evaluation results and the correlation value are obtained with subjective MOS and the correlation value averaging of 8 tests is as a given filter number of conditions, both objective measures are used as the overall performance in eight tests.

Comprehensive performance of 8 tests is in Fig. 3 and Table 1 is for overall performance in the test, they are the filter number of optimum performance and their corresponding to the average correlation values which Mel-CD and Mel-SD obtain.

In Fig. 4 and Table 1, the number of filters is between 7-13, the correlation value of Mel-CD is between 0.91-0.92, it remains flat and slightly undulating, when the number of the filter is 10, the maximum relevant value reaches to 0.9166, with an increased number of filter, performance is monotonically decreased, when the number of filters is greater than 15, the correlation value has dropped to less than 0.9 and with the increase in the number of filters, the correlation value continues to reduce but when the number of the filter reaches 25, the correlation value is slightly larger than 0.85.

With the increase in the number filter, the Mel-SD overall performance remained stable. It is between 7-25, a correlation value is between 0.94-0.95, it remains flat, when the number of the filter is 10, it is related to maximum 0.9445, when the increase in the number of filters, correlation value is slightly reduced, the curve is essentially flat.

By analysis and comparison of Mel-CD and Mel-SD, Mel-SD performance was significantly better than Mel-CD, in the entire range of the filter number variation, the minimum correlation value of Mel-SD is greater than 0.94 , it is even higher than the optimum correlation value of the Mel-CD, the correlation value of Mel-CD was less than 0.92.

Table 1:	Optimal number of filters and the best value of objective evaluation


Fig. 3:	Relationship between Mel-CD, Mel-SD performance and the filter number


Fig. 4:	Relationship with compression factor and Mel-SD performance (No. of fllter = 10)

Mel-SD performance remains flat in the entire range of the filter number variation, the change of the filter bank is not sensitive and Mel-CD only is in the range of 7-13, it remains stable performance, when the number of filters continues to increase, its performance decreases monotonically, it is lack of robustness of the filter changes.

The results are that Mel-SD performance is better than Mel-CD but its structure change of the filter bank is robust, performance, Mel-CD filter number is good in less than 13. When the actual use of Mel-CD, the number of filters can not be too large, in order to ensure the accuracy and availability of objective evaluation. Although, Mel-SD performance have robust in the structure change of the filter bank but because of its performance is the best value at 10, so, a smaller number of filter banks should also selected in actual use, both to ensure the performance but also to reduce the computational complexity.

Based on the above testing and analysis, Mel-CD and Mel-SD in actual use are choosing smaller filters number of filter bank. From the analysis of the test results, the two measure performance between 7-13 are better than relative filter changes. For Mel-CD and Mel-SD, when both are 10 in the number of filters, the average in tests is to achieve the best performance, the best number of filters is 10.

Relationship between Mel-SD and compression transform factor: In Mel-SD, the literature (Chen and Jin, 2006) chooses the cube root function as the voice intensity-the relationship between perceived loudness feature. This relationship is approximate expression of static measurements in psychoacoustic experiment conclusion. For the objective evaluation of voice quality, it is involving dynamic change of voice, thereby it is generating two questions:

•	Whether this relationship is for voice quality assessment
•	Whether static relationships is suitability to evaluation which involves dynamic changes

Table 2:	Mel-SD optimal correlation value and the corresponding compression factor (Filter = 10)

On the basis of the number design of optimized filter, we research the relationship between compression transform and Mel-SD performance. A power function is selected as a compression function, the exponent is called as the compression factor, it is requiring less than one. According to the experimental knowledge and experience, we will vary the compression factor, it is set at 0.20-0.53, how compression factor change affects on Mel-SD evaluate performance. Test conditions are the same section, the average value of eight test performance is used as a comprehensive evaluation performance.

The number of filters is 10 in Fig. 4 which is graph of the compression factor changes and Mel-SD evaluate performance. For comparison and description while the figure shows that when the number of filters is 10, performance Mel-CD is used as a performance benchmark, since, the Mel-CD is regardless of changes in the compression factor, it appears as a straight line in the Fig. 4.

From Fig. 4 and Table 2, it can be concluded that as the increase of the compression factor, Mel-SD performance first increases monotonously, it decreases monotonically later at the maximum, the maximum compression factor is 0.27. Overall, Mel-SD change in the compression factor range, performance varies is between 0.935-0.95. When compression factor is 0.27, the correlation value is 0.9445, when compression factor is 0.33, correlation value is 0.9437, both is not very different. When the compression factor is less than 0.4, the performances are above 0.94. When compression factor is greater than 0.4, the performance began to decline significantly. When compression factor increases to 0.53, it is compared with a maximum, there is gap of around 0.01 but it still is better than Mel-CD of 0.9219.

From the above analysis, Mel-SD has the best compression factor in the case of the number of filter design optimization. Within a certain range, compression impact factor is not serious and it always guarantee better performance than Mel-CD. Best compression factor is close to the approximate expression of the experimental results for static psychoacoustic measurements, these verify its basic relationship for voice quality assessment, the best factor is 0.27.

DISCUSSION

Parameter optimization of Mel-SD and Mel-CD for objective speech quality evaluation tests: According to the analysis of two sections, the select number of filters is 10 for Mel filter bank, they are used for Mel-CD and Mel-SD and Mel-SD compression factor is to take 0.27 (Barnwell III and Bush, 1978; Barnwell III, 1979, 1980a, b; Barnwell III and Quackenbush, 1982).

The optimization results of Mel-CD and Mel-SD are used speech quality objective evaluation for a communication system under interference conditions, in order to compare the performance, the ITU P.862 PESQ standard (ITU., 2001). Evaluation results are used the baseline performance.

Table 3 shows the objective evaluation of the PESQ, Mel-CD and Mel-SD for eight performance test results and Fig. 5 is the histogram for three measure to evaluate the performance.

The results in Table 3 and Fig. 5 shown that performance comparison is made to the Mel-SD, Mel-CD and PESQ, the eight conditions are tested, in addition to the 3 and 4 conditions, Mel-SD and Mel-CD properties are better than PESQ.


Fig. 5:	Correlation value of PESQ, parameter optimization of Mel-CD and Mel-SD in evaluation experiments

Table 3:	PESQ, subjective and objective correlation values ρ to optimize the Mel-CD and Mel-SD parameters in evaluation experiments and the estimated value σ_sse of deviation

Conditions 3 and 4 testing contains the internal signal speech delay, Mel-SD and Mel-CD is not included in PESQ, they are the re-alignment additional processing section and therefore PESQ in the conditions 3 and 4 is better than Mel-SD. In conditions 1 and 2, PESQ performance is poor, Mel-CD is better than PESQ, Mel-SD is the good performance.

Comparative Mel-SD, Mel-CD, in eight tests, Mel-SD are better than Mel-CD. In order to compare three objective evaluation tests in the comprehensive performance, the average of the eight test performance are given in Table 4.

From the results in Table 4, there is the comprehensive comparison of the three measure performance results in eight tests. The average correlation value of Mel-SD is 0.9445, relatively it increases 0.0342 than PESQ average correlation value, there is 3.7% performance improvement. The estimated average deviation of Mel-SD declines 24% than the PESQ estimated average deviation.

Table 4:	Correlation values and the estimated average deviation of the subjective and objective evaluation on PESQ, parameter optimization of Mel-CD and Mel-SD experiments

Table 5:	Mel-CD and Mel-SD correlation values and the estimated average deviation in an average of objective and subjective evaluation experiments without making parameter optimization

The average correlation value of Mel-SD increases 0.0279 than relative Mel-CD, there is 3.0% performance improvement; the estimated average relative deviation of Mel-SD dropped 17.1% than Mel-CD average estimation error.

The average correlation value of Mel-CD is 0.9166 and it is considerable of the average correlation value PESQ, the estimated average deviation of Mel-CD declines 8.3% than PESQ estimated average deviation.

Evaluation results show that the performance of Mel-SD is best, it is better than Mel-CD and PESQ, Mel-CD performance is quite with PESQ. It is considered in conditions 3 and 4 that Mel-SD and Mel-CD has not dealt with the case of misaligned interior, after the plus processing part, it can be speculated that the optimization of Mel-SD and Mel-CD performance is likely to reach even better than PESQ.

Performance comparison between the performances which have been optimized and which have not been optimized for Mel-SD and Mel-CD: To compare the influence of parameters to optimize the performance, the performance results of parameter non-optimization of Mel-CD and Mel-SD is in Table 5, the number of the filter bank is taken to be 24, the compression factor is 0.33.

From comparison in Table 4 and 5, Mel-CD and Mel-SD optimized performance parameters are improved. Especially the average correlation value of the optimized Mel-CD increases 7.5% than the average correlation value of the relative optimize absence which is improved to 0.0636, the average estimation error is decreased by 20.4%. There is a slight increase in the optimized performance of Mel-SD than the no optimized one but the change is not obvious, it also shows the Mel-SD robustness, especially there is robustness for filter bank design. Table 4 shows that the parameter optimization of Mel-CD performance is equivalent to PESQ, if is not a reasonable to select filter design, it is showed in Table 5 that Mel-CD performance is not ideal.

CONCLUSION

In this study, the characteristic parameters to MFSC for Mel-SD were compared and analyzed with MFCC feature parameters for Mel-CD, indicating similarities and differences and their links. Because filter bank of Mel domain is an important part of the objective measure of Mel domain, therefore, we study affect the performance of two measures in Mel filters. Studies have shown that in a given test, Mel-SD have the structure change robustness of the filter bank, its performance is better than Mel-CD, Mel-CD is more sensitive to changes of the filter structure. After the number of filters is more than 13, the performance degradates with the filter increased number. Overall performance and computation is complexity. Two Measures should choose the number of filters between 7-13. Ten is the best number of filters in the two tests measure.

Mel-SD has the optimum compression factor in the case of a given number of filters. Within a certain range, compression impact factor is not serious and the performance is better than Mel-CD. The best compression factor is basically in line with the approximating expression of experimental results in the psychoacoustic static measurements which verify the intensity of the sound-the basic relationship of loudness is suitable for speech quality assessment, the best factor is 0.27.

The optimization parameters of Mel-CD and Mel-SD are used to voice quality objective evaluation for communication system under interference conditions, the results show that the performance Mel-SD is best, which is better than Mel-CD and PESQ, Mel-CD performance is quite to PESQ. By optimization of the performance parameters before and after, an objective evaluation of the analysis shows that the optimization of the performance parameters of Mel-CD is significantly improved but these also further validate the Mel-SD robust to parameter changes.

In summary, a reasonable parameter optimization of voice quality evaluation measure in Mel domain can guarantee a good evaluation of performance but also to avoid the computational complexity. After the filter parameters are appropriately selected, Mel-CD has the same equivalent evaluation performance with PESQ, Mel-SD shows a good performance and robustness against parameter variation.

REFERENCES

Chen, G., X.L. Hu, Y.Y. Zhang and Y.T. Zhu, 2001. Research advance on objective measures of speech quality. Acta Electronica Sinica, 29: 548-552.
Direct Link
Chen, H. and F. Jin, 2006. Mel-spectral distortion measure based on perception model for objective speech quality assessment. J. Southwest Jiaotong Univ., 41: 723-726.
Direct Link
Fu, Q., K.C. Yi, B. Tian and Z.Y. Zhang, 2001. One-step strategy of speech quality objective assessment. Acta Electronica Sinica, 29: 885-887.
Direct Link
Huang, H.M., Y. Wang, S.W. Zhao and Z.Y. Zhang, 2000. Study of objective quality evaluation for the speech systems. Acta Electronica Sinica, 28: 112-114.
Direct Link
ITU., 2001. Perceptual Evaluation of Speech Quality (PESQ) an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs. ITU-T Recommendation P.862, International Telecommunication Union. http://www.itu.int/rec/T-REC-P.862-200102-I.
Kitawaki, N., M. Honda and K. Itoh, 1984. Speech-quality assessment methods for speech-coding systems. IEEE Commun. Mag., 22: 26-33.
CrossRef Direct Link
French, N.R. and J.C. Steinberg, 1947. Factors governing the intelligibility of speech sounds. J. Acoust. Soc. Am., 19: 90-119.
CrossRef Direct Link
Quackenbush, S.R., T.P. Barnwell III and M.A. Clements, 1988. Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs, New Jersey, USA., ISBN-13: 978-0136290568, Pages: 377.
Kubichek, R.F., 1993. Mel-cepstral distance measure for objective speech quality assessment. Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Volume 1, May 19-21, 1993, Victoria, BC., pp: 125-128.
CrossRef Direct Link
Voiers, W.D., 1977. Diagnostic Evaluation of Speech Intelligibility. In: Speech Intelligibility and Speaker Recognition, Hawley, M.E. (Ed.). Chapter 34, Dowden, Hutchinson and Ross Publisher, Stroudsburg, PA., USA., ISBN-13: 9780879332990, pp: 374-387.
Wang, S., A. Sekey and A. Gersho, 1992. An objective measure for predicting subjective quality of speech coders. IEEE J. Sel. Areas Commun., 10: 819-829.
CrossRef Direct Link
Barnwell III, T.P. and A. Bush, 1978. Statistical correlation between objective and subjective measures for speech quality. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 3, April 10-12, 1978, Tulsa, Oklahoma, pp: 595-598.
CrossRef Direct Link
Barnwell III, T.P., 1979. Objective measures for speech quality testing. J. Acoustical Soc. Am., 66: 1658-1663.
CrossRef Direct Link
Barnwell III, T.P., 1980. Correlation analysis of subjective and objective measures for speech quality. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 5, April 9-11, 1980, Denver, CO., USA., pp: 706-709.
CrossRef Direct Link
Barnwell III, T.P., 1980. A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 5, April 9-11, 1980, Denver, CO., USA., pp: 710-713.
CrossRef Direct Link
Barnwell III, T.P. and S.R. Quackenbush, 1982. An analysis of objectively computable measures for speech quality testing. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Volume 7, May 3-5, 1982, Paris, France, pp: 996-999.
CrossRef Direct Link

Journal of Software Engineering

Research Article