Strong Robustness Hash Algorithm of Speech Perception Based on Tensor Decomposition Model

Journal of Software Engineering

Year: 2017 | Volume: 11 | Issue: 1 | Page No.: 22-31
DOI: 10.3923/jse.2017.22.31

Strong Robustness Hash Algorithm of Speech Perception Based on Tensor Decomposition Model

Yibo Huang and QiuYu Zhang

Abstract: Background: With constant progress in modern speech communication technologies, the technology of communication is becoming more and more important in the transmission of the mobile environment and the speech data is prone to be attacked by the noise or maliciously tampered. Existing speech authentication algorithms are inefficient, complicated and cannot meet the real-time requirements of speech communication in the mobile computing environment. Materials and Methods: In order to make the speech perception Hash algorithm has strong robustness and high efficiency of authentication under the common background noise, this study put forward a speech perception Hash algorithm based on the tensor reconstruction and the decomposition is proposed. This algorithm analyzes the speech perception feature from the 3D perspective and acquires each speech component wavelet packet decomposition. The MFCC and ΔMFCC feature of each speech component are extracted to constitute the speech feature tensor. The feature tensor is decomposed tensor decomposition to reduce the complexity of the feature tensor. Speech authentication is done by generating the Hash values through the feature of matrix quantification which use mid-value. Results: Experimental results showing that the proposed algorithm is robust for content to maintain operations. It is able to resist the attack of the background noise which is commonly heard during a communication. Also, the algorithm is highly efficiency in terms of arithmetic and is able to meet the real-time requirements of speech communication and complete the speech authentication quickly. Conclusion: Compared with common algorithms, this algorithm has better authentication performance, it can effectively improve accuracy, real-time performance and be able to control the tensor size as required. Its model building is flexible besides, it can realize the speech content authentication and speaker authentication thus, the algorithm has high practical value.

Fulltext PDF Fulltext HTML

How to cite this article

Yibo Huang and QiuYu Zhang, 2017. Strong Robustness Hash Algorithm of Speech Perception Based on Tensor Decomposition Model. Journal of Software Engineering, 11: 22-31.

Keywords: Information security technology, perceptual speech Hash, strong robustness, tensor decomposition model and background noise

INTRODUCTION

During the process of the transmission of the speech signal in the channel, it is easily to be disturbed therefore, during the calculation of the perceptual speech Hash, it is necessary to consider whether the feature can be extracted completely and accurately, it is required that the robust should be the strongest and the coupling should be minimum, easy to be calculated and strong in the real-time. The extraction of the speech perception feature value is the key of speech perceptual authentication. The current extraction of the speech perception feature value and the calculation are based on the human ear psycho acoustic model, it based on the spectrum coefficient¹LPC², line spectrum frequency³ and frequency cepstral coefficients⁴. The previous study⁵ proposed a speech perception Hash algorithm that based on the MFCC and NMF, the study proposed a singular value decomposition then obtains the audio information and undergo the non-negative matrix factorization (NMF). It reduces the mistakes and gives out a satisfactory outcome of the Hash function. The experiment shows that the robustness is improved but the distinction is relatively poor due to the principal component analysis method is used in the algorithm, the time complexity of the algorithm is relatively large and it cannot meet the requirements of real-time speech authentication.

To sum up, this study introduces the method of tensor decomposition in the process of Hash’s speech perception. To speech wavelet packet decomposition coefficient tensor reconstruction, speech tensor construction analyze the phonetic feature based on three dimensions, it can perform feature of speech frame structure decomposition and the relationship between the scale and feature parameters therefore, it has better speech feature analysis of the performance. The tensor decomposition of the reconstructed speech matrix is used to reduce the computational complexity of the feature tensor.

MATERIALS AND METHODS

Tensor decomposition: Tensor in the past 10 years in the field of information processing has been widely used. Tensor can be considered as a product of vector space and it is a higher order generalization of vector and matrix. The order of the tensor can be expressed as . The tensor decomposition includes two kinds of decomposition methods: CANDECOMP/PARAFAC and Tucker⁶. Tensor decomposition method is widely used in image processing, pattern recognition, data compression and so on⁷. Figure 1 is the principle of the third-order tensor Tucker decomposition.

CANDECOMP/PARAFAC decomposition and Tucker decomposition are the high order generalization of the Singular Value Decomposition (SVD) in the matrix decomposition⁸.

Wavelet packet transform: Discrete Wavelet Transform (DWT) has the ability to accurately characterize local details of speech signals^9,10. Wavelet Packet Transform (WPT) as the further expansion of wavelet analysis theory, the corresponding frequency can be adaptively selected according to the feature of the information being analyzed which can improve the matching degree of the signal spectrum thus, improving the time-frequency resolution. Therefore, wavelet packet decomposition can reflect the feature and nature of the signal, very suitable for the analysis and processing of speech signal types of non-stationary signal^11-13. The K level wavelet packet decomposition principle as shown in Fig. 2. Define the subspace U_i^m is U_m(t) and U_2m(t) closure spaces, speech signal through the recursive equation wavelet packet decomposition:


Fig. 1:	Tucker decomposition of third-order tensor

where, h(n) is a high pass filter group, g(n) is a low pass filter group, g(n) = (-1)ⁿ h(1-n) and that the two coefficients with orthogonal relation. The wavelet basis function R(t) and the scaling function n(t) meet the two scale equations:

MFCC feature parameter extraction: When phonetic feature are extracted practically, the MFCC¹⁴ are mostly used as the feature vector¹⁵. Mel scale describes the non-linear feature of human ear’s frequency perception. Its relation with the practical frequency can be approximately expressed with the equation below:

where, f is the practical speech frequency, F_n is the Nyquist frequency of speech signal inputted, F_n = f_max/2 and f_max is the sampling frequency of speech signal.

The standard MFCC only reflects the static feature of speech. However, human ears are more sensitive for the dynamic feature of speech. Therefore, the MFCC feature extracted can be differentiated and the differential MFCC is expressed as ΔMFCC:

where, d(n) signifies the n-th first-order difference coefficient, c(n) signifies the n-th MFCC, k is a constant and is generally set as 1 or 2, if the k is 2, it means that the first-order difference coefficient is the linear combination of the previous two frame coefficients and the latter two frame coefficients.

Establishment of feature system of speech tensor: The construction of speech tensor which analyzes the feature of speech from the three dimensional perspective can better show the relationship among speech frame structure, decomposition scale and feature parameter so, it has a better analytical performance of speech feature. Figure 3 shows the schematic diagram for the construction of speech tensor which can directly describe the structure of speech tensor.

Describe the speech feature from three perspectives which are respectively the speech frame, wavelet packet decomposition scale as well as MFCC and ΔMFCC feature parameters. The speech frame mainly describes the precedence relationship of speech and describes the relationship of speech feature from the time scale. The wavelet package decomposition scale conducts wavelet package decomposition for each frame of speech signal so as to different scales of approximate components and detailed components of each frame of speech signal.


Fig. 2:	Decomposition graph of K level wavelet packet


Fig. 3:	Structure graph of speech tensor

The MFCC and ΔMFCC feature parameters conduct the feature extraction for the components decomposed by each wavelet package to obtain the component feature. Tensor construction can be carried out for a section of speech from the above three perspectives. The tensor constructed is the third order speech tensor of speech frame×wavelet package decomposition scale×MFCC and ΔMFCC feature parameters.

The main steps for the construction of speech tensor are shown below:

Step 1:	Carry out pre-emphasis, framing and windowing of speech signals
Step 2:	Carry out wavelet package decomposition for each frame of speech after windowing
Step 3:	Extract the MFCC coefficient of each wavelet component and solve the ΔMFCC
Step 4:	Reconstruct the MFCC and ΔMFCC of each frame of speech components to obtain the feature matrix of each frame of speech
Step 5:	Arrange each frame of speech feature matrixes in the sequence of speech frames to obtain the third order speech tensor

Figure 4 shows the construction method for speech feature tensor adopted in this study. The construction diagram consists of the speech signal preprocessing, speech feature extraction and speech feature tensor construction.

Quantization: Reconstruct the core tensor G to form the two dimensional feature matrix; calculate the sum of each column of matrixes:

where, signifies the feature coefficient in row j and line i and k is the number of rows of feature matrix.

Quantize the coefficient formed and row matrix to form the Hash value h(j) of speech segment:

where, is the mid-value.


Fig. 4:	Structure graph of speech feature tensor


Fig. 5:	Flow chart of speech perception Hash authentication based on tensor decomposition

Speech perception Hash authentication scheme: Figure 4 describes the construction of speech tensor. After the construction of speech tensor, it is required to decompose the speech tensor and then obtain the core tensor G of speech tensor and the project matrixes U⁽¹⁾, U⁽²⁾ and U⁽³⁾of speech frame, decomposition scale and feature parameter and U⁽ⁿ⁾ is usually orthogonal. Such project matrixes can be considered as the principal components of speech tensor in each order. Since, the core tensor G is less than the original tensor X, the core tensor and G can be considered as the compression form of original tensor X. In this algorithm, the core tensor G is used to describe the speech feature. The flow chart of speech perception Hash authentication based on tensor decomposition is shown in Fig. 5.

The detailed steps of the algorithm are shown below:

Step 1:	Preprocessing: Conduct pre-emphasis on the speech in the speech library to be tested, enhance the useful frequency spectrum of high frequency and reduce the edge effect and eliminate noise
Step 2:	Framing and windowing: In order to eliminate the inter frame loss during framing, conduct framing and add the hamming window for speech x(t) during framing, the frame length is L, when the frame moves at L/2, s(n) can be obtained, later, add hamming window for s(n) to obtain s_w(n) and n is the frame number
Step 3:	Construction of speech feature tensor: Carry out wavelet package decomposition for the speech frame, calculate the MFCC and ΔMFCC coefficients of each frame and conduct the tensor construction of feature coefficient to obtain the speech feature tensor X
Step 4:	Decomposition of speech tensor: Carry out Tucker decomposition for the feature tensor X to obtain the low dimensional core tensor G and the project matrix U⁽ⁿ⁾
Step 5:	Quantization: Construct the core tensor G and thus, obtain the sequence R_h(j), quantize R_h(j) to obtain the perception Hash sequence h(j)
Step 6:	Calculation and matching of perception Hash distance: Suppose that there are two speech segments α and β, define the Hash mathematic distance is D_h(:,:) which is shown below:

Match according to the hypothesis testing of Hash mathematic distance D_h(:,:) and Hash sequence h(:) as follows:

K₁: If the perception contents of the two speech segments α and β are the same:

K₂: If the perception contents of the two speech segments α and β are different:

where, τ is the matching threshold. The matching threshold can be used to determine whether the perception contents of speech signals are the same so as to realize the perception Hash authentication of speech signals.

RESULTS

As for the experiment in the study, the MATLAB 2010b simulation realization is adopted, the TIMIT speech library and Noisex-92 noise library are used. The speech perception Hash function must meet the property of perception Hash function including the discrimination, uni-directional and robustness. This study will evaluate the speech perception Hash algorithm proposed from these aspects. The tensor decomposition can be considered as the high order promotion of SVD algorithm. The Tucker decomposition used in this algorithm is to solve the optimization problem so this algorithm is similar to the NMF algorithm. Therefore, the algorithm’s contrastive algorithms include t he MFCC and SVD⁵ algorithm and LPC and NMF¹⁶ algorithm.

Discrimination: Discrimination is mainly used to evaluate the reliability of the algorithm for distinguishing different speech contents read by different or same persons. Since, the Bit Error Rates (BER) of different speech segments are random variables, this experiment analyzes the discrimination of algorithm with the probability distribution curve. Compare every two of speech in the speech library and the diagram of BER normal distribution obtained is shown in Fig. 6.


Fig. 6(a-c):	Bite error rate normal distribution graphics of different perception Hashing algorithm, (a) Proposed algorithm, (b) LPC and NMF and (c) MFCC and SVD

When the error rate is used as the distance measure, it should approximately obeys the normal distribution. It can be seen from Fig. 6 that the probability curve of standard normal distribution overlaps the probability distribution of BER value of this algorithm so, the Hash distance obtained through this algorithm approximately obeys the normal distribution, namely and speech with different perceptions will generate different Hash values.

In the ideal condition, any speech segments with different contents will have its different perception Hash valve and any pair of Hash Value matching should have a high error rate. However, there are always a few of BER data which are low and probably lower than the threshold value then it will be wrongly judged as same content. According to Table 1, it can know that the False Acceptation Rate (FAR) increases with the enlargement of BER threshold value. Compared with the other two algorithms, the algorithm proposed in this study has a strong collision resistance. When the threshold value τ = 0.25, the collision probability is that 4 segments among 10¹² speech segments may collide. When τ = 0.27, the collision probability is that 6 segments among 10⁹ speech segments may collide. When τ = 0.30, the collision probability is that 2 segments among 10⁸ speech segments may collide. It can be seen from Fig. 6 that 2 segments among 10⁵ speech segments will collide when the threshold value τ = 0.35. As indicated in Table 1 compared with other two algorithms, the algorithm is very stable in the collision resistance. Therefore, this algorithm can correctly identify the authenticated speech segments.

Robustness: The speech perception Hash robustness is mainly used to evaluate the reliability of the same speech after different holding operations. Five types of holding operations are conducted for the speech. They are respectively the 50% of volume increase, 50% of volume decrease, resampling, 50 db of Gaussian white noise attack and 4 kHz of low-pass filtering. As shown in Table 2, -50% volume, +50% volume and resampling don’t change the vocal track feature of speech signal so they won’t affect the accuracy rate of speech authentication.

Table 1:	Threshold false accepted rate of the algorithm

Generally, human’s speech range is between 300 Hz and 3 kHz so, 4 kHz low-pass filtering will not alter the vocal tract feature of speech but will filter out the sharp noise. According to the data in Table 2, for the robustness of holding operation of resistance content and the algorithm proposed is generally stronger than other two algorithms.

Based on the BER rate of content holding operation, the False Acceptance Rate (FAR) and False Reject Rate (FRR) are obtained. Draw the FAR-FRR curve which is shown in Fig. 7.

As indicated in the result in Fig. 7, the algorithm’s FRR curve does not intersects with the FAR curve, the FRR curve shows a significant convergence and a very wide judging domain and the judging threshold is between 0.32 and 0.36 showing a significant judging domain. Compared with the other two algorithms, this algorithm has a good robustness can correctly authenticate the same and different speech segments and can authenticate the speech segments which go through the content holding operation and malicious attack. Therefore, compared with other two algorithms, this one has good discrimination and robustness.

Robustness for common background noise: Add common background noises including White noise, Pink noise, Factory floor noise 1, Factory floor noise 2, Speech Babble and Volvo noise during speech communication in the Noisex-92 library. The signal-to-noise ratios of noises added are 0, 10, 20, 30, 40 and 50 db, respectively.

As shown in Fig. 8, this algorithm has extremely strong robustness for Gaussian white noise attack and Babble noise attack; the robustness of this algorithm is obviously stronger than that of other two ones. Its robustness for other several noises is also in the middle level. As for the passing rate for Pink noise attack, the passing rates of the three algorithms are all 100%. Therefore, it is required to evaluate the robustness of algorithm by using the matched BER mean value and BER span.

Table 2:	Matching rate of speech authentication after the being kept operating content %


Fig. 7(a-c):	FAR-FRR curves of different perceptual Hashing algorithm, (a) Proposed algorithm, (b) LPC and NMF and (c) MFCC and SVD

As shown in Fig. 8, this algorithm has strong robustness for common background noises. In particular, its robustness for Gaussian white noise and Babble noise is significantly higher than the robustness of other two algorithms. Its robustness performances for Volvo noise, Factory 1 noise, Factory 2 noise and Pink noise are in the middle level. The passing rate of authentication matching for various signal-to-noise ratios (SNR) is very high. Compared with the NMF algorithm, the SVD decomposition algorithm has stronger robustness for pink noise attack and the feature value decomposed through SVD algorithm has higher stability. Therefore, the algorithm proposed in this study has strong combination property of robustness for common noises, so it can meet the practical need of speech matching in our daily life.

Efficiency analysis: In order to measure the computational efficiency of algorithm proposed, The researcher randomly extracted 100 segments of speech from the speech library to count the average operating time of the algorithm.

Table 3:	Algorithm efficiency (time sec^–1)

As shown in Table 3 compared with the other two algorithms, in this algorithm, it is required to conduct the wavelet package decomposition prior to feature extraction and make tensor reconstruction for the feature extracted. Thus, the expenditure of operating time is high. On the premise of enhancing the robustness compared with other algorithms, there is no great increase in the overall expenditure of operating time of this algorithm and its operating efficiency is not largely affected, so this algorithm can meet the need of real-time property of speech communication.


Fig. 8(a-f):	Speech authentication passing rate of being the common background noise attack, (a) Volvo noise, (b) Babble noise, (c) Factory1 noise, (d) Factory 2 noise, (e) Pink noise and (f) Gaussian white noise

CONCLUSION

•	This study proposed an efficient and robust perception Hash speech authentication algorithm based on the tensor decomposition model. Through an experimental discussion and analysis, this algorithm is better than the earlier methods as shown in the discussion, the conclusions below can be made:
•	As shown in the results of speech perception Hash discrimination and collision resistance experiments, the highest speech misidentification probability within the range of threshold is 1.9921e-008, the contrast of two kinds of algorithms, MFCC and SVD threshold are 3.54e-008, LPC and NMF threshold are 1.978e-007. It means that the algorithm has good collision resistance performance and can meet the need of practical application
•	As shown in the robust experiment compared with the other two algorithms, the robustness of the algorithm proposed is improved to a certain extent. After the content holding operation is conducted, the algorithm can correctly match speech, there is an obvious judging domain in the FAR-FRR curve and the scope of the judging domain is between 0.34 and 0.39. Compared with other two kinds of algorithms, the FRR - FAR curve has no cross
•	This study is especially in allusion to the common background noises during daily communications. As shown in the experiment of speech against noise attack, this algorithm has strong robustness for common background noises so, it can meet the need of daily communications on varied dialogue backgrounds. Compared with other two kinds of algorithms and the algorithm for common background noise robustness is more stable
•	This algorithm can control the tensor size as required and its model building is flexible, besides, it can realize the speech content authentication and speaker authentication thus, the algorithm has a high practical value

ACKNOWLEDGMENT

The researchers would like to thank for the support by the National Nature Science Foundation of China (No. 61363078), the Natural Science Foundation of Gansu province of China (No. 1606RJYA274).

REFERENCES

Ozer, H., B. Sankur, N. Memon and E. Anarım, 2005. Perceptual audio hashing functions. EURASIP J. Adv. Signal Proc., 12: 1780-1793.
CrossRef Direct Link

Lotia, P. and D.M. Khan, 2013. Significance of complementary spectral features for speaker recognition. Int. J. Res. Comput. Commun. Technol., 2: 579-588.
Direct Link

Nouri, M., N. Farhangian, Z. Zeinolabedini and M. Safarinia, 2012. Conceptual authentication speech hashing base upon hypotrochoid graph. Proceedings of the 6th International Symposium on Telecommunications, November 6-8, 2012, Tehran, pp: 1136-1141.

Panagiotou, V. and N. Mitianoudis, 2013. PCA summarization for audio song identification using Gaussian mixture models. Proceedings of the 18th International Conference on Digital Signal Processing, July 1-3, 2013, Fira, pp: 1-6.

Chen, N., H.D. Xiao and W. Wan, 2011. Audio hash function based on non-negative matrix factorisation of mel-frequency cepstral coefficients. IET Inform. Secur., 5: 19-25.
CrossRef Direct Link

Bader, B.W. and T.G. Kolda, 2006. Efficient MATLAB computations with sparse and factored tensors. SIAM J. Scient. Comput., 30: 205-231.
CrossRef Direct Link

Li, J., L.X. Jin and G.N. Li, 2013. Hyper-spectral remote sensing image compression based on nonnegative tensor factorizations in discrete wavelet domain. J. Electron. Inform. Technol., 35: 489-493.
Direct Link

Salmi, J., A. Richter and V. Koivunen, 2009. Sequential unfolding SVD for tensors with applications in array signal processing. IEEE Trans. Signal Process., 57: 4719-4733.
CrossRef Direct Link

Sharma, R. and V.P. Pyara, 2012. A discrete wavelet transform based analysis of sounds of some musical instruments. Proceedings of the 3rd International Conference on Computing Communication and Networking Technologies, July 26-28, 2012, Coimbatore, pp: 1-4.

Ali, S.T., J.P. Antoine and J.P. Gazeau, 2014. Discrete Wavelet Transforms. In: Coherent States, Wavelets and their Generalizations, Ali, S.T., J.P. Antoine and J.P. Gazeau (Eds.). Springer, New York, pp: 379-410

Yang, Y. and S. Nagarajaiah, 2014. Blind identification of damage in time-varying systems using independent component analysis with wavelet transform. Mech. Syst. Signal Process., 47: 3-20.
CrossRef Direct Link

Sharma, P., K. Khan and K. Ahmad, 2014. Image denoising using local contrast and adaptive mean in wavelet transform domain. Int. J. Wavelets Multiresolution Inform. Process., Vol. 12, No. 6.
CrossRef

Sharma, R. and V.P. Pyara, 2013. A robust denoising algorithm for sounds of musical instruments using wavelet packet transform. Circ. Syst., 4: 459-465.
CrossRef Direct Link

Picone, J.W., 1993. Signal modeling techniques in speech recognition. Proc. IEEE, 81: 1215-1247.
CrossRef Direct Link

Hossan, M.A., S. Memon and M.A. Gregory, 2010. A novel approach for MFCC feature extraction. Proceedings of the 4th International Conference on Signal Processing and Communication Systems, December 13-15, 2010, Gold Coast, QLD., pp: 1-5.

Chen, N. and W.G. Wan, 2010. Robust speech hash function. ETRI J., 32: 345-347.
Direct Link

HOME JOURNALS CONTACT

Journal of Software Engineering

Year: 2017 | Volume: 11 | Issue: 1 | Page No.: 22-31 DOI: 10.3923/jse.2017.22.31

Strong Robustness Hash Algorithm of Speech Perception Based on Tensor Decomposition Model

Yibo Huang and QiuYu Zhang

How to cite this article

Yibo Huang and QiuYu Zhang, 2017. Strong Robustness Hash Algorithm of Speech Perception Based on Tensor Decomposition Model. Journal of Software Engineering, 11: 22-31.

Keywords: Information security technology, perceptual speech Hash, strong robustness, tensor decomposition model and background noise

REFERENCES

Year: 2017 | Volume: 11 | Issue: 1 | Page No.: 22-31
DOI: 10.3923/jse.2017.22.31