Gray joint analysis and voice / music signal identification
Source: Application of electronic technology: Chen Gong Zhang Xiongwei Abstract: The gray correlation analysis method was applied to classify and recognize the voice / music signals and gives the methods and procedures of the audio signal on Gray Correlation Analysis. The probability statistical feature of the speech and music signals establishes the reference data of the target and the comparative data, the gray closing analysis of voice and music signals, determines the criterion of the target identification and classification, and performs two types of signals Identify. The simulation results show that the ash association analysis method is applied to audio signal classification and identification has certain feasibility.
Keywords: gray joint analysis feature voice and music identification simulation
Voice and music are the two most important audio data, automatic classifications of speech and music in content based on the content audio retrieval, summary of video, and many areas such as speech recognition.
At present, domestic and foreign speech signal identification is used to automatically identify signals for pattern recognition techniques based on sensory features such as loudness, tone, harness, etc.) and the zero-zero, power spectrum, MFCC coefficient. However, when the parameters of the identification object are incomplete, when the method is incomplete, these methods cannot give the correct and reliable results, and some cannot be identified, and some recognition is poor. Due to the complexity and variability of the speech signal environment, speech and music signal parameters are sometimes difficult to completely, the above method has certain limitations in practical applications.
Therefore, how to effectively use the existing small amount of audio data to accurately carry out the automatic classification identification of audio signals, especially voice and music classification, and as one of the important means of extracting the semantics and structure of audio content, and its research is increasingly Pay attention. Gray system theory, especially the development of gray joint analysis, and provides problems to solve this problem. Figure 11 Voice / music signal gray joint analysis method
The gray system theory is a scope of systemism, and gray is not complete. The gray system is mainly researching the system model. The behavioral information is incomplete, the operational mechanism is unclear, the modeling, prediction, decision and control of such systems. When sequence-associated analysis, the reference number column must first be determined, and then judged by the approximate to the other sequence and the reference sequence. The main step of the gray joint analysis: (1) Determine the reference sequence and the comparison sequence; (2) Ask ash joint coefficient; (3) find grayscale; (4) Sort by gray correlation degree.
2 Determination of reference sequences and comparison sequences
Select a must-have pause voice signal and music signal as the audio signal to be identified, and the characteristic extraction of the audio signal is essentially designed, with less dimensions exhibits an audio signal on the time domain. Considering that only the characteristics of the audio signal can be considered substantially unchanged within 5 to 20 ms time intervals. Therefore, this paper selects the probability statistical method of short-time energy root to extract the characteristics of the voice and music signals.
1 (a) and 2 (a) are time domain waveforms of the root mean square (RMS) of the speech and the sound of the music signal, respectively. Its sampling frequency is 11025 Hz, and the rectangular window length N takes 10 ms, and the length of time is 30s.
In the formula (1), X (n) is an audio signal, and the rectangular window sequence moves frame-by-frame sequence, each frame length is N.
The probability distribution of the 30S RMS is shown in Fig. 1 (b), and Fig. 2 (b) is shown in Fig. 1 (b). It can be used as a significant significant difference between the two distributions, which can be used as a special basis for identifying voice and music signals. Further studies have found that their probability distribution is a generalized X2 distribution from different parameters.
The speech and music signal RMS probability distribution of the above 30s is selected as the reference sequence, which is recorded as XJ = {xj (k) | k = 1, 2, ..., k}, where X1 is the speech sequence, X2 is the music reference sequence, Yi = {yi (k) | k = 1, 2, ..., k}, where Y1 is a comparison number column, Y2 is a sequence of music. K is a feature number, which is taken in K = 10. In order to test the gray closing degree of different length comparison sequences, the comparison sequence time length of the feature extraction is 0.1s, 1s, 10s, respectively. Figure 3 is a comparison of the probability distribution of the voice and music reference signal RMS with 30s. As can be seen from Fig. 3, the longer the length of the comparison sequence, the larger the probability distribution and the reference sequence, the probability distribution is almost consistent with the reference sequence when the comparison sequence is 10 s. In order to ensure the comparability of the audio sequence, when the gray correlation analysis is performed, the sequence is required to generate the sequence, that is, all data on a number column is removed. This new sequence indicates a multiple of the values of different times in the original number column to the first time value. Figure 23 calculates the gray joint coefficient
In voice / music recognition, due to two types of targets, there are two reference sequences, which requires a set of compare sequences and group reference sequences to distinguish between different types. If each set of comparison sequences is calculated in a local environment, each set of comparison sequences respectively contacts the grayscale contacts of the two sets of reference sequences, the gray closing degree obtained under different local conditions will be lost. Therefore, in order to realize the recognition of the audio type, when calculating a comparative sequence and the gray correction degree of each reference sequence, it must be calculated at the same maximum and minimum, thereby obtaining the "global environment" gray correction coefficient.
The algorithm for calculating the gray contacts in the global environment is as follows:
Among them, NJ = {1, 2}, ni = {1, 2}, k = {1, 2, ..., 10}, constant ξ is called resolution coefficient, ξ∈ [0, 1], its function is adjusted comparison The size of the environment. The smaller it, the greater the resolution. A section of ξ = 0.5. Minminmin | XJ (K) -Yi (k) | is called two-pole minimum, Maxmaxmax | XJ (k) -Yi (k) | is called two-pole maximum, | XJ (k) -Yi (k) | called No. K indicator XJ and Yi absolute differences.
4 calculate the gray correlation
The essence of the gray closing analysis is a comparison of the geometric relationship between the number of curves. If the two number column curves are coincident, the correlation is good, the association coefficient is 1, and the two-digit correlation degree is also installed in 1. At the same time, the two number column curves cannot be vertical, that is, there is no correlation, so the association coefficient is greater than 1, so the degree of correlation is greater than 0. Due to more than one correlation coefficient in the comparison, the average value of the correlation coefficient is the measure of the correlation degree RJI of the comparison, namely:
5 Sort by gray correlation
The correlation between the reference sequence XJ and the comparison sequence yi is sorted from large to small, and the gradation sequence is obtained. This article uses the principle of identification of the maximum gray correction. Figure 3 Table 1 gives the time length of 0.1s, 1s, and 10s compare sequences in Table 1, using the identification result of the maximum gray level correlation. Table 1 Different time long speech, the correct identification rate (3) of the music signal is 10 s, the association value of the speech comparison signal and the similar model is higher than 20% ~ 35%, which is 20% to 35%, higher than the voice. The reference signal is 25% to 30% of the associated value of the music model; the association value of the music comparison signal and the homologous model is greater than the association value of the voice comparison signal, and is higher than the music comparison signal and the music. Take the value, and 5% to 20% higher than the association value of the music comparison signal and the speech model. Therefore, when the signal feature is extracted, the recognition rate is up to 100%.
Time length (s) voice correct recognition rate Music correct recognition rate 0.162.37% 76.22% 194.50% 88.70% 10100% 99.8% Figure 4 is a 100-time Monte Carlo simulation, the speech and music comparison signals of three time lengths and each The maximum gray closing level of the comparison sequence.
As can be seen from Figure 4:
(1) When the length of time is 0.1, the voice, the music signal intersects the association value of its two types of mandrels. This is because the characteristic value of the sequence is not complete.
(2) When the length of time is 1, the degree of association between the speech comparison signal and its reference signal is greater than 0.85, and the correlation degree of the music comparison signal and the speech reference signal is greater than 0.6 less than 0.95; the correlation between the music comparison signal and its reference signal is greater than 0.73. Less than 0.9; the degree of association between the speech comparison signal and the music reference signal is greater than 0.7 less than 0.85. Under the conditions of speech module, 97% of the speech signal correlation value is greater than the association value of the music signal. Under the conditions of the model, 92% of the music signals are greater than the association value of the speech signal. Therefore, under different reference signals, it can be used as a basis for identifying voice and music signals by setting a threshold. Figure 4
In fact, the gray closing of the audio signal can be considered approximately related, the higher the reference sequence and the specific sequence of the sequence, the greater the correlation value, and it is smaller.