In recent years, in the field of biometric technology, vaginal identification technology has attracted the world's attention to its unique convenience, economic and accuracy, and is increasingly a safe verification method in everyday life and work.
The sound pattern recognition is a kind of biometric technology. It is a technical parameter that reflects speech and behavioral characteristics according to the speech waveform, which automatically identifies the technology of speaker identity. Unlike speech recognition, sound pattern identification is the speaker information in the voice signal, regardless of the word meaning in the voice, it emphasizes the personality of the speaker; and the purpose of speech recognition is to identify voice signals The content of words does not consider who the speaker is, it emphasizes the commonality. The sound pattern identification system mainly includes two parts, namely feature detection and pattern matching. The task of feature detection is to select the only valid and stable and reliable feature of the identity of the speaker, and the mode matching task is to match the characteristic mode of training and identification.
1. Feature extraction
The characteristic detection in the sound pattern recognition system is the basic feature of the monographs in the speech signal, which should be able to distinguish between different speakers, and maintain relative stability to the same speaker. Taking into account the number of feature, the quantity of training samples and the evaluation problem of system performance, the current vocabulary recognition system relies primarily on lower level acoustic characteristics. The speaker is generally due to the following categories:
The spectral envelope parameter voice information is output by the filter bank, and the filter outputs samples with a suitable rate and uses them as a sound pattern identification feature.
The pitch profile, the resonance peak frequency bandwidth and its trajectory are parameters extracted based on the physiological structure of the vocal organs, channels, and nasal cavities.
Linear prediction coefficients use a linear prediction coefficient to be a leap in speech signal processing, a variety of parameters exported in linear prediction, such as linear prediction coefficient, self-correlation coefficient, reflection coefficient, logarithmic area ratio, linear predictive residual and combinations thereof, etc. Parameters, as the identification feature, better results can be obtained. The main reason is that the linear prediction is in line with the channel parameter model.
Reflecting the characteristics of the hearing characteristics, the characteristics of the human earmission of the sound frequency, such as the US falling profile, perceived linear prediction, and the like.
In addition, people also improve the performance of the actual system through a combination of different feature parameters. When the correlation between the combination parameters is not large, there will be a good effect because they reflect the different characteristics of the speech signal, respectively.
2. Pattern matching
The research of model matching methods currently proposed in various characteristics is getting deeformally. These methods are generally independent of the following categories:
Probability statistical method
The speaker information is relatively stable in a short time, by statistical analysis of steady-state characteristics such as ponor, acoustic gain, low-order reflection coefficient, can be classified with a mean, variance and other statistic and probability density functions. The advantage is that no need to regulate the characteristic parameters, compare speaker recognition that is suitable for text-independent.
Dynamic time regular method
The speaker information not only has a stable factor (the structure of the vocal organ and the habitual habits), but there are time-change factors (tone, tone, accent, and rhythm). The identification template is compared with the reference template, and the similarities between the two templates are determined according to some distance. Commonly used methods are based on the dynamic time regulation of nearest neighboring principles DTW.
Vector quantization method
Vector quantization is the earliest data compression coding technology based on cluster analysis. Helms first uses it for sound identification, and encodes the test text by encoding the test text when the test text is encoded, and the resulting distortion is quantified as the judgment standard. The Bell Lab Rosenberg and Soong use VQ to use VQ to identify the sound pattern of isolated digital text. This method has a high recognition accuracy, and the judgment is fast.
Hidden Markov model method
The hidden Markov model is a random model based on transfer probability and transmission probability, which was earlier in CMU and IBM for speech recognition. It regards voice as a random process consisting of an observable symbol sequence, and the symbol sequence is the output of the sound system status sequence. When using HMM identification, the vocal model is established for each speaker, and the status transfer probability matrix and the symbol output probability matrix are obtained by training. The maximum probability of the unknown speech during the state shift is calculated, and the model corresponding to the maximum probability is decided. HMM does not require time regulation, saving the calculation time and storage amount when the decision is saved, and is currently widely used. The disadvantage is that the amount of calculation is large when training. Artificial neural network method
Artificial neural network simulates the perception characteristics of biology to some extent, which is a network model of distributed parallel processing structures, with self-organized and self-learning ability, strong complex classification boundaries distinguishes, and for incomplete information Robust, its performance is approximately an ideal classifier. The disadvantage is that the training time is long, the dynamic time is weak, and the network scale may be too difficult to train when the number of speakers increases.
The organic combination of the above classification method with different features can significantly improve the performance of sound grain identification, such as the NTT laboratory T. Matsui and S. Furui uses the split, differential spectrum, myopular and differential pones, using VQ and HMM mixed The method obtained 99.3% of the speaker confirmation rate.
For the speaker confirmation system, the most important two parameters characterizing their performance are erroneous reject rates and error acceptance rates. The former is a mistake caused by rejecting the true speaker, and the latter is an error caused by a counterfeiter, and the two is related to the setting of the threshold. The speaker confirms that the system's error rate is independent of the number of users, and the performance of the speaker recognizes the performance of the system is related to the number of users, and as the number of users increases, the performance of the system will continue to decline.
In general, a successful speaker recognition system should do the following:
It is possible to distinguish between different speakers, but maintain relative stability when the same speaker is changed, such as a cold and other conditions.
It is not easy to imitate others or can solve the problem of being imitated by others.
Keep a certain stability during the acoustic environment, that is, the anti-noise performance is good.
Voice identification application prospect
Compared with other biological identification techniques, such as fingerprint recognition, palm-shaped identification, iris identification, etc., the sound pattern is not lost and forgotten, no memory, convenient use, etc., has the following characteristics:
User acceptance is high, because users do not have any mental obstacles.
The use of speech can be identified to be one of the most natural and economical methods. The sound input device is cost-effective, and even no fees (telephone), while other biometric technology input devices tend to cost expensive.
In the identification application based on telecommunication network, such as telephone banking, telephone stock, electronic shopping, etc.
Since the sound identification is more convenient, accurate, economical and scalable, which can be widely used in security verification, control, etc., in particular, based on the telecommunications network.