The fields involved in speech recognition techniques include: signal processing, pattern recognition, probability, and informationism, vocal mechanism, and audible mechanism, artificial intelligence, and more.
Extraction and selection of acoustic feature acoustic features is an important part of speech recognition. Extraction of acoustic character is both a process of significant compression, as well as a signal unwinding process, which is to make the mode divider better partition. Due to the time-transition characteristics of the speech signal, the feature extraction must be performed on a small speech signal, i.e., short-term analysis. This section is considered to be a smooth analysis section called a frame, and the offset between the frame and the frame typically takes 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to increase the high frequency, and the signal is searched to avoid the influence of the edge of the short time. Some of the commonly used acoustic features are described below. Linear prediction coefficient LPC: linear prediction analysis starts from human vocalization mechanism, by research on the short tube cascading model of the channel, it is considered that the transmission function of the system meets the form of the full digital filter, so that the signal of the N time can be used A linear combination of several times of signals is estimated. The linear prediction coefficient LPC can be obtained by making the sample value of the actual voice and the linear predicted sampling value reaches the minimum LMS. The calculation method of LPC has autologous law (Debin Durbin), covariance method, latty method, etc. The rapid and effective guarantee of this acoustic feature is widely used. Similar acoustic features similar to LPC's prediction parameter model also have linear pairs of LSP, reflection coefficients, and the like. The ceclarge coefficient CEP: Using the same-state processing method, the voice signal is obtained after the discrete Fourier transform DFT takes the logarithm, and then the reverse transform IDFT can obtain a falling profile. For LPC Recurring (LPCCep), after obtaining the linear prediction coefficient of the filter, it can be calculated using a recurrent formula. Experiments show that the use of the spectrum can improve the stability of the characteristic parameters. MEL Stroeed Coefficient MFCC and Perceived Linear Prediction PLP: Unlike LPC et al, the acoustic characteristics obtained by the study of human vocal mechanism, MEL column factor MFCC and perceived linear prediction PLP are promoted by the auditory system research results Acoustic characteristics. The study of people's auditory mechanism found that people can only hear a tone when the tone of the two frequencies is simultaneously issued. The critical bandwidth refers to such an aimistic that the mutated bandwidth boundary, when the frequency difference between the two tones is smaller than the critical bandwidth, people will listen to one of the two tones, which is called the shielding effect. MEL scale is one of the metrics of this critical bandwidth. The calculation of the MFCC first converts the time domain signal into the frequency domain with FFT, and then the logarithmic energy spectrum is convolved in accordance with the triangular filter bank distributed in the MEL scale, and finally discrete cosine transforms for the output of each filter. DCT, take the previous N factor. The PLP still uses the Debin method to calculate the LPC parameters, but is also a method of performing DCT for the alignment of the audible excitation when calculating the autocorrelation parameters.
The model of the acoustic model speech recognition system is typically composed of two parts of the acoustic model and the language model, respectively corresponds to the calculation of the speech to the syllable probability and the calculation of the syllable to the word probability. This section and the next section describes techniques for acoustic models and language models, respectively. HMM Acoustic Modeling: The concept of the Markov model is a discrete time domain finite state automaton, hidden Markov model HMM refers to the internal state of this Markov model is not visible, and the outside world can only see every moment. output value. For voice recognition systems, the output value is usually acoustic characteristics calculated from each frame. Two assumptions need to be made with HMM portraying speech signals, one is that the transfer of internal state is only related to the previous state, and the output value is only related to the current state (or the current state transfer), both assumptions greatly reduce the model Complexity. The corresponding algorithm for HMM score, decoding, and training is the forward algorithm, the ViterBi algorithm, and the forward rearward algorithm. Using HMM in speech recognition is usually used from left to right, with a top-loop, a topological structure, a topological structure, and a phoneme is an HMM of a three to five states, one word is to construct a word The HMM of the pluna is serially connected, while the entire model of continuous speech recognition is the word and mute HMM. Context-related modeling: synergistic pronunciation, refers to the impact of a sound from the front and rear neighboring sounds, from the vocal mechanism, it is that the human sounder officer is turning to another sound to another sound, thus The spectrum of the latter sound is different from the spectrum under other conditions. The context-related modeling method takes into account this effect during modeling, so that the model can more accurately describe voice, only considering the influence of the previous sound, considering the effect of the previous sound and the latter Tri-Phone. The context-related modeling of English is usually based on phoneme as a primary, and since the effects of some phonemes on the sounds of the sounds are similar, the sharing of model parameters can be performed by clusters of the phoneme decoding state. The result of cluster is called Senone. Decision tree uses to achieve efficient Triphone's correspondence, by answering the problem of a series of front-rear sound belonging (meticular / consonial, clear / voiced, etc.), which finally determines which seenone should be used in HMM. Classification Regression Tree Cart model is used to verify the word to the phoneme. The language model language model is mainly divided into two types of rule models and statistical models. The statistical language model is the method of probability statistics to reveal the internal statistics of the language unit, where N-Gram is simple and effective, widely used. N-gram: This model is based on such a hypothesis that the appearance of the Nth word is only related to the previous N-1 word, and the probability of the entire sentence is the probability of the probability of the word. These probability can be obtained by directly statistics of n words directly from the corners. Commonly used is a binary Bi-Gram and a tri-gram of Tri-gram. The performance of the language model is usually measured with intersence entropy and complexity. The meaning of cross entropy is to use this model to identify the text identification, or from a compressed perspective, each word is averaged to encode. The significance of complexity is that this model represents this average number of branches, which can be regarded as the average probability of each word. Smooth refers to a probability value to the N-element combination that has not been observed, to ensure that the word sequence is always a probability value through the language model. Usually used smooth technologies have graphic estimates, delete interpolation, Katz smoothing and Kneser-Ney smooth.
Searching for continuous speech recognition, it is to find a word model sequence to describe input speech signals to obtain a word decoding sequence. The search is based on the acoustic model score and language model scores in the formula. In actual use, it is often necessary to add a high weight to the language model, and set a long word penalty score. Viterbi: Based on the dynamic planned ViterBi algorithm at each state in each time point, calculate the subkey probability of the decoded state sequence on the observation sequence, the maximum path, and record the corresponding status information in each node for final way Decoding sequences to get the word. Under the condition that does not lose the optimal solution, the Viterbi algorithm also solves the nonlinear time alignment, word boundary detection and word identification of the HMM model state sequence and the acoustic observation sequence in continuous speech recognition, so that this algorithm is speech recognition. Search for basic strategies. Since speech recognition is unpredictable after the current time point, it is difficult to apply the heuristic scrap based on the target function. Due to the time-line characteristics of the Viterbi algorithm, each path at the same time corresponds to the same observation sequence, and thus has the comparability, the beam beam search only the maximum probability of the maximum probability of the maximum probability is improved. Search efficiency. This time QITERBI-BEAM algorithm is the most effective algorithm in the current speech recognition search. N-BEST Search and Multi-Search: To use various knowledge sources in the search, usually do more search, the first time, the low price source, generate a candidate list or word candidate grid, on this basis The second pass search for high-cost knowledge sources is available for optimal path. The knowledge sources previously have acoustic models, language models, and phonetic dictionaries, which can be used for searches in the first pass. In order to achieve more advanced speech recognition or speaking understanding, it is often necessary to take advantage of some of the higher cost sources, such as 4th or 5th-order N-Gram, 4-order or higher context-related models, inter-word correlation model, segmentation Model or syntax analysis, re-score. The latest real-time general wording continuous speech recognition system uses this multi-range search strategy. The N-BEST search generates a candidate list, and the best path to each node is retained, the computational complexity will increase to N times. Simplified approach is to reserve several words candidates for each node, but may lose secondary candidates. A compromise is to consider only the path of two words, and retain K. The word candidate grid gives multiple candidates in a more compact way, and the algorithm for generating candidate grids can be generated after the N-BEST search algorithm is changed accordingly. The forward-back search algorithm is an example of a search for multiple searches. When the previous Viterbi search is applied, the forward probability obtained during the search process is only used in the calculation of the target function of the backward search, so that the heuristic A algorithm can be used for later search, economy Searched N candidates.
The system implementation of the speech recognition system Selecting the Recognition of the primitive is that there is an accurate definition that can be trained enough to obtain enough data. English usually uses the context-related phoneme modeling, Chinese synergy is not as serious as English, can be modeled in syllables. The training data required by the system is related to the complexity of model. The model is designed to be more complicated to exceed the ability of the training data provided, which makes the performance drastically decreased. Write: big words, non-specific people, continuous speech recognition system is often referred to as a writing machine. Its architecture is the HMM topology based on the aforementioned acoustic model and the language model. At the time of training, the model parameters are obtained by the forward rearward algorithm, and when identified, the primer is connected in series into words, the words plus the mute model and introduce the language model as the probability of transition, forming a cyclic structure, using Viterbi The algorithm is decoded. In response to the characteristics of ease of segmentation, the first evolution of each segment is decoded to improve the efficiency. Dialogue system: System for implementing human-firing dialogue is called dialogue systems. Since the current technology, the dialogue system is often facing a system for a narrow field, the number of vocabulary is limited, and its subject matter has tour inquiry, booking, database search, and so on. The front end is a voice recognizer, identified N-Best candidate or word candidate grid, analyzing semantic information by the grammatical analyzer, and then determines the response information by the dialog manager, and outputs by the speech synthesizer. Since the current system is often limited, it is also possible to obtain semantic information with a method of extracting keywords. The performance of adaptive and robust speech recognition systems is affected by many factors, including different speakers, speaking methods, environmental noise, transport channels, and more. Improve system robustness, it is to improve the ability of the system to overcome these factors, so that the system is stable in different applications; the purpose of adaptation is based on different influence, automatically, targeted The system is adjusted and gradually improves performance in use. The following introduction of different factors affecting system performance separately. The solution is divided into two categories according to the method (hereinafter referred to as feature) and model adjustment of speech characteristics (hereinafter referred to as model methods). The former needs to look for better, jelly-ranging feature parameters, or on the basis of existing feature parameters, add some specific processing methods. The latter is a small amount of adaptive corpus to correct or transform the original speaker-independent (SI) model to make it an adaptive (SA) model of speaker. The self-adaptive characteristics of the speaker have a generalized and speaking human space method, and the model method has a Bayesian method, a transform method, and a model combining method. Noise in the voice system, including electronic noise added to the environmental noise and the recording process. Increasing the characteristics of the system robustness include speech enhancement and finding features of noise interference, and model methods have parallel model combined PMC methods and people in training. Noise. Channel distortion includes the distance of the microphone, using different sensitivity microphones, different gain preamplifiers and different filter design, and more. The feature method has a long-term average and RASTA filtering from the falling spectrum vector, and the model method has a translational translation.
Summary The techniques of all aspects of the audio recognition system are introduced. These technologies have achieved better results in actual use, but how to overcome various factors affecting voice require more in-depth analysis. Currently, the computers still cannot fully use to replace the keyboard input, but the maturity of identification technology has promoted higher levels of speech understanding techniques. Due to the different characteristics of English and Chinese, how to use techniques in Chinese is also an important research topic, while the four-sound-speaking Chinese own legislative problem is to be resolved.