Design of adding image identification technology in speech recognition system
Source: Application of Electronic Technology: School of Communication Engineering, Chongqing University Ye Ling gorge
Abstract: Using machine vision to track voice objects, use the speaker normal pronunciation, the opening and closing status of the mouth is extracted, and the working speed of the speech recognition is adjusted in real time, matching speech rhythms, and effectively improve speech recognition accuracy. The design principle and implementation technique of the system are discussed, and the design analysis of corresponding secondary image recognition is introduced.
Keywords: speech recognition machine visual image identification
Voice identification is an abbreviation for automatic speech recognition (Automatic Speech Recognition by Machine).
Voice recognition technology is related to multi-learning research, and research results in different fields have contributed to the development of speech recognition. Let the machine identify the voice of voice to some extent like a person who is not good in a foreign language listening to the perimeter, it is related to speaker, speech speed, speaking content, environmental conditions. The characteristics of the speech signal itself have caused the difficulty of speech recognition. These features include variability, dynamic, transient, and continuity.
The process of computer speech recognition is substantially consistent with the speech recognition process. At present, mainstream speech recognition technology is based on the basic theory of statistical model recognition. A complete speech recognition system can be roughly divided into three parts:
(1) Speech feature extraction: The purpose is to extract the speech feature sequence that changes over time from the voice waveform.
(2) Acoustic Model and Mode Match (Identification Algorithm): Acoustic Model usually produces the acquired speech characteristics through the learning algorithm. When identifying the input speech feature is matched to the acoustic model (mode), the optimal identification result is obtained.
(3) Language Model and Language Processing: Language Model Includes language models composed of identifying voice commands or language models composed of statistical methods, and language processing can perform grammar, semantic analysis. The language processing section is usually not required for the speech text. The acoustic model is the underlying model of the identification system and is part of the most relevant voice recognition system. The purpose of the acoustic model is to provide an effective method, calculating the characteristic vector sequence of the speech and the distance between each pronunciation template. The design of the acoustic model is closely related to the characteristics of language pronunciation. Acoustic model cell size (word pronunciation model, halftone model or phoneme model) has a large influence on voice training data size, system identification rate, and flexibility. The size of the system vocabulary determines the size of the identification unit must be identified according to the characteristics of different languages. Since there are various difficulties, speech recognition techniques typically construct a different type of system based on restrictive requirements in use, usually include three categories. One is a way to limit the user's speech, which in turn can be divided into isolate-word speech recognition system, Connected-Words Speech System, Continuous speech recognition system (Continue Speech Recognition) SYSTEM) and a SPONTANEOS SPEECH Recognition System. Its second is the range of words that limit users. The third is a user object that limits the system. Using the speed as a second information channel for the system, once the system tracks the target's language, it is also possible to effectively eliminate external noise as synchronized with the voice information while assisting speech recognition, so the system can achieve better recognition performance. 2.2.2 Image processing algorithm design
The speech recognition system of the language model is particularly important. When the classification is wrong, it can be corrected according to the linguistic model, grammatical structure, semantics, especially some homogeneous words, must be determined by context structures. Linguistic theory includes relevant aspects regarding semantic structure, grammar rules, mathematical description models of language. Currently successful language models are usually the language model using statistical syntax and the language model based on rule syntax structure. The syntax structure can define interconnect relationships between different words, reducing the search space of the identification system, which facilitates improving the identification of the system. The speech recognition process is actually a process of understanding. Just like people who listen to speech, they do not separate the syntax structure of voice and language. Because people can use these and knowledge to guide the understanding process of language when speech pronunciation is blurred, but the identification system also uses these knowledge, but it is also difficult to describe these syntaxes and semantics:
(1) Speech recognition system: Specical recognition system including dozens of words.
(2) Speech recognition system for medium vocabulary: typically include a recognition system of hundreds of words to thousands of words.
(3) Normative speech recognition system: usually include thousands of speech recognition systems for thousands to tens of words.
These different restrictions also determine the difficulty of speech recognition systems.
1 Add image recognition design
Today's speech recognition technology is gradually maturing, many speech recognition techniques have been able to achieve high recognition rate, but when the identification rate has reached a certain degree, it is difficult to improve, and other technologies need to assist voice. Identify.
The current speech recognition system is working in a single sensor (sound sensor). When performing speech recognition, it is impossible to subdivide the captured voice information, and can only use the fixed rate tracking identification, the speech information and the template in the system library. Match judgment. If the rate of capturing information is not much related to the template stored in the system library, you will work normally, otherwise you will have an error. However, in real life, people can always maintain the same rate, but constantly changing, this is bound to cause an increase in the error of speech recognition systems, so that the practicality of the speech recognition system is reduced.
When conducting voice communication, while capturing voice information, other information is also acquired to help understand, such as expressions, expressions, and more. If there is no relevant information, people will also misunderstand each other, for example, when two people are distinguished distance, because they cannot be clearly understood, they often have an error in understanding errors. It can be seen that using a single information channel will indeed increase the difficulty of understanding, so it is necessary to consider adding information channels to solve this problem. This is why the image recognition function is added.
For voice recognition systems, if you can judge the words of language objects like people, they are the best, but this will greatly increase the amount of image recognition, even exceeding the cost of speech recognition, not reasonable. And, due to the reason, the machine is not allowed to have the same visual capabilities like a person, so that the machine can accurately identify the various expressions that have not been achieved, so this approach is not advisable. Further analysis knows that the effect of speech recognition system is very large. When people talk about normal, the opening and closing state of the mouth is small. If the machine is only judged and real-time tracking of the opening and closing of the mouth, it is possible to do the speech speed information. So just need the speed of the machine to identify the speed, to adjust the matching speed of speech recognition, thereby suitable for voice rhythm, naturally facilitating system speech recognition capabilities, improve recognition accuracy. With the help of speech recognition systems, the speech system will provide another practical information channel for the voice system.
2 implementation technology and methods
2.1 Overall Design Overview of System
The two-way information of the voice system collects the speaker (language object), and all the way to obtain the sound sensor, and all the way is obtained by the camera device. After the visual change of the camera device, the transformation of the transformation into the speech speed information is sent to the speech recognizer to match the collected speech to match the identification, automatically adjust the identification speed, and better completion of speech recognition. The workflow of the system is shown in Figure 1.2.2 Auxiliary image identification design
The speech rate of the capture language object is the key to ensuring good work across the system. In order to achieve this, the use of machine vision-based person's mouth state detection method is used. Since the accuracy of the image is not high, the grayscale image can be selected to improve the operational speed. In the case of speed allowed, the identification accuracy of the color image will be higher.
2.2.1 Process Analysis
The purpose of the entire auxiliary image identification design is to use the machine vision to capture, image processing, and extract the speaker's speed information for the speech system.
Taking the grayscale image as an example, its image processing is shown in Figure 2.
For color images, the feature amount will be more, it is more complex, but the basic steps are first completed, and then the image deck extraction is performed, and the image recognition and understanding are performed.
When the image processing is completed, the comparison module compares the frame image with the data of the previous frame image, and determines the change, thereby statistical discharging speed information, and finally outputs the speech recognizer.
(1) Image segmentation
After analysis of face RGB pixels, the two components in the face image are found in accordance with the two-dimensional Gaussian distribution. Therefore, the position of the face can be determined by these two components. On the basis of face positioning, according to the person's mouth is in the lower half of the human face, it is relatively easy to determine the general position of the mouth, which provides basic conditions for the precise positioning of the mouth, as shown in Figure 3.
(2) Extraction of image characteristics
According to the system requirements, the grayscale image can meet the needs, but because the gradation difference between the lips and the skin is not large, and the grayscale information is more affected by the illumination conditions, the movement of the face and the rotation of the face, so that the lips in the face image The edge is not obvious, especially when the shadow area inside the lips appears alternately, the edge of the lips becomes more blurred, so the grayscale and edge information division of the lip color and the edge information cannot achieve high accuracy. To increase the recognition accuracy of the condition of the nozzle, it can be determined using color information to determine the shape and position of the human mouth.
The study found that the main color characteristics of the lips are the lip-color color of the color, and the normalized RGB color has no variability to light, face movement, and rotation. Therefore, color information is used to divide the lip region by mode classification technology, and the shortcomings of the grayscale image itself can be overcome. Since the Fisher linear classifier can minimize the two categories, it is learned from off-line processing, which reduces the amount of calculation, so the Fisher linear classifier can be used to perform lips segmentation.
The appearance of the human mouth is obviously different from the shape under normal circumstances. When speaking, the mouth is opened large, and under normal circumstances, the mouth is substantially closed (Figure 4). Therefore, these features can be utilized to detect detection of nozzle state. It has been found that the maximum width wmax in the mouth region can characterize the opening degree of the opening of the mouth and should be characterized by the characteristic value; the height between the upper lips and the lower lips are also significantly different. Take a characteristic value. The above three feature values may be formed into a set of vectors, that is, the geometric tissue of the mouth under different states can be described, as shown in FIG.
A feature vector ZUI that is obtained by describing the person's mouth region geometry is a feature vector zui, that is, an input vector for the next discrimination classification: Zui = (Wmax, Hmax, Hmin).
As long as you identify the two status "open", "comb" can be determined. Different speech during actual pronunciation can cause a different degree of opening of the mouth, so many cases of "opening" (hmin> 0) will have a lot of cases, and it will inevitably increase the arithmetic storage. Relatively, the "closed" state of the mouth generally has only one mode (hMIN = 0), so it can only determine the "closed" state of the speaker mouth, and the other states are "open", which will make it easy to process. (3) Image recognition understanding
Since the requirements for the recognition accuracy are not high, there can be only a traditional statistical model identification method, or a more popular neural network recognition method is also employed; however, since the system is high, the neural network identification method is functional, so no Neural network recognition method is recommended.
2.3 Extraction of Speed Express Information
The appropriate capture frequency continuously collects the image of the speaker, comparing the data of the current frame with the previous frame data, according to the frequency of its change, the speech speed information can be estimated. In the actual situation, the desired speed information can not be required to achieve general requirements. 3 application prospects
Voice recognition is an enwelling technology. Existing many human machine interactions may be improved by supplementing speech recognition functions. Voice recognition technology can turn costs, labor, and time-time machine operations into a very easy and fun thing, in many "busy", "hands can't be used", "the hand can't be", "too lazy "In the scene, including the cab, some dangerous industrial situations, home appliance control, etc., high recognition rate speech recognition systems will be more convenient for people's work and life.
Due to the difference in knowledge and knowledge, there is a considerable number of people in real life to enjoy the convenience of modern life, including the help of information services and other advanced equipment. High-identification rate speech recognition technology helps improve this situation, enabling more people in society to enjoy more social information resources and modern services, and improve the level of information and modernity of the entire society.
High-identification rate speech recognition technology will also launch the development of robot intelligence technology. Since the robot is to configure the visual system, this program is easy to implement, and the ability to interact with people will be improved. Different high recognition rate speech recognition techniques In the voice entry system, meeting real-time records and simultaneous translations, the reporter interviewed equipment and other directions have broad application prospects.