Application of speech recognition in home appliance remote control
Source: Application of electronic technology: Tsinghua University Zhou Jihua Shi Yuanyuan Runsheng
Abstract: Introduction to a voice recognition algorithm suitable for appliance remote control applications, the algorithm uses dual modules and two-stage endpoint detection methods, which can effectively improve identification and robustness; introduce a new type of learning remote control using this technology. The broad prospect of speech recognition technology is in the field of home appliances.
Keywords: speech recognition DTW fed Fred learning remote control
An important aspect of the development of household appliances is to make the user interface more user-friendly, more convenient and natural, and doing access to elderly and disabled people can be used accessible. Speech control is an important way to improve the quality of user interface quality of home appliances. This article uses voice control remote control as an example to explain how speech recognition techniques are applied to the home appliance.
Speech recognition embedded system structure suitable for household appliance applications As shown in Figure 1, it consists of four parts. The first portion is analog / digital conversion portion, which receives the input speech signal, and converts it to the digital chip to process digital acquisition signal; convert the decoded speech digital signal to an audio analog signal at the output, through the speaker Release. The second part is a speech recognition section, and its role is to analyze the input digital voice report signals, identify the commands represented by the entry signal, usually completed by the DSP. Part III voice prompt and voice playback section, it is generally done in the DSP, its core is digital compression coding and decoding of voice signals, the purpose is to prompt the user to operate and respond to the identification of the voice, the completion of the personnel voice interaction . The fourth part is the system control section, which converts the speech recognition result into a corresponding control signal and converts its output into a physical layer operation to complete the specific function. The organic combination of speech recognition and system control is the key to completing the voice-control interaction, and the speech recognition algorithm and the remote control system control section will be discussed in detail. 1 speech recognition algorithm
Currently, speech recognition in consumer electronics is often made in a single-chip or DSP. Such speech recognition is primarily isolated, and it has two implementations: one is based on non-specific person identification of the hidden Markov statistical model (HMM) framework; the other is based on dynamic planning (DP) principle. Specific person identifies. They have advantages and disadvantages in their application. The advantage of HMM non-specific person is that the user can be used directly; and well stability (ie, speech recognition performance will decrease over time). However, non-specific human speech recognition also has a difficult to overcome the defects. First, use this method to collect a large number of corpus in advance to train the corresponding identification model, which greatly improves the pre-cost of the application of this technology; secondly, non-specific people's speech recognition is difficult to solve the problem of different dialects in Chinese. It is also a region; there is another factor should also be considered that the specific command word for control in the home appliance should not be completely fixed, and should change according to the habit of the user, this is almost no specific person recognition. May implement. Therefore, most home appliance remote controls are not suitable for this program. The advantages of DP specific person identification are simple, lower hardware resources; in addition, the training process in this method is also very simple, do not need to have excessive samples in advance, not only reduced the previous cost, but also according to user habits The user arbitrarily defines the specific command statement of the control project, so it is suitable for the application of most home appliance remote controls. The serious disadvantage of DP specific identification is that its robustness is not ideal. It is high for some people's speech recognition rate, and some people have not high recognition rate; the identification rate is high, but the identification rate is delayed over time. reduce. Some disadvantages often bring inconvenience to users. In order to overcome these defects, the traditional methods are improved, making the identification performance and robustness have significantly improved, and satisfactory results. 1.1 endpoint detection method
An important factor affecting the identification performance of isolated words is endpoint detection accuracy [4]. In 10 English-digit identification tests, the 60 millisecond endpoint error reduces the recognition rate by 3%. For speech recognition chip systems for consumer applications, various interference factors are more complex, making precise detection endpoint problems more difficult. To this end, a two-stage endpoint detection scheme called Fred (frame-based readl_time endpoint detection) algorithm [3] is proposed, and the accuracy of endpoint detection is improved. The first stage is performed on the input speech signal, according to its energy and the rate of change, perform a simple real-time endpoint detection to remove the time domain range of the input voice, and perform spectrum feature extraction work on this basis. The second level calculates the energy distribution characteristics of the high frequency, intermediate frequency and low frequency bands according to the FFT analysis of the input speech. It is used to discriminate light consonants, turbidity, and vowels; after determining the metapy, the voiced sound section, Extend the search to the front and rear two ends to search for frames containing voice end points. The FRED endpoint detection algorithm performs endpoint detection based on the essential characteristics of the voice, can better adapt to the interference and change of the environment, and improve the accuracy of endpoint detection.
In specific people's identification, the performance of common FED (Fast EndPoint Detection) [5] and FREDs are compared, and the performance of the two endpoint detection algorithms is compared. Two algorithms test use the same database, including 7 people's recording, each person says 100 people, each read 3 times. The DP template training and recognition algorithm in the test are traditional fixed endpoint dynamic time telescopic (DTW) template matching algorithms [4]. The identification rate test results of the two endpoint detection algorithms are listed in Table 1. Table 1 Compare the effect of FED and FRED endpoint detection algorithms on the matching recognition rate of DTW template
Endpoint detection algorithm The first person The second person The third person The third person The fourth person 5th person The sixth person The 7th person Average FED92.5% 87% 92.6% 95.6% 96.2% 96.8% 100% 94.4% FRED94.3% 89.9% 93.2 % 99.4% 99.4% 98.8% 100% 96.4% Test results Description: Use FRED endpoint detection algorithm, all speaker recognition rates have increased varying degrees. Therefore, this system uses this two-stage endpoint detection scheme.
1.2 analog matching algorithm
DTW is a typical DP specific person algorithm, in order to overcome the difference in natural speech speed, the template feature sequence and the speech feature sequence match the dynamic time-registration method, compare the distortion between the two, and the basis for identifying the judgment.
Assuming a stored entry template includes M frame fallback feature R = {R (m); m = 1, 2, ∧, m}; identification feature sequence includes n frame spectrum feature T = {T (N); N = 1, 2, ∧, N}. Defining frames between r (i) and t (i) D (I, J), D (i, j) = | R (i) -t (i) | 2, through dynamic planning process, search The pitch is found to find the smallest path of cumulative distortion, that is, the optimal matching result. Using symmetrical form DTW:
Where S (i, j) is accumulated distortion, D (i, j) is local distortion.
When the dynamic planning process calculates the fixed node (n, m), the normalized distance of the template dynamic match can be calculated. The identification result is a template entry whose normalized distance is the smallest, and the result is: x = argmin {s (n MX)}.
In order to improve the identification performance of the DTW identification algorithm and the steadyness of the template, the dual template strategy is proposed, namely x = argmin {s (n, m2x)}. The first time the training entry is stored as the first template. The same training entry in the second input is stored as the second template, and it is desirable to maintain high recognition performance through two compartered templates. . The same is the same as above, and the 100 people named 7 people are also used, each person is named 3 times database, comparing the performance difference between DTW single template and dual template, and the result is more in Table 2. Table 2 Comparison of Identification of Different Templates of DTW
DTW 1st person, the second person, the third person, the fourth person, 5th person, the sixth person, the 7th person, the average single template 94.3% 89.9% 93.2% 99.4% 99.4% 98.8% 100% 96.4% double template 99.4% 96.6% 98.5% 100 % 100% 98.8% 100% 99.0%
Test results Description: By storing two templates, it is quite greatly improved the performance of DTW identification, and its robustness has greatly improved. Therefore, the DTW dual template is a simple and effective strategy for a particular person's identification system.
In summary, the embedded speech recognition chip system employs a FRED algorithm for improved endpoint detection performance, 12th order MEL frequency marking parameter (MFCC) as feature parameters, using dual template training identification strategy. Through a series of tests, it proves that the system has achieved good recognition performance for specific people, fully meets the requirements of acoustic control applications in household appliances.
2 speech control remote control design
At present, the home remote control is mainly in key, and there are two types: one is a fixed pattern, each key corresponds to one or several types of size, which are pre-set in advance, users can't change; another It is a learning type, which has the function of self-learning remote control. It can define a pattern corresponding to each button of the remote control. It can set a variety of remote controls to one, using a remote control to control multiple home appliances, and Can be used as a backup of the original remote control. Due to the increase in modern home appliances, both of the two remote controls have too much button, and users are not easy to remember the meaning of each key. Applying speech recognition technology to a learning remote control, use voice commands instead of the memory and use of the command, and saves a large number of buttons, narrowing the volume of the remote control. The hardware block diagram of the voice control remote control is shown in Figure 2, which consists of two independent modules: speech signal processing module and system control module. The system's control software flow chart is shown in Figure 3. Before using, press "Learning button" to enter the learning status, the user will train the voice command to the learning remote control, and make it learn from the principle control code corresponding to each voice command. Press "Identification" when using, enter the voice recognition state, wait for the voice processing module to return the result, if the correct recognition result is returned, the corresponding remote control code is emitted out. For example, the original TV remote control digital key "1" corresponds to the central 1st, the user's training command is "centrally 1", learn the remote control code of the digital key "1" of the original remote control, and make it and the training order "center 1 "corresponds to it. So only the microphone of the learning-type remote control, "central 1", the TV will switch to the central. This user does not need to remember the correspondence of each TV station and the platform number, relative to the boring channel number, user-defined commands easier to remember. The speech signal is a module consisting of a DSP, a flash memory (Flash), the codec (CODEC). Where DSP is the core of the entire speech recognition module, responsible for voice recognition, voice codec, and Flash read and write control. The advantage of DSP is that the calculation speed is fast, the memory space is large, the data exchange speed is fast, which can be used to achieve complex algorithms, improve the identification, and reduce the reactive delay, resulting in higher recognition performance. The DSP chip uses Analog Devices' AD2186L, which has the following characteristics: 1 computing speed up to 40MIPS, and is an efficient monotonic cycle instruction; 2 provides 40K bytes of internal RAM, where 8K word (16bit / word) is data RAM, 8K word (24bit / word) is a program RAM, a maximum of 4 megabytes of storage, for storing data or programs; 33.3V operating voltage, has a variety of power saving modes. The AD2186L can be completed with both voice signals, and it is suitable for the use of battery-based remote controls. Flash and CODEC also use 3.3V work voltage chips. Flash is the US ATMEL's AT29LV040A (4M bit), which is used as a system memory, primarily to store the following: prompt the parameters required for voice synthesis, codebook data, DSP system applications and learning, and learning of specific people's training. Remote code data. CODEC uses the US TI company TLV320AC37 to perform A / D, D / A transform, encoding, and decoding.
The system control module consists of a single chip, an infrared receiving transmitter, and a power management circuit. The microcontroller is responsible for system control of the entire remote control. As a master chip, the keyboard scan is performed, according to the instruction input by the user, complete the learning remote control code; control the DSP to perform voice training, playback, identification; convert the identification result into the corresponding remote control code, emit out through the infrared tube . Communicate with the standard RS232 serial protocol between the single-chip microcomputer and the DSP. If there is no correct command for 30 seconds, the remote control enters the sleep state, the microcontroller controls the power management circuit to switch the DSP and FLASH power, and the single-chip machine itself also enters the sleep state until the user button, wakes the microcontroller, and then controls the MCU control Recovery DSP and Flash Power Supply And start working. This is because the power consumption of the DSP is the greatest, and the speech signal processing module is turned off, and the power consumption of the entire system can be significantly reduced.
During the process of traveling from the laboratory, reliability and cost is the biggest challenge. The DTW and two sets of endpoints in dual-template are used to detect the FRED algorithm, and the recognition rate and robustness can be effectively improved in the case where the system resources and the reactive delay increase. The technology successfully applied to the learning remote control, showing the broad prospects of speech recognition technology in the home appliance.