Let the computer speak
Author Andy Aaron, Ellen Eide and John F. Pitrelli
So goodbye to your dull electronic sound: The new voice synthesis system sounds like vocals, they can respond instantly.
Call the bank or airline, in most cases you will hear the recording reply instead of answering the duty personnel. This system can properly handle banking or tickets such as ticket issues by combining prior recorded phrases. Although the spelled voice sounds unnatural, this system is not problematic in handling a certain range of matters, because things in this area are foreseen. However, since the phrase needs to be recorded in advance, this limits its application range.
IBM's voice synthetic researchers are overcoming a problem: let the computer say everything that people say, and make the sound more natural. For example, we have developed a system that reads an early news or reading an email in the phone. Just like the current phrase splicing system, we are the latest, called supervoices, is also based on live recording, which can react in real time. However, different is that they can issue any sound, including all words of pronunciation, and sound providers have not read.
What is the direct application of this technology? They include broadcasting the latest news, read screens for disabled people, receive E-mail, or any system with a big word library, and it is not predefined, but cannot be displayed. In the future, Super VoC (Supervoices) can improve the playability of video or computer games, enhance portable device functions, or for movie production. IBM issued its latest commercial version at the end of 2002.
Tell me
Scientists simulated people's attempts begin in the early 18th century, when WolfGang Von Kempelen has built a "speech" with a series of exquisite wind boxes, reeds, flutes and co-cavity, and he uses it to read some basic words. By the 1970s, the first generation of modern text reading systems built on digital computers has gained a wide range of applications. The manufacturer tries to directly produce all pronunciations with little relevant parameters. Although with a machine cavity, it can be understood. After the 1990s, faster computers and cheaper data storage devices make advanced speech synthesis possible. It is based on this premise: speech is constructed of limited phonemes, and the recommination of these phonemes can produce any word (pronunciation). Therefore, just like the lead in the typographic worker, a recordset of a sound sample is a module of synthetic speech.
Super voice is to use such a module model. For language, people think it is made of a series of letters or words, and the software considers that it is a series of phonemes. English is approximately 40 papins. For example, the word "please" is composed of four phonemes: P, L, EE, Z. The super voice has a soundme library containing the sample of each phoneme pronunciation. When it needs a sound of a word, it will put the appropriate phoneme sample (pronunciation) to together.
The speech synthesis begins with the audio, so the task of our specific listening team is to find the appropriate sound from many sound samples. We often need to find a beautiful pronunciation or round voice; clear pronunciation to overcome the problem of accent. Sometimes we need to find special pronunciation for special applications, such as English or robot's voice of the transformation of foreigners for movies. The selected recording read thousands of sentences in the recording studio, which costs for a week or even longer. These sentences have been carefully selected, including different content to ensure we can master as much English speech elements in different texts. This has thousands of sound files.
Then, the software will convert text composed of a series to the corresponding phoneme. The software records the feature of each phoneme, including the phoneme before and after, and the position in its sentence, etc. It can also distinguish the verbs and nouns in the text. For example, the sound provider reads "Welcome to My Home Page", and the software will translate it as follows: W EH L K UH M t oo m i h oo m p AY J,
"Page" in the sentence is a noun, the sound of the sentence is followed by EN, J is the tail of the sentence.
Once the text is transformed, you can check our voice file. We measure the three elements of rhythm: tone, time long and loudness. These parameters will help us determine if a sound can be used to synthesize a given phrase in a later example. Tone, time long and loudness are dynamic changes. Our metrics are changed as the sound files.
Next, using techniques from speech recognition (from the sound to the text of the text), the software makes the corresponding phoneme with the text. Through this sequence of sound text, we can view recording files to determine the beginning and end of each phoneme. This is critical. Once we can position and mark the phoneme, our software can encode them and put them in a retrieved database.
Each English phoneme has an average of 1000 samples in our database. Rough looks this will have a lot of redundancy. In fact, in different contexts, or different people's pronunciation changes. For example, let's take a phoneme, read "oo" in "smooth". However, some "OO" in the database "L", such as "pool", as well as the end of the word, such as "Shampoo". These changes will change the pronunciation of "OO" and determine that we use that phoneme in our future applications.
With a voice sample library is just a problem, once we want to synthesize a realistic statement, we also need to select the characteristics of the sound. For example, the speaker will slow down before the pause, such as when you encounter a comma. So we need to pay attention to pause before comma. We establish a statistical model for each sound samples of the supplier to discover the law of its sound, such as tone lifting, time long and loudness. This statistical model is used to synthesize more natural laws in future applications by self-learning.
Applications
Now we have "established", let us try it. Super voice response time is millisecond, you can talk to people instantly. First we give it a few sentences, such as: "Can We Have Lunch Today?" We need to convert words to phonemes first, they are super-voice modules, our sentences look like this:
K AE N W EE H AE V L UH N Ch T OO D AY
Supervoice marker sentence features, this is a question, the third word is a verb, the last word of the second syllable reread.
Enter these features to the statistical model. Based on these features, super speech can determine the sound, the time, and the loudness in subsequent applications. For example, the model can notice the YES / NO question and take up the upgrade at the end of the sentence. Using this mode, we only need to determine the phoneme matching curve in the database and place the correct phoneme sample. But use that phoneme to synthesize our statement? Our sentence contains 16 plumes, there may be more than 1064, or 10,00016, such a large number is unable to handle. We use a dynamic program to search the database more effectively to determine the best match.
When we set the selected phoneme group, there is a smooth optimization problem. Although each phoneme has a large number of samples, and it is carefully selected, but the sentence that has just been generated will have some discontinuities, and the end of the statement will suddenly end, and there will be vibrato. We adjust it by lowering tones, just like carpenters to create a smooth surface through sandpaper and glue. We adjust the tone one by one, make it coordinated, so that the whole sentence sounds like a live dialogue. Prospects
Our developers are also the ultimate the ultimate text of the text. In line with the map spirit test, people can't distinguish true and false? Maybe not, it should be considered, when people think that they are talking to people, it is actually a machine, and it is found that this will feel unpleasant. For example, they call to a company's customer service center. In any case, "Natural Sound" is not the best choice, while driving, cartoon toys, video programs or computer games, which requires some mechanical sounds when driving. Moreover, the text reading system can do a general work that is unable to complete. For example, it can say many languages like native language, or it will not fatigue.
The ultimate use of this technology may be like this: a sweet, fucked voice, which sounds comfortable without adaptation. Or develop the same social skills like people, see this example:
The callingist said: "I'd better take the flight to Boston on Tuesday morning."
Computer: "I have two flights at Tuesday afternoon to Boston."
The software emphasizes the process of the afternoon to make simple flow. The caller will naturally understand that there is no flight available in the morning, and the computer provides him with an optional solution. Otherwise, a system that is completely no expressive will make the caller to misunderstand his meaning and hang up the phone.
This is still a huge challenge for super voice, although it has been amazingly as being amazing with people. After all, the software can't understand what it is said, you can't expect it as an eighth classification channel, it is rich in expressive changes in a reading style, and can explain his or her reading. Get this ability, this is our long-lasting task.
- Translated from Scientific American March 17, 2003