
![]() ![]() 01 02 03 04 05 06 07 08 09 10 11 Speech Synthesis samples
|
After concluding the computation involved in language analysis, the computer -- whether reading text or generating speech -- has information about the hierarchical structure of the text, the focus or stress of the different segments, and the correct pronunciation, including lexical stress, of the words in the utterance. The result of the analysis is a string of phonemes annotated with several levels of stress marking and different levels of phrase marking. Once these linguistic units are generated, the computer is ready to synthesize speech. Synthesis from Linguistics Units It would seem a trivial task to synthesize speech, by either rule or stored data, once the desired sequence of phonemes is known. However, the computer still lacks information about the timing and pitch of the utterance. These factors may seem unimportant as long as the computer can pronounce the phonemes correctly. Nonetheless, mistakes in timing and pitch are likely to result in unintelligible speech or, at best, the perception that the speaker is a non-native speaker.
We are aware of the role of pitch when actors impersonating a computer in a television commercial or science fiction movie try to speak in a monotone. You notice that I said try, because they are not really talking in a monotone; if they did, it would sound more like singing than speaking. They do, however, severely restrict the range of the pitch. Humans normally talk with the timing and intonation appropriate to their native language which they acquired as children by imitating adult speakers. The computer, of course, does not learn by imitation; for the computer to speak correctly, we have to develop the rules for pitch and timing and program it to use them.
The timing of speech events is very complicated. First, phonemes have inherent durations; for example the vowel in the word had is much longer than the vowel in pit. Yet the duration of the phonemes are not invariable. They are affected by the position of the phoneme's syllable in the phrase, the degree of stress on the syllable, the influence of neighboring phonemes, and other factors. For example, the vowel in had is much longer than the vowel in hat, because of the difference between the following consonants /d/ and the /t/. At Bell Laboratories recently we devised a statistics-based analysis scheme that measures the contribution of various factors to phoneme durations and creates algorithms to compute them.
To program rules for the pitch contour of speech, we must first
understand how intonation provides information about the sentence
type, sentence structure, sentence focus, and lexical stress of a
speech signal. We are aware, for example, that the pitch is lower at
the end of a declarative sentence, while in many interrogative
sentences, it rises at the end. At the end of phrases and nonterminal
sentences and parenthetical statements we indicate that we will
continue speaking by lowering the pitch and reducing the range. We
also express focus and stress by large pitch variations. All of the
above phenomena must be programmed to make the computer deliver a
message effectively.
So far, we have concentrated on aspects of speech synthesis that
convey linguistic information by analyzing the acoustics of speech
sounds, as well as the manifestations of timing and pitch. Another
dimension of human speech, the emotional state of the speaker, is as
important as the linguistic content of the message. I do not explore
computer feelings in this chapter (see chapter 13 for discussion of
this topic). In 1974, however, my interest in computer music led me to
write a computer opera dealing with the intriguing subject of computer
emotion. The opera featured a singing computer.
|