
![]() ![]() 01 02 03 04 05 06 07 08 09 10 11 Linear Predictive Coding (LPC) Tutorial
Make your own
Speech-to-Text files
Text-to-Speech Synthesis at Bell Labs
|
At present, two types of parameters are used for the data-driven method of synthesis: stored waveforms and a small set of spectral parameters that is mathematically derived from the speech signal. These parameters are called LPCs (linear predictive coding) because one of their forms predicts the next set of speech-waveform values from a small set of previously computed waveform values. Although waveform parameters produce high-quality speech, it is impossible to control independently the spectrum of waveforms of the stored speech. Synthesizing with these parameters, therefore, lacks flexibility for altering the speech spectrum. The LPC parameters also produce high-quality speech, although it is somewhat mechanical-sounding. These LPCs' flexibility makes it easy to alter them to produce connected speech. When I began working in speech synthesis shortly after the discovery of LPC parameters, I was attracted by their ability to reproduce high-quality speech. My early research involved constructing a synthesizer, using words as the unit of synthesis. By using twelve hundred common words I was able to synthesize many paragraphs of text. Because I used parametrized speech, I could smooth the connection between words and impose an intonation over the utterance to make the speech sound continuous. However, the synthesizer was limited -- too many words were not in my inventory. I then turned to the methods introduced by Peterson and his colleagues. The speech synthesizer I currently use at Bell Laboratories generates speech from stored short utterances of analyzed speech, using LPC-derived parameters. It is not a simple system of diphones, but a complex system that contains many segments larger than diphones -- to accommodate phonemes with complex coarticulation effects. For example, to synthesize the word incapable spoken by HAL and shown in figure 6.1, we first transcribe the word into a phonetic notation. Incapable becomes
where /*/ represents silence, /1/ is the neutral vowel schwa, and /U/ is the vowel a as in word able. The synthesizer then attempts to match the largest string of phonemes from the word to a string in its databank. If two adjacent phonemes do not interact -- that is, there is little coarticulation between them, as is the case for /n/ followed by a /k/ -- the synthesizer will not find a diphone. In this case, it will add a silence element of zero duration. When the phoneme is greatly influenced by its neighbors, as in the case of a schwa, a triplet of phonemes will be stored in the database. Thus the word incapable will be synthesized from the following elements:
The resultant speech is intelligible, although it sounds mechanical
and would never be mistaken for a human voice.
Speech, a subset of language, is one method humans use to communicate
with each other. The most direct form of language communication
happens when one human, the generator, speaks to one or more
humans, the receptors. This mode of communication is easy for
the generator; he or she needs only choose the proper words to
represent an idea and produce the speech sounds that represent the
words. Barring such problems as a noisy environment or language
differences, receptors will usually understand the idea the generator
is trying to transmit.
|