Chapter 7



01  02  03  04  05  

06  07  08  09  10  

11  12  13  14  15  

16  17  





For Bell, whose invention of the telephone created the telecommunications revolution, the original goal of easing the isolation of the deaf remained elusive. His insights into separating the speech signal into different frequency components and rendering those components as visible traces were not successfully implemented until Potter, Kopp, and Green designed the spectrogram and Dreyfus-Graf developed the steno-sonograph in the late 1940s. These devices generated interest in the possibility of automatically recognizing speech because they made the invariant features of speech visible for all to see.

The first serious speech recognizer was developed in 1952 by Davis, Biddulph, and Balashek of Bell Labs. Using a simple frequency splitter, it generated plots of the first two formants, which it identified by matching them against prestored patterns in an analog memory. With training, it was reported, the machine achieved 97 percent accuracy on the spoken forms of ten digits.

By the 1950s, researchers began to follow lesson 5 and to use computers for ASR, which allowed for linear time normalization, a concept introduced by Denes and Mathews in 1960. The 1960s saw several successful experiments with discrete word recognition in real time using digital computers; words were spoken in isolation with brief silent pauses between them. Some notable success was also achieved with relatively large vocabularies, although with constrained syntaxes. In 1969, two such systems -- the Vicens system, which accepted a five-hundred-word vocabulary, and the Medress system with its one-hundred-word vocabulary ­ were described in Ph.D. dissertations.

That same year, John Pierce wrote a celebrated, caustic letter objecting to the repetitious implementation of small-vocabulary discrete word devices. He argued for attacking more ambitious goals by harnessing different levels of knowledge, including knowledge of speech, language, and task. He argued against real-time devices, anticipating (correctly) that processing speeds would improve dramatically in the near future. Partly in response to the concerns articulated by Pierce, the U.S. Defense Advanced Research Projects Agency began serious funding of ASR research with the ARPA SUR (Speech Understanding Research) project, which began in 1971. As Allen Newell of Carnegie Mellon University observes in his 1975 paper, there were three ARPA SUR dogmas. First, all sources of knowledge, from acoustics to semantics, should be part of any research system. Second, context and a priori knowledge of the language should supplement analysis of the sound itself. Third, the objective of ASR is, properly, speech understanding, not simply correct identification of words in a spoken message. Systems, therefore, should be evaluated in terms of their ability to respond correctly to spoken messages about such pragmatic problems as travel budget management. (For example, researchers might ask a system "What is the plane fare to Ottawa?") Not surprisingly, this third dogma was the most controversial and remains so today; and different markets have been identified for speech-recognition and speech-understanding systems.

The objective goal of ARPA SUR was a recognition system with 90 percent sentence accuracy for continuous-speech sentences, using thousand-word vocabularies, not in real time. Of four principal ARPA SUR projects, the only one to meet the stated goal was Carnegie Mellon University's Harpy system, which achieved a 5 percent error rate on a 1,011-word vocabulary on continuous speech. One of the ways the CMU team achieved the goal was clever: they made the task easier by restricting word order; that is, by limiting spoken words to certain sequences in the sentence.

The five-year ARPA SUR project was thoroughly analyzed and debated for at least a decade after its completion. Its legacy was to establish firmly the five lessons I have described. By then it was clear that the best way to reduce the error rate was to build in as much knowledge as possible about speech (lesson 1): how speech sounds are structured, how they are strung together, what determines sequences, the syntactic structure of the language (English, in this case), and the semantics and pragmatics of the subject matter and task -- which for ARPA SUR were far simpler than what HAL had to understand.


top of pageauthor infofurther readingorderforward