
![]() ![]() 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 |
Playing HAL
To demonstrate today's state of the art in computer speech recognition, we fed in some of the sound track of 2001 into the Kurzweil VOICE for Windows version 2.0 (KV/Win 2.0). KV/Win 2.0 is capable of understanding the speech of a person it has not heard speak before and can recognize a vocabulary of up to sixty thousand words (forty thousand in its initial vocabulary with the ability to add another twenty thousand). The primary limitation of today's technology is that it can only handle discrete speech -- that is, words or brief phrases (such as thank you) spoken with brief pauses in between. I played the following dialogue to KV/Win 2.0 with a view to learning whether it could understand Dave as HAL does in the movie: HAL: Good evening, Dave. Dave: How you doing, HAL? HAL: Everything is running smoothly; and you? Dave: Oh, not too bad. HAL: Have you been doing some more work? Dave: Just a few sketches. HAL: May I see them? Dave: Sure. HAL: That's a very nice rendering, Dave. I think you've improved a great deal. Can you hold it a bit closer? Dave: Sure. HAL: That's Dr. Hunter, isn't it? Dave: Hm hmm. HAL: By the way, do you mind if I ask you a personal question? Dave: No, not at all. I trained the system on the phrases "Oh, not too bad" and "No, not at all," but did not train it on Dave's voice. When I did the experiment, KV/Win 2.0 had never heard Dave's voice, and it had to pick out each word or phrase from among forty thousand possibilities. I had the system listen to Dave saying the following discrete words and phrases from the above dialogue: Dave: Oh, not too bad. Dave: Sure. Dave: Sure. Dave: No, not at all.
KV/Win 2.0 was able to successfully recognize the above utterances
even though it had not been previously exposed to Dave's voice (see
figure 7.4). For good measure, I also had KV/Win 2.0 listen to Dave in
the critical scene in which HAL is betraying him. In this scene, Dave
says the word HAL five times in a row in an increasingly plaintive
voice. KV/Win 2.0 successfully recognized the five utterances, despite
their obvious differences in tone and enunciation (see figure
7.5). Looking at the spectrogram, we can see that these five
utterances, although they are similar in some respects, are really
quite different from one another and demonstrate clearly the
variability of human speech (see figure 7.6). So, except for KV/Win's
restriction to discrete speech, with regard to speech recognition
we've already created HAL!
|