Jump to Content
T2RERC  

home > publications > forum proceedings > communication enhancement > voice output and display technologies

Forum Proceedings

Stakeholder Forum on Communication Enhancement

Voice Output and Display Technologies: White Paper

 

Technology Area | Market Needs | State-of-the-Practice | Issues to Consider | References

Technology Area

Researchers, manufacturers and consumers identified improved voice (e.g. personalization and increased intelligibility), display (e.g. wireless, low-power and sun-reflective) and context recognition (e.g. interlocutor speech recognition, global positioning systems) capabilities for Augmentative and Alternative Communication (AAC) devices as high priority technology needs. Stakeholders have stated that advancements to these technologies would meet significant end-user needs and represent significant business opportunities.

[ Top of Page ]

Market Needs

Innovation in areas such as speech output and display is significant to both the user and the clinician. Enhanced output features open the user to increased participation in outdoor activities, in social settings, in the classroom, on the job, and overall in societal interactions. With improvements, clinicians will be able to recommend more suitable options for each client and may find it easier and less time consuming to adapt the system to meet the user's preferences and needs.

The ability to express one's needs and desires is of utmost importance in leading a full and independent lifestyle. Augmentative communication devices should allow users to communicate clearly (e.g. choice of words, good voice quality), efficiently (e.g. easy and intuitive item selection, good communication rate) and effectively (e.g. get their point across, accomplish communication and interpersonal goals). Current systems sound somewhat unnatural, with reduced intelligibility, inappropriate prosody (i.e. varied intonation, stress and rhythm), and inadequate expressiveness (i.e. emotion, tone and feeling). [1] The listener's ability to understand speech depends upon the choice of words, phrases, intonation, and inflection as well as the overall communicative (topic, prior information) and environmental (place, activity) context. [1,2]

Specific needs include the ability to create highly intelligible, eloquent speech with natural intonation and voice personalization - at an acceptable and effective communication rate. The user should be able to easily customize voices to sound young or old, high-pitched or low-pitched, hoarse or smooth, excited or bored, or even whispered. The communicator should be able to inflect their voice (sounding urgent, sarcastic, or serious), vary intonation (pitch, loudness, and volume) with voice quality that reflects their age, gender and personal preference. Temporal control of speech output is desirable in accounting for the temporal demands of different contexts and during communicative interactions. Communicators should be able to direct their speech to specific listeners and make it unavailable to others. [3] Improving AAC voice output capabilities will significantly improve the clarity, quality, and rate of communication; social interaction; confidence and user safety.

Communication context includes the speaker's words and phrases, the environment (time, place), the activity (work, play) and the words and phrases spoken by the communication partner (the "interlocutor"). AAC devices that are capable of utilizing contextual information should have improved communication efficiency, clarity and literacy (e.g. a more appropriate word may be chosen for the utterance).

The visual display plays an important role in AAC-based communication. The display enables the speaker to see their message as they generate it and may allow their communication partner to read their utterance as they compose or speak it. The display also provides the means by which a user selects items, constructs expressions, and makes corrections to text before it is spoken. Dynamic displays change in response to user input while static displays consist of a fixed set of displayed items.

A need exists for virtual displays (i.e. the illusion of a display screen but smaller and transformable). When a person is in bed or in a vehicle for example, use of a "standard" display may be inconvenient or intrusive. A wireless dual display may be appropriate for a teacher (at the front of the room) to communicate with his or her students (at their desk) or for a patient to communicate (from a hospital room) to their doctor or nurse (at a nursing station). Virtual systems could be a separate wearable device or part of an everyday item such as eyeglasses. To facilitate more natural conversation and ensure user privacy, it should not be necessary for the communication partner to read display information over the shoulder of the augmented communicator.

Display weight and power consumption are also important considerations. Size and eye-to-screen viewing distance is critical. Battery charging should be easy and convenient. The display's weight should not affect the ability to transport the technology by imposing extra burden on the user. Displays should reflect the visual, postural, and motor capabilities of the user; be convenient and unobtrusive to the user; and durable through many daily environmental conditions.

When AAC displays are not seen clearly, the ability to communicate is compromised. Glare from daylight or direct sunlight (or generally bright or dark lighting) degrade the user's ability to see displayed information. Displays can also be damaged by rain or moisture. AAC displays are needed that are clearly visible in most lighting conditions and impervious to environmental factors.

[ Top of Page ]

State-of-the-Practice

SPEECH OUTPUT

High-tech AAC devices employ speech output systems with either digitized or synthesized speech. Digitized speech output is essentially natural speech -- of an individual other than the augmented communicator, such as a spouse, Speech Language Pathologist, or other person selected by the user - that has been recorded, stored, and reproduced. AAC devices with digitized speech output are recognized in the professional literature as "closed" systems because they reproduce only those words or messages that have been pre-stored for their user. [4] Advances in speech output are already finding applications in reading skill enhancement programs for young children and reading-impaired individuals, telephone and integrated voice response systems, multi-media CD-ROMs, kiosk information systems, talking World Wide Web pages, emergency notification systems, proofreading programs, vehicular guidance systems, and real-time language translators. [5]

Individuals who have the cognitive and linguistic ability to formulate messages independently often utilize AAC devices with synthesized speech output. Speech synthesis is a technology that transforms text input into device-generated speech using linguistic rules - including rules for pronunciation, pronunciation exceptions, voice inflections, and accents of the language. Unlike AAC devices with digitized speech output, there is no pre-recording of specific words, phrases, sentences, or messages by another person, and there is no time limit or message length limit to the speech that can be produced. Because users can construct original messages as their communication needs dictate, synthesized speech AAC devices are described as having "generative speech capability" or as being "open systems".

Presently AAC devices can provide unrestricted text to speech synthesis with somewhat unnatural speech quality, or restricted speech using prerecorded utterances with natural quality. [6] Text-to-speech conversion is done in two steps.

The first step contains the rules that process the input text. In order to provide the information needed for high-quality speech output, a text module must divide the text into sentences, the sentence into phrases, the phrases into words, the words into syllables, and the syllables into phonemes (the basic sounds from which speech is constructed). For accurate phoneme generation in any languages, this module must also analyze words into morphs (prefixes, root words, and suffixes). In English, for example, the ed of naked is pronounced very differently from the ed of baked, because in naked -ed is part of the root, while in baked, -ed is the past-tense suffix. Similarly, the th of the two-root word hothouse is realized as two sounds, while the th of the single root mother is realized as a single sound. [7]

The second step uses the linguistic structure generated by the text module in producing the speech output. As mentioned above, the speech module must generate the overall prosody (rhythm) of the utterance, as well as the appropriate acoustic patterns for the individual speech sounds. To generate prosody, all systems use rules that manipulate parameters like pitch and durations, although the sophistication of the prosody rules differs dramatically across systems. To generate the acoustics of the speech segments themselves, systems generally use one of two main strategies: concatenative or rule-based. [7]

In concatenative systems, actual voice segments (such as syllables or parts of syllables), are recorded and pieced together to produce the intended utterance. Depending on a variety of factors-how large the segments are, how many segments are stored, how the segments are represented and many others-concatenative systems can produce quite natural-sounding voice quality, capturing much of the voice quality of the original speaker. However, because these systems often depend so critically on the speaker(s) from whom the units were extracted, generating a variety of voices and speech styles (e.g., whispering) can be difficult, and each new voice or style may require extracting an entirely new set of voice segments from the model speaker. Also, depending on the strategy, it is often difficult to express the rules responsible for generating the overall prosody; while a given system may produce quite natural voice quality, it may also produce unnatural prosody. For example, the voice may be clear, yet the intonation and rhythm may make it sound very unnatural. Finally, again depending on the details of the approach, memory requirements for concatenative systems can be excessive (1.5 megabytes of memory for each second of speech), even when only a small number of voices are provided. [7]

Rule-based synthesizers utilize some variation of formant synthesis (the vocal tract is defined by its resonant frequencies - its "formants"). In rule-based systems, the acoustic parameter values (frequencies, bandwidths, amplitudes) for the utterance are generated entirely by algorithmic means and capture the perceptual cues for reproducing the spoken utterance. A set of voice filters modifies these cues in accordance with the values specified for a number of parameters (i.e. gender, breathiness, roughness, and pitch) to produce the desired voice quality. A synthesizer then generates the final speech waveform from the parameter values. [7]

Concatenative and rule-based systems each have their respective advantages. Rule-based approaches require more extensive knowledge and understanding of the sound patterns of speech than do concatenative approaches. While acquiring this knowledge can be expensive and time-consuming, rule-based approaches have the advantage that the system is based on underlying linguistic models.

A number of synthesizers are commercially available. AAC voice output is currently standardized on DECTalk™. DECTalk™ is a rule-based voice synthesizer that runs on Windows 95, 98, NT, 2000 and Windows CE personal computers (PCs). DECTalk™ offers nine different voices (four male, four female, one child) as well as sentence rhythm and intonation and takes into consideration surrounding words and their effects or individual pronunciations. [6]

DoubleTalk™ is a rule-based text-to-speech synthesizer that operates on Apple and IBM-compatible personal computers. DoubleTalk™ features 8 voices and includes the option of digitized speech playback. Currently DoubleTalk™ is used to provide computer speech output for individuals who are blind or who have low vision. [8]

ETI-Eloquence™ is another rule-based speech synthesizer used much less commonly in AAC. ETI-Eloquence runs on Windows 95, 98, CE and NT based PCs and UNIX platforms (Solaris, AIX, Linux). A Windows compatible sound card or any audio output device with Windows drivers is required for use. [7]

The Microsoft® Speech SDK 5.0 supports continuous speech recognition, concatenative text-to-speech voice synthesis and for Voice Xtensible Mark-up Language (VoiceXML, similar to the HyperText Mark-up Language HTML used to create web pages). [9] Multiple applications can share speech resources on the computer. Microsoft Speech versions are available for download on the Microsoft Agent website.

Smoothtalker is a concatenation-based system, where all the units being concatenated are diphones (i.e., brief samples of speech spanning the region from the middle of one phoneme to the middle of the subsequent phoneme). Unfortunately, diphone concatenation often sounds very "choppy" because diphones don't always blend together smoothly. [6]

CONTEXT RECOGNITION

Voice recognition is an emerging technology with vast potential. Voice recognition is already used to access PC applications, the Internet, telephone answering services, etc. Voice recognition technologies can be classified as speaker dependent (require training with the same voice) or independent (can recognize multiple users, without training for each) and discrete (with pausing between each word) or continuous (without pausing). VoiceExpress, Apple Speech Recognition and ViaVoice perform with continuous speech, while Dragon Dictate performs with only discrete speech. All large-vocabulary dictation systems are speaker dependent, requiring each user to "train" or "calibrate" a voice file for each user. In "natural" environments, the ability to isolate the speaker's voice from noise (other voices, wind) is critical. Recently, adaptive beam-forming microphone technology has become available that "tracks" a speaker inside a "directed cone" while reducing sound from all other directions. [10]

Voice Extensible Mark-up Language, or VoiceXML, is an evolving standard for creating speech-enabled applications. This "voice-enabled HTML" defines how a dialogue is constructed and executed between a caller and a computer running speech recognition and/or text-to-speech software. VoiceXML is able to create speech-enabled Web pages accessed by phone, and build large-scale systems such as telephony-based speech recognition call center solutions. [11] VoiceXML has applications to Voice Portals.

A voice portal is an audio version of a web browser such as Yahoo, Excite, Alta Vista and others. It is the interface between a caller and an information source. Instead of typing on a keyboard or clicking on a link, users dial a toll free number and verbally select from several categories of information. Voice portals are speaker-independent and allow telephone users to browse stock quotes, movie schedules, weather reports, local restaurants, horoscopes, traffic, news and other web-like services on a voice-activated system. Instead of reading text, or looking at graphics on the web, callers speak commands in response to voice prompts and listen to voice responses through the telephone. The voice portal uses speech recognition software to identify the caller's selection and then tracks down the information on pre-determined web sites that are regularly updated with new information. After matching the text to pre-recorded audio clips of words or numbers, the voice portal program strings together the clips into speech that is relayed through a telephone receiver. [12]

The Global Positioning System is a worldwide radio navigation system that provides accurate location and time information (i.e. context) to devices equipped with the proper receivers. GPS provides specially coded satellite signals that can be processed in a GPS receiver, enabling the receiver to compute position, velocity and time. [13] GPS is funded and controlled by the U. S. Department of Defense (DOD) and is used worldwide by military, government, and even more recently AAC users to determine location of an individual or device. Examples of GPS systems that are currently being used in assistive devices are Atlas Speaks (talking map) and Strider (GPS access system). These systems incorporate a GPS system receiver that lets users learn about the physical layout of a neighborhood, city, or state and navigate from location to location. [11]

DISPLAY TECHNOLOGY

Current display systems vary in complexity, size, layout, and technology utilized. Advancements to display technology have fundamentally improved the quality of the image presented. The most recent electronic displays are thin panels, a significant improvement from traditional Cathode Ray Tube (CRT) monitors in which the screen is a giant vacuum tube. These light and flat panel structures are based on liquid crystals (LCDs) or vacuum fluorescence (VFD) for example. Other displays include microdisplays in which VFD, CRT or LCD technology is utilized but scaled down. Recently, a revolutionary "electronic paper" has been developed as a lightweight, reusable alternative to conventional displays. In addition, advancements in filter technology ("filters" allow some colors or "orientation of light" to pass while blocking others) have come to market due to the need for reduction of glare.

A liquid crystal display (LCD) is the viewing screen found on many communication devices (Lightwriter, Dynavox 3100, Hand Held Voice, DigiCom, E-Talk). An LCD consists of two plates of glass with liquid crystal material between them that controls the dots of color, or "pixels", that combine together to form the screen's image. LCDs require a separate light source (usually light emitting diodes, or LEDs) for illumination. [14,15] There are two main types of LCD computer displays: passive matrix (or dual scan) and active matrix. Passive matrix displays use one transistor at the head of each row or column of pixels in the screen. Like playing the board game "Battleship", when the computer issues an instruction for the transistors on, say, the first row and third column to activate, it's like scoring a hit, and pixel C1 lights up. In active matrix displays, each pixel has its own transistors and electrodes and is individually controlled to turn on, off, or somewhere in between. Typically, you'll only find passive matrix displays on low-end notebooks with 12-inch displays. [15]

A vacuum fluorescent display (VFD) consists of a grid system and is optional on some communication devices such as the LightWRITER. A segment of a VFD grid is activated by applying a voltage across some point on the grid. A layer of phosphor covers the grid and when the gridpoint is hit by electrons, it produces a characteristic blue-green emission, or glow. VFDs have applications to products including office equipment, domestic appliances, hi-fi equipment and vehicle dashboards. The technology's advantages are its high brightness, wide viewing angle, multi-color capability and mechanical reliability. VFD's are available in a variety of standard formats covering numeric, alpha-numeric and dot matrix fonts. [16,17]

Typical microdisplays measure less than 1.5 inches diagonally. Because of their size, they require optics to enlarge the image for your eye. Microdisplay technology can take one of three forms: emissive (VFD, CRT), transmissive and reflective (LCD). Emissive displays use a layer of electroluminescent phosphors that emit light when stimulated. Transmissive displays create an image by modulating (increasing and decreasing the brightness) a light source (i.e. LED) behind the display at the pixel level. Reflective display technologies modulate an external light source by controlling the reflectivity of each of its pixels. [18] Microdisplays would make laptop personal computers lighter and smaller and extend battery life.

Electronic paper is both a digital and wireless technology that displays images, has very high resolution, perfect contrast, is viewed in reflective light, and has a wide viewing angle. It will allow one to send email and faxes directly on paper without the need for a computer or tablet. The paper acts as the input screen in which any graphical interface can be printed, erased, and reused. [19] This material has many potential applications in the field of information display including digital books, low-power portable displays, wall-sized and fold-up displays. It has an advantage over current displays in that it is brighter, uses much less power, is more portable, has a bigger screen, is flexible and lower cost.

Along with all flat-panel displays comes the problem with glare. Although removing or minimizing the source of the reflection is the most effective method for reducing the effects of glare on displays, this is not always possible or desirable. A hood or sun-guard over the display often minimizes some reflection problems but is physically cumbersome, unaesthetic and does not always function adequately. Active adaptive luminance is used for suppression of direct glare from displays but problems arise in the positioning and number of sensors used to accommodate shadows upon the display area. [20] Another way to control reflected glare is to use positive polarity displays where possible, since the reflections are more noticeable on the larger dark parts of the display (mirror effect). In addition, a variety of filters are available that allow certain colors and wavelengths of light to pass through. A number of variations exist including mesh filters, neutral density and polarized filters, circular polarizing filters etc. [20,21]

[ Top of Page ]

Issues to Consider

The Need
  • What are the important, unmet (or poorly met) user needs related to voice output, displays and context recognition in AAC devices?
  • In which environments would improvements to these systems be most beneficial for the AAC user?
  • What types of changes should be made to the devices in order to incorporate an improved voice output, display, or context recognition system?
  • What types of display and speech synthesis can best benefit AAC users in their communicative interactions?
State-of-the-Practice
  • What voice output, display and context recognition systems do AAC devices provide now?
  • What are the strengths of these systems in terms of performance, cost, etc?
  • What are the weaknesses of these systems in terms of performance, cost, etc.?
Future Technology and Products
  • What type of technology needs to be developed in order to improve these systems in AAC devices?
  • What technical barriers must be overcome in order to incorporate these systems in AAC devices?
  • How can devices be made that better incorporate feedback from the user's environment into the device?
  • What breakthrough technologies might better address the identified needs and problems that are currently not on the market?

[ Top of Page ]

References

  1. Cahn, Janet E. "The Generation of Affect in Synthesized Speech." MIT Media Technology Laboratory: 1990.
  2. Institute for Communicating and Collaborative Systems. "Abstract - Subsequent Context Affects Word Recognition in Spontaneous Speech." April 12, 2001. [Online: http://www.iccs.informatics.ed.ac.uk/publications/RP/1987/EUCCS-RP-1987-9.html]
  3. Higginbotham, Jeff, Greg Lesher, and Bryan Moulton. "Communication Performance Assessment of Augmentative Communication." [Presentation]
  4. Minnesota Assistive Technology Loan Network. (2000). What is Augmentative Communication. [Online http://www.admin.state.mn.us/assistivetechnology/loan/aacinfo.htm]
  5. Hertz, Susan R. "A White Paper: The ETI-Eloquence Text-to-Speech System." Ithaca: Eloquent Technology, Inc. Nov. 1997. [Online: http://www.eloq.com/White1297-2.htm]
  6. Yarrington, Debra., Steven R. Hoskins, James Polikoff, H. Timothy Bunnell. "Personalized Synthetic Voices for AAC." April 20, 2001. [Online: http://www.asel.udel.edu/speech/reports/isaac2000/isaac2000.htm]
  7. Hertz, Sue. "The Technology of Text-to-Speech." Speech Technology. CI Publishing. April/May 1997. [Online: http://www.eloq.com/tts.htm]
  8. RC Systems Product Brief: DoubleTalk Voice Synthesizers. [Online: http://www.rcsys.com/dt.htm] February 23, 2001.
  9. "Voice Tools Engine Resource." [Online: http://www.speechsolutions.com/sapitools.htm]
  10. Andrea Electronics. "Digital Far-Field Microphone Technology." [Online: Http://andreaelectronics.com/dsdadesk.htm]
  11. Chirokas, Steve. "2001 Speech Odyssey: A Journey Through VoiceXML." Speech Technology Magazine. Mar/Apr 2001. [Online: [http://www.speechtechmag.com/st.mag/current/speech_odyssey.shtml]
  12. AT&T Information Resource Center. "Voice Portals." June 2000. [Online: http://www.ipservices.att.com/techviews/whitepapers/VoicePortals.pdf]
  13. Dana, Peter H. "Global Positioning System Overview." University of Texas at Austin. 1994. [Online:http://www.colorado.edu/geography/gcraft/notes/gps/gps_f.html]
  14. "AAC Glossary: Augmentative and Alternative Communication, Second Edition." Paul H. Brookes Publishing Company.. [Online: http://www.brookespublishing.com/aac/aacgloss.htm]
  15. Van Winkle, William. "All Things Flat." Computer Source Magazine. Jan 2001. [Online: http://archive.sourcemagazine.com/archive/101/feature5.asp]
  16. Pulseview: Vacuum Fluorescent Display Gas. Meggitt Regisbrook Displays. [Online: http://www.i-way.co.uk/~regadmin/vacuum%20fluorescent%20display.html]
  17. "PIC Vacuum Fluorescent Display Interface." Silicon Junction. [Online: http://www.users.bigpond.com/pbhandary/pic/vfd.html].
  18. Vrana, Greg. "Microdisplays: No Longer A Microcosm." EDN Magazine. March 15, 2001.
  19. Silberman, Steve. The Hot New Medium: Paper. Wired Magazine. Apr. 2001.
  20. Human Factors for Designers of Equipment." Part 7: Visual Displays. Defense Standard, Ministry of Defense. 20 Dec 1996. [online: http://dstan.mod.uk/data/00/025/07000200.pdf]
  21. "Polarizing Filters: Applications - Flat Panel Displays." 3M United States. Online: http://www.3m.com/market/omc/om_html/tech_polar_html/applications/flat-panel.jhtml.

[ Top of Page ]