Eve Andersson is a co-founder of ArsDigita Corporation and one of
the authors of the ArsDigita Community System.
She has degrees from Caltech and U.C. Berkeley and has
built dozens of popular web sites on the public internet. Personal
web site: www.eveandersson.com.
|
In every computing era, programmers have been responsible for writing the fundamental
application logic. During the desktop application era (1980s), the
attention given to this logic was generally dwarfed by that given to the user interface, event handling, and
graphics code that a programming team needed to write to get a
computer program into the hands of users. Result: very little
innovation at the individual level; most widely used computer programs were written by large
companies.
During the Web era (1990s), the user interface and graphics were
rendered by the Web browser, e.g., Netscape Navigator or Microsoft
Internet Explorer. Programmers were able to deliver a complete system
to end-users after writing only the application logic and some simple
HTML specifying the user interface behavior. Result: a revolution in
innovation, with most Web services written in a few months by a
handful of people.
Suppose that you'd observed that telephones are much more common and
portable than personal computers and Web browsers. Furthermore, you'd
noticed that telephones are able to be used by almost everyone whereas many
consumers have little patience for the complexities of the PC. Thus,
you'd want to make your information system accessible to a user with
only a telephone. How would you have done it? In the 1980s, you'd
rent a telephone line, buy a big specialized box to recognize
utterances, buy another specialized box to talk to the user, and park
those boxes right next to the main server for your application. In
the 1990s you'd have had to rent a telephone line, buy specialized
software, and park a standard computer running that software next to
the server running your application. Result in both decades: very
little innovation, with only the largest organizations offering
voice/telephone interfaces to their information systems.
With the advent of today's voice browsers, the coming years promise to
be a period of tremendous innovation in the development of
telephone-accessible Internet applications. With a Web service, you
operate the HTTP server and run the application; someone else runs the
browser. The idea of the voice browser is the same. You operate a
server and the application. Someone else, perhaps the phone company,
runs the telephone lines and voice browser.
Bottom line: voice browsers allow you to build telephone voice
applications with nothing more than an HTTP server.
From this, great innovation shall spring.
Illustration
One weekend in February 2001, Tracy Adams, one of my ArsDigita
co-founders, called me from her cell phone. She had just flown into Los
Angeles and wanted to know the telephone number and address of our
Los Angeles office, as well as the direct number for one of the
employees. I pointed a Web browser to our intranet, looked up the
info, and read it aloud to Tracy.
Feeling inspired, I spent a few hours creating a VoiceXML application:
the ArsDigita Telephone Directory, accessible from any telephone in
the world. You call up and say which office or employee you're
looking for. After searching through some pre-existing Oracle database
tables, it tells you the phone numbers and addresses you want.
Next time Tracy arrives confused in a foreign city, she won't have to
rely on me being at my desk.
What is VoiceXML?
VoiceXML, or VXML, is a markup language like HTML. The difference:
HTML is rendered by your web browser to format content and user-input
forms; VXML is rendered by a voice browser. Your application can
speak to the user via synthesized speech or by pre-recorded audio
files. Your software can receive input from the user via speech or by
the tones from their telephone keypad. If you've ever built a web
application, you're ready to get started with your phone application.
How to make your content telephone-accessible
As in the old days, you can still rent a telephone line and run commercial
voice recognition software and text-to-speech (TTS) conversion
software. However, the most interesting aspect of the VXML revolution
is that you need not actually do so. There are free VXML gateways, such as Tellme
(http://www.tellme.com) and
VoiceGenie (http://www.voicegenie.com).
These take VXML pages from your web server and read them to
your user. If your application needs input from the user, the gateway
will interpret the incoming response and pass that response to your
server in a way that your software can understand.
You use a web form to configure the gateway with the URL of your application, and it
will associate a telephone number with it. In the case of Tellme,
your users call 1-800-555-TELL, dial your 5-digit extension, and now
they're talking to your application.
VoiceXML basics
The format of a VXML document is simple. Here's how to say "Hello,
World" to your visitors:
<vxml>
<form>
<block>
<audio>Hello, World</audio>
<!-- _home takes you back to the entry point of the call -->
<goto next="_home"/>
</block>
</form>
</vxml>
Every opening tag (e.g., <vxml>
) has to be closed,
either with a closing tag like </vxml>,
or with a slash (/
) as at the end of the singleton goto
tag.
The <vxml>
tag specifies that this is a VXML document.
Within that is a <form>,
which can either be an interactive element -- requesting
input from the user -- or informational. You can have as many
forms as you want within a VXML document. A <block>
is a
container for your executables, meaning that all your tags that
make your application do something, such as <audio>
and
<goto>
, can be clumped together inside of a block.
<audio>text</audio>
will read the text with a TTS
converter, whereas <audio src="wav_file_URL"/>
will play
a pre-recorded .wav audio file. <goto>
can point to another
URL, another form within the same VXML doc, or _home,
meaning the
application is finished.
Here's an example that accepts user input and
behaves differently depending on what the user says:
<vxml>
<form id="animal_questionnaire">
<field name="favorite_animal">
<prompt>
<audio>Which do you like better, dogs or cats?</audio>
</prompt>
<grammar>
<![CDATA[
[
[dogs hounds puppies (hound dogs)] {<option "dogs">}
[cats kitties kittens] {<option "cats">}
]
]]>
</grammar>
</field>
<!-- if the user gave a valid response, the filled block is executed -->
<!-- Tellme only accepts one field per dialog, so there is no ambiguity about what field this refers to -->
<filled>
<result name="dogs">
<goto next="#popular_dog_facts"/>
</result>
<result name="cats">
<!-- curly braces around the name of a variable give you its value -->
<goto next="psychological_evaluation.cgi?affliction={favorite_animal}"/>
</result>
</filled>
<!-- if the user's response doesn't match the grammar, the nomatch block is executed -->
<nomatch>
<audio>I'm sorry, I didn't understand what you said.</audio>
<reprompt/>
</nomatch>
<!-- if there is no response for a few seconds, the noinput block is executed -->
<noinput>
<audio>I'm sorry, I didn't hear you.</audio>
<reprompt/>
</noinput>
</form>
<!-- additional forms can go here -->
</vxml>
In this example, we've created a variable called favorite_animal
using the
<field>
tag. After we've prompted the user for a response, we have to specify
what the user is allowed to answer by defining a grammar.
In our grammar, if the user says
"dogs," "hounds," "puppies," or "hound dogs," the value of favorite_animal
becomes "dogs." If they respond "cats," "kitties," or "kittens,"
favorite_animal
will be set to "cats."
That's all there is to getting user input. Now we can use the value of their response in our
program. In this example, if their answer is "dogs," they will be sent to a form named
"popular_dog_facts" within the same VXML document. If they answer "cats,"
they will be sent to a different URL, psychological_evaluation.cgi.
Putting
curly braces around a variable name references the value of the variable, so
the query string sent to psychological_evaluation.cgi
will be
affliction=cats
.
That's the gist of VXML. Excellent reference material can be found
on the Tellme developers' site, http://studio.tellme.com/, including a VXML reference, a grammar reference, a code
library, and a library of reusable grammars.
Case Study 1: Building a Pi Reciter Using Tellme
You can sign up for a free developer account at studio.tellme.com.
With a developer account, you get your own Tellme telephone extension
that you can point to the URL of your VXML application. Having an
account also gives you
access to Tellme's handy utilities such as the Scratchpad -- a
web form where you can type in some VXML (e.g., the "Hello, World"
example), it checks your syntax, and then you can call a phone number to hear how it
turned out. Tellme provides a very gentle slope into the VXML world.
The first step to building the Pi Reciter is to create a page asking the user how many
digits of pi they want to hear. This is a very straightforward VXML
document, similar to the example above
(source code: http://www.arsdigita.com/asj/vxml/pi-index.vxml.txt).
It is convenient to write your VXML in the Tellme Scratchpad first, test it, and
then move it over to your web server.
Try out the Pi Reciter!
Call 1-800-555-TELL.
At the main menu, speak the word "Extensions."
Enter extension 58874.
|
The next VXML page is a little more complicated because it has to be
dynamically generated based on user input (how many digits
of pi they want to hear).
Your program needs to:
- understand the user input (
n_digits
)
- generate n digits of pi
- write out a string containing:
<vxml>
...
<audio>3.14159...[the nth digit]</audio>
...
</vxml>
Since form variables are passed in exactly the same manner
whether you're making a VXML application or a web
application, step 1, understanding the user input, is
no problem. In my case, I already had step 2, generating the digits
of pi, covered: I have
more digits of pi stored on my hard drive
than I know what to do with.
Step 3, writing out the VXML, seems like it would
be very straightforward, but it turns out there are
a few subtleties here.
256-character word length limit in audio tags
Although I didn't see anything about this in the VXML documentation,
it turns out that Tellme ignores <audio>
tags if there
are overly long words (i.e., long strings of digits) between them. The experimentally-derived character
limit is 256.
The solution: break up the audio into multiple tags. If someone wants
500 digits of pi, give it to them in two 250-digit chunks.
Funny number pronunciation
The TTS translator used by Tellme can be rather clever with its
pronunciation. Unfortunately, sometimes this backfires,
as when it pronounces a string of digits in pi, say 3238462, as
"three million, two hundred thirty-eight thousand, four hundred
sixty-two."
The solution: put spaces between the digits.
Tip
Tellme and VoiceGenie don't care what Content-Type you use when writing out your content, so
you might as well choose text/plain instead of application/x-vxml .
It makes debugging
easier because you can view VXML source using a
web browser.
|
Source code: http://www.arsdigita.com/asj/the-digits.tcl.txt
(coded to the AOLserver Tcl API because that's what I had running on
my development server but the code can be easily translated to run in
Perl, VB, or Java).
Case Study 2: ArsDigita Telephone Directory (including some reusable code)
The user experience:
- Joe Employee calls up 1-800-555-Tell and dials the ArsDigita Directory extension.
- For security, he is asked to dial or say the passcode before he can go any farther.
- Joe spells out a few letters of an office name or a person's last name using his keypad.
- He hears a list of matches (pulled from ArsDigita's intranet database), and chooses the one he wants.
- An automated voice reads the person or office's contact info to him.
- Joe can go back to Step 3 if he wants to hear someone else's contact info.
The source code:
This code can be used as-is if you are running the ArsDigita Community
System Intranet Module (http://www.arsdigita.com/products/modules).
The ArsDigita Community System (ACS) is a free, open-source platform
enabling ecommerce, enterprise coordination, and education. The
Intranet Module allows you to manage employees by keeping track of
salaries, benefits, assignments and reviews, and manage customers
and projects by keeping track of resource allocation, schedules,
tasks, deadlines and status.
Regardless of whether you are running the ACS, it should be easy for
you to adapt the concepts and logic to your programming environment.
The shared Tcl procedures and PL/SQL procedures will be immediately
useful for any application using Oracle or AOLserver.
HTML-encoded characters
If you are using the same database to serve web content and voice
content, you have to be aware that many of the character strings
common in HTML are illegal in VXML, for example,
á
(á), ç
(ç), and ö
(ö). Your employee Carl
Bjørnsen may enter his name with an ø
(ø) so that it will render correctly in a web browser, which
is fine; just make sure you translate it to the corresponding
iso8859-1 code (ø
) when you generate your VXML, or your voice
application will crash and burn.
vxml-defs.tcl.txt contains a procedure, vxml_convert_illegal_characters
,
which will take care of the HTML/iso8859-1 conversion for you, while also doing the necessary conversion of & to &
,
< to <
, and > to >
. Any user-entered data that is going to be embedded in
VXML should be filtered through this procedure first.
Practical limitations on grammars
When people want to look up a name in the directory application, they have to enter the first few
letters of the name using their keypad. E.g., for "Adams," you would push 23267, or some subset
thereof. I thought it would be much more convenient if people could use their voice to spell out the name they
were looking for: "A D A M S."
To accept keypad digits from the user, you can just use Tellme's pre-defined grammar TM_DTMF_DigitString
.
This listens for an arbitrary number of keypad tones. But there is no pre-defined grammar that
listens for an arbitrary number of spoken letters. So I decided to create my own grammar that
would accept all combinations of letters up to 4 letters in length. I wrote a little script that
generated all such combinations and saved them in a file:
eves-first-grammar.gsl:
[
[a] {<option "a">}
[b] {<option "b">}
[c] {<option "c">}
...
[z] {<option "z">}
[(a a)] {<option "aa">}
[(a b)] {<option "ab">}
[(a c)] {<option "ac">}
...
[(z z)] {<option "zz">}
[(a a a)] {<option "aaa">}
[(a a b)] {<option "aab">}
[(a a c)] {<option "aac">}
...
[(z z z)] {<option "zzz">}
[(a a a a)] {<option "aaaa">}
[(a a a b)] {<option "aaab">}
[(a a a c)] {<option "aaac">}
...
[(z z z z)] {<option "zzzz">}
]
I referenced this grammar from within my VXML (<grammar src="eves-first-grammar.gsl"/>
),
and ran my application. The Tellme voice browser ground to a halt and then crashed. Apparently it's not good to reference a
450,000-element grammar from within a VXML file. A little testing showed that you can't have a
grammar with more than about 10,000 elements in it, otherwise the Tellme voice browser will crash.
More experimentation showed that it's not reasonable to have more than a few (10? 15?) elements in your
grammar, otherwise the gateway's voice recognition facilities become extremely inaccurate. And even
with only 10 elements, it still makes many more mistakes than a human would.
My advice: use spoken grammars when there are only a couple items to choose from (Yes/No,
Dogs/Cats, Coke/Pepsi). Otherwise, have your user key in their choice.
More funny pronunciation
We have a medical doctor on our staff at ArsDigita (on our sales staff). His name, as stored
in the intranet database is: Harry Greenspun, MD.
The TTS converter, as clever as always with its pronunciation, reads his name as:
"Harry Greenspun, Maryland." I could have fixed it by adding spaces between the M and the D,
just like the trick with the digits of pi, but since he actually lives in Maryland, I didn't bother.
SSL
Make sure that the VXML gateway you use supports SSL. Both Tellme and VoiceGenie do; just
point them to an application URL beginning with https
.
With Tellme, all form variable values, including user-entered passwords, appear in the query string
Since query strings show up as an extension of the URL, you typically don't want to have sensitive data
present in the query string. One reason is that URLs are captured in log files,
which are not often heavily protected. You don't want to have a line like this
in your access log:
64.14.68.215 - - [25/Feb/2001:19:10:43 -0500] \
"GET /login.tcl?username=eveander&password=alexisgood HTTP/1.0" \
403 413 "" "Mozilla/1.01 [en] (Win95; I)"
In the HTML world, you can avoid the query string entirely by submitting forms via
method="post"
instead of method="get"
. But what happens when you try to do this using Tellme?
VXML code snippet:
<submit next="passcode-check.tcl" method="post"/>
Corresponding line in the access log:
64.41.140.74 - - [25/Feb/2001:19:13:01 -0500] \
"POST /tellme/passcode-check.tcl?passcode=1111 HTTP/1.0" \
200 0 "" "Tellme/1.0 (I; en-US)"
Bottom line: if you're going to collect sensitive customer data using Tellme, protect your server log.
(Note: VoiceGenie does not have this problem.)
VoiceXML gateway comparison
Which VoiceXML gateway do you choose? Two publicly accessible
gateways I found on the Web were Tellme and VoiceGenie:
|
Tellme |
VoiceGenie |
Security |
Good. Serves pages via SSL. Unfortunately, sensitive data can appear in the access log, so you have to protect it. |
Great. Serves pages via SSL. No sensitive data in the access log as long as forms are submitted via post. |
Convenience for users |
OK. Toll-free telephone number. The downside is that your users have to listen to two advertisements per call, and
there is no international telephone number. Users have to go through too many menus to reach your extension. |
Good. Non-toll-free telephone number, accessible by everyone. |
Text-to-speech capabilities |
Great. Does an impressive job of pronunciation and phrasing. |
Good. Pronunciation is good, but the voice inflection is odd. |
Documentation and libraries |
Great. Extensive well-written documentation. Extensive grammar and code libraries. |
Good. Good documentation and code library. Small grammar library. |
Development environment |
Good. Convenient utilities like the Scratchpad for testing code. Can run real-time debugger. Unfortunately, this debugger is necessary because the error messages sent upon application failure are no use ("Failure for unknown reasons" is the most common message). |
Good. Helpful error messages upon application failure. Up to 20 telephone extensions per login account. |
Support |
Great. Received a helpful response to my email request within an hour. |
Great. Received a helpful response to my email request within minutes, complete with VXML code. |
Switching between gateways
Switching between gateways is non-trivial, so make your choice carefully. Different gateways require slightly
different VXML syntax, and they provide different pre-defined grammars, on which you will find yourself
relying.
Beyond VXML
Will all voice applications be VXML applications? The current syntax
of VXML is geared at producing a user experience of navigating through
hierarchical menus. State-of-the-art research is moving beyond this towards
conversational systems in which any utterance makes sense at any time and where context
is carried from exchange to exchange. For example, you can call the
MIT Laboratory for Computer Science's server at 1-888-573-8255:
- You: Will it rain tomorrow in Boston?
- JUPITER: To my knowledge, the forecast calls for no rain tomorrow in Boston.
- You: What about Detroit?
- JUPITER: To my knowledge, the forecast calls for no rain tomorrow in Detroit.
- You: I'm interested in floods.
- JUPITER: Flood warnings have been issued for Louisiana and Mississippi.
- You: Will it be sunny in Phoenix?
...
Notice how the system, more fully described at http://www.sls.lcs.mit.edu/sls/whatwedo/applications/jupiter.html,
assumed that you were still interested in rain when asking about
Detroit, context carried over from the Boston question.
In the long run, as these more natural conversational technologies are
perfected, the syntax of VXML will have to grow to accommodate the full
power of speech interpreters or be eclipsed by another standard.
More
VoiceXML gateways:
Related links:
Source code:
Credits: