SPEECHSC S. Maes
Internet Draft IBM
Document: draft-maes-speechsc-web-services-00 A. Sakrajda
Category: Informational IBM
Expires: December, 2002 June 23, 2002
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026 [1].
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts. Internet-Drafts are draft documents valid for a
maximum of six months and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use
Internet-Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
Discussion of this and related documents is on the MRCP list. To
subscribe, send the message "subscribe mrcp" to
majordomo@snowshore.com. The public archive is at
http://flyingfox.snowshore.com/mrcp_archive/maillist.html.
NOTE: This mailing list will be superseded by an official working
group mailing list, cats@ietf.org, once the WG is formally
chartered.
1. Abstract
This document proposes the use of the web service framework based
on XML protocols to implement speech engine remote control
protocols (SERCP).
This document is informational. It illustrates how web services
could be used. It is not a detailed specification. This is expected
to be the output of the SPEECHSC activity, if it is decided to go
in this direction. It also enumerates the requirements that have
led to selecting a web service framework.
Speech engines (speech recognition, speaker, recognition, speech
synthesis, recorders and playback, NL parsers, and any other speech
processing engines (e.g. speech detection, barge-in detection etc)
etc...) as well as audio sub-systems (audio input and output
sub-systems) can be considered as web services that can be
described and asynchronously programmed via WSDL (on top of SOAP),
combined in a flow described via WSFL, discovered via UDDI and
Maes & Sakrajda Informational - Expires December 2002 1
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
asynchronously controlled and via SOAP that also enables
asynchronous exchanges between the engines.
This solution presents the advantage to provide flexibility,
scalability and extensibility while reusing an existing framework
that fits the evolution of the web: web services and XML protocols
[15].
This document proposes using web services as a framework for SERCP.
The proposed framework enables speech applications to control
remote speech engines using the standardized mechanism of web
services. The control messages may be tuned to the controlled
speech engines.
2. Conventions used in this document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
this document are to be interpreted as described in RFC-2119 [2].
3. Introduction
This document uses the terminology SERCP (Speech Engine Remote
Control Protocols) to be consistent with the terminology used in
other documents exchanged at ETSI, 3GPP and OMA while
distinguishing from the detailed specification proposed by MRCP.
SERCP addresses a same set of high level "SPEECHSC" objectives: the
capability to distribute the automatic processing of speech away
from the audio sub-system and the associated controlling speech
application.
The need for SERCP has been identified in different forums.
Originally, the need for SERCP was formulated in the context of the
multimodal architecture proposal at ETSI Aurora STQ [3] and followed
by explicit SERCP requirements in the context of Distributed Speech
Recognition (DSR)[4]. This was followed by two concrete proposals
that suggested to rely on web services [5,6].
Later, IETF has initiated the SPEECHSC BOF activity [7] around the
MRCP proposals:
- draft-shanmugham-mrcp-01.txt
- draft-robinson-mrcp-sip-00.txt
that provided additional justifications and requirements for a SERCP
framework.
A preliminary requirement document [16] and use cases [17] have also
been published.
In general, SERCP will support two classes of usage scenarios where
speech processing is distributed away from the audio sub-systems and
Maes & Sakrajda Informational - Expires December 2002 2
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
the speech engines are controlled:
- By the source of the audio. A typical scenario is a voice
enabled application running on a wireless terminal but using
server side speech recognition. In [3] and [13], this is
exemplified by fat client MVC multi-modal browser configuration
with use of remote engines.
- By a third party controller (i.e. application). A typical
scenario is a server side application (e.g. VoiceXML browser)
that relies on speech recognition performed elsewhere in the
network. Numerous voice portal or IVR (Interactive Voice
Response) systems rely on such concepts of distribution of the
speech processing resources.
This is consistent with the framework described in [17].
4. Design Requirements
At a high level, a distributed speech recognition framework should
aim at enabling the application developer or service provider to
seamlessly use remote engine:
- The location of the engine SHOULD NOT be important: the system
behaves as if the engine was local to the application runtime.
- The performances of the speech engines SHOULD NOT be affected
by distribution of the engines and the presence of the network
- The functionality achievable by the speech engines MUST be at
least equivalent to what can be achieved with local engines.
The rest of section summarizes and expands on the requirements
identified so far that drive the proposal to rely on web services.
4.1 General considerations
There are numerous challenges to the specification of an appropriate
SERCP framework.
In addition to the MRCP internet drafts, numerous proprietary or
standardized fixed engine APIs have been proposed (e.g. SRAPI,
SVAPI, SAPI, JSAPI, etc ...). None have been significantly adopted
so far!
Besides strong assumptions in terms of the underlying platform, such
APIs typically provide often too constrained functions. Only very
limited common denominator engine operations are defined. In
particular, it is often difficult to manipulate results and
intermediate results (usually exchanged with proprietary formats).
On the other hand, it would not have been practical to add more
capabilities to these APIs.
Therefore, we propose that:
- SERCP SHOULD NOT be designed as a fixed speech engine API,
but
- SERCP MUST be designed as a rich, flexible and extensible
framework that allows the use of numerous engines with numerous
levels of capabilities.
Maes & Sakrajda Informational - Expires December 2002 3
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
4.2 Speech engine interoperability: replaceable engines or
common protocols?
The considerations made above raise fundamental issues in terms of
standardization and interoperability. What is the objective of
SPEECHSC?
- (target-1): to enable the replacement of a speech engine
provided by one speech vendors by an engine provided by another
and still be able to immediately run the same speech application
without any other change
or
- (target-2): to enable speech applications to control remote
speech engines using a standardized mechanism but messages tuned
to the controlled speech engines?
(target-1) is very difficult to achieve. Today, speech engine are
adapted to particular tasks. Speech data files (acoustic models,
engine configurations and settings, front-end features, internal
algorithms, grammars, etc ...) differ significantly from vendor to
vendor. Even for a same vendor, the deployment of performing
conversational applications requires numerous engine settings and
data file tuning from task to task.
In addition, conversational applications and engines still
constitute an emerging field, where numerous changes of behavior,
interfaces and capabilities must be supported to enable rapid
introduction of new conversational capabilities (e.g. support of
free flow dialogs, NL parsing etc..).
Eventually, in usage scenarios [17] where SERCP would be used by a
terminal to drive remote engines or by a voice portal to perform
efficient and scalable load balancing, the application / controller
knows exactly the engine that it needs to control. The value of
SPEECHSC is to rely on a standardized way to implement this remote
control.
It may be possible to define a framework where a same application
can directly drive engines from different vendors. We prefer to
consider this as particular cases of the (target-2) framework rather
than (target-1) that would introduce unnecessary usage limitations
on the output of the SPEECHSC activity.
Wireless deployments like 3G will require end-to-end specification
of such a standard framework. At this stage, it is more valuable to
start with an extensible framework (target-2) and when appropriate,
provide a framework that address (target-1).
Therefore, SERCP is designed to focus on (target-2), while providing
mechanisms to achieve (target-1) when it makes sense.
This translates into the following key design requirement for SERCP:
- SERCP MUST provide a standard framework for an application to
remotely control speech engines and audio sub-systems. The
Maes & Sakrajda Informational - Expires December 2002 4
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
associated SERCP messages MAY be tuned to the particular speech
engine.
- SERCP MUST NOT aim at supporting application interoperability
across different speech engines with no changes of the SERCP
messages.
- SERCP SHOULD aim at distinguishing and defining messages that
are invariant across engine changes from messages that are
engine specific.
As a result, adding support of speech engines from another vendor
MAY require changes of the SERCP messages and therefore changes of
the application or dialog manager to support these new messages. In
the web service framework proposed below, it results into changing
the WSDL (XML) instructions exchanged with the engines. However it
does not imply any changes other than adaptation of the XML files
exchanged with the engines (and possibly new speech engine data
files).
4.3 Requirements identified in the context of SPEECHSC
In [16], the following requirements have been proposed:
- SERCP SHOULD reuse existing protocols.
- SERCP MUST maintain integrity of existing protocols.
- SERCP SHOULD avoid duplication of existing protocols.
- SERCP SHOULD satisfy the TTS requirements as described in
draft-burger-mrcp-reqts-00.txt, section 6 (expressed according
to the terminology defined in RFC-2119 [2]).
- SERCP SHOULD satisfy the ASR requirements as described in
draft-burger-mrcp-reqts-00.txt, section 7 (expressed according
to the terminology defined in RFC-2119 [2]).
[7] provides additional considerations in terms of security,
dual-mode usage (speech recognition and synthesis provided by same
system) etc...
Fllowing [17], we assumes from the onset that SERCP will drive
engines that act on uplink (audio-subsystem to engine)and downlink
(from engines to audio sub-system) speech.
4.4 Requirements identified in the context of ETSI Aurora DSR
In the context of the ETSI Aurora distributed speech recognition
framework, the following requirements have been considered. These
have also driven the design of SERCP.
Note that the DSR framework is not limited to the use of DSR
optimized codecs but it can be used in general to distribute speech
recognition functions over packet switched network with any encoding
scheme.
- SERCP MUST control the different speech engines involved to
carry a dialog with the user. As such,
- SERCP SHOULD NOT distinguish between controlling single
Maes & Sakrajda Informational - Expires December 2002 5
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
engine or several engines responsible to process speech
input and generate speech or audio output.
- SERCP SHOULD NOT be limited to ASR or TTS engines
- SERCP SHOULD enable control of the audio sub-systems and
additional processors (e.g. control of settings of codecs,
acoustic front-end, handling of voice activity detection,
barge-in, noise subtraction, etc...).
- Audio sub-systems and seech processors MAY be considered
as "engines" that may be controlled by the application using
SERCP messages.
- SERCP MUST support control speech engines and audio
sub-systems by:
- An application located on the component where audio-system
functions are located (e.g. wireless terminal)
- An application located elsewhere on the network (i.e. not
collocated with speech engines or audio input or output
sub-systems).
- SERCP SHOULD NOT specify call-control and session control
(re-direction etc...) and other platform/network specific
functions based on dialog, load balancing or resource
considerations.
- However SERCP MUST support the request to expect or establish
streaming sessions between target addresses of speech engines
and audio-sub-systems.
- Session establishment and control MUST rely On existing
protocols
- SERCP MUST NOT address the transport of audio.
- SERCP MAY address the exchange of result messages between
speech engines.
- SERCP MUST support the combination (serial or parallel) of
different engines that will process the incoming audio stream or
post-process recognition results. For example, it should be
possible to specify an ASR system able to provide an N-Best list
followed by another engine able to complete the recognition via
detailed match or to pass raw recognition results to a NL parser
that will tag them before passing the results to the application
dialog manager. More details are provided in [17].
- The framework SHOULD enable engines to advertise their
capabilities, their state or the state of their local system.
This is especially important when the framework is used for
resource management purpose.
- SERCP SHOULD NOT constrain the format, commands or interface
that an engine can or should support.
- SERCP MUST be vendor neutral:
- SERCP MUST support any engine technology and capability
- SERCP MUST provide efficient extensibility mechanisms
mechanisms to support any type of engine functionality:
existing and future.
- SERCP MUST support vendor specific commands, results and
engine combination through a well specified extensible
framework
- SERCP MUST be asynchronous.
- SERCP MUST be able to stop, suspend, resume and reset the
Maes & Sakrajda Informational - Expires December 2002 6
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
engines.
- SERCP MUST NOT be subject to racing conditions. This
requirement is extremely important. It is often difficult from a
specification or a deployment point of view to efficiently
handle the racing conditions that may occur when hand holding
the engine to load appropriate speech data files (e.g. grammars,
language model, acoustic models etc...) and report / handle
error conditions while simultaneous racing with the incoming
audio stream.
It should be noted that if the requirements described above are
satisfied, it would be possible to support the use case identified
in [17].
4.5 Additional design considerations
Eventually, the following requirements have been driven the design:
- Scalability and robustness of the solution
- Simplicity of deployment
- Transmission across firewalls, gateways and wireless networks.
- This implies that the end-to-end specification of SERCP
and the assumed protocols that it may use for transport MUST
be supported by the target deployment infrastructure. This
is especially important for 3G deployments.
- Need to support the exchange of additional meta-information
useful to the application or the speech engines (e.g. speech
activity (speech-no-speech), barge-in messages, end of
utterance, possible DTMF exchanges, front-end setting and noise
compensation parameters, client messages -- settings of
audio-sub-system, client events, externally acquired parameters
--, annotations (e.g. partial results), application specific
messages).
5. Speech engines and audio-sub-systems as web
We propose the framework of web services as an efficient, extensible
and scalable way to implement SERCP that satisfy the different
requirements enumerated in section 4 and supports the use cases
identified in [17].
According to the proposed framework, speech engines (audio
sub-systems, engines, speech processors) are defined as web services
that are characterized by an interface that consists of some of the
following ports:
- "control in" port(s): It sets the engine context, i.e. all the
settings required for a speech engine to run. It may include
addresses where to get or send the streamed audio or results.
- "control out" port(s): It produces the non-audio engine output
(i.e. results and events). It may also involve some session
control exchanges.
- "audio in" port(s): It receives streamed input data.
- "audio out" port(s): It produces streamed output data.
Maes & Sakrajda Informational - Expires December 2002 7
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
Audio sub-systems can also be treated as web services that can
produce streamed data or play incoming streamed data as specified by
the control parameters.
The "control in" or "control out" messages can be out-of-band or
sent or received interleaved with "audio in or out" data. This can
be determined in the context (setup) of the web services.
Speech engines and audio sub-systems are pre-programmed as web
services and composed into more advanced services. Once programmed
by the application / controller, audio-sub-systems and engines await
an incoming event (established audio session, etc...) to execute the
speech processing that they have been programmed to do and send the
results as programmed.
Speech engines as web services are typically programmed to handle
completely a particular speech processing task, including handling
of possible errors. For example, as speech engine is programmed to
perform recognition of the next incoming utterance with a particular
grammar, to send result to a NL parser and to contact a particular
error recovery process if particular errors occur.
5.1 Examples of SERCP web services
The following list of services and control types is not exhaustive.
It is provide purely as illustration. These examples assume that all
control messages are sent as "control in" and "control out". As
explained above, the framework could support such exchanges
implemented by interleaving with the streamed audio, etc...
The following are examples of SERCP web services:
- Audio input Sub-system - Uplink Signal processing:
- control in: silence detection / barge-in configuration,
codec context (i.e. setup parameters), asynchronous stop
- control out: indication of begin and end of speech,
barge-in, client events, ...
- audio in: bound to platform
- audio out: encoded audio to be streamed to remote speech
engines
- Audio output Sub-Systems - Downlink Signal processing:
- control in: codec / play context, barge-in configuration,
play, ...
- control out: done playing, barge-in events
- audio in: from speech engines (e.g. TTS)
- audio out: to platform
- Speech recognizer (ASR):
- control in: recognition context, asynchronous stop
- control out: recognition result, barge-in events
- audio in: from input sub-system source,
- audio out: none
- Speech synthesizer (TTS) or pre-recorded prompt player:
- control in: annotated text to synthesize, asynchronous
stop
Maes & Sakrajda Informational - Expires December 2002 8
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
- control out: status (what has been synthesized so far)
- audio in: none
- audio out: audio streamed to audio output sub-system (or
other processor)
- Speaker recognizer (identifier/verifier):
- control in: claimed user id (for verification) and context
- control out: identification/verification result,
enrollment data
- audio in: from audio input sub-system
- audio out: none
- DTMF Transceiver. Note that this example illustrates how web
services can also handle DTMF in a consistent manner.
- control in: how to process (DTMF grammar), expected output
format,...
- control out: appropriately encoded DTMF key or string
(e.g. RFC2833).
- audio in: bound to platform events (possibly programmed by
control-in)
- audio out: None
- Natural language parser:
- control in: combined recognition and DTMF detector results
- control out: natural language results
- audio in: none
- audio out: none
Variations and additional examples of speech engines as web service
examples can be considered. Pre and post processing can alsobe
considered as other web services.
5.2 Advantages of a web service framework for SERCP
The use of web services enables pre-allocating and pre-programming
the speech engines. This way, the web services framework
automatically handles the racing conditions issues that may
otherwise occur, especially between the streamed audio and setting
up the engines. This is especially critical when engines are remote
controlled across wireless networks where control and stream
transport layer may be treated in significantly different manners.
This approach also allows decoupling handling streamed audio from
Configuration, control and application level exchanges. This
simplifies deployment and increase scalability.
By using the same framework as web services, it is possible to rely
on the numerous tools and services that have been developed to
support authoring, deployment, debugging and management (load
balancing, routing etc...) of web services.
6. Controlling Speech Engines and Audio Sub-Systems
With such a web service view, the specification of SERCP can
directly re-use of protocols like SOAP [8], WSDL [9], WSFL [10] and
UDDI [11].
Maes & Sakrajda Informational - Expires December 2002 9
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
Contexts can be queried via WSDL [9] or advertised via UDDI [11].
Detailed specifications will be provided when this documents evolves
into an internet draft.
6.1 WSDL
Using WSDL [9], it is possible to asynchronously program each speech
engine and audio sub-systems.
To illustrate the proposal, let us consider the case where speech
engines are allocated via an external routing / load balancing
mechanism. A particular engine can be allocated to a particular
terminal, telephony port and task on an utterance or session basis.
Upon allocation, the application sets the context via WSDL. This
includes the addresses of the source or target control and audio
ports.
As an example, consider a speech recognition engine allocated to a
particular application and telephony port. WSDL instructions
program the web service to recognize any incoming audio stream from
that telephony port address with a particular grammar, what to do in
case of error (what event to throw where), how to notify of barge-in
detection, what to do upon completion of the recognition (where to
send result and end of recognition events). Similarly the telephony
port is programmed via WSDL to stream incoming audio to the audio
port of the allocated ASR web service. When the user speaks, audio
is streamed by the port to the ASR engine that performs the
pre-programmed recognition task and sends recognition results to
the pre-programmed port for example of the application (e.g.
VoiceXML browser [12]). The VoiceXML browser generates a particular
prompts and programs its allocated TTS engine to start generating
audio and stream it to the telephony port. The cycle can continue.
6.2 WSFL
WSFL [10] provides a generic framework from combining web services
through flow composition. We recommend using WSFL to define the flow
of the speech engines as web services and configure the overall
system. Accordingly, sources, targets of web services and overall
flow a be specified with WSFL.
The use of web services in general and WSFL particular greatly
simplify the remote configuration and control of chained engines
that process the result of the previous engine or engines that
process a same audio stream.
6.3 UDDI
UDDI [11] is a possible way to enable discovery of speech engines.
Other web services approaches can be considered. Speech engines
advertise their capability (context) and availability. Applications
Maes & Sakrajda Informational - Expires December 2002 10
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
or resource allocation servers interrogate to UDDI repository to
discover available engines that can be allocated for the next
utterance or session.
6.4 SOAP
SERCP transports WSDL and WSFL on top of SOAP [8].
It is also particularly attractive as events and other messages
between controllers and web services as well as among speech engine
/ audio sub-systems web services can also be transported via SOAP.
Exchanges of results and events (including, stop, resume reset
etc...) among speech engine and audio sub-system web services and
between web services and the controller or application, can be done
via SOAP.
In the future, more advanced coordination mechanisms can be used for
example following frameworks as proposed in [14].
SOAP presents the advantage that:
- SOAP is a distributed protocol that is independent of the
platform or language.
- SOAP is a lightweight protocol, requiring a minimal amount of
overhead.
- SOAP runs over HTTP. This allows access though firewalls.
- SOAP can run over multiple transport protocols such as HTTP,
SMTP, and FTP. This should simplify its transport through
wireless networks and gateways
- SOAP is based on XML which is a highly recognized language
used within the Web community.
- SOAP/XML is gaining increasing popularity in B2B transactions
and other non-telephony applications.
- SOAP/XML is appealing to the Web and IT development community
due to the fact that is a current technology that they are
familiar with.
- SOAP can carry XML documents.
7. Syntax
7.1 Introduction
The SERCP syntax and semantics should be extensible to satisfy
(target-2).
For these reasons, we propose a XML-based syntax with clear
extensibility guidelines. The web service framework is inherently
extensible and enables the introduction of additional parameters and
capabilities.
The SERCP syntax and semantics is designed to support the widest
possible interoperability between engines by relying on message
invariant across engine changes as discussed in section 4.2. This
Maes & Sakrajda Informational - Expires December 2002 11
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
should enable to minimize the need for extensions in as many
situations as possible.
Existing speech APIs, [5] and the MRCP syntax have been considered
as starting points.
Speech engines as web services are considered to come with internal
contexts that typically consist of the context beyond the scope of
the invariant-based SERCP syntax and semantics.
In as much as possible, the semantics and syntax rely on the W3C
Voice Activity specifications [12] to describe the different speech
data files required by the engines.
The application software requests from a broker a reference to an
SERCP channel and after obtaining one, all interaction between SERCP
and the user consists of XML requests posted to the SERCP server
followed by result responses. The interface to the broker is used
by the user only once per lifetime of the user process to bind to a
SERCP channel. All SERCP channels are created equal, they become
qualified as of certain type only when the user attaches itself to
it (i.e. they assume application name). The connection to the
broker is used by SERCP channel to acquire speech services as
needed--this is hidden from the user.
The following sub-sections present a sketch of possible content of
the Body element of the SOAP messages. ItÆs intended for
illustrative purposes, not as an actual specification.
7.2 sercp Namespace
All messages are defined in the sercp namespace, e.g.:
... the body ...
The content of the SOAP message body consists of a set of tags
defining action to be performed in the control and/or audio
channels. The sequencing and grouping by functionality differs for
different categories:
- audio in--the tag prompt is used with speak, text or audio
elements, each one can be repeated more than once in one message
but they cannot be mixed in same request, i.e. one request
consists only of one type; speak implies use of TTS server and
play from socket in the audio sink server, audio without speak
implies play of pre-recorded audio and tones from URI resolved
in the audio server;
- audio out--the tag listen activates recording in the audio
server and defines set of speech services and context of
execution for them; it contains as well prescription for digit
collection and treatment
Maes & Sakrajda Informational - Expires December 2002 12
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
- audio in and out--the listen and prompt tags can be submitted
in one request; it implies single connection to the audio server
and sequencing of play/record/collect dtmf based on attributes
(e.g. barge in)
- stop--asynchronous stop can be issued at any time and
propagates to all components involved; some requests are atomic
(e.g. playing dtmf) and asynchronous stop is just tolerated but
has no real impact
The following assumptions are made in the following part:
- audio server (aud-s)--there is a reachable point where audio
stream(s) are present and available for processing, constant for
a duration of the session (i.e. the location can change by means
external to SERCP); an example would be the first host
receiving/sending audio from/to telephony channel
- speech server (asr-s,tts-s,siv-s)--all represent reachable
points capable of establishing a connection to aud-s allocated
and aggregated by means external to SERCP
- the XML request and responses can be passed with mime-like
attachements
7.3 prompt--Audio Out Element
The prompt element may have the following attributes:
and may take one of the following formats:
.......
or
...
or
.......
The prompt element defines content and method of obtaining and
handling audio to be generated in the outbound channel. The content
following speak MUST contain SSML message (which implies mixt of
synthesized speech and pre-recorded audio streamed by TTS server).
Maes & Sakrajda Informational - Expires December 2002 13
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
The content following text contains text annotated using OEM
specific notations. The omission of speak or text tag implies that
it is just pre-recorded audio retrieved and streamed by the audio
server.
The attribute src following audio element may use "cid:" scheme and
in this case the audio segment is passed by value as attachment.
Sending DTMF digits or call progress tones is a special case of
audio generation dealt with by the audio server. For prompt, the
tone, dtmf and mf attributes are defined with following syntax:
tone="(dialtone|ringback|busy|reorder|f1[-f2])"
where f1[-f2] is used to specify single or dual frequency, and may
be accompanied by additional attributes:
duration="NNms" // total duration in ms
timeon="NNms" // time on for pulsed tones in ms
timeoff="NNms" // time off for pulsed tones--on, off pattern
is fitted into duration
level="-NNdBm" // level at which the tone should be played
The digits can be sent using attributes
dtmf="555*#" or mf="01KS" // K=KP, S=ST
The nicknamed tones have the timeon/timeoff/duration/level
pre-defined, any of these tones can be redefined by supplying full
tone spec with frequencies and all attributes.
Samples:
7.4 listen--Audio In Element
The listen element defines context and method of handling audio in
the inbound channel. It implies activating "recording", i.e.
reading from the audio stream. Handling of the digits embedded in
the input audio is defined through an element digits with a set of
attributes. The speech services to be activated and applied to the
input stream need a context--this is supplied within blocks tagged
by service type: asr, siv,...(extensible--new services can be
added).
.........
The listen element uses attributes to provide detailed prescription
for control of recording and signal processing:
The meaning of attributes is as follows:
- bargein--none disables barge in, speech turns on energy based
barge in, asr turns on recognition based barge in (i.e. only
successful recognition stops prompt), default: none
- beep--0 disables beep indicating begin of recording, 1 turns
it on; if barge in is enabled, the setting is ignored,
default: 0
- echocancel--0 disables echo cancellation, 1 turns it on; it is
applied to firmware in the telephony driver and control is
provided primarily for testing/ data collection; if the
underlying hardware is not capable of echo cancellation, the
setting is silently ignored; default: 1
- endprompt--the "pacifier" prompt to be played immediately
after end of recording; it will be stopped on receipt of a
subsequent request; it may be used to cover the time used e.g.
for backend queries, default: "".
- endsilence--end silence duration, silence of this duration
after speech has been seen triggers end of recording, -1 or 0
reserved for infinite, default: infinite
- initialsilence--initial silence duration, expiration of the
timeout stops recording and terminates the request with final
result "silence", -1 or 0 reserved for infinite, default:
infinite
- maxspeech--guard timer, expiration of this timer stops
recording, -1 or 0 reserved for infinite, default: infinite
- minspeech--minimum amount of speech triggering speech
detection, 0 means silence detection is disabled, "start record"
request submitted to audio server starts streaming, default: 0
- retrieve-- requests that the recorded audio in specified mime
type is returned in the response by value. If omitted, audio is
not retrieved.
- save--requests that the recorded audio be saved on the audio
server. The response will contain uri of the saved audio. The
value of the attribute is a mime type, including
x-ep...subtypes marking endpointed pcm. If omitted, audio is
not saved.
- source--uri of the audio source, location from which the
recipient of the message retrieves audio chunks; itÃÆs mandatory.
- steponbeep--on/off flag requesting that detection of speech in
Maes & Sakrajda Informational - Expires December 2002 15
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
the very first samples of recording (20-50 ms) triggers end of
recording reporting "step- on-beep" result; if barge in is on,
the setting is ignored and value of 0 is used; default: 0
- stopondtmf--requests that the recording stops when a dtmf tone
is detected; default: 1
7.4.1 asr--Speech Recognition listen
The asr tag content implies use of ASR service and defines context
of execution for recognition
..
spelling
...
The context tag attaches a name to a collection of grammars and/or
vocabularies. The attributes are:
- nbest--max number of results to retrieve
- completetimeoutÃù-amount of silence triggering ASR response
when the ASR engine has complete "in-grammar" result
- incompletetimeoutÃùamount of silence triggering ASR response
when the ASR engine has partial result
Vocabulary is just a flat list of entries with optional
pronunciations and soundslikes, it can be embedded in a grammar or
it can be standalone. Any of the entries (context, grammar,
vocabulary) can be passed by reference by supplying URI. The scheme
"cid:" is used to specify attachments.
7.4.2 digits--Digits In
The digits element uses attributes to provide detailed prescription
for digits collection:
The meaning of attributes is as follows:
- length--number of digits to collect (if omitted, 1 is assumed)
- firstdigit--how long to wait for first digit (ms); the timer
starts on completion of play (if omitted, configuration defined
value is assumed)
Maes & Sakrajda Informational - Expires December 2002 16
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
- nextdigit--defines interdigit interval (if omitted,
configuration defined value is assumed)
- termdigit--optional string constructed from
0123456789*#ABCDabcd; any digit in the string detected
terminates digit collection; the default value is ""
7.4.3 prompt Response Tags
The response structure
done|hangup|error
7.4.4 listen Response Tags
The response structure
done|dtmf|hangup|error
...spelling
...
text
Maes & Sakrajda Informational - Expires December 2002 17
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
...
...
spelling
...
...
8. Usage examples
The elements prompt and listen can be used separately over two
distinct channels--in this case synchronization of the channels
needs to be managed by the user (e.g. stopping play on speech
detected when barge in is enabled). It can be assisted by bits of
control information passed in the audio channel (e.g. audio server
should be capable of stopping tts server through the audio channel).
The server responses have to be need results received
8.1 Prompt
The use of prompt element
8.1.1 Play
Playing pre-recorded audio is possible by delivering message
directly to audio server, there is no need to involve TTS. The
audio server can determine when to stop (e.g. on digit or speech).
Playing synthesized audio is possible by delivering message
...
8.2 Listen
The listen tag implies recording. The speech servers if any receive
uri of the audio server defining scheme and location used to
retrieve audio.
Maes & Sakrajda Informational - Expires December 2002 18
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
8.2.1 recording
The listen element can be used to specify just recording of the
audio.
The recording can be returned as attachment to the response.
8.2.2 asr
The recognition request can contain explicit context
stopstartgo back
...
or just reference
The "cid:" scheme can be used to pass context by value as
attachment.
8.2.3 asr and siv
Two speech servers can be attached to single audio source, e.g.
recognizer and speaker identifier.
...
The mechanism of delivering the end-pointed audio to both servers is
up to the audio server.
8.3 Prompt and Listen
Presence of both tags (prompt and listen) in a message implies
dispatch of one turn consisting of play/record and optional end
play. The actions are either parallel or sequential--it depends on
barge in setting--but from the user perspective itÃÆs a single
request-response sequence.
8.3.1 Play and Collect
Maes & Sakrajda Informational - Expires December 2002 19
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
Asking for choice
please press 1 for yes, 0 for no
or variable length input:
please enter pin number followed by a pound
Asking for selection
please press any digit when you hear your choice; one
two ...
The response contains number of samples played which together with
marker offsets allows to determine the choice made.
8.3.2 Play and Recognize
Asking for choice
please say yes or no
or variable length input:
please say pin number
Asking for selection
Maes & Sakrajda Informational - Expires December 2002 20
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
please say stop when you hear your choice; one two
...
...
The dual mode input--speech or dtmf--can be handled by
please say or enter from the keypad pin
number
A digit pressed before speech is detected stops recording (if the
asr server is allocated on speech detected itÃÆs never used) and
prompt, the dtmf timers take over control of the digits collection.
The asr barge in can be used to ignore out of grammar speech likely
during playback of long text (e.g. synthesis of a long e-mail)
...e-mail text...
The speech detected does not stop the prompt, successful recognition
of a command (e.g. "stop") is needed to trigger the stop. The
recording stops at the end of prompt.
9. Security Considerations
SERCP may raise several security issues that are to be considered as
OPEN ISSUES:
- Engine remote control may come from non-authorized sources
that may request un-authorized processing (e.g. extraction of
voice prints, modification of the played back text, recording of
a dialog, corruption of the request, re-routing of recognition
results, corrupting recognition), with significant security,
privacy or IP/ copyright issues. The SPEECHSC activity SHOULD
address these issues. Web services are confronted to the same
issues and same approaches (encryption, request authentication,
content integrity check and secure architecture etc...) can be
used with SERCP.
- Engine remote control may enable third party to request speech
data files (e.g. grammar or vocabulary) that are considered as
proprietary (e.g. hand crafted complex grammar) or that contain
Maes & Sakrajda Informational - Expires December 2002 21
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
private information (e.g. the list of names of the customer of a
bank) etc... The SPEECHSC activity SHOULD address how to
maintain control on the distribution of the speech data files
needed by web services and therefore not only the authentication
of SERCP exchanges but also of target speech engine web
services.
The exchange of encoded audio streams may raise also important
security issues. However they are not different from conventional
voice and VoIP exchanges. This SHOULD be considered as beyond the
scope of the SPEECH activity.
10. References
[1] Bradner, S., "The Internet Standards Process -- Revision 3", BCP
9, RFC 2026, October 1996.
[2] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997
[3] Maes, S. H., Muthusamy Y. and Wajda W., Multi-modal Browser
Architecture Recommendation, "Clayman proposal" to the ETSI DSR
Application and Protocol Working Group, ETSI, January 16, 2000
[4] Maes, S. H. and Reng R., "Requirements and Recommendations for
Conversational Distributed Protocols and Conversational Engine
Remote Control; Version 0.5", AU/310/01, May 3, 2001.
[5] Maes, S. H., Sakrajda, A., Conversational Engine Remote Control
Protocols, Proposal to ETSI DSR STQ Application and Protocol
Working Group, June 26, 2001
[6] Coles, A., Use of SIP and SOAP as Basis for a Speech Engine
Control Protocol, ETSI STQ Aurora DSR Applications and Protocols
working group, June 28, 2001
[7] E. Burger and D. Oran, Control of ASR and TTS Servers BOF(cats),
http://www.ietf.org/ietf/02mar/cats.txt
[8] Simple Object Access Protocol (SOAP) http://www.w3c.org/2002/ws/
[9] Web Services Description Language (WSDL 1.1), W3C Note 15 March
2001, http://www.w3.org/TR/wsdl.
[10] Leymann, F., Web Service Flow Language, WSFL 1.0, May 2001,
http://www-4.ibm.com/software/solutions/webservices/pdf/WSFL.pdf
[11] UDDI, http://www.uddi.org/specification.html
[12] W3C Voice Activity, http://www.w3c.org/Voice/
[13] S. Maes, Multi-modal and Multi-device Interaction, Input
document to 3GPP T2 and W3C MMI,
http://www.w3.org/2002/mmi/2002/MM-Arch-Maes-20010820.pdf
[14] WSXL - Web Service eXperience Language submitted to OASIS WSIA
and WSRP - WSXL - Web Service eXperience Language submitted to
OASIS WSIA and WSRP
[15] W3C Web Services, http://www.w3c.org/2002/ws/
[16] Burger, E. and Oran, D., "Requirements for Distributed Control
of ASR, SV and TTS Resources", draft-burger-speechsc-reqts-00,
June 13, 2002.
[17] Maes,S. and Sakrajda, A., "Usage Scenarios for Speech Service
Control", draft-maes-speechsc-use-cases-00.text, June 23, 2002
Maes & Sakrajda Informational - Expires December 2002 22
Speech Engine Remote Control Protocols by treating Speech Engines
and Audio Sub-systems as Web Services June 2002
11. Author's Addresses
St‰phane H. Maes
IBM T.J. Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-2908
Email: smaes@us.ibm.com
Andrzej Sakrajda
IBM T.J. Watson Research Center
PO Box 218, Yorktown Heights, NY 10598
Phone: +1-914-945-4362
Email: ansa@us.ibm.com