Workshop meeting notes taken by Dave Raggett

Q&A

 * Nuance - Steve Ehrlich. (slides)

Mike Robin - asks about WAP makes use of networked speech?
Nuance is less interested in this approach than the others.

 * Digital Channel Partners - Daniel Applequist (no slides)

What kinds of multimodal services do informational and
transactional service providers want to provide?

Platforms such as WAP, Voice, DTV, Desktop etc.

Talks about how content is generated starting with a database
and flowing into different channels, being tailored as
appropriate to each platform.

What are the issues?

Contextual clues needed for each modality. An authoring step
is required to provide these.

Andrew Scott - how do you see the markup language choice
effecting this.

Daniel replies that he is more interested in working back
from the applications to understand the needs.

Do you want one language for all modalities? No in short term,
sending info for all modalities would waste bandwidth.

Stephane Maes - you can transform to distill down to what is
needed for delivery, hence conserving bandwidth

 * Philips - Eric Hsi (slides)

Trends. Multi-modal content adaptation for universal access.
Would like discussion on following points:

 - Do we need a new standard or are existing ones sufficient?
 - Synchronization across modalities, what level of granularity
   is needed?
 - Separating presentation from content is hard if not impossible

 * SEC, Sachiko Yoshihama (slides)

SEC is a software vendor from Japan with WAP related products.
HTML is more popular than HDML/WML in Japan. 9.8 million users for
iMode vs 3 million users for EZWeb which is based on HDML and WML
1.0. More content, tools, knowledge. SEC encourage convergence of
WAP and W3C. SEC think convergene of WML and VoiceXML since speech
is natural useage of cell phones and complements weak input
capabilities.

Jim Larson - asks if HTML is more popular than WAP because it
appeared earlier?

Yes this is one reason.


Charles McCathy-Neville - asks if you think content providers
will be comfortable with authoring in WML and HTML?

Scott McGlashan - are there services which are only available
in iMode and not via WAP?

Sachiko - notes that WML has context management features which
are missing in Compact HTML.

Andrew Scott - thinks the data on relative popularity doesn't
show the full picture.

 * Conversa, Mike Robin (slides)

Inventing new languages is generally the wrong thing to do.
Conversa focusses on adding voice features into the client
platform. In other words, powerful clients.

                       Multimodal
IVR (voice+DTML)  <--------|--------> WML/HTML(Graphics+Clicks)


Talks about architectural choices for distributing work between
client and server.

One issue is security. Need for nested security contexts and
a trust network. Another is for recovering from connection
failures.

Ted Wugofksi - cell phone with intermittent connection. Does
Mike think consumers will tolerate reduced vocabulary when
the phone can't reach the server (e,g. via Aurora)

Dave Bevis - asks Mike to expand on the security problems

Mike - applications crossing many servers.
-----

10am Brainstorming Session

Jim lists issues identified from position papers and
ask for additional issues:

 - Integration of WML and VoiceXML

 - Single authoring for multiple modalities (multiple use)
        - re-use of existing content
        - dialog authoring language
        - dialog differences and synchronization

 - Multiple authoring

 - Authoring systems

 - Architecture Convergence of GUI markup and VoiceXML
        - variations of how to deal with ASR

 - difficulties in separating content from presentation

 - Ergonomics of switching between listening and watching
   (moving phone from ear to eye). End user oriented issues
        - feature bloat for combined language

 - Multi-modal interaction issues
        - synchronization issues
        - dialog management
        - events

 - Push/Pull issues

 - Contextual cues specific to particular modalities
        - knowing where you are and where you can go

 - Semantic Binding (where and how)

 - Other modes, e.g. video, handwriting ...


Jim labels the groups and invites people to write their
initials against the following set of topics:

 1. Synchronization and Multimodal interaction/dialog management

 2. Ergonomic issues

 3. Push-pull issues

 4. Contextual clues

 5. multiple authoring

 6. Authoring systems

We break for 30 mins.

 * HP Labs, Marianne Hickey (slides)

Marianne talks about the W3C Voice Browser multi-modal
dialog requirements (in her role of leader of this
work in the Voice Browser working group).

Multi-modal: 3 main approaches

  - modes are used in parallel and you choose
    whether to use speech or keypad at any point
  - complementary use of modalities with different info
    presented via different modalities
  - coordinated input from multiple modalities 
    is seen as lower priority (an area for future study)

Jim Larson: How does the multimodal language you have
defined so far relate to WML, what is the relationship?

Marianne says her examples place the voice dialog in
control with WML or other GUI markup in a subservient role.
Her example uses HTML.

 * PipeBeach, Scott McGlashan (slides)

Scott's presentation covers following topics:
 - voice browser dialog requirements
 - work on transcoding WML to VoiceXML
 - language integration issues

W3C is basing its work on dialog markup on the VoiceXML
submission from the VoiceXML Forum (May 2000).

Transcoding arbitrary HTML into VoiceXML is practically
impossible. Transcoding WML into VoiceXML is more tractable.

WML apps often use abbreviations on account of small display
size. Another issue is support for free text input in WML,
something that is problematic for VoiceXML.

Rather than directly going for an integrated approach, Scott
is interested in extending WML to support speech interaction.

Charles: why not start from a service description?

Scott: starting from the service data is much the easiest
but doesn't address existing content.

Charles: what about XForms?

Scott: the Voice Browser WG is looking at this but we
anticipate this effecting later versions of the dialog
markup language, since XForms is still at a very early stage.

 * Telstra, Andrew Scott (slides)

Develops shouldn't be burdened. Move the burden back up the
chain. (what does he mean by this?)

Authors need to control presentation for different modalities.
Andrew cites "Wednesday" as an example "Wed" on WAP and
pronounced as "wendsday" when spoken.

A simple mechanism is needed for authors to express alternatives
for different media/platforms

Daniel: authors are unwilling to give the necessary info for
modalities other than the one they are focussing on right now.

Dave Raggett: asks what Andrew meant by "chain".

Ans: users -> content developers -> infrastrucure providers
where up is to the right.

Lunch Break

Summer Palace took 1 hour 25minutes!

 * NTT DoCoMo, Kazunari Kubota (slides?)

 * IBM, Stephane Maes (slides)

 * NEC, Naoko Ito "XML document navigation language" (slides)

XDNL defines a document flow but not a full dialog model.
It uses the same syntax as XSLT for ease of understanding.

Ted: asks about how XLink is being used to break things up.

Example shows every two titles in a list of titles being
show as a separate "document". This takes advantage of
the for-each mechanism and the counter-size attribute.

2:30 Brainstorming on single authoring (1 hour)

 * PipeBeach, Scott McGlashan (slides)

Scott talks about the convergence of WAP and VoiceXML
architectures.

VoiceXML browser in the network, WML browser on the
mobile device, where browsers synchronize via a
control document. Push notifications used for
synchronization. Requires WML 1.2 with push gateway.

Pro:
  Independent browsers, no change to ML's. Reusable
  standalone.  Simple to create.

Cons:
  no tight synchronization on local transations, timing ...
  concurrent voice and data requires GPRS/3G
  completely separate services/content


 * Motorola, David Pearce  "DSR ETSI/STQ-Aurora"

Distributed speech recognition. Current ETSI spec Feb 2000,
for Mel-Cepstrum front end. Ongoing work on advanced front-end
to half error rate in presence of noise. DSR works particularly
well in weak signal strengths when compared to server based
speech recognition using GSM to convey the audio.

Merge of WML (HTML) and VoiceXML. Complete control in the
terminal. Parallel for voice and visual. Thin clients - all
processing handed off to server. Fat clients do all the work
locally. Intermediate architectures are also possible.

ETSI is keen to promote the merged language approach,
with speech handled as data and transferred in parallel
with markup.

Stephane: voice channel is cheaper than data channel, bit for
bit.

Alastair: in volume the data channel is cheap.

Reports from Break-out sessions:

Charles McCathyNevile - Integration of WAP/VoiceXML (html)



Dave Raggett - Single Authoring (text)

Note: Some authors don't care about modalities, and the authoring
system should be able to fill in for the roles of the graphic
designer etc. for the different modalities.

??? - Architecture Convergence (slides)

Volker Steinbiss - Dialog management, Synchronization, and
Multimodal interaction issues (html or word)

Question about SMIL as a good starting point. SMIL doesn't
really deal with dialog. Another question

Jim Larson - Ergonomic and End user oriented issues

 1. When is multimedia really useful versus monomedia.

Handsets are not ideal for multimedia. Handsfree or
distant microphone operation would be helpful.

 2. Social interaction issues:

  - Privacy
      * Eavesdroppers
      * Appearing stupid in front of other people

  - No cell phone allowed in theatres

  - Not structured for a social role

 3. Voice is emphemeral (it doesn't hang around)

  - a record of what was said could be useful

 3.  Is the device on a phone, handheld, new thing?

  - what is the migration path to this new thing

 4. Input device problems

  - buttons hard to use with big fingers

  - device is inadvertently turned on

  - thumb fatigue

 5. Solutions

  - enourage research on when multimodal interfaces
    are really useful

  - research on minimizing number of user actions/gestures
    to achieve particular input actions

  - involve social scientists to look at what people like
    to do/not like to do, etc. To get a better feel for
    appropriate useage models and constraints

  - design and implement social protocols for privacy and
    social interaction

  - design and implement a persistent visual channel (kind
    of like a short term memory) as an aide memoire

  - capture and publish use-cases for multimodal devices/apps

  - a roadmap for getting there from here

  - ability to keep going or at least to suspend and recover
    when access to server is temporarily removed

  - cultural sensitivities to use of color etc.

Volker: you may need to change how you interact, e.g. when
you walk into or out of a meeting, as this may effect whether
you want to use/not to use aural interation.

Dave Beavis - Authoring  and rendering systems; semantic
    binding (slides)

What is an example where the synchronization points would
change dependent on the user preferences/profile.

Ans: Someone with cerebral palsy would take much longer
to respond (Charles). Another is when you are driving.

The profile might need to take into account the transport
mechanisms, so that applications can take these into
consideration.