A partnership for a better speech interface
The partnership between Vlingo and AT&T dates from November 2009 when the two companies signed a licensing agreement to incorporate AT&T's WATSON speech engine into Vlingo's speech-enabled interfaces for mobile devices. By spring 2010, Vlingo had updated all its products to use WATSON.
In this interview, Mike Phillips, CTO and co-founder of Vlingo, talks about the partnership.
How is WATSON integrated into the Vlingo system?
Mike Phillips: The WATSON integration work with AT&T has been on a number of different levels. First, we had to ensure that the technology was working as well or better than the existing engine we had in our service. This is actually a difficult challenge since we had already optimized the previous engine for our service and also optimized our service based on the capabilities of the engine. So, we started with a pretty intensive effort to optimize the performance–mainly accuracy and speed–for the needs of the Vlingo service. In the end, we were able to outperform our existing technologies–in some cases quite dramatically. Most importantly, we were able to reduce the number of incorrectly recognized words by approximately 20%. This is a pretty big gain and certainly noticeable by the end users.
The second area of work was the software integration into our server infrastructure. This was actually quite easy for us since the APIs exposed by WATSON were already a good match for what we were expecting. We had a few issues to work out with the team at AT&T but were quickly able to integrate and fortunately had no problems with scalability or robustness of the software.
Now, we have moved on to the more interesting part of the partnership. We are working closely with the researchers at AT&T to make continued improvements to the WATSON technology and how we use it in our applications. This includes things like extending to new languages, making use of improved model training and adaptation tools, and adding functionality like natural language understanding.
When testing WATSON, how did you evaluate performance? What were the criteria?
We do quite extensive accuracy testing at Vlingo. One of the things we believe strongly is that we have to test on real-world data from real users (vs laboratory-type evaluations). So, we make use of digitized recordings from real usage of our system–typically we test on something like 10,000 examples from 3000-4000 different users.
. . . integration hooks provided by WATSON allowed us to get even greater gains by adapting to the needs of our system.
The actual testing is to carefully measure accuracy (% of words correctly recognized) and the computation speed and memory use for different configurations. Our goal is to find the maximum accuracy within the computation budget we can afford in our data center.
You’ve been quoted as saying that the joint Vlingo-WATSON solution provides more accuracy than was possible in the past. Does WATSON have better recognition, or is it because WATSON’s plugin architecture enables tighter integration with the Vlingo system?
It’s a combination of the two. We found the “out-of-the-box” accuracy to be better than our previous solutions, and then also found that the tools and integration hooks provided by WATSON allowed us to get even greater gains by adapting to the needs of our system.
Did you have to adapt your system in any way when incorporating WATSON?
The changes were pretty minimal. We did have to add support for the WATSON APIs within our server infrastructure, and then also added some support for new features available through WATSON, but that was about it.
. . . allow people to interact with Vlingo through whatever means they desire and then Vlingo should figure out what they want and perform the action for them.
Is there new functionality achievable with WATSON that wasn’t possible or easy before?
The main new functionality we got with WATSON was the ability to handle much larger language models. This has allowed us to scale our offerings in ways we could not in the past. Also, because of some of the details of the WATSON interface, we were able to achieve significant reductions in the end-user latency (the time between when the user is done speaking and when results appear on their screen).
Also, since integrating with WATSON, we have been adding new capabilities in our system such as automatic punctuation and are increasing our focus on Natural Language Understanding.
Your company has a lot of speech expertise. Why not build your own recognition engine?
We’ve considered this at various times, but would prefer to concentrate our resources on the levels above the speech technologies themselves. In this overall industry, it is our view that there have been very significant investments in the core technologies over the years, and this is paying off with high-performance engines such as WATSON. But, there hasn’t been enough focus on how these technologies are actually used. That certainly includes things like application and user interface issues, but at the technology level involves things like model training and adaptation, making use of user feedback to improve the technology, etc. This has been the main technology focus of Vlingo and therefore is the basis of the strong technology partnership with AT&T.
. . . while we make great use of speech technology, the main challenge is really around how we use this within this much broader context.
AT&T researchers credit feedback from Vlingo speech scientists for improving the WATSON engine. What were the actual contributions by Vlingo scientists?
These fall into a few categories. One of the things Vlingo has been doing is asking AT&T for a number of new features in the WATSON engine–mainly around the details of handling multiple large language models. AT&T has been very responsive in getting those changes to Vlingo and I think is finding that the things we are asking for to be generally useful. Also, as part of our agreement, Vlingo provides usage data to AT&T which is directly useful for creating better models and for other improvements in the engine. Finally, Vlingo has been creating some technology components which we share back to AT&T and has also been porting the engine to many new languages–this is also shared back with AT&T. We are just in the process of deploying WATSON for a number of European languages and have a very significant language roadmap for 2011.
The partnership has been in place for over a year. How is it working?
It’s been great from our point of view. Not only is the technology working better than what we had in the past, but more importantly the ongoing technical collaboration has been very healthy and helping both Vlingo and AT&T. We each have things to bring to the partnership–AT&T has deep technical expertise in the core technologies and Vlingo has a strong focus on bringing speech-enabled products to market. This combination of technical expertise and real-world market focus (and real usage data from deployments) is resulting in very fast progress on what we can bring to the end users.
Do you foresee expanding the way in which you use WATSON?
Yes, this is happening in a number of areas. On one hand, we are continuing to expand the capabilities of the mobile phone applications–in addition to continued improvements in performance, we are focusing a lot of effort on what we call “intent modeling”, so not only recognizing what users are saying, but converting that to meaning and then actions. In addition, we are expanding into some new markets, including similar speech-enabled interfaces on tablets, in cars, and in TVs. We think once people get used to the fact that they can speak to their mobile phones and have the phone do what they say, they are going to expect similar functionality from other connected devices.
What hard problems are you working on now for further improving speech interfaces?
It’s mainly in this area I mentioned above around “intent modeling”. When people are mobile, they don’t just want a speech interface into existing applications (such as adding voice to a search engine or messaging application), but want to be able to just speak to their phone and have Vlingo do the thing they want. We are starting to use the “Virtual assistant” metaphor because this really is what we are trying to achieve–allow people to interact with Vlingo through whatever means they desire and then Vlingo should figure out what they want and perform the action for them. As recent examples of this we have integrated with applications like Foursquare, Kayak, Fandango, and Opentable so that users can say things like “where are my friends”, or “find a hotel for tomorrow night in Chicago” and launch the appropriate application. We want to support this from your mobile phone, devices built into your car, your TV set, etc. and allow you to speak it, type it, or interact with a GUI. So, while we make great use of speech technology, the main challenge is really around how we use this within this much broader context.
What is a Vlingo?
Vlingo is a speech app for smart phones that lets you talk to your phone and just say what you want.
Users who download Vlingo can speak to their phones, and Vlingo does the rest: it parses the meaning of what was said, opens the appropriate app, and converts spoken input into text input, inserting it into the appropriate text boxes.
If you say “look for a Mexican restaurant in Boston,” Vlingo will open a browser and enter the words “Mexican restaurant in Boston” into the search box. Similarly you can tell your phone to dial a phone number (“call work”), update your status on Facebook or other social sites, or send an email. When emailing, Vlingo opens an email form, enters the To field, and places the message in the message field. Vlingo works with other applications as well.
Unlike other speech interfaces, Vlingo allows you to speak naturally, no matter the app. (This flexibility is due to Vlingo’s use of AT&T WATSON’s support for hierarchical language models.)
And because Vlingo incorporates text-to-speech, users can hear back responses and listen to emails, texts, browser results.
Another Vlingo advantage: it gets better the more you use it. Each utterance is saved so the system can learn a user’s speech patterns and most likely requests. (Vlingo does this by utilizing WATSON’s adaptation feature).