Skip to main content

Groundbreaking tech will make Cortana's hearing as good as ours

We all want to know the person we're talking to is listening to us. We extend this expectation to our "non-human" AI companions as well. Having to repeat oneself, once we've mentally moved from "expression of thought" to "expectation of response" only to be volleyed back to "expression of thought" because the listener did't hear us is an exercise in frustration.

Talking to digital assistants can be frustrating.

This all-too-common pattern of exchange between humans and our AI digital assistants has caused many of us to default to more reliable "non-verbal" exchanges.

Ironically, speaking to our assistants is supposed to be a more natural and quicker way to get things done.

Perhaps science fiction has spoiled us. The ease with which Knight Rider's Michael Knight verbally interacted with his artificially intelligent Trans Am, KITT, (and other fictional AI-human interactions), painted a frictionless picture of verbal discourse. The bar has been set very high.

Microsoft may have gotten us a bit closer to that bar, and to a route out of the pattern of frustration talking to digital assistants often engenders.

Crystal clear

On Monday, October 17th, 2016 Microsoft announced that their latest automated system had reached human parity regarding speech recognition. In a nutshell, Microsoft's non-human system was just as accurate as professional human transcriptionists when hearing and transcribing conversational speech.

Microsoft's automated system performed as well as professional transcriptionists.

The tests included pairs of strangers discussing an assigned topic and open-ended conversations between family and friends. The automated system surpassed the respective 5.9%, and 11.3% word error rates of the professional transcriptionists for each test.

Microsoft's executive vice-president of Artificial Intelligence and Research Group, Harry Shum, praised this accomplishment:

Even five years ago, I wouldn't have thought we could have achieved this. I just wouldn't have thought it would be possible.

This breakthrough, of course, is foundational to the realization of the more complex AI interactions we have come to believe are just around the corner. An assistant's ability to accurately hear us, like with human interactions, is a prerequisite to its understanding us. The journey to understanding is, of course, the next step. That may or may not take as long as our trek to human parity in "hearing" has taken us to achieve.

Competition is good

The road to human parity speech recognition began in the 1970's with a research division of the US Department of Defense, DARPA. Microsoft's subsequent decades-long investments led to artificial neural networks which are patterned after the biological neural networks of animals. The combination of convolutional and Long-Short-Term-Memory neural networks helped Redmond's system become "more human". Xuedong Huang, Microsoft's chief speech scientist exclaims:

This is an historic achievement.

And, of course, it is. Microsoft is not alone in its efforts to evolve AI understanding of human language, however. Google and Apple have invested in neural networks as well. The boost to Siri's performance is likely attributable to Cupertino's investments. Furthermore, Google's access to a massive repository of data through its search engine and ubiquitous Android OS has helped Mountain View's voice-recognition efforts make tremendous strides.

Apple is striving to catch up to Microsoft and Google.

Google's and Microsoft's industry leading progress with deep neural networks, AI, natural language processing and machine learning have likely spurred Apple's most recent investments in Siri.

Cupertino is looking to make its bounded assistant more competitive if recent job postings for the Siri team in Cambridge UK are any indication:

Join Apple's Siri team…and be part of revolutionizing human-machine interaction!""You will be joining a highly talented team of software engineers and machine learning scientists to develop the next generation of Siri. We use world class tools and software engineering practices and push the boundaries of artificial intelligence with a single aim: make a real difference to the lives of the hundreds of millions of Siri users.

This will undoubtedly increase Siri's reliability over time. Still, these advantages will be limited to Siri users within Apple's walled garden of 11.7% of all smartphone users and around 10% of all PC users.

Canvasing the competition

In a world where digital experiences are transient, the unbounded (cross-platform) nature of Cortana and Google Now, is arguably an advantage these assistants have over Siri. Combined with the Bing and Google search engine backbones respectively, Microsoft's and Google's investments in conversation as a canvas position these companies beyond the "iPhone-focused" Apple.

Speech recognition is key to Microsofts Conversations as a Platform plan.

Microsoft's forward-thinking platform focus is the backdrop which supports its accomplishments in human parity in speech recognition. This achievement is important to Nadella's Conversation as a Platform and human language (written/verbal) as a UI strategy. It's a critical piece to a complex puzzle of making AI and human interaction more natural. Nadella had this to say about Conversation as a Platform (opens in new tab):

We're in the very early days of this…It's a simple concept, yet it's very powerful in its impact. It is about taking the power of human language and applying it more pervasively to all of our computing. ...we need to infuse into our computers and computing intelligence, intelligence about us and our context…by doing so…this can have as profound an impact as the previous platform shifts have had, whether it be GUI, whether it be the Web or touch on mobile.

Cortana on PC, Edge, Windows Mobile, iOS, Android and Xbox is a big part of this vision. Shum shared of Redmond's speech recognition achievement and Microsoft's AI assistant:

This will make Cortana more powerful, making a truly intelligent assistant possible.

Going Forward

Customizing Cortana to each region, like Korea, before her launch there, provides a tailored regional experience but slows her global expansion. This is a sore spot for many users but is an essential part of Microsoft's personal digital assistant vision. Still, it is a disadvantage Cortana has in relation to the more widely distributed Siri and Google Now.

Finally, the lack of a standalone unit such as Alexa and Google Home is an apparent hole in Microsoft's strategic positioning of Cortana. Perhaps Redmond's groundbreaking success in speech recognition will give such a future a device a strategic advantage over the competition. Fans are asking, I wonder if Microsoft can hear them.

Jason L Ward is a columnist at Windows Central. He provides unique big picture analysis of the complex world of Microsoft. Jason takes the small clues and gives you an insightful big picture perspective through storytelling that you won't find *anywhere* else. Seriously, this dude thinks outside the box. Follow him on Twitter at @JLTechWord. He's doing the "write" thing!

58 Comments
  • Thanks for reading folks! As our tech becomes more integrated in our lives its human-like sensory abilities will continue to evolve. Hearing, vision, cognitive abilities etc.are being advanced to make our interactions with our tech/AI's more efficient and it's interaction with our world is being made more natural. What are your thoughts about Microsoft's advancement in the area of "hearing" we spoke about here? Should Microsoft have a standalone unit like Alexa and Google Home? LET'S TALK!!!
  • This would have worked perfectly with the XBOX streaming device they seem to have canned. I would have loved to get Groove music in my kitchen without the need for a full xbox and then to make that device Cortana and UWP app enabled would have offered something that Amazon and Google do not have yet...
  • I read through some blog posts about this recently it's really a fascinating leap forward and I'm happy that MSFT Research is behind it. As for a stand alone unit for Cortana I don't think it's the right time yet due to the lack of support with home automation, I'm sure there are a lot of options available but here in the UK at least the popular kits I see only seem to work well with iOS and Android and I'm not sure how well they integrate with Siri / Google Now and if any of them even touch base with Cortana on any platform. For me to be interested in a standalone unit I would expect pretty deep integration across a wide range ot smart devices, without that I don't see much benefit for a dedicated box.
  • The rollout to different regions isn't the biggest problem, the main issue is that anywhere outside of the US the development has been slow at best recently. The UK was probably one of the best non-US regions for Cortana at one point, but there's so much missing. We have got Sticky Note insights in the later insider builds for PC but still plenty than UK Cortana can't do - No Wunderlist integration for example, and when searching for a document (Like Panos did at the Surface Studio announcement) just aren't anywhere near as slick or natural. It's disappointing it's slowed down in the UK and don't see how once it releases in Korea they will actually keep it up to date with features.
  • Cortana as a standalone will need quite a few changes - They've focussed heavily on a single user personal assistant. When placed in a shared environment Cortana currently doesn't have a way to know WHO is asking the question. A shared device like the Echo needs to be able to respond to multiple users IMO.
  • Well she does have a slight understanding that different people sound different; for example the "only answer to me" option in cortana.
  • Yeah it really just could do with something to expand that, good example though. Microsoft do have a cloud service for voice recognition so they have the tech (Hopefully they are already working on something like Alexa and won't get left behind)
  • Agree and it also needs to be able to listen on multiple devices simultaneously (pc, phone, etc) and decide which one should take action. Take advantage of the audio from all the sources and get an even better understanding of what was said. They actually took out a license for something this. I have also been wanting them to do a "wireless kinect" for some years now, as in a device like the Echo, but I want it (or a version of it) to come with the depth sending camera. It makes sense if you want to scan a room for VR/AR.
  • In fact, just add more sensors to it: temperature, humidity, etc. Proximity sensor would be redundant with the Kinect camera onboard thiugh. Let it be programmable with Flow...
  • The question is has the hearing been improved in normal situations and not just a controlled sound room. The other is have they improved launch reliability.
  • I've only had time to skim the article but isn't it talking about the actual server side recognition rather than just better microphones? Like the ability to transcribe it, I don't think they'll make the reliability better on the client side of existing phones without new Mic arrays
  • Yes
  • Yes. We will call her the Cortana *****
    .....
    It needs to be the first highly animated holographic home assistant... That would blow those stupid solid black appliances out of the water.
  • Make Cortana's hearing as good as mine? You wouldn't want that, if you believe my wife.... But really, this is a great advance in tech!
  • My wife and kids would agree with you. About my hearing that is, not yours
  • If Cortana would be useful for other languages than english ...
  • It's getting there...
  • Apple has an 18 month start on many languages. Cortana is nowhere near as easy to use as Google Now or Siri. I am currently using Google Now, prior to that I had a Lumia 950 but had to switch to android because of the lack of a digital assistant. No crazy requirements; I just wanted to make phone calls while driving hands free.. MS is way behind in serving the world and Cortana still hasn't much to show for in terms of additional competencies over her competitors.
  • Cortana did start off really strong, but the responses got less natural in the Windows 10 version IMO. I think they took the foot off the pedal a little, even if overall it's way better to have Cortana in more places they was something that got lost in the transition.
  • i don`t really need other languages but the least they could do is make her available everywhere, i can speak english but can`t use this because ... reasons ... this thing could actually make my speech better because I get to exercise my English ...
  • I'm pretty sure it's not limited by region anymore since AU, you should be able to set the language to english even if your region isn't the US
  • Good to hear :)
  • I see what you did there....... No. No, I'm overthinking things again.
  • Love how your articles compare what MS is doing against the wider competitive environment. Thanks for the insights.
  • Your welcome! :-)
  • Yep
  • Thanks
  • Stand Alone assistant is at a major risk of becoming Microsoft's next "too late" technology. The market will be flooded with Google/Android devices.
  • Good point and also, once again, one that Microsoft doesn't advertise it's benefits through marketing. It doesn't matter how great it is if people don't know about it's features.
  • Go, team AI of msft!
  • I've been using an Amazon Dot and I'm very impressed with its hearing capability. Heck, the whole Alexa experience is terrific. Cortana should take notice.
  • I have to agree, but part of what makes the Echo and Dot so good at recognition is its limited vocabulary and command syntax inflexibility. Also, the microphone array is tuned specifically for the purpose.
  • What search engines do Siri and Echo use?
  • Siri uses Bing as a search engine.
  • Sorry, I didn't catch that... (Opens bing page)
  • Weird to read that Apple is playing catch up in this arena, while I am pretty sure Siri is a lot more functional than Cortana... I really wholeheartedly disagree with the notion that doing research reflects how good the consumer product is; cortana just doesn't make a difference. Also, I was able to use Siri in my native language years ago and Cortana is still miles away of the same coverage.
  • /"I was able to use Siri in my native language years ago and Cortana is still miles away of the same coverage"/ Siri was around before Cortana by a few years. Compared to Siri, Cortana is still in infancy. You cant really compare the two with the frame of mind "How could this new software not have all language compatibility already. This years older software has it, why not Cortana" That's like saying "My 5yr old dog knows not to make messes in the house... So why is my 1yr old pup doing it" That being said, I'm not saying Cortana doesn't need improving, as there is a lot of room for improvement.
  • Ya, I somewhat agree with you. Siri may have downsides that Cortana/Google Assistant don't...but Siri as far as my use of it is way more interactive than Cortana...but we'll have to wait and see how the coming months play out. It's been 30 days since they accomplished this goal, now how do they implement to the mainstream?
  • Nice, now it will be even easier for everyone of the Cortana enabled devices in my room to all answer me at the same time when im trying to tell the xbox to change the channel! lol :)
  • Yeah, it's pretty annoying that anytime I ask my xbox to do something, I get a cacophony of Cortana's around the room telling me they can't do that.
  • I said this at the time but Microsoft missed a huge opportunity not putting an array Mic in the Xbox One S and maybe even a small speaker. They had the perfect way to get a large userbase of Cortana users for alexa like functionality. Hopefully Microsoft will release a standalone assistant along with a cheaper one that plugs int the Xbox for those that don't have Kinect.
  • Maybe the Cortana will be in next years Xbox One upgrade model.
  • Perhaps an array mic will be in a forthcoming Xbox One C... Don't need Kinect just to have a microphone.
  • Full hands-free interaction with the PC is hopefully coming soon.
  • It is great to see Cortana's improvements but, and it's a big one!!! there is a lack of Cortana availability or  capabilities in other languages. I mean, I'm from Mexico and a proud owner of a 950XL but I can't show off my Lumia with Cortana because it has limited functionality in my country, for example we can't use "Hey Cortana" in Spanish, and a friend with iOS is able to talk to ask Siri instructions with voice commands in Spanish with no touching interaction, you know? In Mexico Siri has a similar option of "Hey Cortana" in Spanish. It is just one example, what about uber integration, still only available at US...  I think that this little details could affect W10 mobile participation in other markets outside US. So MS, do not forgot us!!!
  • Oh boy, such impressive , very performance, of an assistant that is available only to a select number of countries ...
  • First step should be to enable "Hey Cortana" in non-english speaking countries. Still waiting for that on my win 10 mobile device...
  • Yet I still cannot train her to recognize my voice despite understanding my request, precisely repeating my words. Then after falling to get her to recognize me for it to come to life when not asked and know you can stop it if only she understood me.
  • One big thing with this speech recognizance thing that I take a big beef with:
    The transcription errors with someone doing an actual transcription (speech to text) are largely typos, not a misunderstanding of what word was used. Because they are trying to keep up with a real-time conversation there are quite a few missed spaces, punctuation, misspelled words etc that are considered 'acceptable' because it is generally still legible and coherent. The number of actual transcription errors that are due to a misunderstanding is something around 1:2,000 Computers don't make spelling mistakes, so if all of the errors are in punctuation and misunderstanding words then it is actually still pretty terrible. Better than the others on the market... but no where near 'good' yet. Personally, I would like to see an offline version hit the market built into computers. Programs like Dragon Speaking Naturally is pretty terrible out of the box, but after it has been used for a few weeks the error rate approaches 1:1000 which is actually pretty damn good. If we could have a local cortana that runs on our phones and computers, and actually gets to personally know us and do the processing locally then it would be MUCH much better that what we can get from an online service.
  • Hi CaedanV thanks for your response. Actually, if you get a chance to check out Microsoft's original blog post on this topic they give an example of what is pointed out in this article: The issue isn't punctuation but Word Error Rate. (WER) I understand your claim but what I reported here was the actual accomplishment, human/AI parity in accuracy with transcribing a word correctly, not punctuation. Here's an excerpt from Microsoft's blog that help make this point clearer: "Parity, not perfection
    The research milestone doesn’t mean the computer recognized every word perfectly. In fact, humans don’t do that, either. Instead, it means that the error rate – or the rate at which the computer misheard a word like “have” for “is” or “a” for “the” – is the same as you’d expect from a person hearing the same conversation.
    Zweig attributed the accomplishment to the systematic use of the latest neural network technology in all aspects of the system.
    The push that got the researchers over the top was the use of neural language models in which words are represented as continuous vectors in space, and words like “fast” and “quick” are close together.
    “This lets the models generalize very well from word to word,” Zweig said." Hope this helps to clarify! :-) And thanks for your contribution to the discussion!:-) Read more at http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsof...
  • Articles like this are why I keep reading windows central. Used to read android central constantly, but now they only articles they release are click bait things and lists and lists of "best things for bla bla bla" and "five things you should do immediately after you get a Google pixel" (and the comment sections devolve into warzones on every single article, because of fanboys and people who don't like opposing opinions. Change in head editor really changes things I guess. They still do interesting articles like this, but it feels like it's taken a back seat. Not that windows central doesn't do click bait stuff either, but they seem to focus more on quality content as well as clicks. I always look forward to Jason's articles, because even if I disagree with them, they're very well thought out and informative, as well as interesting to read. Keep up the good work. Will never understand why everyone I know hates it when I say "ok google" or "hey cortana" to get something done (no, I'm not disturbing their peace, I don't do it in a quiet room). I love using them when I can. These amazing services are great tools, and I will use a tool if it's available to me and gets things done faster or better than other tools. Also, Cortana is nice to me :P
  • Thanks! And I'm glad Cortana's nice to you. She's pretty nice to me too. Lol :-)
  • Good read and well written article.
  • While Microsoft is celebrating historical achievements, Cortana is still unable to read me texts while driving. She gives me two options: "read it or ignore it". If I say "read it" she ignores me 95% of the time. Flipping a coin would have a 50% success rate. I'm so happy for Microsoft for achieving great success in a controlled lab environment, but their technology is worthless in the real world, where it really matters.
  • That might weigh heavily on the car you drive, that is quite highly the limiting factor...  My Ford hears me with +99% accuracy. Both in the Sync system and my phone if I "Call Cortana".  My Nissan doesn't have a built in system but we have an aftermarket bluetooth adapter. It isn't handy as you have to flip to AUX to hear anything but still is very accurate if I do use it. [Two months later, haha, I know]
  • I'll chime in with "When outside of North America?".  Here is Australia, Cortana seems very limited via my 950XL, with most questions resulting in a web page opening on the phone ... absolutely useless while I'm driving.  HOWEVER ... change all my regional settings to the US, and suddenly Cortana becomes halfway useful.  So what's the problem?  Cortana can "hear" you, as long as I convince her I'm in America, and put on an American accent?  Siri, on my wife's iPhone 6, seems HEAPS more useful ... "she" usually offers up a useful response, where Cortana is left fumbling for a web page ...
  • I'm in Australia, set my 950xl to us ages ago, don't have to speak in an American accent.   Works great
  • Making her understand my words better wouldn't solve my issues. She understands me almost anytime I talk to her, but for the most thing she puts up a Bing search. Not very useful when I ask her about my next flight. But the biggest issue is, that I won't ever talk to her in public because it's awkward. And I don't ever see people using their Google Now, Siri or Cortana anywhere.