Microsoft’s AI image captioning is more accurate than humans and improves accessibility

Joanna Estrada
October 16, 2020

The power of the cloud continues to impress as Microsoft-powered AI can now write image captions as good or better than real people.

Everyone has experienced an automatically generated caption at some point that is more robotic gibberish than a description of the photo.

And while that's a notable milestone on its own, Microsoft isn't just keeping this tech to itself. The company is rolling it out as part of Azure's Cognitive Services.

Meanwhile, Microsoft is including the tool in its app for the visually impaired-Seeing AI. And later this year, the captioning model will also improve your presentations in PowerPoint for the web, Windows and Mac. It'll also pop up in Word and Outlook on desktop platforms.

"Eric Boyd of Microsoft's Azure AI division says, "[Image captioning] is one the hardest problems in AI.

"You really need to understand what is going on, you need to know the relationship between objects and actions and you need to summarize and describe it in a natural language sentence", she said.

That's because accurate automatic image captioning is used widely to create so-called "alt text" for images on the Internet-that's the text that screen readers use to describe an image to sight-impaired individuals who rely on these accessibility options to make the most of their time online or when using certain apps on their smartphones. And for visually impaired users, it can make navigating the web and software dramatically better.

Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services. (Typically, training automatic captioning models requires corpora that contain annotations provided by human labelers.) The vocabulary comprises an embedding space where features of image regions and tags of semantically similar objects are mapped into vectors that are close to each other (e.g., "person" and "man", "accordion" and "instrument"). In its tests and a novel object captioning benchmark, Microsoft says its AI model managed to describe images as well as humans can. Whether or not that claim holds up in the real world remains to be seen.

"This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory", Huang said.

But while Office 365 and Seeing AI could automatically caption images better than some AI baselines, Microsoft engineers pursued new techniques to improve them further.

Of course, Microsoft is careful to point out that the system "won't return ideal results every time".

Other reports by Click Lancashire

Discuss This Article