Yet another AI update from Google, folks. A couple days ago, we saw how Google’s translation neural networks successfully created their own secret language to talk about things we can’t comprehend. Well, now they’re working to create the most accurate lip-reading software ever — even advanced than our own ‘human’ skills.
Researchers from Google’s DeepMind AI division and the University of Oxford are working together on this project. To accomplish the task, a cohort of scientists fed thousands of hours of TV footage — 5000 to be precise — from the BBC to a neural network. It was made to watch six different TV shows, which aired between the period of January 2010 and December 2015. This included 118,000 difference sentences and some 17,500 unique words.
The primary task of the AI was to annotate the video footage. But, in the published research paper, Google and Oxford researchers describe their ambition for the project as:
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos
To understand the progress, it successfully deciphered words with a 46.8 percent accuracy. The neural network had to recognize the same based on mouth movement analysis. The under 50 percent accuracy might seem laughable to you but let me put things in perspective for you. When the same set of TV shows were shown to a professional lip-reader, they were able to decipher only 12.4 percent of words without error. Thus, one can understand the great difference in the capability of the AI as compared to a human expert in that particular field.
But this is not the only transcribing AI which has surfaced recently. This research follows pursuit to a similar work published by a separate group of researchers from University of Oxford itself. Using similar techniques but different input data, this cohort was able to develop a lip-reading AI called LipNet.
This neural network achieved 93.4 percent accuracy during analysis, compared to 52.3 percent human accuracy. It was able to attain such high numbers because the research group tested the AI using specially recorded footage where volunteers speaking formulaic sentences.
But the two research groups are now looking to expedite their analysis and use materials from each other’s research to truly understand the capabilities of their individual neural networks. Google’s DeepMind researchers have christened the said neural network — “Watch, Listen, Attend, and Spell” and it could most likely be used for a host of application. The scientists believe it might help hearing-impaired people understand conversations, and could also be helpful in annotating silent films, and instructing virtual assistant by mouthing words to a camera.