Gradually adding to its already burgeoning arsenal of machine learning-powered artificially intelligent products, Google has today announced and open-sourced yet another one of its projects. The latest addition being an image captioning model called ‘Show and Tell‘ that learns how to describe the content of images. This means that the AI can interpret and describe any image — with text captions — supplied by the user.
This ‘image-to-text’ project is powered by a deep neural network running on Google’s second generation machine learning system — TensorFlow — launched about an year ago. It has been developed by the research scientists on the company’s Brain Team, and they boast the system of being a 93.9 per cent accurate as compared to previous version which fell short on expectations.
But how does the ‘Show an Tell’ AI predict text captions with respect to the corresponding images?
To make the captions as accurate as possible, the research team had to train both the vision and language frameworks with captions created by real people. This approach towards naming objects in a frame, reduces redundancies and helps the system piece together a completely descriptive sentence to describe the image in question. It works on a more complex level to synthesize original captions from previously unseen images.
The core strength of the ‘Show and Tell’ project is its ability to bridge logical gaps to connect objects with an image to the corresponding context.
Talking a bit in terms of machine learning lingo(research sheet), the Show and Tell project is an example of an encoder-decoder ‘convolutional’ neural network, where the image is being encoded into fixed-length vectors and then decoded into a natural language description. The system has been trained to work as a language model conditioned on the image encoding. The text representation works on an embedding model where each word — also a fixed-length vector — is learned during the real training.
It took the team a good one to two weeks to go through the initial training phase, which was conducted on a single machine with a NVIDIA Tesla K20m GPU. The second training phase may take a couple additional weeks to achieve peak performance, but it’ll help you achieve reasonable results with each try. Previous version of image captioning models took an average of 3 seconds per training step, but today’s open-source project takes it a notch further and perform the same task in quarter of that time — 0.7 seconds.
Once combined with the immensely huge catalog of Google, this technology could be highly useful for visually impaired users, who can use the same to recognize the content of images. They can then interact with the content in a way that has never been made possible before, except Microsoft(COCO) and Facebook are also working to develop similar technologies to integrate into their platforms. This could also be the same technology that is currently being used by Google Assistant to help you determine the next smart reply on Allo.
Also, Google has previously also announced a public alpha release of a TensorFlow-based cloud machine learning platform that powers various different service offered by the company, including speech recognition in the Google app, search in Google Photos and the Smart Reply feature in Inbox by Gmail. This service can be used by developers for building and training custom models to be used in intelligent applications.