So [Music], [Applause] and [Laughter] will tell you what that noise is in the background. Sound effects in videos will now be described as captions near the bottom of the screen. This is thanks to the advancements in the YouTube’s automatic captioning system brought about thanks to the progress of Google’s machine learning systems. The system has gotten really good at the transcription of what people are saying and it is now moving a step further forward to tackle describing ambient sounds as well.
Announcing the news, YouTube said:
Since 2009, YouTube has provided automatic caption tracks for videos, focusing heavily on speech transcription in order to make the content hosted more accessible. However, without similar descriptions of the ambient sounds in videos, much of the information and impact of a video is not captured by speech transcription alone.
The company also said that music, applause and laughter were the sounds that were most easily recognized and thus it was starting there. Actually, its systems can also recognize other sounds however, context proved to be a difficulty. For instance, if something “rang”, the question would be exactly what rang and how to proceed from there. Plus, the three sounds Google chose, were also among some of the most frequently labeled.
The model applied here is DNN or Deep Neural Network. The company hopes to continue expanding upon its capabilities until it can offer more particularization and produce captions like [Mild Applause], [Raucous Applause] and so on. In the Viterbi algorithm that was applied, Google said that the predicted segments for each sound effect corresponding to the ON state.
The company hopes to continue along this track for some time to come. Google also said that it had managed to develop a framework that would enrich the automatic caption track with sound effects, but added that there was still much to be done.
We hope that this will spur further work and discussion in the community around improving captions using not only automatic techniques, but also around ways to make creator-generated and community-contributed caption tracks richer (including perhaps, starting with the auto-captions) and better to further improve the viewing experience for our users.
You can learn more about the topic by going right here.