This article was published 1 yearago

SANTA CLARA,CA/USA – FEBRUARY 1, 2014: Microsoft corporate building in Santa Clara, California. Microsoft is a multinational corporation that develops, supports and sells computer software and services.

At its annual Ignite conference this year, tech behemoth Microsoft has now introduced the public preview release of Azure AI Speech’s text-to-speech avatar. This technology allows users to generate talking avatar videos through text input, pretty much exactly like deepfakes, though ofcourse meant to serve better purposes.

The text-to-speech avatar feature is a culmination of advanced vision capabilities and synthetic video creation. It leverages deep neural networks trained on human video recording samples to develop 2D photorealistic avatars. These avatars, equipped with text-to-speech voice models, bring together visual and auditory elements seamlessly. Behind the scenes, the avatar models undergo training through deep neural networks, a process rooted in the analysis of human video recording samples.

The result is a lifelike avatar capable of delivering spoken content with the aid of a text-to-speech voice model. This synthesis of vision and voice sets the stage for a multitude of creative applications. After all, the ability to generate talking avatars through text input provides a novel and efficient way for businesses, educators, and content creators to convey information, create engaging training materials, and develop immersive presentations.

A significant impetus behind the development of avatars is the simplification of video content creation. Traditional methods, often resource-intensive, demand considerable time and budget for shooting and editing. The text-to-speech avatar disrupts this traditional model, allowing users to input text and effortlessly generate videos tailored to their specific requirements. This efficiency marks a paradigm shift in how businesses and content creators approach video production.

The company gives an overview into how content is generated by avatars: once the text is put into the text analyzer, a phoneme sequence is produced as the output. Following this, the TTS audio synthesizer anticipates the acoustic characteristics of the entered text and generates the voice synthesis, and subsequently, the Neural Text-to-Speech Avatar model forecasts the lip-synced image in conjunction with the acoustic features, culminating in the production of the synthetic video.

Microsoft, in an official blog post, describes the text-to-speech avatar as “a custom text to speech avatar feature enables customers to create a personalized avatar for their product or brand. Customers can upload their own video recording of avatar talent, which the feature uses to train a synthetic video of the custom avatar speaking.”

The company further notes that interested customers can choose between a prebuilt or a custom neural voice for their avatar, which will resemble the person if their voice and likeness are used for the custom neural voice and the custom text-to-speech avatar. Whether used in advertisements, virtual sales agents, or AI-driven teaching scenarios, the ability to create visually and audibly compelling content can captivate audiences. Virtual assistants and chatbots, powered by these avatars, may provide more engaging and interactive user experiences.

Prebuilt and custom Avatars

Speaking of prebuilt and custom avatars, prebuilt avatars provide ready-to-use options available on Azure, offering a variety of choices for video content or interactive applications. On the other hand, the custom text-to-speech avatar feature empowers users to create personalized avatars by uploading their own video recordings. This customization option adds a layer of flexibility for brands or products seeking a unique representation. The provision of both prebuilt and custom avatars offers creative flexibility. Brands and businesses can choose ready-made avatars or tailor them to align with their unique identity. Customization extends beyond visual representation, allowing for the incorporation of specific voices, fostering a more personalized and brand-aligned communication approach.

The versatility of the text-to-speech avatar becomes evident in its extensive range of applications. From traditional uses like training videos and product introductions to cutting-edge applications such as AI-driven teaching and virtual human resources (HR) assistants, the feature caters to a broad spectrum of industry needs. Its adaptability extends to advertisements, virtual sales agents, and others, and interested users can use the avatar to build conversational agents, virtual assistants, chatbots, and more. In the realm of training and education, the text-to-speech avatar can streamline the development of instructional materials. Training videos, product introductions, and educational content can be produced more efficiently, reducing the time and resources traditionally required for video creation. This efficiency is especially valuable in corporate training, where conveying information effectively is crucial.

Microsoft has also implemented guardrails to ensure ethical use of the text-to-speech avatar, particularly in the creation of custom avatars. Access to custom avatars is restricted and available through registration only, with stringent criteria in place to prevent misuse.