Microsoft has launched three new in-house AI models – MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 – marking a major step toward building its own core AI technology instead of relying mainly on OpenAI. These models handle speech-to-text, text-to-speech, and image generation, covering key areas of generative AI. These models are being integrated into its AI platforms to improve speed, reduce costs, and support enterprise use. The rollout clearly puts Microsoft in more direct competition with Google and Anthropic.
Each of the three models focuses on a different important and widely used area of AI. For example, MAI-Transcribe-1 is designed for advanced speech recognition and can process multiple languages while maintaining accuracy even in noisy and low-quality audio environments. This makes it particularly suitable for enterprise scenarios like meeting transcription, call center analytics, compliance recording, and accessibility services.
MAI-Voice-1, focused on text-to-speech generation, brings improvements in both speed and realism. The model is capable of producing near real-time audio output, which is critical for applications like virtual assistants, interactive customer support, and real-time translation. Its ability to generate customizable voices also opens up new possibilities for businesses looking to create branded AI interactions. Industries like media, gaming, education, and entertainment are expected to benefit from such capabilities.
“The model can generate 60 seconds of audio in just a single second, and highly efficient GPU usage delivers that quality and speed affordably, the tech giant noted.
Meanwhile, MAI-Image-2 enters the highly competitive image generation space, where demand has surged due to the rise of AI-assisted design and content creation tools. The model is intended for both creative professionals and enterprise users, allowing automated generation of visuals for marketing, product design, and digital media. The company claims it is rapidly deploying these models across its products to deliver better quality, speed, and efficiency at competitive pricing, with MAI-Transcribe-1 starting at $0.36 per hour, MAI-Voice-1 at $22 per 1 million characters, and MAI-Image-2 at $5 per 1 million text tokens and $33 per 1 million image tokens.
Also, a key aspect of this launch is integration. Microsoft is embedding these models into its broader AI infrastructure, particularly within Azure and its enterprise AI platforms. This allows businesses to access these capabilities directly through cloud services, making deployment easier and more scalable. And by controlling both the models and the infrastructure, the Sataya Nadella-led firm can optimize performance, manage costs more effectively, and ensure tighter data security.
This comes as the software behemoth has already committed over $13 billion toward AI development and partnerships, and continues to expand dedicated teams focused on advanced AI research and deployment. However, it is important to note that Microsoft is not ending its partnerships. Instead, it is adopting a multi-model approach, where its own MAI systems coexist with models from OpenAI and others.
The Tech Portal is published by Blue Box Media Private Limited. Our investors have no influence over our reporting. Read our full Ownership and Funding Disclosure →

Ashutosh is a Senior Writer at The Tech Portal, largely reporting on new tech, and intersection of technology and business. Ashutosh’s career spans across nearly a decade of technology writing across multiple platforms and languages.