openai audio models

OpenAI has rolled out a new set of real-time audio models focused on making voice AI faster and more useful in live conversations. The release includes three systems – GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper – built to support everything from live multilingual translation and instant speech-to-text transcription to full conversational voice agents that can reason, use tools, and execute actions during ongoing dialogue.

At the center of the release is GPT-Realtime-2, the company’s most advanced voice reasoning model to date. The Sam Altman-led firm describes it as a high-performance system capable of handling complex spoken interactions with reasoning power comparable to its latest large language models. Unlike traditional voice assistants that rely on step-by-step processing, this model is designed to operate in a continuous stream, allowing it to interpret speech as it happens and respond without noticeable delay. It supports a large context window of up to 32,000 tokens, allowing it to maintain long conversations without losing earlier context.

The model is also engineered for what OpenAI calls ‘agentic behaviour’, meaning it can perform actions during conversations. Through tool integration, GPT-Realtime-2 can interact with external systems like calendars, booking platforms, databases, and enterprise APIs. GPT-Realtime-2 is positioned as a premium enterprise product. Its pricing is set at around $32 per million audio input tokens and $64 per million output tokens, with reduced rates for cached inputs.

The second model, GPT-Realtime-Translate, focuses entirely on live speech translation. It is designed to process spoken input continuously and generate translations in real time without requiring speakers to pause or complete full sentences. OpenAI reports support for over 70 input languages and around 13 output languages. GPT-Realtime-Translate is priced on a usage-per-minute basis, at around $0.034 per minute of audio processing, making it significantly more accessible than the reasoning-heavy GPT-Realtime-2.

The third model, GPT-Realtime-Whisper, extends OpenAI’s earlier Whisper speech recognition technology into a real-time streaming system. Whisper was widely adopted in earlier AI applications for its strong multilingual transcription accuracy, and this new version is optimized for continuous speech-to-text conversion rather than post-recording analysis. It produces live transcriptions as speech is being spoken, allowing near-instant captions and documentation.

Such capability makes GPT-Realtime-Whisper particularly useful for live meetings, newsroom transcription, courtroom documentation, accessibility tools for hearing-impaired users, and enterprise logging systems. Meanwhile, the pricing for GPT-Realtime-Whisper is the lowest among the three models, at about $0.017 per minute of audio processing.

Technically, all three models show a shift away from the traditional voice AI architecture that relied on separate stages for speech recognition, language processing, and speech synthesis. A major focus in the new system is latency reduction. Another key advancement is the ability of the models to interact with external tools during conversation. Instead of functioning only as language generators, the models can actively retrieve data, perform operations, and trigger workflows in connected systems.

The Tech Portal is published by Blue Box Media Private Limited. Our investors have no influence over our reporting. Read our full Ownership and Funding Disclosure →