AI & SaaS

OpenAI Realtime API Is Out of Beta: What Voice Builders Get in 2026

June 12, 20266 min readBy SaaS Master

OpenAI's Realtime API left beta on May 12, 2026. The beta interface was deprecated on the same day, meaning any app still using the old endpoint needs to migrate to the GA version now. For builders who sat on the sidelines because the beta felt unfinished, the GA release adds three features that remove the main reasons to wait: SIP phone calling, remote MCP server support, and image inputs.

Key takeaways

The Realtime API is now fully generally available — the beta was deprecated May 12, 2026
New gpt-realtime model is more accurate at following complex instructions and switching languages mid-conversation
SIP integration means voice agents can now make and receive actual phone calls
Remote MCP servers connect voice agents to external tools and data without custom middleware
Audio input pricing is $32 per million tokens; a typical 15-minute conversation costs 30% less than at launch

OpenAI Realtime API pricing breakdown 2026

What changed from beta to GA

The original Realtime API beta launched as a way to build low-latency voice-to-voice applications without transcribing speech to text, passing it through a language model, and converting the output back to audio. That pipeline works, but it adds latency and loses prosody — the emotional tone and pacing of natural speech. The Realtime API keeps voice in the audio domain end-to-end.

The GA release is not just a stability upgrade. OpenAI added three features that change the scope of what you can actually build.

SIP support opens the phone channel

Session Initiation Protocol — SIP — is the standard that powers virtually every commercial phone system. Adding SIP to the Realtime API means a voice agent built on gpt-realtime can now make and receive actual phone calls, not just handle audio in a browser or mobile app. For SaaS companies building customer support bots, appointment schedulers, or outbound sales tools, this removes an entire integration layer that previously required stitching together Twilio or similar providers with the Realtime API through custom bridges.

This is the feature I think will have the biggest downstream commercial impact. The moment your voice agent can answer an inbound phone line directly, the use case list expands to anything a call center agent does today.

Remote MCP servers let voice agents use tools

Model Context Protocol — MCP — is the emerging standard for giving AI models access to external data and tools in a consistent way. The GA Realtime API now supports remote MCP servers, which means a gpt-realtime voice agent can call an external tool mid-conversation: look up a customer record, check inventory, create a calendar event, or query a database — without leaving the live audio session.

Previously, tool access in a voice agent required a separate WebSocket stream or a complex callback architecture. Remote MCP support folds that into the API itself. If you already have MCP servers running for your other AI workflows, those same servers now work with your voice agent.

Image input in a voice session

This one is less obvious but genuinely useful in the right context. A user can share a screenshot, a photo of a product, or a diagram, and the voice agent running on gpt-realtime can see it and discuss it — all without breaking the voice conversation flow. For technical support scenarios where a user is looking at an error screen or a configuration dialog, this matters. The agent can say "I can see the error — that's a missing API key in line 4" rather than asking the user to describe what they see.

Five new voices and better instruction following

The GA release ships five new voices: Ash, Ballad, Coral, Sage, and Verse. These are designed to be more expressive than the original voice set, with tunable emotions, accents, and tones. You can prompt the model to read a disclaimer script word-for-word, repeat alphanumerics clearly, or switch seamlessly between languages mid-sentence. The new gpt-realtime model is also better at interpreting system messages and developer prompts — the kind of edge case that caused frustration in production beta deployments.

What does the Realtime API actually cost?

Pricing at GA is $32 per million audio input tokens and $64 per million audio output tokens. Audio input that hits the cache costs 80% less — dropping to $0.40 per million tokens. Text input to the same model can also be cached at 50% off.

OpenAI reports that a typical 15-minute conversation now costs about 30% less than when the Realtime API first launched in beta. The exact per-session cost depends on how much of the conversation context is cached versus fresh, and how much the model speaks versus listens. For a customer support call where the agent reads from a consistent system prompt, caching can dramatically reduce the effective cost per session.

Who should build with this now?

The GA label matters for procurement and security reviews. Enterprise teams that were waiting for the beta designation to drop before evaluating the Realtime API for production use can move forward now. The SIP integration in particular unlocks a whole category of voice automation use cases that were impractical to build before — anything that needs to call or be called on a regular phone number.

For creators and smaller teams building AI-powered products, the price improvement and the expanded voice set make now a good time to experiment. The five new voices are noticeably more expressive than what shipped in beta, which matters if your product's personality depends on how it sounds.

The creator angle

I have been paying attention to the Realtime API since it launched in beta, mostly because it is the clearest path to building AI assistants that do not feel robotic. The latency problem that plagued early voice AI demos — the half-second pause before every response — is largely solved here. The SIP support is what has me excited for builders. Adding phone calling capability to an AI assistant is the kind of step-change that creates genuinely new product categories, and it is now available without a specialized telephony vendor.

Frequently asked questions

Is the OpenAI Realtime API beta still available?

No. The Realtime API beta was deprecated on May 12, 2026. All production traffic should now use the GA Realtime API endpoint. Apps still using the beta interface need to migrate immediately.

What is the pricing for the OpenAI Realtime API in 2026?

The GA Realtime API charges $32 per million audio input tokens and $64 per million audio output tokens. Cached audio input drops to $0.40 per million tokens — an 80% reduction. A typical 15-minute session costs approximately 30% less than at the beta launch.

Does the Realtime API support phone calls?

Yes. The GA release added Session Initiation Protocol — SIP — support, which allows voice agents built on the gpt-realtime model to make and receive actual phone calls through standard telephone infrastructure, without a separate telephony middleware layer.

OpenAI Realtime API voice AI AI agents SaaS

Was this article helpful?

SaaS Master

Creator behind SaaS Master — tutorials, walkthroughs, reviews, and explainers that help SaaS, AI, and WordPress products get understood and chosen. Writing here about the tools, trends, and tactics that actually move the needle. Work with me →