Azure Speech at Build 2026: Powering Voice Agents with Real-Time and Life-like Experiences
Voice is rapidly becoming the default interface for AI. At Build 2026, Azure Speech in Foundry Tools is making it dramatically easier to ship production-grade voice experiences that feel real, responsive, and global. From agentic, low-latency Voice Live experiences built natively into Microsoft Foundry Agent, to a new generation of LLM powered Speech to text and Text to speech voices - every layer of the Speech stack is getting faster, more expressive, and more customizable. With a new unified speech experience in the latest Foundry update that brings playgrounds and self-service fine-tuning to every Speech capability, developers now have a clear path from prototype to production for real-time, multilingual, and truly agentic voice applications - faster and more scalable than ever before.
Build Real-Time Voice Agent with Voice Live and Foundry Agent Service
As developers move beyond traditional chatbots, they are building a new class of real-time agents that can listen, reason, take actions, and respond naturally in live conversations. From customer support agents and virtual assistants to healthcare intake, retail concierge, field operations, in-car assistants and multilingual employee support, voice agents are becoming a key interface for how people interact with AI.
At Build 2026, we’re announcing major updates that support building enterprise-ready, voice agents at scale.
Voice Live for Foundry Prompt Agents is now generally available. This is a strong fit for developers that want enterprise-ready capabilities with minimal operational overhead. Voice Live brings the essential pieces of voice interaction into a single API, from Speech to Text and Text to Speech to turn detection, interruption handling, avatars, and other conversational capabilities. Customers can combine real-time speech-to-speech interaction with managed agent orchestration, knowledge, memory, enterprise governance, observability, and scalable deployment - all within a single developer workflow.
Hosted Agents with Voice Live is available in Public Preview. Some customers need full control over their agent’s runtime, orchestration framework, and execution model. For those scenarios, Microsoft AI Foundry supports Hosted Agents with Voice Live (public preview), so developers can build with the frameworks they prefer and deploy on managed infrastructure. Whether using Microsoft Agent Framework, LangChain, or a custom orchestration stack, they can host those agents on Foundry Agent Service and connect them directly to Voice Live. Both Response API and Invocations Protocol are supported.
Hosted Agent also adds support for real-time voice interfaces such as WebSocket and WebRTC, which allows developers to deploy real-time voice workloads as managed containers while continuing to use frameworks such as Microsoft Voice Live, Pipecat, and LiveKit etc. These interfaces are bidirectional and full duplex, which makes them well suited to both cascaded pipelines and native speech-to-speech models. (Hosted Agent with Voice Live Demo)
In addition, we are advancing the Voice Live API with the following enhancements that developers can integrate into their agents:
- New all-in-one speech-to-speech models to help developers build highly responsive voice experiences. These include GPT-Realtime 1.5 and the new Azure-Realtime model (public preview), which delivers more natural voice output across multiple languages and accents, including en-US, zh-CN, es-ES, fr-FR, de-DE, hi-IN and more. This is a strong option for customers prioritizing speed, simplicity, and natural conversational quality in multilingual voice experience (learn more).
- Integration with MAI Transcribe-1 (public preview) for more accurate multilingual speech input, Neural HD V3 voices for more conversational and realistic voice experience, and four new full-body standard avatars (public preview) to make the voice agent more engaging. More details about the models can be found in next sections.
- Full integration with speech customization/fine-tuning capabilities in Foundry, including custom speech for better recognition accuracy, custom voice for branded voice experience and custom avatar for one-of-a-kind visual representations of the agent. More details about the features can be found in next sections.
- WebRTC (Web Real-Time Communication) connection as public preview, enabling low‑latency, real‑time voice interactions directly from web and mobile clients (learn more).
- The solution template call center voice agent accelerator now expands the telephony capabilities by integrating more third-party providers such as Twilio and Infobip, giving customers greater flexibility to connect with their preferred telephony infrastructures.
- A new Voice Live Evaluation Harness gives developers a one-command-deployable pipeline to score their voice agents on 13 Foundry evaluators - intent resolution, task adherence, task completion, response completeness, and more - using pre-recorded multi-turn audio in Push-toTalk (PTT), Voice-Activity-Detection (VAD), or Foundry Agent mode (learn more).
Next Generation Speech Models in Azure Speech
We're advancing Azure Speech to text with a new generation of LLM-powered recognition models that raise accuracy, expand language coverage, and give developers more control across both batch and real-time scenarios.
- LLM Speech API is now generally available in Azure speech for LLM-powered transcription and translation of audio files (learn more): 25 languages / 90+ locales with locale hint, renewed speech-LLM model with better context and entity recognition and reduced hallucination, up to 5-hour long-form audio, prompt-tuning with 20,000-character input and 2,000 phrase-list entries, and broader regional availability. This model achieves industry-leading accuracy, ranking No.1 across all models on the Open ASR Leaderboard. We also upgraded the MAI-Transcribe model from 1.0 to 1.5 with the phrase list support and verbatim mode.
We're upgrading the TTS and TTS avatars in Azure Speech. With flexible instruction controls brought into the HD voices, upgraded recipes in personal voice and new TTS avatar capabilities, customers can build voice agents that feel real, human-like and personalized.
- Neural HD V3 (En-US Ava-Preview/Andrew-Preview/Serena-Preview) is now in public preview, delivering best-in-class quality with prompt-level instruction control. We also upgrade the MAI-Voice from 1.0 to 2.0 in public preview with 10+ languages support. Personal Voice is upgraded the OmniHD and MAI-Voice-2, optimized for conversational AI, creative applications, and long-form narration with emotion and style control.
- Avatar updates the Photo Avatar and Custom Photo Avatar that are in generally available (demo). Also, four new full-body standard avatars are now in public preview in the Foundry Voice Live and Text-to-Speech Avatar playground. Kobie Burrell, Director of Development of Optimal Blue, sharing: "The photo avatar and speech service made it incredibly easy for our team bring our Virtual Economist to life. The photo avatars in particular helped us create something that feels human and intuitive - giving our users the experience of engaging with an economist, not just an interface to a set of powerful models."
Azure Speech & Customization experience in Microsoft Foundry
Speech Playgrounds are now available for every Speech capability in Microsoft Foundry
Every Azure Speech capability now has a hands-on playground in one place, so developers can try different models, compare them and prototype in the different speech capabilities such as speech-to-text, text-to-speech, avatars, Voice Live, and speech translation - no code required - and go from experimentation to production without ever leaving Foundry. try it here
Azure Speech is fine-tunable through the new Foundry experience
For the first time in the Microsoft foundry, custom speech, voice and avatar allow the developers tailor models to their own domain vocabulary, brand identity, and visual presence so their agents sound, understand, and look distinctly their own.
- Custom Speech: adapt speech-to-text to domain vocabulary and acoustic conditions.
- Custom Voice: train brand voices with Professional Voice, or zero-shot cloning with Personal voice, including Omni and MAI-Voice-1/2 model.
- Custom Avatar: create high quality avatars using video, or a quick avatar with a single image. See the self-serving photo avatar creation in the foundry experience as follows:
Get started today
The easiest way to explore is through the Microsoft Foundry portal and the Foundry Tools catalog. From there you can follow the documentation and Microsoft Learn courses, and start building with Azure Speech referring to Azure Speech Documentation
What's Your Reaction?
Like
0
Dislike
0
Love
0
Funny
0
Wow
0
Sad
0
Angry
0
Comments (0)