Azure Speech at Build 2026: Powering Voice Agents with Real-Time and Life-like Experiences

Jun 03, 2026 - 17:46

Updated: 26 days ago

0 15

Azure Speech at Build 2026: Powering Voice Agents with Real-Time and Life-like Experiences

Voice is rapidly becoming the default interface for AI. At Build 2026, Azure Speech in Foundry Tools is making it dramatically easier to ship production-grade voice experiences that feel real, responsive, and global. From agentic, low-latency Voice Live experiences built natively into Microsoft Foundry Agent, to a new generation of LLM powered Speech to text and Text to speech voices - every layer of the Speech stack is getting faster, more expressive, and more customizable. With a new unified speech experience in the latest Foundry update that brings playgrounds and self-service fine-tuning to every Speech capability, developers now have a clear path from prototype to production for real-time, multilingual, and truly agentic voice applications - faster and more scalable than ever before.

Build Real-Time Voice Agent with Voice Live and Foundry Agent Service

As developers move beyond traditional chatbots, they are building a new class of real-time agents that can listen, reason, take actions, and respond naturally in live conversations. From customer support agents and virtual assistants to healthcare intake, retail concierge, field operations, in-car assistants and multilingual employee support, voice agents are becoming a key interface for how people interact with AI.

At Build 2026, we’re announcing major updates that support building enterprise-ready, voice agents at scale.

Voice Live for Foundry Prompt Agents is now generally available. This is a strong fit for developers that want enterprise-ready capabilities with minimal operational overhead. Voice Live brings the essential pieces of voice interaction into a single API, from Speech to Text and Text to Speech to turn detection, interruption handling, avatars, and other conversational capabilities. Customers can combine real-time speech-to-speech interaction with managed agent orchestration, knowledge, memory, enterprise governance, observability, and scalable deployment - all within a single developer workflow.

Hosted Agents with Voice Live is available in Public Preview. Some customers need full control over their agent’s runtime, orchestration framework, and execution model. For those scenarios, Microsoft AI Foundry supports Hosted Agents with Voice Live (public preview), so developers can build with the frameworks they prefer and deploy on managed infrastructure. Whether using Microsoft Agent Framework, LangChain, or a custom orchestration stack, they can host those agents on Foundry Agent Service and connect them directly to Voice Live. Both Response API and Invocations Protocol are supported.

Hosted Agent also adds support for real-time voice interfaces such as WebSocket and WebRTC, which allows developers to deploy real-time voice workloads as managed containers while continuing to use frameworks such as Microsoft Voice Live, Pipecat, and LiveKit etc. These interfaces are bidirectional and full duplex, which makes them well suited to both cascaded pipelines and native speech-to-speech models. (Hosted Agent with Voice Live Demo)

In addition, we are advancing the Voice Live API with the following enhancements that developers can integrate into their agents:

New all-in-one speech-to-speech models to help developers build highly responsive voice experiences. These include GPT-Realtime 1.5 and the new Azure-Realtime model (public preview), which delivers more natural voice output across multiple languages and accents, including en-US, zh-CN, es-ES, fr-FR, de-DE, hi-IN and more. This is a strong option for customers prioritizing speed, simplicity, and natural conversational quality in multilingual voice experience (learn more).

Integration with MAI Transcribe-1 (public preview) for more accurate multilingual speech input, Neural HD V3 voices for more conversational and realistic voice experience, and four new full-body standard avatars (public preview) to make the voice agent more engaging. More details about the models can be found in next sections.

Full integration with speech customization/fine-tuning capabilities in Foundry, including custom speech for better recognition accuracy, custom voice for branded voice experience and custom avatar for one-of-a-kind visual representations of the agent. More details about the features can be found in next sections.

WebRTC (Web Real-Time Communication) connection as public preview, enabling low‑latency, real‑time voice interactions directly from web and mobile clients (learn more).

The solution template call center voice agent accelerator now expands the telephony capabilities by integrating more third-party providers such as Twilio and Infobip, giving customers greater flexibility to connect with their preferred telephony infrastructures.

A new Voice Live Evaluation Harness gives developers a one-command-deployable pipeline to score their voice agents on 13 Foundry evaluators - intent resolution, task adherence, task completion, response completeness, and more - using pre-recorded multi-turn audio in Push-toTalk (PTT), Voice-Activity-Detection (VAD), or Foundry Agent mode (learn more).

Next Generation Speech Models in Azure Speech

We're advancing Azure Speech to text with a new generation of LLM-powered recognition models that raise accuracy, expand language coverage, and give developers more control across both batch and real-time scenarios.

LLM Speech API is now generally available in Azure speech for LLM-powered transcription and translation of audio files (learn more): 25 languages / 90+ locales with locale hint, renewed speech-LLM model with better context and entity recognition and reduced hallucination, up to 5-hour long-form audio, prompt-tuning with 20,000-character input and 2,000 phrase-list entries, and broader regional availability. This model achieves industry-leading accuracy, ranking No.1 across all models on the Open ASR Leaderboard. We also upgraded the MAI-Transcribe model from 1.0 to 1.5 with the phrase list support and verbatim mode.

We're upgrading the TTS and TTS avatars in Azure Speech. With flexible instruction controls brought into the HD voices, upgraded recipes in personal voice and new TTS avatar capabilities, customers can build voice agents that feel real, human-like and personalized.

Neural HD V3 (En-US Ava-Preview/Andrew-Preview/Serena-Preview) is now in public preview, delivering best-in-class quality with prompt-level instruction control. We also upgrade the MAI-Voice from 1.0 to 2.0 in public preview with 10+ languages support. Personal Voice is upgraded the OmniHD and MAI-Voice-2, optimized for conversational AI, creative applications, and long-form narration with emotion and style control.

 Avatar updates the Photo Avatar and Custom Photo Avatar that are in generally available (demo). Also, four new full-body standard avatars are now in public preview in the Foundry Voice Live and Text-to-Speech Avatar playground. Kobie Burrell, Director of Development of Optimal Blue, sharing: "The photo avatar and speech service made it incredibly easy for our team bring our Virtual Economist to life. The photo avatars in particular helped us create something that feels human and intuitive - giving our users the experience of engaging with an economist, not just an interface to a set of powerful models."

Azure Speech & Customization experience in Microsoft Foundry

Speech Playgrounds are now available for every Speech capability in Microsoft Foundry

Every Azure Speech capability now has a hands-on playground in one place, so developers can try different models, compare them and prototype in the different speech capabilities such as speech-to-text, text-to-speech, avatars, Voice Live, and speech translation - no code required - and go from experimentation to production without ever leaving Foundry. try it here

Azure Speech is fine-tunable through the new Foundry experience

For the first time in the Microsoft foundry, custom speech, voice and avatar allow the developers tailor models to their own domain vocabulary, brand identity, and visual presence so their agents sound, understand, and look distinctly their own.

Custom Speech: adapt speech-to-text to domain vocabulary and acoustic conditions.

Custom Voice: train brand voices with Professional Voice, or zero-shot cloning with Personal voice, including Omni and MAI-Voice-1/2 model.

Custom Avatar: create high quality avatars using video, or a quick avatar with a single image. See the self-serving photo avatar creation in the foundry experience as follows:

Get started today  

The easiest way to explore is through the Microsoft Foundry portal and the Foundry Tools catalog. From there you can follow the documentation and Microsoft Learn courses, and start building with Azure Speech referring to Azure Speech Documentation

The Rise of Modular Browser Automation and Workflow Toolboxes

What's Your Reaction?

Like 0

Dislike 0

Love 0

Funny 0

Wow 0

Sad 0

Angry 0

Christopher Holloway

Christopher Holloway is the founder and director of Progressive Robot, a UK-based technology company. A full-stack engineer with more than two decades of experience, he works across PHP development, ecommerce, Linux infrastructure, technical SEO and AI automation, and writes here on technology, AI, hardware and software.

Surface Pro and Laptop Update: Snapdragon X2 Architecture and AI Readiness

NVIDIA Blackwell Dominates MLPerf Training...

HPE and NVIDIA Expand AI Infrastructure...

Benchmarking Agentic AI Infrastructure:...

Why Artificial Intelligence Has Not...

Asus ROG Ally X20 Review: OLED Refinement...

Gran Turismo World Series Singapore:...

007 First Light Sets New Sales Record...

Summer Game Fest 2026: Industry Shifts...

iPhone 18 Pro Color Confirmed: Dark...

The Complete Guide to MagSafe and Magnetic...

Understanding the Reality Behind the...

Mobile Document Scanning: Evaluating...

Apple Launches New Accessories And Thinnest...

Beats Studio Buds Firmware Update Addresses...

Apple Updates AirPods Pro and Beats...

Apple Distributes Routine Firmware Updates...

Apple A22 Pro Chipset and the 1.4nm...

Apple 2027 Roadmap: Camera AirPods and...

HPE and NVIDIA Expand AI Infrastructure...

NVIDIA Blackwell Sets New Standards...

Why Storage Infrastructure Is Essential...

HPE Updates AI Infrastructure for Agentic...

HPE Expands Self-Driving Networks for...

HPE Broadens Quantum Partnerships to...

AMD AGESA 1.3.0.1b BIOS Update Improves...

MSI MPG 271KRAW18 5K Mini LED Monitor...

AMD Warranty Dispute Highlights Evolving...

MSI Forecasts Persistent Memory And...

Domestic 24 Gb Chips Enable 48 GB DDR5...

DDR5 Memory Prices Surge in Germany,...

Intel Raptor Lake Next Desktop CPUs...

Intel Extends Raptor Lake Lifecycle...

Arctic Computex 2026 Cooling and Chassis...

Adata XPG Computex 2026 Hardware Lineup...

Compact NCase P1 ATX Chassis for Multi-GPU...

Lian Li Computex 2026 Hardware Innovations...

Mini PC Buying Guide: Performance, Value,...

Compact Desktop Systems: Architecture,...

PC Hardware Transition Guide: Migration,...

Asus ROG Edition 20 Desktop Balances...

MSI Unveils Pro Max Desktops and Monitors...

Intel Core-X Series and X299 Platform...

Intel Core i9-7980XE Benchmarks Reveal...

MSI Introduces Vigor GK80 and GK70 Keyboards...

Optimizing Chiplet Cooling With Adjustable...

How Modern Security Suites Replace Multiple...

Red Hat NPM Channel Compromised in Supply...

How Malvertising Campaigns Exploit Trusted...

AI doesn't break security. Complexity...

Meta AI Chatbot Exploit Compromises...

Scientific Insights From Overlooked...

Space Market Correction as SpaceX IPO...

Negative Time in Quantum Optics: Peer-Reviewed...

How Underwater Technology Is Reshaping...

Why Night Driving Poses Unique Risks...

Anker Prime 250W Charging Station Review...

Tesla Model 3 Pricing Shift in Canada...

How AI and Machine Learning Are Reshaping...

Singapore Airlines Brings Live World...

Dolby Atmos Changed Movie Audio: Why...

Clarkson's Farm Season 5 Release Schedule...

Masters of the Universe Director Addresses...

Google Engineer Charged With Insider...

Fake downloads of popular PC utilities...

Pearl Cryptocurrency Mining Rush Fades...

Physical Attacks Against Major Cryptocurrency...

Coinbase and Kalshi introduce perpetual...

Welcome!

Azure Speech at Build 2026: Powering Voice Agents with Real-Time and Life-like Experiences

What's Your Reaction?

Related Posts

Comments (0)

Popular Posts

Follow Us

Recommended Posts

Popular Tags