NewsApril 12, 2026· 4 min read

Microsoft Launches MAI Foundation Models for Text, Voice, and Video

Microsoft AI unveils three new foundation models — MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 — signaling its push to build a full multimodal AI stack.

By Agentic Daily · Editorial processSource: TechCrunch

Our verdict

Verified

Factual accuracy confirmed through independent verification.

How we rate →

The Announcement

Microsoft AI has released three new foundation models that can generate text, voice, and video, marking the company's most significant push to compete directly with rival AI labs on model development rather than relying solely on its OpenAI partnership.

Model Details

MAI-Transcribe-1: Speech-to-text in 25 languages with near-human accuracy
MAI-Voice-1: Natural-sounding audio generation with emotion control
MAI-Image-2: Video generation from text prompts with temporal consistency

Strategic Significance

This release signals Microsoft's desire to reduce dependency on OpenAI for core AI capabilities, while maintaining their partnership for ChatGPT and enterprise products. The models are available through Azure AI Services.

#Microsoft#Foundation Models#Multimodal#Azure

Get the next one in your inbox

One daily brief. Every story gets a hype verdict.

No spam. Unsubscribe anytime.

Keep reading

Microsoft Launches MAI Foundation Models for Text, Voice, and Video

The Announcement

Model Details

Strategic Significance

One daily brief. Every story gets a hype verdict.

Related stories

Fenergo hires Finastra CRO to lead global revenue expansion

UK banks have 18 months to map third-party risks under PS26/2

Quantifind Lands $200M to Scale AI-Native Financial Crime Detection