Microsoft's new MAI models: what they do and where they fit in enterprise

azure ai foundry enterprise ai microsoft speech ai mai models embeddings

Microsoft released MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 into Foundry public preview on April 2nd - here is what enterprise teams should actually know about them.

Microsoft's new MAI models: what they do and where they fit in enterprise

Microsoft released three new proprietary AI foundation models into Foundry public preview on April 2nd. MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 landed with some genuinely interesting benchmarks and a cost story worth examining.

This is also a strategic signal. Microsoft has been heavily dependent on OpenAI for its model stack. These releases are the clearest indication yet that Microsoft is building its own capability in parallel, giving enterprise customers more options without leaving the Foundry ecosystem.

I want to walk through what each model actually does, where I see it fitting in real enterprise workloads, and some practical examples of how to use them.

What Microsoft just released

Three models, three modalities:

MAI-Transcribe-1 - speech-to-text across 25 languages, 3.9% word error rate on the FLEURS benchmark, outperforming GPT-Transcribe, Gemini 3.1 Flash and Whisper-large-v3. 50% lower GPU cost than leading alternatives. 2.5x faster batch throughput than Microsoft's existing Azure Fast offering. Pricing starts at $0.36 per hour.
MAI-Voice-1 - text-to-speech that produces 60 seconds of expressive audio in under one second on a single GPU. Supports custom voice creation from just a few seconds of source audio. Pricing starts at $22 per million characters.
MAI-Image-2 - text-to-image, debuting at number 3 on the Arena.ai leaderboard for image model families. 2x faster than the previous Microsoft image model with improvements in natural lighting, skin tone accuracy and in-image text rendering. Pricing starts at $5 per million text input tokens and $33 per million image output tokens.

All three are available now in public preview through Microsoft Foundry.

MAI-Transcribe-1: the one I am paying attention to

The word error rate numbers are the headline here. 3.9% average on FLEURS puts it ahead of the obvious competitors at a meaningfully lower cost point.

For enterprise, the relevant use cases are:

Call centre transcription at scale - the cost and speed advantage over alternatives is real at high volumes
Meeting transcription pipelines - especially if you are already in the Microsoft stack and want clean data residency
Multilingual support scenarios - 25 languages covers most enterprise footprints
Compliance and audit workflows - converting spoken interactions into structured records

Here is a basic example using the Azure AI Inference SDK:

from azure.ai.inference import AudioTranscriptionClient
from azure.core.credentials import AzureKeyCredential
import os
from dotenv import load_dotenv
 
load_dotenv()
 
client = AudioTranscriptionClient(
    endpoint=os.getenv("AZURE_AI_FOUNDRY_ENDPOINT"),
    credential=AzureKeyCredential(os.getenv("AZURE_AI_FOUNDRY_KEY")),
)
 
with open("call_recording.wav", "rb") as audio_file:
    result = client.transcribe(
        model="mai-transcribe-1",
        audio=audio_file,
        language="en",
    )
 
print(result.text)

One honest caveat: benchmark accuracy and accuracy on your actual audio are different things. Real call centre recordings have background noise, regional accents and domain-specific vocabulary that benchmarks dont capture. Test it against your own audio before drawing conclusions.

MAI-Voice-1: the custom voice story

The latency numbers on MAI-Voice-1 are impressive but the custom voice capability is what enterprise teams will actually care about.

Building a branded voice experience - for a customer service IVR, for accessibility tooling, for on-brand training content - used to require a significant project. The ability to create a custom voice from a few seconds of audio materially changes that.

Where I see it fitting:

Customer-facing IVR systems that need a consistent, branded voice
Accessibility tooling for reading back content to users
Voiceover automation for internal training and onboarding materials
Multilingual content expansion where you need the same voice identity across languages

The $22 per million characters pricing is competitive. A typical customer service interaction involves 500-800 characters of spoken content. At that rate you can process substantial volumes before cost becomes a concern.

MAI-Image-2: useful but less enterprise-critical

This is the one where I am most cautious about overstating the relevance.

Text-to-image is genuinely useful for certain enterprise scenarios but the Arena.ai leaderboard ranking measures creative quality, which may not be your primary criterion.

Where it earns its place:

Marketing teams generating image variants at scale rather than commissioning individual assets
Product teams creating UI mockups and concept visuals quickly
Internal comms teams replacing stock image subscriptions

The $33 per million image output token pricing needs to be modelled carefully against your actual usage before committing to it.

Harrier: the embedding model you should also know about

Alongside the MAI models, Microsoft open-sourced Harrier - a family of multilingual text embedding models that have hit state of the art on the Multilingual MTEB v2 benchmark.

Harrier comes in three sizes: 270M, 0.6B, and 27B parameters. The 0.6B model scores 69.0 on Multilingual MTEB v2, putting it ahead of proprietary alternatives from OpenAI, Google and Amazon. MIT licence.

This matters for enterprise RAG. Good embeddings have a bigger impact on retrieval accuracy than most teams realise. A weaker embedding model degrades every answer your AI system generates downstream, regardless of how capable your LLM is.

Here is how to use Harrier in a retrieval pipeline:

from azure.ai.inference import EmbeddingsClient
from azure.core.credentials import AzureKeyCredential
import os
from dotenv import load_dotenv
 
load_dotenv()
 
client = EmbeddingsClient(
    endpoint=os.getenv("AZURE_AI_FOUNDRY_ENDPOINT"),
    credential=AzureKeyCredential(os.getenv("AZURE_AI_FOUNDRY_KEY")),
)
 
# Harrier is instruction-tuned - prepend a task description to customise retrieval behaviour
query_instruction = "Retrieve documents relevant to the following enterprise IT support question:"
query = f"{query_instruction} How do I reset my VPN credentials?"
 
response = client.embed(
    model="harrier-oss-v1-0.6b",
    input=[query],
)
 
query_embedding = response.data[0].embedding
print(f"Embedding dimensions: {len(query_embedding)}")

The instruction-tuning is a practical advantage over standard embedding models. Rather than fine-tuning for your specific retrieval task, you can prepend a natural language instruction to shape the embedding behaviour without any training overhead.

The strategic picture

These releases are not just product additions. Microsoft is reducing its dependency on OpenAI.

The Microsoft-OpenAI partnership has defined the enterprise AI market for three years. Building proprietary capability across speech, voice, image and embeddings is a clear hedge, and enterprise customers benefit from having more native Microsoft options within Foundry without needing to reach for a third-party model.

The honest question is whether these models are production-ready right now. They are in public preview, which means the API surface and pricing could still change. I would use them for prototyping and benchmarking against existing workflows - but I would not rely on them as the sole solution in a production pipeline yet.

What I am testing right now

The models I am actively benchmarking in client work:

MAI-Transcribe-1 against Azure Speech on real call centre audio - the benchmark numbers need to hold up against specific audio quality, accents and domain vocabulary before I recommend a switch
Harrier embeddings in a multilingual RAG scenario where English-first models have been underperforming on non-English queries

The image and voice models are interesting but not in my immediate priority queue for enterprise deployments.

If you are building speech pipelines or RAG applications right now, pull these into your evaluation. The cost and accuracy numbers are genuinely competitive. Just validate them against your actual workload before making any commitments.