Running Phi-4 for enterprise workloads

Small language models that punch above their weight


Running Phi-4 for enterprise workloads

There is a tendency in enterprise AI projects to always reach for the biggest model available. GPT-4o is great but it comes at a cost and for a lot of tasks you simply dont need that level of capability. Microsoft's Phi-4 is a 14 billion parameter model that performs surprisingly well on reasoning and structured tasks. I have been testing it on several scenarios and want to share where it makes sense to use it.

You can deploy Phi-4 directly from the Azure AI Foundry model catalog. Once deployed, you call it the same way you would any other Azure OpenAI model:

from openai import AzureOpenAI
import os
from dotenv import load_dotenv
 
load_dotenv()
 
client = AzureOpenAI(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-12-01-preview",
)
 
response = client.chat.completions.create(
    model="phi-4",  # your deployment name
    messages=[
        {
            "role": "system",
            "content": "You are a document classification assistant. Classify the document into one of the following categories: Invoice, Contract, Report, Email, Other."
        },
        {
            "role": "user",
            "content": "Dear team, please find attached the Q2 financial results along with a breakdown by region and product line. Regards, Finance team."
        }
    ],
    max_tokens=50,
)
 
print(response.choices[0].message.content)

Where Phi-4 works well

I ran a comparison between Phi-4 and GPT-4o on a set of 500 document classification tasks. The results were quite close on accuracy but the cost difference was significant.

  1. Document classification - Phi-4 matched GPT-4o accuracy on structured classification tasks
  2. Data extraction from structured text - performed well when the format was consistent
  3. Summarisation of short documents - produced clean concise summaries
  4. Intent detection - reliable for routing queries to the right handler

Where you still want GPT-4o

  1. Complex reasoning across long documents - Phi-4 starts to struggle with very long contexts
  2. Code generation for complex logic - GPT-4o is noticeably better here
  3. Tasks requiring broad world knowledge - the smaller training set shows in edge cases
  4. Multi-step agent workflows with tools - larger models handle tool calling more reliably

The cost argument

If you are running a high volume pipeline where you are classifying or extracting from thousands of documents a day, the cost savings from using Phi-4 for the simpler steps can be significant. My approach is to use Phi-4 as the first pass for tasks that have clear structured outputs and only escalate to GPT-4o when the task is genuinely complex or the confidence is low.

You can also run Phi-4 locally using Ollama if you want to test without any cloud costs:

ollama pull phi4
ollama run phi4

This is useful for development and testing before you move to a cloud deployment.

The bottom line is that the right model for the job is not always the biggest one. Start with what you need and scale up only when necessary.