Running Phi-4 for enterprise workloads
Small language models that punch above their weight
Running Phi-4 for enterprise workloads
There is a tendency in enterprise AI projects to always reach for the biggest model available. GPT-4o is great but it comes at a cost and for a lot of tasks you simply dont need that level of capability. Microsoft's Phi-4 is a 14 billion parameter model that performs surprisingly well on reasoning and structured tasks. I have been testing it on several scenarios and want to share where it makes sense to use it.
You can deploy Phi-4 directly from the Azure AI Foundry model catalog. Once deployed, you call it the same way you would any other Azure OpenAI model:
from openai import AzureOpenAI
import os
from dotenv import load_dotenv
load_dotenv()
client = AzureOpenAI(
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_key=os.getenv("AZURE_OPENAI_API_KEY"),
api_version="2024-12-01-preview",
)
response = client.chat.completions.create(
model="phi-4", # your deployment name
messages=[
{
"role": "system",
"content": "You are a document classification assistant. Classify the document into one of the following categories: Invoice, Contract, Report, Email, Other."
},
{
"role": "user",
"content": "Dear team, please find attached the Q2 financial results along with a breakdown by region and product line. Regards, Finance team."
}
],
max_tokens=50,
)
print(response.choices[0].message.content)Where Phi-4 works well
I ran a comparison between Phi-4 and GPT-4o on a set of 500 document classification tasks. The results were quite close on accuracy but the cost difference was significant.
- Document classification - Phi-4 matched GPT-4o accuracy on structured classification tasks
- Data extraction from structured text - performed well when the format was consistent
- Summarisation of short documents - produced clean concise summaries
- Intent detection - reliable for routing queries to the right handler
Where you still want GPT-4o
- Complex reasoning across long documents - Phi-4 starts to struggle with very long contexts
- Code generation for complex logic - GPT-4o is noticeably better here
- Tasks requiring broad world knowledge - the smaller training set shows in edge cases
- Multi-step agent workflows with tools - larger models handle tool calling more reliably
The cost argument
If you are running a high volume pipeline where you are classifying or extracting from thousands of documents a day, the cost savings from using Phi-4 for the simpler steps can be significant. My approach is to use Phi-4 as the first pass for tasks that have clear structured outputs and only escalate to GPT-4o when the task is genuinely complex or the confidence is low.
You can also run Phi-4 locally using Ollama if you want to test without any cloud costs:
ollama pull phi4
ollama run phi4This is useful for development and testing before you move to a cloud deployment.
The bottom line is that the right model for the job is not always the biggest one. Start with what you need and scale up only when necessary.