How to evaluate your AI application output

Stop guessing whether your AI is performing well

How to evaluate your AI application output

One of the biggest gaps I see in enterprise AI projects is the lack of evaluation. Teams spend weeks building the application, do some manual testing, then push it to production and hope for the best. A few months later users are complaining that the responses are not accurate or relevant and no one has any data to understand why.

Evaluation is not a one-time thing you do before launch. It is something you need to build into your development process and keep running as your data and usage patterns change.

Azure AI Foundry has built-in evaluation capabilities that I have been using on recent projects. You can run evaluations directly from the SDK which makes it easy to integrate into your CI pipeline.

Install the evaluation package:

pip install azure-ai-evaluation python-dotenv

The simplest evaluation to start with is relevance - is the response actually relevant to what the user asked?

from azure.ai.evaluation import RelevanceEvaluator, AzureOpenAIModelConfiguration
from dotenv import load_dotenv
import os
 
load_dotenv()
 
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2024-12-01-preview",
    azure_deployment="gpt-4o",
)
 
evaluator = RelevanceEvaluator(model_config=model_config)
 
result = evaluator(
    query="What is the refund policy for software licenses?",
    response="Software licenses purchased through the enterprise portal can be refunded within 30 days of purchase provided the license has not been activated. Contact procurement to initiate a refund.",
    context="Company policy document: Software licenses are non-refundable after activation. Unactivated licenses may be returned within 30 days.",
)
 
print(f"Relevance score: {result['relevance']}")
print(f"Explanation: {result['relevance_reason']}")

You can also run groundedness checks to verify the response is based on the context provided and not hallucinated:

from azure.ai.evaluation import GroundednessEvaluator
 
groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
 
result = groundedness_evaluator(
    query="What is the refund policy?",
    response="Refunds are available for up to 90 days after purchase.",
    context="Software licenses may be returned within 30 days of purchase if unactivated.",
)
 
print(f"Groundedness score: {result['groundedness']}")

Running batch evaluations

For a proper evaluation run you want to test against a dataset of questions and expected answers, not just individual examples. Here is how you run a batch evaluation:

from azure.ai.evaluation import evaluate, RelevanceEvaluator, GroundednessEvaluator
import json
 
# Your test dataset - question, context, and the response from your AI app
test_data = [
    {
        "query": "What is the laptop replacement policy?",
        "context": "Laptops are replaced every 3 years or when they fail. Submit a request via the IT portal.",
        "response": "Laptops are replaced on a 3 year cycle. You can request a replacement through the IT portal.",
    },
    {
        "query": "How many days of annual leave do I get?",
        "context": "Full-time employees receive 25 days of annual leave per year.",
        "response": "You are entitled to 25 days of annual leave each year as a full-time employee.",
    },
]
 
with open("test_data.jsonl", "w") as f:
    for item in test_data:
        f.write(json.dumps(item) + "\n")
 
results = evaluate(
    data="test_data.jsonl",
    evaluators={
        "relevance": RelevanceEvaluator(model_config=model_config),
        "groundedness": GroundednessEvaluator(model_config=model_config),
    },
    output_path="evaluation_results.json",
)
 
print(f"Average relevance: {results['metrics']['relevance.relevance']}")
print(f"Average groundedness: {results['metrics']['groundedness.groundedness']}")

What to do with the results

The scores give you a baseline. Run your evaluation dataset before any change to your system prompt, retrieval pipeline or model version. If the scores drop after a change, you know something went wrong.

I recommend building a small evaluation dataset of at least 50 realistic questions for your use case. Include some edge cases and questions that are out of scope so you can also test that the system handles those gracefully.

Its not glamorous work but it is what separates a reliable AI application from one that embarrasses you in front of a client.