Skip to content

Evaluating AI Responses | Lectures 18

Evaluating AI Responses

Learn how to evaluate the performance of your prompts and the quality of AI responses. Understand the criteria for a good response and how to test your prompts systematically. Lectures 18

How Good is Your Prompt?

Welcome back. You can write detailed, structured prompts. But how do you know if they are actually any good? The only way to know is to test them and evaluate the results.

Evaluation is the process of judging the quality of the AI’s output. It’s a critical skill for any serious prompt engineer, especially in a professional setting where quality and reliability are essential.

Criteria for a Good Response

When you look at an AI’s answer, you should judge it based on a set of criteria. Here are some of the most important ones:

Key Evaluation Criteria

  • Accuracy / Factual Correctness: Is the information correct? (This is where you must fact-check!)
  • Completeness: Did the AI answer all parts of your prompt?
  • Relevance: Is the answer actually relevant to your question? Or did the AI misunderstand and talk about something else?
  • Clarity and Readability: Is the response easy to understand? Is it well-written?
  • Adherence to Constraints: Did the AI follow all of your instructions? (e.g., output format, length, tone, persona).
  • Lack of Bias: Is the response free from harmful stereotypes or unfair assumptions?

How to Test Your Prompts

In a professional environment, you can’t just “eyeball” it. You need a more structured approach to testing, especially when you are comparing two different versions of a prompt to see which is better.

1. Create an Evaluation Set

Come up with a set of test cases. For example, if you have a prompt that is supposed to summarize customer reviews, gather 10-20 different reviews (positive, negative, neutral, long, short) to test it on. This is your “evaluation set.”

2. Define Your “Golden Answer”

For each test case, write down what a perfect answer would look like. You don’t have to write the whole thing, but you should define the key points that must be included. This is your “golden answer” or reference standard.

3. Run the Prompts and Compare

Run your prompt (or different versions of your prompt) on all the test cases in your evaluation set. Then, compare the AI’s output to your “golden answer” for each one, scoring it against your criteria (Accuracy, Completeness, etc.).

Example: Testing a Summarization Prompt

Goal: Create a prompt that extracts the main complaint from a negative customer email.

  1. Evaluation Set: Gather 15 real (anonymized) negative customer emails.
  2. Golden Answer Definition: For each email, a human writes down the main complaint in one sentence. (e.g., “The customer is complaining about a late delivery.”)
  3. Testing:
    • Prompt A: Summarize this email.
    • Prompt B: You are a customer service agent. Read the following email and identify the customer's primary complaint in a single sentence.
  4. Evaluation: Run both prompts on all 15 emails. Count how many times each prompt successfully identified the complaint that matched the golden answer. The prompt with the higher score is the better one.

Why This Matters

This systematic approach is crucial for businesses and developers building applications on top of AI. It allows you to prove that your prompts are reliable and to track improvements over time. It moves prompt engineering from a casual art to a rigorous science.

Evaluating AI Responses
Evaluating AI Responses

Key Takeaways from Lecture 18

  • Don’t just assume your prompts are good; test them.
  • Evaluate AI responses based on clear criteria like accuracy, completeness, relevance, and clarity.
  • For systematic testing, create an evaluation set of diverse examples.
  • Define a “golden answer” for each test case to have a benchmark for comparison.
  • This structured process is what separates professional prompt engineering from casual use.

End of Lecture 18. You now know how to measure the quality of your work. In our next lecture, we’ll clarify the difference between prompt engineering and the more complex process of model fine-tuning.

Prompting for Image Generation | Lecture 17

Najeeb Alam

Najeeb Alam

Technical writer specializes in developer, Blogging and Online Journalism. I have been working in this field for the last 20 years.