Introduction

Traditional software tests usually compare a known input with a predictable output. LLM applications are different because the output is generated, variable, and sometimes correct in more than one form.

That does not mean LLM apps cannot be tested. It means the test suite needs several layers: deterministic checks where possible, model-graded evaluations where judgment is required, hallucination checks against known context, and CI automation so regressions are caught before release.

The course uses a quiz generator app as the running example. The app builds quiz questions from a small question bank, then the tests check whether the output stays in scope, follows the expected format, refuses unsupported topics, and avoids facts outside the source data.

1) Core idea

The practical goal is to turn LLM behavior into a repeatable release signal.

The course builds this progressively:

  1. create a small LLM app,
  2. add deterministic evals,
  3. add model-graded evals,
  4. add hallucination checks,
  5. run the evals through CircleCI,
  6. preserve eval reports as artifacts for human review.

This is the right mental model: start with cheap tests, then add more judgment-heavy tests only where they are useful.

2) Key concepts

CI for LLM applications

Continuous integration means small changes are built and tested automatically. For LLMOps, that includes code changes, prompt changes, data changes, and evaluation changes.

The notes emphasize that LLM output can change in subtle ways. A prompt edit can preserve syntax but change safety behavior. A data update can make the model answer with unsupported information. CI gives feedback while the change is still fresh.

What to evaluate

The course calls out these evaluation targets:

  • context adherence,
  • context relevance,
  • correctness,
  • bias and toxicity,
  • refusal behavior,
  • output format,
  • hallucination against known facts.

Not every app needs every eval. The evaluation set should match the application’s risk surface.

When to evaluate

Useful checkpoints include:

  • after every change,
  • before deployment,
  • after deployment when business needs require it,
  • before release branches or production merges.

For LLM applications, pre-release evals are especially important because they catch behavior that unit tests often miss.

3) Practical workflow

Build a small app around a fixed knowledge bank

The sample app uses a quiz bank with subjects such as Leonardo DaVinci, Paris, Telescopes, Starry Night, and Physics. The prompt instructs the assistant to generate three quiz questions for categories such as Geography, Science, and Art.

The app is built with LangChain:

from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser

def assistant_chain(
    system_message,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser(),
):
    chat_prompt = ChatPromptTemplate.from_messages([
        ("system", system_message),
        ("human", human_template),
    ])
    return chat_prompt | llm | output_parser

The important design choice is that the app has a known source of truth. Without that, many of the evals would have nothing reliable to compare against.

Add deterministic evals first

The first eval checks whether expected words appear in the answer.

def eval_expected_words(system_message, question, expected_words):
    assistant = assistant_chain(system_message)
    answer = assistant.invoke({"question": question})

    assert any(word in answer.lower() for word in expected_words), (
        f"Expected the assistant questions to include {expected_words}"
    )

This is simple, but useful. If a Science quiz never mentions any expected science subject, the app probably drifted.

A refusal eval checks unsupported topics:

def evaluate_refusal(system_message, question, decline_response):
    assistant = assistant_chain(system_message)
    answer = assistant.invoke({"question": question})

    assert decline_response.lower() in answer.lower(), (
        f"Expected refusal containing {decline_response}, got {answer}"
    )

These checks are brittle if used alone, but they are cheap and fast.

Add model-graded format evals

The next layer asks another LLM to judge whether the output looks like a quiz in the expected format.

eval_system_prompt = """You are an assistant that evaluates whether
or not an assistant is producing valid quizzes. The assistant should
produce output in the format Question N:#### <question N>?"""

The evaluator is asked to output Y for valid quiz format and N for invalid format. This is useful when exact string matching is too strict, but the expected structure is still clear.

The key rule from the course is that this evaluator checks format only. It does not decide whether the facts are correct. Keeping evaluator scope narrow makes the result easier to interpret.

Add hallucination checks against the question bank

The course then builds a model-graded eval for factual grounding. The evaluator compares the generated quiz against the quiz bank and fails if facts appear outside the source.

The decision prompt is structured as a checklist:

  1. review the question bank,
  2. compare quiz facts to the question bank,
  3. ignore grammar or punctuation,
  4. fail if a fact is not in the question bank.

This is a more meaningful eval than “does the answer look good?” because it connects output quality to a defined source of truth.

Evaluate a dataset, not only one prompt

The course defines a small dataset with supported and unsupported requests.

test_dataset = [
    {
        "input": "I'm trying to learn about science, can you give me a quiz?",
        "response": "science",
        "subjects": ["davinci", "telescope", "physics", "curie"],
    },
    {
        "input": "I'm a geography expert, give a quiz to prove it?",
        "response": "geography",
        "subjects": ["paris", "france", "louvre"],
    },
    {
        "input": "Quiz me about Italy",
        "response": "geography",
        "subjects": ["rome", "alps", "sicily"],
    },
]

Then it runs the assistant and evaluator for each row:

def evaluate_dataset(dataset, quiz_bank, assistant, evaluator):
    eval_results = []

    for row in dataset:
        user_input = row["input"]
        answer = assistant.invoke({"question": user_input})
        grader_response = evaluator.invoke({
            "context": quiz_bank,
            "agent_response": answer,
        })

        eval_results.append({
            "input": user_input,
            "output": answer,
            "grader_response": grader_response,
        })

    return eval_results

The course stores these results as an HTML table so humans can review evaluator behavior later.

Run the evals in CircleCI

The CircleCI material breaks the config into jobs, commands, and workflows:

  • jobs define high-level automation tasks,
  • commands define reusable steps,
  • workflows orchestrate jobs.

A minimal eval job is just a Python test run inside a CircleCI executor.

version: 2.1

jobs:
  run-commit-evals:
    docker:
      - image: cimg/python:3.11
    steps:
      - checkout
      - run:
          name: Install dependencies
          command: pip install -r requirements.txt
      - run:
          name: Run assistant evals
          command: pytest test_assistant.py

workflows:
  evaluate-app:
    jobs:
      - run-commit-evals

The optional CircleCI notes also cover conditional workflows, scheduled workflows, execution environments, orbs, and contexts for secrets.

4) Technical details worth highlighting

Keep evaluator scope narrow

A model-graded eval should have one job. Format evals should check format. Grounding evals should check source adherence. Refusal evals should check refusal behavior.

Broad evaluator prompts produce vague failures.

Store artifacts for review

The course creates an evaluation report with Pandas and stores it as a CircleCI artifact. This is useful because model-graded evals can be wrong or sensitive to wording. Human review is part of the system.

Separate commit evals and release evals

Fast checks can run on every commit. Heavier model-graded checks can run on release branches, scheduled workflows, or pre-deployment pipelines.

This avoids making every small edit wait for expensive tests while still protecting release quality.

Context is the boundary for hallucination checks

The quiz bank is the authority. If the generated quiz contains facts outside that bank, the hallucination eval should fail even if the generated fact is true in the real world.

This is an important distinction. The eval is checking application grounding, not global truth.

5) Common pitfalls or lessons learned

  • Exact word checks are useful but brittle. Do not rely on them alone.
  • Model-graded evals need clear output formats such as Y/N or SAFE/UNSAFE.
  • A judge model can be wrong, so store reports for review.
  • Unsupported-topic tests are as important as happy-path tests.
  • Evaluation datasets should include phrasing variation, not only ideal prompts.
  • CI secrets should live in secure contexts, not in config files.
  • A failing eval should point to a behavior class: format, refusal, grounding, or safety.

Final thoughts

The strongest pattern in this course is layered testing. Start with deterministic assertions, then add model-graded evals where strict matching breaks down, then preserve reports for human review.

LLM apps do not become reliable because one evaluator says “pass.” They become more reliable when behavior is checked repeatedly, scoped clearly, and connected to the release workflow.


Reference

This article is based on my personal study notes from the Cyber AI Security track.

Full repository: https://github.com/lameiro0x/cyber-ai-security-notes