Introduction
RAG applications are easy to prototype and hard to make reliable. A chatbot can retrieve documents, pass them to an LLM, and answer questions in a few lines of code. The hard part is making sure it does not invent unsupported details, drift off-topic, leak personal data, or violate business rules.
This course frames guardrails as a secondary validation layer around LLM inputs and outputs. Prompting, fine-tuning, RLHF, and RAG help, but they do not remove the need for runtime checks.
This post focuses on the foundational guardrails material: RAG failure modes, the difference between validators, guards, and a guardrails server, and a simple custom validator.
1) Core idea
A guardrail is a check around an LLM call. It validates whether the input or output is acceptable for the specific application context.
Validity depends on the system. For a pizza support chatbot, valid behavior may mean:
- only answer about the pizza business,
- do not discuss competitors,
- do not reveal internal project names,
- do not store or send PII,
- do not invent recipes or policy details.
The main idea is not to make the model perfect. It is to limit worst-case behavior and make violations observable.
2) Key concepts
RAG chatbot workflow
The course builds a RAG chatbot for Alfredo’s Pizza Cafe. The system message says the bot should only answer from the provided knowledge base and should avoid unrelated topics.
The simple workflow is:
- user sends a question,
- vector database retrieves relevant documents,
- retrieved context is passed to the LLM,
- the LLM generates a response.
This improves factual grounding, but it does not guarantee reliability.
Failure modes in RAG applications
The course demonstrates four practical failure modes.
| Failure mode | Example problem |
|---|---|
| Hallucination | The chatbot invents detailed recipe instructions not present in the documents. |
| Unintended app use | The chatbot answers unrelated questions, such as vehicle comparisons. |
| Information leakage | PII is stored in backend message history even if the response is polite. |
| Reputational damage | The chatbot mentions or compares a competitor despite instructions not to. |
These are application-level failures. They are not solved only by saying “answer from the knowledge base” in the system prompt.
Validator
A validator is the core validation logic. It checks whether a value follows a rule.
Examples:
- detect a banned project name,
- detect PII,
- detect unsupported claims,
- detect off-topic content.
Guard
A guard is the application wrapper that applies one or more validators to an input or output.
One guard can contain multiple guardrails. This makes it possible to apply a validation bundle at the right point in the LLM workflow.
Guardrails server
The Guardrails server exposes guarded endpoints compatible with the OpenAI API style. That matters because an application can often switch from a normal OpenAI-compatible client to a guarded client by changing the base URL.
The course uses endpoints shaped like:
from openai import OpenAI
guarded_client = OpenAI(
base_url="http://127.0.0.1:8000/guards/colosseum_guard/openai/v1/"
)
Benefits called out in the material:
- easier deployment,
- containerization,
- independent scaling,
- OpenAI SDK compatible endpoints,
- reusable guards around multiple applications.
3) Practical workflow
Build the unguarded chatbot
The chatbot is created with an OpenAI client, a simple vector database, and a system message.
from openai import OpenAI
from helper import RAGChatWidget, SimpleVectorDB
client = OpenAI()
vector_db = SimpleVectorDB.from_files("shared_data/")
system_message = """You are a customer support chatbot for Alfredo's Pizza Cafe.
Your responses should be based solely on the provided information.
Only answer questions related to Alfredo's Pizza Cafe's menu,
account management, delivery times, and directly relevant topics.
Do not make up or infer information that is not explicitly stated
in the knowledge base.
"""
rag_chatbot = RAGChatWidget(
client=client,
system_message=system_message,
vector_db=vector_db,
)
This is a good prototype, but the course shows that prompt tricks can still bypass instructions.
Add a simple validator
The first custom guardrail detects a banned internal project name.
from typing import Any, Dict
from guardrails import Guard, OnFailAction
from guardrails.validator_base import (
FailResult,
PassResult,
ValidationResult,
Validator,
register_validator,
)
@register_validator(name="detect_colosseum", data_type="string")
class ColosseumDetector(Validator):
def _validate(
self,
value: Any,
metadata: Dict[str, Any] = {},
) -> ValidationResult:
if "colosseum" in value.lower():
return FailResult(
error_message="Colosseum detected",
fix_value=(
"I'm sorry, I can't answer questions "
"about Project Colosseum."
),
)
return PassResult()
This is intentionally simple. It demonstrates the validator contract:
- inspect the value,
- return
FailResultwhen the rule is violated, - return
PassResultotherwise, - optionally provide a
fix_value.
Wrap the validator in a guard
guard = Guard().use(
ColosseumDetector(
on_fail=OnFailAction.EXCEPTION,
),
on="messages",
)
With EXCEPTION, the application sees an error when validation fails.
For better user experience, the notes replace EXCEPTION with FIX so the guard uses the validator’s fix_value.
colosseum_guard_2 = Guard(name="colosseum_guard_2").use(
ColosseumDetector(on_fail=OnFailAction.FIX),
on="messages",
)
That turns a raw guardrail failure into a controlled response.
Run through the Guardrails server
The course installs Guardrails server dependencies, hub guards, and starts a local server.
pip install -r requirements.txt
python -m spacy download en_core_web_trf
guardrails configure
guardrails start --config config.py
The course also installs several hub guardrails:
guardrails hub install hub://guardrails/provenance_llm --no-install-local-models
guardrails hub install hub://guardrails/detect_pii
guardrails hub install hub://tryolabs/restricttotopic --no-install-local-models
guardrails hub install hub://guardrails/competitor_check --no-install-local-models
Once the server is running, the app uses a guarded OpenAI-compatible client:
guarded_client = OpenAI(
base_url="http://127.0.0.1:8000/guards/colosseum_guard/openai/v1/"
)
Then the same RAG widget can run through the guarded endpoint.
4) Technical details worth highlighting
Guardrails complement prompts
The source material shows a useful failure: even when the system prompt forbids a topic, a completion-style prompt can still elicit a response. A validator checks the actual content at runtime instead of relying only on instruction-following.
The validation target matters
The course applies the simple detector to messages. Other validators may be better applied to user input, retrieved context, final output, or streamed chunks.
Choose the location based on what you are trying to prevent.
Be careful with system prompt content
One practical lesson from the notes: if the system message itself contains the banned word, a validator scanning all messages can fail before the user asks anything.
The course removes the project name from the guarded system message for that reason.
Failure handling is product behavior
EXCEPTION is useful during development because it makes failures obvious. FIX can be better for production because it gives the user a controlled response.
The right behavior depends on the risk. Some failures should block hard. Others can be rewritten, anonymized, or redirected.
5) Common pitfalls or lessons learned
- A RAG proof of concept is not reliable by default.
- System instructions do not reliably enforce business rules.
- Validators should be scoped to the right part of the request/response flow.
- A guardrail can fail because of your own system prompt if you scan too broadly.
- Raw exceptions are useful for debugging but poor user experience.
- Guardrails should be logged so violations become measurable.
- Guardrails do not replace evals or monitoring; they add runtime control.
Final thoughts
The most practical takeaway is that guardrails turn application rules into executable checks. That is stronger than hoping the model follows instructions every time.
For production LLM systems, the useful pattern is layered: prompt design, retrieval quality, evals, monitoring, and runtime guardrails. Each layer catches a different class of failure.
Reference
This article is based on my personal study notes from the Cyber AI Security track.
Full repository: https://github.com/lameiro0x/cyber-ai-security-notes