Introduction

The guardrails course becomes most useful when it moves from a simple keyword detector to specialized validators. The material covers four practical controls:

  • hallucination detection with Natural Language Inference,
  • topic restriction with zero-shot classification,
  • PII detection with Microsoft Presidio,
  • competitor mention detection with exact matching, NER, and vector similarity.

Each control protects a different failure mode. Together they show the right engineering pattern: use small, task-specific models and validators around the LLM instead of expecting the LLM to police itself.

1) Core idea

Guardrails should be specific. A generic “be safe” instruction is weak. A validator that checks whether a response is grounded, on-topic, free of PII, or free of competitor names is stronger because it turns policy into code.

The course implements each validator as a runtime check that can fail, raise an exception, fix output, or sit behind an OpenAI-compatible guardrails endpoint.

2) Key concepts

Groundedness

A grounded response contains claims explicitly supported by the provided context. The course treats hallucination as lack of groundedness.

For RAG systems, this is the right framing. The goal is not “is this statement globally true?” The goal is “is this statement supported by the sources retrieved for this application?”

Topic control

Topic control prevents unintended app use. A pizza support chatbot should not answer political or automobile questions just because the user asks persuasively.

The course uses zero-shot classification with facebook/bart-large-mnli.

PII handling

PII should not be sent, stored, or returned accidentally. The course shows a subtle issue: the chatbot does not reveal private order information, but the raw user message containing a name and phone number is still stored in message history.

That is still a privacy problem.

Competitor mention control

The competitor guardrail prevents the assistant from naming or comparing a specific competitor. The implementation combines exact matching, named entity recognition, and vector similarity so it can catch more than one wording form.

3) Practical workflow

Hallucination control with NLI

The course uses an entailment model from Hugging Face:

from transformers import pipeline

entailment_model = "GuardrailsAI/finetuned_nli_provenance"
NLI_PIPELINE = pipeline("text-classification", model=entailment_model)

The basic NLI shape is premise plus hypothesis:

premise = "The sun rises in the east and sets in the west."
hypothesis = "The sun rises in the east."

result = NLI_PIPELINE({
    "text": premise,
    "text_pair": hypothesis,
})

For an LLM response, each sentence becomes a hypothesis. The validator then finds relevant source chunks and checks whether each sentence is entailed.

The course’s validator has four important methods:

  • split_sentences() uses NLTK to split model output,
  • find_relevant_sources() embeds sentences and sources,
  • check_entailment() runs the NLI model,
  • validate() returns failure if any sentence is unsupported.

The relevant-source lookup uses all-MiniLM-L6-v2 embeddings and keeps top sources above a cosine similarity threshold.

Simplified structure:

@register_validator(name="hallucination_detector", data_type="string")
class HallucinationValidation(Validator):
    def validate(self, value: str, metadata=None):
        sentences = self.split_sentences(value)
        relevant_sources = self.find_relevant_sources(
            sentences,
            self.sources,
        )

        hallucinated_sentences = []
        for sentence in sentences:
            if not self.check_entailment(sentence, relevant_sources):
                hallucinated_sentences.append(sentence)

        if hallucinated_sentences:
            return FailResult(
                error_message=(
                    "The following sentences are hallucinated: "
                    f"{hallucinated_sentences}"
                )
            )

        return PassResult()

The guard is then created with explicit sources:

guard = Guard().use(
    HallucinationValidation(
        embedding_model="all-MiniLM-L6-v2",
        entailment_model="GuardrailsAI/finetuned_nli_provenance",
        sources=[
            "The sun rises in the east and sets in the west.",
            "The sun is hot.",
        ],
        on_fail=OnFailAction.EXCEPTION,
    )
)

Applied to the RAG chatbot, the hallucination guard blocks the recipe answer that was not supported by the knowledge base.

Topic control with zero-shot classification

The course revisits an off-topic failure where the chatbot answers a vehicle comparison after the user injects new “system instructions.” The guardrail uses a zero-shot classifier:

from transformers import pipeline

CLASSIFIER = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli",
    hypothesis_template=(
        "This sentence above contains discussions "
        "of the folllowing topics: {}."
    ),
    multi_label=True,
)

Topic detection returns labels above a threshold:

def detect_topics(
    text: str,
    topics: list[str],
    threshold: float = 0.8,
) -> list[str]:
    result = CLASSIFIER(text, topics)
    return [
        topic
        for topic, score in zip(result["labels"], result["scores"])
        if score > threshold
    ]

The validator blocks banned topics:

@register_validator(name="constrain_topic", data_type="string")
class ConstrainTopic(Validator):
    def __init__(
        self,
        banned_topics: Optional[list[str]] = ["politics"],
        threshold: float = 0.8,
        **kwargs,
    ):
        self.topics = banned_topics
        self.threshold = threshold
        super().__init__(**kwargs)

    def _validate(self, value: str, metadata=None):
        detected_topics = detect_topics(
            value,
            self.topics,
            self.threshold,
        )
        if detected_topics:
            return FailResult(
                error_message=(
                    "The text contains the following banned topics: "
                    f"{detected_topics}"
                )
            )

        return PassResult()

The course uses banned topics such as politics and automobiles, then runs the guard through a server endpoint named topic_guard.

PII control with Microsoft Presidio

The PII lesson focuses on preventing personal information from being sent to the provider or stored in logs/history.

Presidio provides an analyzer and anonymizer:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

presidio_analyzer = AnalyzerEngine()
presidio_anonymizer = AnonymizerEngine()

The analyzer returns entity positions and confidence scores:

text = (
    "can you tell me what orders i've placed in the last 3 months? "
    "my name is Hank Tate and my phone number is 555-123-4567"
)

analysis = presidio_analyzer.analyze(text, language="en")

The anonymizer can then mask those detected values:

presidio_anonymizer.anonymize(
    text=text,
    analyzer_results=analysis,
)

The custom validator limits detection to selected entities:

def detect_pii(text: str) -> list[str]:
    result = presidio_analyzer.analyze(
        text,
        language="en",
        entities=["PERSON", "PHONE_NUMBER"],
    )
    return [entity.entity_type for entity in result]

Validator:

@register_validator(name="pii_detector", data_type="string")
class PIIDetector(Validator):
    def _validate(
        self,
        value: Any,
        metadata: Dict[str, Any] = {},
    ) -> ValidationResult:
        detected_pii = detect_pii(value)
        if detected_pii:
            return FailResult(
                error_message=f"PII detected: {', '.join(detected_pii)}",
                metadata={"detected_pii": detected_pii},
            )
        return PassResult(message="No PII detected")

The course also shows streaming validation using Guardrails Hub’s DetectPII:

from guardrails.hub import DetectPII

guard = Guard().use(
    DetectPII(
        pii_entities=["PHONE_NUMBER", "EMAIL_ADDRESS"],
        on_fail="fix",
    )
)

This is useful when the model output itself might contain PII and needs to be fixed while streaming.

Competitor mention control

The competitor validator combines three checks:

  1. exact whole-word match,
  2. named entity extraction,
  3. vector similarity between extracted entities and competitor names.

The model setup uses a BERT NER pipeline and sentence embeddings.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
NER = pipeline("ner", model=model, tokenizer=tokenizer)

The validator precomputes competitor embeddings and sets a similarity threshold of 0.6.

@register_validator(name="check_competitor_mentions", data_type="string")
class CheckCompetitorMentions(Validator):
    def __init__(self, competitors: list[str], **kwargs):
        self.competitors = competitors
        self.competitors_lower = [c.lower() for c in competitors]
        self.ner = NER
        self.sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.competitor_embeddings = self.sentence_model.encode(
            self.competitors
        )
        self.similarity_threshold = 0.6
        super().__init__(**kwargs)

The validation flow is:

def validate(self, value: str, metadata=None):
    exact_matches = self.exact_match(value)
    if exact_matches:
        return FailResult(
            error_message=(
                "Your response directly mentions competitors: "
                f"{', '.join(exact_matches)}"
            )
        )

    entities = self.extract_entities(value)
    similarity_matches = self.vector_similarity_match(entities)
    all_matches = list(set(exact_matches + similarity_matches))

    if all_matches:
        return FailResult(
            error_message=(
                "Your response mentions competitors: "
                f"{', '.join(all_matches)}"
            )
        )

    return PassResult()

This is stronger than simple substring matching because users and models may refer to a competitor indirectly or with slight wording variation.

4) Technical details worth highlighting

Small models can be the right tool

The topic lesson compares using an LLM for classification with using a local zero-shot classifier. The note is practical: smaller task-specific models can be faster, cheaper, and more consistent for narrow validation tasks.

Grounding requires sources

The NLI validator is only as good as the sources it receives. If retrieval misses the right context, the validator may fail correct claims. If irrelevant context is included, it may pass weakly supported claims.

PII is not only response leakage

The PII lesson shows that even a safe response can be paired with unsafe backend storage. Guardrails need to protect logs, history, and provider requests, not only final output.

Exact matches are not enough

The competitor guardrail uses exact matching first, but adds NER and vector similarity. That pattern is useful for many policy checks: start simple, then add semantic detection where exact strings are too brittle.

5) Common pitfalls or lessons learned

  • Hallucination validators need source quality and retrieval quality.
  • Sentence-level NLI can be slow if every response is long.
  • Topic thresholds need calibration. Too low overblocks; too high misses violations.
  • PII detection should decide whether to block, mask, or rewrite.
  • Streaming validation is useful when unsafe output may appear chunk by chunk.
  • Competitor matching can false positive on unrelated named entities if similarity thresholds are weak.
  • Guardrails should produce user-friendly failures, not raw stack traces.

Final thoughts

The most useful engineering lesson is to stop asking one LLM to handle every responsibility. Let the LLM generate, but surround it with smaller validators that check specific rules.

For RAG applications, the practical baseline is clear: check groundedness, constrain topics, protect PII, and enforce business-specific policies such as competitor mentions. Those controls do not make the system perfect, but they make failure modes explicit and much easier to manage.


Reference

This article is based on my personal study notes from the Cyber AI Security track.

Full repository: https://github.com/lameiro0x/cyber-ai-security-notes