LLMOps Pipeline: From Dataset Preparation to Safe Prediction | lameiro0x

Introduction

LLMOps is the operational layer around LLM applications. It is not only model deployment. A useful LLM workflow also needs data preparation, artifact versioning, orchestration, prompt consistency, endpoint management, safety checks, and monitoring.

The main lesson from this course is that an LLM application becomes production-ready only when the surrounding system is controlled. The model is one component. The data pipeline, prompt format, evaluation split, deployment strategy, and response metadata are just as important.

1) Core idea

Classic MLOps focuses on automation and monitoring across integration, testing, release, deployment, and infrastructure management. LLMOps keeps those ideas, but adds LLM-specific concerns:

prompts become part of the production interface,
context and retrieval influence model behavior,
outputs are probabilistic,
safety metadata matters,
model behavior must be checked after deployment, not only before it.

A practical LLMOps pipeline in these notes follows this flow:

collect and filter data,
prepare train and evaluation artifacts,
version the artifacts,
orchestrate tuning with a pipeline,
deploy or select a tuned model endpoint,
send production prompts in the same format used during tuning,
inspect safety attributes and citation metadata.

2) Key concepts

Data warehouse

The course uses BigQuery as the data warehouse for the Stack Overflow public dataset. This matters because tuning and evaluation data usually starts larger than what should be loaded into local memory.

For local or small data, Pandas is enough. For large training sources, the workflow should push filtering and joins into the warehouse before converting anything to a local dataframe.

Dataset format

The notes compare three formats:

Format	Practical use
`JSONL`	Human-readable, simple, good for small and medium-sized tuning datasets.
`TFRecord`	Binary and efficient for training systems.
`Parquet`	Better for large and complex tabular datasets.

For the course workflow, the tuning data is exported as JSONL with one record per line.

Artifact versioning

Versioning is not cosmetic. It gives reproducibility, traceability, and maintainability. The source material uses timestamps in generated filenames so training and evaluation data can be tied back to a specific run.

Orchestration

Kubeflow Pipelines are used to define and run the workflow. The important concept is that pipeline components are self-contained execution units. Instead of passing large payloads directly between steps, components usually pass locations where artifacts are stored.

That makes the pipeline easier to scale and easier to re-run when only one part changes.

3) Practical workflow

Prepare the cloud environment

The course initializes Vertex AI and BigQuery with project credentials and a region.

from utils import authenticate
credentials, PROJECT_ID = authenticate()

REGION = "us-central1"

import vertexai
vertexai.init(
    project=PROJECT_ID,
    location=REGION,
    credentials=credentials,
)

For BigQuery:

from google.cloud import bigquery

bq_client = bigquery.Client(
    project=PROJECT_ID,
    credentials=credentials,
)

Query only what the model needs

The course joins Stack Overflow questions and accepted answers, filters for Python questions, and limits the dataset.

SELECT
    CONCAT(q.title, q.body) AS input_text,
    a.body AS output_text
FROM
    `bigquery-public-data.stackoverflow.posts_questions` q
JOIN
    `bigquery-public-data.stackoverflow.posts_answers` a
ON
    q.accepted_answer_id = a.id
WHERE
    q.accepted_answer_id IS NOT NULL
    AND REGEXP_CONTAINS(q.tags, "python")
    AND a.creation_date >= "2020-01-01"
LIMIT
    10000

This is the right shape for a tuning dataset: user-style input in one column and expected answer in another.

Add the instruction before tuning

The material adds an instruction prefix so the model learns the desired response behavior, not only the question-answer pairs.

INSTRUCTION_TEMPLATE = """\
Please answer the following Stackoverflow question on Python.
Answer it like you are a developer answering Stackoverflow questions.

Stackoverflow question:
"""

stack_overflow_df["input_text_instruct"] = (
    INSTRUCTION_TEMPLATE + " " + stack_overflow_df["input_text"]
)

This detail is important later. If production prompts do not match the tuning format, the deployed model is being used differently from how it was trained.

Split train and evaluation data

The default split in the course is 80/20.

from sklearn.model_selection import train_test_split

train, evaluation = train_test_split(
    stack_overflow_df,
    test_size=0.2,
    random_state=42,
)

Then export only the input and output fields needed for tuning.

cols = ["input_text_instruct", "output_text"]
tune_jsonl = train[cols].to_json(orient="records", lines=True)

Build a simple Kubeflow pipeline

The simple example shows the mechanics: components are functions wrapped with @dsl.component, and the pipeline defines their order.

from kfp import dsl, compiler

@dsl.component
def say_hello(name: str) -> str:
    return f"Hello, {name}!"

@dsl.component
def how_are_you(hello_text: str) -> str:
    return f"{hello_text}. How are you?"

@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
    hello_task = say_hello(name=recipient)
    how_task = how_are_you(hello_text=hello_task.output)
    return how_task.output

compiler.Compiler().compile(hello_pipeline, "pipeline.yaml")

The practical value is not the hello-world example. It is learning the boundary between pipeline definition, compiled artifact, runtime parameters, and execution.

Run a supervised tuning pipeline

The real workflow reuses a provided tuning pipeline template and passes arguments such as model name, region, base model, training steps, dataset URI, evaluation interval, and evaluation data URI.

pipeline_arguments = {
    "model_display_name": MODEL_NAME,
    "location": REGION,
    "large_model_reference": "text-bison@001",
    "project": PROJECT_ID,
    "train_steps": 200,
    "dataset_uri": TRAINING_DATA_URI,
    "evaluation_interval": 20,
    "evaluation_data_uri": EVALUATION_DATA_URI,
}

One useful implementation detail is caching. With enable_caching=True, unchanged components can reuse prior outputs instead of running every time.

4) Technical details worth highlighting

Prompt management is deployment logic

The tuned model should receive production prompts in the same shape as the training examples.

INSTRUCTION = """\
Please answer the following Stackoverflow question on Python.
Answer it like you are a developer answering Stackoverflow questions.
Question:
"""

QUESTION = "How can I store my TensorFlow checkpoint on Google Cloud Storage?"
PROMPT = f"{INSTRUCTION} {QUESTION}"

This is not just prompt engineering. It is interface compatibility between training and inference.

Load balancing can be simple

The notes select randomly from a list of tuned model names.

import random

list_tuned_models = model.list_tuned_model_names()
tuned_model_select = random.choice(list_tuned_models)
deployed_model = TextGenerationModel.get_tuned_model(tuned_model_select)

This is not a full production load balancer, but it illustrates the concept: multiple deployed copies can distribute traffic and reduce pressure on a single endpoint.

Safety attributes are part of the response

The Vertex AI response contains safety metadata. The course inspects whether a response was blocked and which safety attributes were assigned.

blocked = response._prediction_response[0][0]["safetyAttributes"]["blocked"]
safety_attributes = response._prediction_response[0][0]["safetyAttributes"]

The important distinction is probability versus severity. A phrase can look risky in isolation while being harmless in context. A production system should treat safety metadata as a signal to inspect, not as a complete risk model by itself.

Citation metadata helps inspect originality

The course also checks citationMetadata.

citations = response._prediction_response[0][0]["citationMetadata"]["citations"]

The notes do not build a full citation workflow, but the point is useful: response metadata can help detect whether generated text is leaning heavily on existing material.

5) Common pitfalls or lessons learned

Loading the full dataset into memory too early fails quickly. Push filtering and joins into BigQuery first.
Tuning data should include instructions, not only raw input/output pairs.
Training and production prompt formats must stay aligned.
Pipeline components should exchange artifact locations when data is large.
Safety attributes are useful, but they need interpretation and thresholds.
Artifact names should encode version or timestamp information so runs can be reproduced.
Deployment is not the end of LLMOps. Monitoring, evaluation, and safety review continue after release.

Final thoughts

The strongest practical idea in this course is that LLMOps is a lifecycle, not a deployment step. A good pipeline preserves control over data, prompt format, orchestration, tuned model selection, and response metadata.

If any of those parts are treated informally, the system becomes harder to debug and harder to trust in production.

Reference

This article is based on my personal study notes from the Cyber AI Security track.

Full repository: https://github.com/lameiro0x/cyber-ai-security-notes

Introduction#

1) Core idea#

2) Key concepts#

Data warehouse#

Dataset format#

Artifact versioning#

Orchestration#

3) Practical workflow#

Prepare the cloud environment#

Query only what the model needs#

Add the instruction before tuning#

Split train and evaluation data#

Build a simple Kubeflow pipeline#

Run a supervised tuning pipeline#

4) Technical details worth highlighting#

Prompt management is deployment logic#

Load balancing can be simple#

Safety attributes are part of the response#

Citation metadata helps inspect originality#

5) Common pitfalls or lessons learned#

Final thoughts#

Reference#