AI Agent Evaluation

While working on yet another AI agent, I realized something interesting.
In 2025, almost no one starts building an agent by asking the question:

“How are we going to evaluate it?”

Without systematic evaluation, we cannot determine:

if the agent behaves consistently over time,
if it answers the same questions in the same way,
or whether adding a new tool or new dataset changes its behavior — and by how much

This is not a minor detail.
This is critical — because without evaluation we have no idea what exactly we changed.

And here is the core point:
With new technologies we should start from defining how we measure their behavior.
First the method of observing and reasoning — then development.

Because evaluation in AI is not simply transferring test practices from classical Software Development.

A proper approach to evaluation allows us to actually understand how the agent works — and what real impact each change has: new tools, new datasets, new prompt chains.

Additionally — I see the role of the engineer evolving.
An agent engineer can no longer be only a “pipeline assembler.”
They must communicate with the people who previously performed these tasks manually: helpdesk, service teams, business domain experts.
Without this, we cannot even define what should be evaluated — and why.
It requires a shared language with business, not only adding new logic nodes into code.

Without this, we are not building consciously.

We are just guessing.

Piotr Ludwig Space

Comments