While working on yet another AI agent, I realized something interesting.
In 2025, almost no one starts building an agent by asking the question:
“How are we going to evaluate it?”
Without systematic evaluation, we cannot determine:
- if the agent behaves consistently over time,
- if it answers the same questions in the same way,
- or whether adding a new tool or new dataset changes its behavior — and by how much
This is not a minor detail.
This is critical — because without evaluation we have no idea what exactly we changed.
And here is the core point:
With new technologies we should start from defining how we measure their behavior.
First the method of observing and reasoning — then development.
Because evaluation in AI is not simply transferring test practices from classical Software Development.
A proper approach to evaluation allows us to actually understand how the agent works — and what real impact each change has: new tools, new datasets, new prompt chains.
Additionally — I see the role of the engineer evolving.
An agent engineer can no longer be only a “pipeline assembler.”
They must communicate with the people who previously performed these tasks manually: helpdesk, service teams, business domain experts.
Without this, we cannot even define what should be evaluated — and why.
It requires a shared language with business, not only adding new logic nodes into code.
Without this, we are not building consciously.
We are just guessing.
