Kategoria: Uncategorized

  • AI Agent Evaluation

    While working on yet another AI agent, I realized something interesting.
    In 2025, almost no one starts building an agent by asking the question:

    “How are we going to evaluate it?”

    Without systematic evaluation, we cannot determine:

    • if the agent behaves consistently over time,
    • if it answers the same questions in the same way,
    • or whether adding a new tool or new dataset changes its behavior — and by how much

    This is not a minor detail.
    This is critical — because without evaluation we have no idea what exactly we changed.

    And here is the core point:
    With new technologies we should start from defining how we measure their behavior.
    First the method of observing and reasoning — then development.

    Because evaluation in AI is not simply transferring test practices from classical Software Development.

    A proper approach to evaluation allows us to actually understand how the agent works — and what real impact each change has: new tools, new datasets, new prompt chains.

    Additionally — I see the role of the engineer evolving.
    An agent engineer can no longer be only a “pipeline assembler.”
    They must communicate with the people who previously performed these tasks manually: helpdesk, service teams, business domain experts.
    Without this, we cannot even define what should be evaluated — and why.
    It requires a shared language with business, not only adding new logic nodes into code.

    Without this, we are not building consciously.

    We are just guessing.