• AI Agent Evaluation

    While working on yet another AI agent, I realized something interesting.
    In 2025, almost no one starts building an agent by asking the question:

    “How are we going to evaluate it?”

    Without systematic evaluation, we cannot determine:

    • if the agent behaves consistently over time,
    • if it answers the same questions in the same way,
    • or whether adding a new tool or new dataset changes its behavior — and by how much

    This is not a minor detail.
    This is critical — because without evaluation we have no idea what exactly we changed.

    And here is the core point:
    With new technologies we should start from defining how we measure their behavior.
    First the method of observing and reasoning — then development.

    Because evaluation in AI is not simply transferring test practices from classical Software Development.

    A proper approach to evaluation allows us to actually understand how the agent works — and what real impact each change has: new tools, new datasets, new prompt chains.

    Additionally — I see the role of the engineer evolving.
    An agent engineer can no longer be only a “pipeline assembler.”
    They must communicate with the people who previously performed these tasks manually: helpdesk, service teams, business domain experts.
    Without this, we cannot even define what should be evaluated — and why.
    It requires a shared language with business, not only adding new logic nodes into code.

    Without this, we are not building consciously.

    We are just guessing.

  • Hallucinations in Isolation: A Shared Mechanism in Minds and Models

    Introduction

    Language models hallucinate. We hear this phrase more and more often—sometimes with a smile, sometimes as a criticism, sometimes with indulgence. As if their flaw were a charming personality quirk. But what if it isn’t a flaw at all? What if it’s a sign of… kinship?

    As someone who has spent hundreds of hours on meditation retreats in silence and isolation, I know a thing or two about hallucinations. In my life, I’ve attended over twenty such retreats, most of them lasting ten days or more. Each day involved ten hours of meditation, no talking, no stimulation, just me and my mind. And that experience has taught me one thing:

    The mind hallucinates when it has nothing to hold on to. When it loses orientation. And language models today are in precisely that state.

    (more…)
  • How I Simplified a Complex Admin Panel

    A Case Study in Fast-Paced Product Environments


    The Challenge

    In one of my previous projects, I worked with a fast-scaling product team that operated in a dynamic, consumer-facing industry. Part of my responsibility was maintaining and developing features within an internal admin panel used by customer support agents. Over time, the panel grew organically as new features were added on the go — and while that’s often necessary in high-velocity environments, it eventually led to friction in daily operations.

    Some of the recurring issues included:

    • Support staff constantly switching between multiple views, which increased the time needed to resolve cases.
    • Inconsistent UI structure — elements related to the same workflow were sometimes far apart, or scattered across different pages.
    • Each view had a slightly different layout or logic, making orientation harder.
    • There was no logical, user-defined way to jump between related sections without retracing steps or scrolling.

    The system wasn’t broken, but it had become inefficient — and it was clear something could be improved.

    (more…)
  • Improving Product Comparison with LLMs and Granular Indexing

    I recently published a case study on Siili’s blog, where I share insights from an AI-powered product advisor I built as a Proof of Concept. The goal? To let users simply describe what they need—in natural language—and get accurate, relevant product recommendations without sifting through complex specs or filters.

    In the article, I dive into data indexing strategies with Azure AI Search, controlling context to avoid hallucinations, and the challenge of balancing human-like communication with the precision we expect from AI.

    Read the full story here: https://www.siili.com/stories/ai-advisor-for-product-discovery-and-comparison