You evaluate an AI agent by measuring it against an outcome agreed up front — task success, tool-call fidelity, cost, and latency on real data, run repeatably — rather than by reading a few outputs and deciding they look right. The standard we hold to is that an agent should ship measured, not asserted: if you cannot put a defensible number on it, you have not evaluated it.
It is also the part most teams skip. A demo proves an agent can work once; evaluation tells you whether it works reliably enough to run a real workflow every day, on inputs nobody hand-picked. Closing that distance — evaluation, tool-use fidelity, retrieval that holds up on real data — is most of the work of getting to production.
Why agents are harder to evaluate than ordinary software
Conventional software is deterministic: the same input gives the same output, and a test asserts it. An agent breaks both assumptions. Its output varies run to run, and many of the tasks worth giving an agent — researching, drafting, classifying open-ended inputs — have no single correct answer. You cannot write assertEquals against a well-written paragraph.
So agent evaluation works differently. Where a known-correct answer exists, you check against it. Where it does not, you score against checkable properties instead: did the agent cite its sources, did the output match the required structure, did it stay inside the rules it was given, did its tool calls validate? These property checks can be automated, which is what lets you evaluate an agent that produces open-ended work without a human grading every run.
What good evaluation actually measures
Four signals carry the weight, and the discipline is in weighing them together rather than optimising one.
- Task success against the outcome you defined first — by exact check where possible, by property check where not.
- Tool-call fidelity — the right tool, valid inputs, valid outputs — because an agent earns its keep by calling real systems, and a typed tool layer makes this directly measurable.
- Cost per task, treated as a first-class metric. An accurate agent that is too expensive to run is not production-ready, and rising cost is a regression even when the answers still look fine.
- Latency, per task and per step, because a correct answer that arrives too late fails the workflow it was meant to serve.
What ties the four together is sequence: you agree the measure before you build, not after. Score an agent against a target nobody signed off on and you have not evaluated it — you have justified it after the fact. So we start by agreeing the number, then prove we moved it. The measure is the contract.
Evaluation is continuous, not a launch gate
The common mistake is to treat evaluation as a one-time check before go-live. It is not. An agent that passed last month can quietly degrade: the underlying data shifts and retrieval quality drops, a model or prompt change alters behaviour, a tool’s API changes shape and fidelity falls. Evaluation has to be repeatable so you can re-run it on every change, and it has to be paired with observability so a live agent stays watched in production. Evaluation sets the bar; observability tells you when a deployed agent slips below it.
Our point of view
At Agent Foundry Labs, evaluation is one of the composable layers every agent runs on — a built-in eval harness, not a separate testing phase — and it is the layer that most directly carries our “not demos” thesis. You can see it in our in-house outreach engine, a compliance-first research agent we built and run ourselves. From one declarative profile definition it researched and fully drafted a batch of leads with hard compliance rules enforced on every message and zero account risk. Because the agent was measured and traced rather than trusted on faith, we could engineer its running cost down materially while holding the compliance line. That is what evaluation buys you: not a one-off result, but the ability to keep improving an agent without losing track of whether it still works.
There is a simple test for any agent in your business: can you say, with a number, how often it succeeds, what it costs to run, and how you would know if it got worse? If not, the evaluation gap is the first thing worth closing. Book a 30-minute call and we will start by agreeing the measure.