You Should Be Running Evaluations on Your Solace Agent Mesh Agents

Why and When You Should Be Running Evaluations on Your Solace Agent Mesh Agents

You’ve built an agent in Solace Agent Mesh. It seems to work. Someone on your team said “looks good” in a Slack thread. Are you ready to ship it?

Probably not, here is why: Specifically, it’s about when and why you should be running evaluations (evals) on your agents before you put them in front of users, and what kinds of failures evals exist to catch.


What is an evaluation?

An evaluation, or eval, is a structured, repeatable test of an AI system’s behavior against defined expectations. It’s the AI equivalent of an integration test. Instead of asserting that a function returns true, you’re asserting that an agent did the right thing given a realistic input.

At its core, an eval answers a single question:

Given this input, does the agent behave the way I expect?

That might mean:

  • Did the agent respond with a greeting when I said hello?
  • Did it call the right tool to process a file?
  • Did it delegate to the right peer agent to complete a task?
  • Was the final response accurate and complete?

Solace Agent Mesh has an evaluation framework built into its CLI. You write JSON test cases, group them into a suite, and run sam eval. The framework runs your tests against real agents connected via a real broker (no mocked simulations) and scores the results.


Why agents need evals in place of traditional unit testing

AI agents fail differently from traditional software. There’s no stack trace when an agent gives a mediocre answer. No alarm fires when a model update quietly changes how your agent reasons about a task. The kinds of failures you’re up against:

Silent degradation. Model updates, tool changes, and config drift can all shift agent behavior without breaking anything in an obvious way. The system “still works.” It just works worse.

Immeasurable variance. LLMs are probabilistic. The same prompt produces different outputs across runs. Without evals, you have no way to distinguish acceptable variance from dangerous variance (different tool calls, wrong decisions, dropped steps).

Multi-hop complexity. Enterprise agents invoke tools, delegate to peer agents, and process artifacts. Every hop is a potential failure point. Manual testing scales poorly across this surface area.

No shared baseline. Without a repeatable test, “it works” means something different to every person on your team. Evals give everyone the same measuring stick.

Evals are how you move from vibes to evidence.


When to start running evals

The honest answer: earlier than you think. Here are the moments where evals start paying off.

As soon as you have one working agent

The very first eval to run is the canary: does the agent respond to a basic greeting? It sounds trivial, but if hello_world fails, something is fundamentally broken with your setup, and you want to know that before any of the more complex tests muddy the picture.

{
  "test_case_id": "hello_world",
  "category": "Content Generation",
  "query": "Hello, world!",
  "target_agent": "OrchestratorAgent",
  "wait_time": 30,
  "evaluation": {
    "expected_tools": [],
    "expected_response": "Hello! How can I help you today?",
    "criterion": "Evaluate if the agent provides a standard greeting."
  }
}

Before changing anything that affects behavior

Any time you modify agent instructions, swap out a model, add a tool, or change how an agent delegates, your evals should be the gate. A drop in tool match is a reliable early warning that your instruction changes broke the agent’s reasoning. A drop in LLM eval scores points to response quality degradation. Catching that pre-deploy is much cheaper than catching it post-deploy.

When you’re choosing between models

Model choice is rarely a one-time decision. Models get updated, costs change, new models ship. The Solace Agent Mesh eval framework can run your full suite against multiple model you list in the suite config and renders a side-by-side comparison in the report. This is one of the strongest reasons to invest in evals early: you can make model decisions based on numbers, not vibes.

When a model provider pushes an update

Models change under you. If your provider pushes a new version of a model you’re running in production, you don’t actually know it still does what you need until you re-run your suite. We’ve seen real cases where a new model greets correctly but completely fails at tool delegation. Without evals, you wouldn’t know that until a user complains.

Before any production deploy, on a cadence

Once you have a stable suite, run it consistently: after every agent config change, after a model version update, and ideally as part of a CI pipeline. The point of evals isn’t a one-time check. It’s an ongoing baseline.


What evals are not

A few things worth being clear about.

Evals are not unit tests. They run real requests through real agents on a real broker. They’re closer to integration tests in spirit, and they take real time and (depending on your scoring methods) real LLM API budget to run.

Evals are not a substitute for human review. A high LLM evaluator score on a complex task doesn’t mean your agent is producing customer-ready output. It means it’s behaving consistently against the criterion you wrote. Read your results.

Evals are not just for production agents. They’re useful from day one. The earlier you start writing them, the more valuable they get, because each test case becomes part of your regression suite as your agents evolve.

If you want to skip ahead, the framework lives in the Solace Agent Mesh repo at tests/evaluation/, and the documentation is at solacelabs.github.io/solace-agent-mesh/docs/documentation/developing/evaluations.

Start with hello_world. It will give you and idea of how eval test and test suites work.

1 Like