Review of the Three Scoring Methods in Solace Agent Mesh Evaluations

When you run sam eval against a test suite, the framework can score every run with three different methods. Each one tells you something the others can’t, and the real value comes from reading them together. Lets examine how to configure it, where it shines, and where it can mislead you if used in isolation.


The three methods at a glance

Method How it works Best for
Tool Match Checks whether the agent called the tools listed in expected_tools Verifying correct tool usage and peer agent delegation
Response Match ROUGE score comparing the actual response to expected_response Factual responses, extraction tasks
LLM Evaluator A separate LLM judges the full interaction against your criterion Holistic quality, complex orchestration

You enable any combination of them in your suite’s evaluation_settings:

"evaluation_settings": {
  "tool_match": { "enabled": true },
  "response_match": { "enabled": true },
  "llm_evaluator": {
    "enabled": true,
    "env": {
      "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-3-pro-preview",
      "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
      "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
    }
  }
}

1. Tool Match: did the agent actually do the work?

Tool Match checks whether the agent invoked the tools listed in your test case’s expected_tools field during a run. It’s binary at the per-tool level: either the tool was called, or it wasn’t. There’s no partial credit.

This includes peer agent delegations. When an orchestrator hands work to another agent, that delegation shows up as a tool call with a peer_ prefix. Tool Match treats those exactly like any other tool call, which makes it a powerful way to verify multi-agent orchestration paths.

Configuring Tool Match

In the test case:

{
  "test_case_id": "convert_pdf_to_md",
  "category": "Orchestration",
  "target_agent": "OrchestratorAgent",
  "query": "Please convert the attached PDF file to markdown using the Markitdown Agent.",
  "artifacts": [
    { "type": "file", "path": "artifacts/sample.pdf" }
  ],
  "wait_time": 120,
  "evaluation": {
    "expected_tools": ["peer_MarkitdownAgent"],
    "expected_response": "I have converted the PDF file to markdown and attached it.",
    "criterion": "Evaluate if the agent successfully uses the MarkitdownAgent to convert the PDF file to a markdown file and confirms task completion."
  }
}

peer_MarkitdownAgent is the orchestrator delegating to its peer. If the orchestrator never makes that delegation, Tool Match returns 0.

An empty list ("expected_tools": []) is valid and means no tool calls are expected, which is typical for conversational responses.

Why Tool Match catches things the other scores miss

Multi-hop agent orchestration introduces a specific failure mode: the agent produces a plausible-looking final response without actually doing the underlying work. Tool Match is the cleanest way to catch that.

Real example from a multi-model run:

Model                     | Test Case          | Tool Match | Response Match | LLM Eval
----------------------------------------------------------------------------------------
bedrock-claude-4-5-sonnet | hello_world        | 1.00       | 0.00           | 0.00
bedrock-claude-4-5-sonnet | convert_pdf_to_md  | 0.00       | 0.04           | 0.00

A Tool Match of 1.00 on hello_world (no tools expected) and 0.00 on the PDF test tells you immediately: the model can hold a conversation, but it never delegated to peer_MarkitdownAgent. The conversion never actually happened.

What Tool Match can’t tell you

Tool Match is binary and structural. It can’t tell you:

  • Whether the tool was called with the right arguments
  • Whether the tool’s output was actually used in the final response
  • Whether the agent called the tool once or fifteen times when once was enough

For those, you can use the LLM Evaluator.


2. Response Match: ROUGE for cheap factual sanity checking

Response Match uses ROUGE (Recall-Oriented Understudy for Gisting Evaluation) to measure word overlap between the agent’s actual response and the expected_response you defined. It’s fast, deterministic, and free to compute. No LLM calls, no API cost, no variance between runs.

Where ROUGE shines

Response Match is most useful for tests with a single, factual, deterministic answer:

  • Extraction tasks (pull this field out of this document)
  • Calculation results
  • Lookup tasks
  • Yes/no responses where the keyword is what matters

For the CSV filtering test where the right answer is “John Doe,” Response Match will reward responses that contain those words and penalize ones that don’t. Cheap and effective.

Where ROUGE will lie to you

Real example. The hello_world test has this expected response:

"Hello! How can I help you today?"

GPT-4.1 replied with:

"Hello! How can I assist you today?"

Response Match: 0.53. The LLM Evaluator score for the same response: 0.67. The swap from help to assist tanked ROUGE because it measures word overlap, not meaning. A correct response, phrased differently, gets punished.

Another example: convert_pdf_to_md test often produces a Response Match around 0.25 (because the agent’s confirmation wording differs from the expected text) but an LLM Evaluator score of 1.00, because the judge can read the message trace and see the conversion actually happened.

:warning: Low Response Match doesn’t always mean failure
A 0.25 Response Match alongside a 1.00 LLM Evaluator score is normal. The agent returned a confirmation message worded differently from your expected_response, but the LLM judge correctly recognized that the task completed.

How to use Response Match well

  • Write expected_response to capture the spirit of the answer, not the exact phrasing.
  • Always pair Response Match with the LLM Evaluator. ROUGE alone is a misleading signal on free-form responses.
  • Treat anything below 0.5 as a prompt to look at the LLM Evaluator score before drawing conclusions.
  • Use it as a regression detector: once you have a working baseline, sudden drops are useful early warnings.

3. LLM Evaluator: an LLM-as-judge for meaning and quality

The LLM Evaluator is your “LLM as judge” scorer. After the agent under test produces its response, the framework sends the full interaction (user query, tool calls, intermediate messages, final response) to a separate LLM along with a criterion written in plain English. The judge returns a score between 0.0 and 1.0 plus written reasoning.

Two things make it worth the API cost:

  1. It evaluates meaning, not word overlap.
  2. It produces reasoning, which you can read in the results file when a score surprises you.

Configuring the LLM Evaluator

The test case’s criterion is the most important field. It’s the rubric:

"criterion": "Evaluate if the agent correctly filters the CSV data."

In the suite config, point the evaluator at a model:

"llm_evaluator": {
  "enabled": true,
  "env": {
    "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-3-pro-preview",
    "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
    "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
  }
}

:red_exclamation_mark: The judge should be a different model from the ones being tested
If you’re comparing Claude Sonnet against Claude Opus and your judge is Claude, you’ve introduced bias. Pick a stable, strong model that isn’t in your llm_models list. Gemini 2.5 Pro and GPT-4o are both reasonable picks.

Why this score catches what the others miss

For filter_csv_employees_by_age_and_country, GPT-4.1 averaged 0.30 on LLM Evaluator across three runs. Looking only at Response Match (which was higher), you might have assumed the agent was producing roughly correct text. The LLM judge’s reasoning on the 0.6 run tells the actual story:

“The agent ultimately provides the correct answer (John Doe). However, the response is confusing and initially incorrect. The agent’s first sentence is factually wrong… it then contradicts itself…”

The agent eventually got to the right answer, but only after stating a wrong one. That’s a fail. ROUGE couldn’t catch that. Tool Match couldn’t catch that. The LLM Evaluator did.

Tips for clean LLM Evaluator scores

  • Write specific criteria. “Evaluate if the agent answered correctly” is too vague. “Evaluate if the agent correctly filtered the CSV by age and country and returned only the matching names” is much better.
  • Run at least 3 times. LLM judges have variance. A single judge call is noisy. Configure runs: 3 (or higher) so you can read distributions.
  • Track variance, not just means. A 1.0, 1.0, 0.0 distribution averages to 0.67 but the 0.0 run is the real story. The results.json file stores min, Q1, median, Q3, and max.
  • Don’t let the judge be one of the models under test. It biases the comparison.

Reading the three together

The real value of the framework comes from reading the three scores side by side. Each catches a different category of failure.

Model    | Test Case                               | Tool Match | Response Match | LLM Eval
-------------------------------------------------------------------------------------------
gpt-4-1  | filter_csv_employees_by_age_and_country | 0.00       | 0.34           | 0.30
gpt-4-1  | hello_world                             | 1.00       | 0.53           | 0.67
gpt-4-1  | convert_pdf_to_md                       | 1.00       | 0.25           | 1.00

A few things to notice:

  • hello_world: Response Match (0.53) lower than LLM Eval (0.67). Classic ROUGE-vs-meaning gap. Wording differs, meaning is fine.
  • convert_pdf_to_md: Response Match (0.25) much lower than LLM Eval (1.00). The agent did the job but worded its confirmation differently. Not a problem.
  • filter_csv_...: Tool Match is 0.00. The agent never called extract_content_from_artifact. Whatever the LLM Eval reasoning says, this test failed at the structural level.

A rough rule of thumb when reading results:

  • Tool Match low → the agent isn’t doing the work. Investigate first.
  • Response Match low, LLM Eval high → fine. The agent reworded a correct response.
  • Response Match high, LLM Eval low → suspicious. The agent used the right words but the LLM judge spotted a problem in reasoning or completeness.
  • All three low → broken.

:pencil: What “good enough” looks like
For tool-dependent tasks, aim for Tool Match 1.00 consistently. For LLM Eval, a score above 0.8 averaged across 3 runs is a reasonable production bar for most tasks (set your own thresholds based on criticality). For Response Match, treat anything below 0.5 as a prompt to check the LLM Eval score before drawing conclusions.

Example test cases live in the Solace Agent Mesh GitHub repo at tests/evaluation/test_cases/.