Gen AI Evaluation Service - Agent Metrics

Agent metrics

In my previous Gen AI Evaluation Service - Tool-Use Metrics post, we talked about LLMs calling external tools and how you can use tool-use metrics to evaluate how good those tool calls are. In today’s fifth post of my Vertex AI Gen AI Evaluation Service blog post series, we will talk about a related topic: agents and agent metrics.

What are agents?

There are many definitions of agents but an agent is essentially a piece of software that acts autonomously to achieve specific goals. They use LLMs to perform tasks, utilize external tools, coordinate with other agents, and ultimately produce a response to the user.

An agent is similar to an LLM using a tool, but it tends to use multiple tools and it also contains reasoning logic to coordinate with other agents to produce a response. As such, both the response and the path of tool use to get to that response are important to measure in evaluations.

List of metrics

You have the following options to evaluate your agent:

Response evaluation: Evaluate the final output of an agent (whether or not the agent achieved its goal).
Trajectory evaluation: Evaluate the path (sequence of tool calls) the agent took to reach the final response.

We’ll get to the response evaluation later but these are the out-of-the-box trajectory evaluation metrics in the Gen AI evaluation service:

trajectory_exact_match measures whether the predicted trajectory is identical to the reference trajectory.
trajectory_in_order_match measures whether the predicted trajectory contains all the tool calls from the reference trajectory in the same order, and may also have extra tool calls.
trajectory_any_order_match measures whether the predicted trajectory contains all the tool calls from the reference trajectory, but the order doesn’t matter and may contain extra tool calls.
trajectory_precision measures how many of the tool calls in the predicted trajectory are actually relevant or correct according to the reference trajectory.
trajectory_recall measures how many of the essential tool calls from the reference trajectory are actually captured in the predicted trajectory.
trajectory_single_tool_use measures whether a specific tool that is specified in the metric spec is used in the predicted trajectory. It doesn’t check the order of tool calls or how many times the tool is used, just whether it’s present or not.

Precision and recall return float values between 0 and 1, while all other metrics return either 0 (no match) or 1 (match).

You can choose to evaluate just the response or the trajectory or more realistically both. You can also choose to run the evaluation with saved responses from the agent or by calling the agent live.

Let’s take a look at some concrete examples.

Response evaluation

The response evaluation for agents is no different than model-based pointwise evaluation we already discussed in my Gen AI Evaluation Service - Model-Based Metrics post for LLM responses. It involves picking a metric (e.g. pointwise fluency), defining prompts, responses, and running an evaluation.

Take a look at response_model_based.py for details on how to achieve that.

Trajectory evaluation

The trajectory evaluation has 2 flavours: computation or model based. You’ll most likely use the computation-based trajectory evaluation for determinism but it’s also possible to use a more flexible and custom model-based trajectory evaluation. Additionally, it’s also possible to define custom computation-based trajectory metric, if the standard one is not sufficient.

Computation-based (standard)

In computation-based trajectory evaluation, you first define a reference (i.e. expected) trajectory for your agent with the tool calls, their parameters, and their order. For example, here’s a reference trajectory to call two functions to get the weather of a location:

reference_trajectory = [
    [
        {
            "tool_name": "loc_to_lat_long",
            "tool_input": {
                "location": "London"
            }
        },
        {
            "tool_name": "lat_long_to_weather",
            "tool_input": {
                "longitude": "-0.12574",
                "latitude": "51.50853"
            }
        }
    ],

You then get and save the actual trajectory from your LLM. For example, the LLM calls the rigth tools but with the wrong location in this case:

predicted_trajectory = [
    # location mismatch
    [
        {
            "tool_name": "loc_to_lat_long",
            "tool_input": {
                "location": "Paris"
            }
        },
        {
            "tool_name": "lat_long_to_weather",
            "tool_input": {
                "longitude": "2.3488",
                "latitude": "48.85341"
            }
        }
    ]

Next, create an evaluation dataset with actual and reference trajectories, along with the desired metric:

eval_dataset = pandas.DataFrame(
    {
        "predicted_trajectory": predicted_trajectory,
        "reference_trajectory": reference_trajectory,
    }
)

metrics = [
    Metric.TRAJECTORY_EXACT_MATCH,
]

Run the evaluation:

python trajectory_computation_based.py


**==Metrics table==**
predicted_trajectory:
[{'tool_name': 'loc_to_lat_long', 'tool_input': {'location': 'Paris'}}, {'tool_name': 'lat_long_to_weather', 'tool_input': {'longitude': '2.3488',...}]

reference_trajectory:
[{'tool_name': 'loc_to_lat_long', 'tool_input': {'location': 'London'}}, {'tool_name': 'lat_long_to_weather', 'tool_input': {'longitude': '-0.1257...}]

trajectory_exact_match/score
0

You can see that the exact_match metric has 0.0 score.

See trajectory_computation_based.py for more metrics and details.

Computation-based (custom)

If the standard computation-based metrics are not sufficient, you can create your own custom metrics. For example, let’s say you want to define a custom metric to make sure all essential tools are present in the trajectory.

First, you define a essential_tools_present function with instance parameter:

def essential_tools_present(instance, required_tools = ["loc_to_lat_long", "lat_long_to_weather"]):
    trajectory = instance["predicted_trajectory"]
    tools_present = [tool_used['tool_name'] for tool_used in trajectory]
    if len(required_tools) == 0:
      return {"custom_essential_tools_present_metric": 1}
    score = 0
    for tool in required_tools:
      if tool in tools_present:
        score += 1
    return {
        "custom_essential_tools_present_metric": score/len(required_tools),
    }

Then, you create a custom metric with the metric_function pointing to the function you just defined:

custom_metric = CustomMetric(
    name="custom_essential_tools_present_metric",
    metric_function=essential_tools_present)

And use that metric in the evaluation:

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[custom_metric],
    experiment=get_experiment_name(__file__)
)

See trajectory_computation_based_custom.py for details.

Model based

If you need a more flexible or customized trajectory evaluation, you can use a model-based trajectory evaluation.

For example, here’s a custom prompt template and a custom metric that you can define:

custom_trajectory_prompt_template = PointwiseMetricPromptTemplate(
    criteria={
        "Follows trajectory": (
            "Evaluate whether the agent's response logically follows from the "
            "sequence of actions it took. Consider these sub-points:\n"
            "  - Does the response reflect the information gathered during the trajectory?\n"
            "  - Is the response consistent with the goals and constraints of the task?\n"
            "  - Are there any unexpected or illogical jumps in reasoning?\n"
            "Provide specific examples from the trajectory and response to support your evaluation."
        )
    },
    rating_rubric={
        "1": "Follows trajectory",
        "0": "Does not follow trajectory",
    },
    input_variables=["predicted_trajectory"],
)

custom__metric = PointwiseMetric(
    metric="custom_trajectory_metric",
    metric_prompt_template=custom_trajectory_prompt_template,
)

Then you can use the custom metric in your evaluation.

See trajectory_model_based_custom.py for details.

Trajectory and response evaluation with runnable interface

So far, we used saved trajectory and responses for agent evaluation. You can also call your agent live from your evaluation, transform the trajectory and response into the format that Gen AI Evaluation Service expects and run the evaluation. Let’s see how this works.

First, when you run evaluate, you specify a runnable function:

eval_result = eval_task.evaluate(
    runnable=agent_parsed_outcome
)

This function retrieves the prompt from the evaluation and initiates the agent run. Then, you’d set up and run your agent code here. The exact code depends on the agent framework you are using:

def agent_parsed_outcome(prompt):
    print(f"Prompt: {prompt}")

    # Setup and run your agent code here

    # Parse the agent response and trajectory
    return parse_agent_output_to_dictionary()

You’d get the agent response, trajectory, and you’d convert them into the format that the Gen AI evaluation service expects. In this case, we’re simply constructing a dummy response to show the format:

def parse_agent_output_to_dictionary():
    # Parse agent response and trajectory and convert into the format the eval service expects

    # Returning dummy response here
    final_output = {
        "response": "It's rainy and 10 degrees",
        "predicted_trajectory":
            [
                {
                    "tool_name": "loc_to_lat_long",
                    "tool_input": {
                        "location": "London"
                    }
                },
                {
                    "tool_name": "lat_long_to_weather",
                    "tool_input": {
                        "longitude": "-0.12574",
                        "latitude": "51.50853"
                    }
                }
            ]
    }
    return final_output

For example, if you want to run this against Google’s Agent Development Kit (ADK), you’d have the following agent_parsed_outcome function:

async def agent_parsed_outcome(query):
   app_name = "product_research_app"
   user_id = "user_one"
   session_id = "session_one"

   product_research_agent = Agent(
       name="ProductResearchAgent",
       model=model,
       description="An agent that performs product research.",
       instruction=f"""
       Analyze this user request: '{query}'.
       If the request is about price, use get_product_price tool.
       Otherwise, use get_product_details tool to get product information.
       """,
       tools=[get_product_details, get_product_price],
   )

   session_service = InMemorySessionService()
   await session_service.create_session(
       app_name=app_name, user_id=user_id, session_id=session_id
   )

   runner = Runner(
       agent=product_research_agent, app_name=app_name, session_service=session_service
   )

   content = types.Content(role="user", parts=[types.Part(text=query)])
   events = [event async for event in runner.run_async(user_id=user_id, session_id=session_id, new_message=content)]

   return parse_adk_output_to_dictionary(events)

And you’d parse the output as follows:

def parse_adk_output_to_dictionary(events: list[Event], *, as_json: bool = False):
    """
    Parse ADK event output into a structured dictionary format,
    with the predicted trajectory dumped as a JSON string.
    """

    final_response = ""
    trajectory = []

    for event in events:
        if not getattr(event, "content", None) or not getattr(event.content, "parts", None):
            continue
        for part in event.content.parts:
            if getattr(part, "function_call", None):
                info = {
                    "tool_name": part.function_call.name,
                    "tool_input": dict(part.function_call.args),
                }
                if info not in trajectory:
                    trajectory.append(info)
            if event.content.role == "model" and getattr(part, "text", None):
                final_response = part.text.strip()

    if as_json:
        trajectory_out = json.dumps(trajectory)
    else:
        trajectory_out = trajectory

    return {"response": final_response, "predicted_trajectory": trajectory_out}

See trajectory_and_response_runnable.py for details and evaluating_adk_agent.ipynb notebook for ADK specific evaluation.

Conclusion

In this post, we covered agent metrics, and showed how to evaluate agent response and trajectory with saved or live responses.

In this series so far, we’ve focused on evaluating text-based responses. However, models can produce multimodal outputs (e.g., images, video) as well. How do you evaluate those? That’s the topic of our next blog post.

References

GenAI VertexAI Gemini Google Cloud Platform