Gen AI Evaluation Service - Tool-Use Metrics


Tool-use metrics

I’m continuing my Vertex AI Gen AI Evaluation Service blog post series. In today’s fourth post of the series, I will talk about tool-use metrics.

What is tool use?

Tool use, also known as function calling, provides the LLM with definitions of external tools (for example, a get_current_weather function). When processing a prompt, the model determines if a tool is needed and, if so, outputs structured data specifying the tool to call and its parameters (for example, get_current_weather(location='London')).

Your application (or the underlying model library in the case of Gemini) then executes this tool, feeds the result back to the model, allowing it to complete its response with dynamic, real-world information or the outcome of an action. This effectively bridges the LLM with real-world systems and extends its capabilities and knowledge beyond its training data.

This is great, but how do you know if the right tool has been requested by the LLM with the right parameters? You don’t until you check. Thankfully, you can rely on tool-use metrics.

List of metrics

These are the out-of-the-box tool-use metrics in Gen AI evaluation service:

  • tool_call_valid determines if the model’s output contains a valid tool call.
  • tool_name_match determines if the tool call has the correct name.
  • tool_parameter_key_match determines if the tool call has a correct parameter name.
  • tool_parameter_kv_match determines if the tool call has a correct parameter name and value.

Each metric ends up with either a 0 (invalid) or 1 (valid) value. Let’s take a look at some detailed examples.

Tool-use metrics with saved responses

Imagine you have a function named location_to_lat_long that takes an argument location and returns the latitude and longitude of that location. You want to evaluate whether the LLM calls this function with London value. How can you set up an evaluation for this scenario?

Reference tool call

First, you’d define a reference tool call as follows:

references = [
    {
        "content": "",
        "tool_calls": [
            {
                "name": "location_to_lat_long",
                "arguments": {
                    "location": "London"
                }
            }
        ]
    }
] * 7

Note that, we’re defining the same reference multiple times, as we’ll be using it against different tool metrics to show how they work.

No or wrongly formatted tool call

Now, let’s assume we get the following responses from the LLM. The first one has no tool call, and the second one has an incorrectly formatted tool call:

responses = [
    {
        "content": "",
        "tool_calls": [
            # no tool call - fails Metric.TOOL_CALL_VALID
        ]
    },
    {
        "content": "",
        "tool_calls": [
            # wrongly formatted (no name) tool call - fails Metric.TOOL_CALL_VALID
            {
                "foo": "some_function",
                "arguments": {
                    "some_arg": "some_value"
                }
            }
        ]
    },
]

Both of these will be caught by the tool_call_valid metric. This is useful for the basic sanity check.

Wrong tool name

Once you have the basic tool call validated, you need to check more like the tool name and parameter key and values.

For example, if you get a response with the wrong tool name, it’ll be caught by the tool_name_match metric:

{
    "content": "",
    "tool_calls": [
        {
            # name (loc_to_lat_long) does not match (location_to_lat_long) - fails Metric.TOOL_NAME_MATCH
            "name": "loc_to_lat_long",
            "arguments": {
                "location": "London"
            }
        }
    ]
}

Wrong parameter name

Of course, you can have the right tool name but not the right parameter name. This one will be caught by tool_parameter_key_match metric:

{
    "content": "",
    "tool_calls": [
        {
            "name": "location_to_lat_long",
            "arguments": {
                # key (city) does not match (location) - fails Metric.TOOL_PARAMETER_KEY_MATCH
                "city": "London"
            }
        }
    ]
}

Wrong parameter name or value

Finally, if you need to check both the key and the value of the parameter, you can use the tool_parameter_kv_match metric. Both of these responses will be caught by the tool_parameter_kv_match metric:

[
    {
        "content": "",
        "tool_calls": [
            {
                "name": "location_to_lat_long",
                "arguments": {
                    # key (city) does not match (location) - fails Metric.TOOL_PARAMETER_KV_MATCH
                    "city": "London"
                }
            }
        ]
    },
    {
        "content": "",
        "tool_calls": [
            {
                "name": "location_to_lat_long",
                "arguments": {
                    # value (Paris) does not match (London) - fails Metric.TOOL_PARAMETER_KV_MATCH
                    "location": "Paris"
                }
            }
        ]
    }
]

See tool_use.py for the full sample code.

Tool-use metrics with Gemini function calling

So far, we have looked at samples with saved responses in the right format. In a real-world evaluation scenario, you’d get the function call responses and you’d need to transform those responses to the format that the Gen AI evaluation service expects. Moreover, in models like Gemini, there’s automatic function calling where the responses and function calls happen automatically. How can you set up an evaluation for those scenarios?

You can take a look at tool_use_gemini.py for an end-to-end sample, but let me walk you through the important parts.

First, you generate content with a prompt (What's the temperature, wind, humidity like in London, Paris?) and function calls (location_to_lat_long, lat_long_to_weather). You can see the details tool_use_gemini.py. In Gemini, function calling is automatic, so you don’t necessarily get the function call requests in responses. However, the final response includes the automatic function calling history via response.automatic_function_calling_history property. This is very useful for evaluation.

You pass the function calling history to a function to convert it to the Gen AI evaluation service format:

def convert_function_calling_history_for_eval(automatic_function_calling_history):
    function_calls = []
    for item in automatic_function_calling_history:
        if item.role == 'model':
            for part in item.parts:
                if part.function_call:
                    output_format = {
                        "content": "",
                        "tool_calls": [
                            {
                                "name": part.function_call.name,
                                "arguments": part.function_call.args
                            }
                        ]
                    }
                    function_calls.append(output_format)
    return function_calls

At this point, we have the function call responses from the model in the format we need. Now, we can define our reference function calls sequence and values:

references = [
    {
        "content": "",
        "tool_calls": [
            {
                "name": "location_to_lat_long",
                "arguments": {
                    "location": "London"
                }
            }
        ]
    },
    {
        "content": "",
        "tool_calls": [
            {
                "name": "location_to_lat_long",
                "arguments": {
                    "location": "Paris"
                }
            }
        ]
    },
    {
        "content": "",
        "tool_calls": [
            {
                "name": "lat_long_to_weather",
                "arguments": {
                    "longitude": "-0.12574",
                    "latitude": "51.50853"
                }
            }
        ]
    },
    {
        "content": "",
        "tool_calls": [
            {
                "name": "lat_long_to_weather",
                "arguments": {
                    "longitude": "2.3488",
                    "latitude": "48.85341"
                }
            }
        ]
    },
]

Create an evaluation dataset with the reference and the response:

eval_dataset = pandas.DataFrame(
    {
        "response": [json.dumps(history) for history in converted_function_calling_history],
        "reference": [json.dumps(reference) for reference in references],
    }
)

Finally, run the evaluation task with the metrics we care about:

eval_task = EvalTask(
    dataset=eval_dataset,
    metrics=[
        Metric.TOOL_CALL_VALID,
        Metric.TOOL_NAME_MATCH,
        Metric.TOOL_PARAMETER_KEY_MATCH,
        Metric.TOOL_PARAMETER_KV_MATCH
    ],
    experiment=get_experiment_name(__file__)
)

eval_result = eval_task.evaluate()

In the end, we should get the results back where the response perfectly matches with reference in this case:

==Metrics table==
                                                                                                                           response  \
0                            {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "London"}}]}
1                             {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "Paris"}}]}
2  {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "-0.12574", "latitude": "51.50853"}}]}
3    {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"latitude": "48.85341", "longitude": "2.3488"}}]}

                                                                                                                          reference  \
0                            {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "London"}}]}
1                             {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "Paris"}}]}
2  {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "-0.12574", "latitude": "51.50853"}}]}
3    {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "2.3488", "latitude": "48.85341"}}]}

   tool_call_valid/score  tool_name_match/score  \
0                    1.0                    1.0
1                    1.0                    1.0
2                    1.0                    1.0
3                    1.0                    1.0

   tool_parameter_key_match/score  tool_parameter_kv_match/score
0                             1.0                            1.0
1                             1.0                            1.0
2                             1.0                            1.0
3                             1.0                            1.0

Conclusion

In this post, we covered tool-use metrics, and showed an end-to-end example on how to set it up with Gemini’s automatic function calling. In the next post, we’ll explore how to evaluate beyond tools and show how to evaluate agents.

References:


See also