I’m continuing my Vertex AI Gen AI Evaluation Service blog post series. In today’s fourth post of the series, I will talk about tool-use metrics.
What is tool use?
Tool use, also known as function calling, provides the LLM with definitions of external tools (for example, a
get_current_weather
function). When processing a prompt, the model determines if a tool is needed and, if so, outputs
structured data specifying the tool to call and its parameters (for example, get_current_weather(location='London')
).
Your application (or the underlying model library in the case of Gemini) then executes this tool, feeds the result back to the model, allowing it to complete its response with dynamic, real-world information or the outcome of an action. This effectively bridges the LLM with real-world systems and extends its capabilities and knowledge beyond its training data.
This is great, but how do you know if the right tool has been requested by the LLM with the right parameters? You don’t until you check. Thankfully, you can rely on tool-use metrics.
List of metrics
These are the out-of-the-box tool-use metrics in Gen AI evaluation service:
tool_call_valid
determines if the model’s output contains a valid tool call.tool_name_match
determines if the tool call has the correct name.tool_parameter_key_match
determines if the tool call has a correct parameter name.tool_parameter_kv_match
determines if the tool call has a correct parameter name and value.
Each metric ends up with either a 0 (invalid) or 1 (valid) value. Let’s take a look at some detailed examples.
Tool-use metrics with saved responses
Imagine you have a function named location_to_lat_long
that takes an argument location
and returns the latitude and
longitude of that location. You want to evaluate whether the LLM calls this function with London value. How can you set
up an evaluation for this scenario?
Reference tool call
First, you’d define a reference tool call as follows:
references = [
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
"location": "London"
}
}
]
}
] * 7
Note that, we’re defining the same reference multiple times, as we’ll be using it against different tool metrics to show how they work.
No or wrongly formatted tool call
Now, let’s assume we get the following responses from the LLM. The first one has no tool call, and the second one has an incorrectly formatted tool call:
responses = [
{
"content": "",
"tool_calls": [
# no tool call - fails Metric.TOOL_CALL_VALID
]
},
{
"content": "",
"tool_calls": [
# wrongly formatted (no name) tool call - fails Metric.TOOL_CALL_VALID
{
"foo": "some_function",
"arguments": {
"some_arg": "some_value"
}
}
]
},
]
Both of these will be caught by the tool_call_valid
metric. This is useful for the basic sanity check.
Wrong tool name
Once you have the basic tool call validated, you need to check more like the tool name and parameter key and values.
For example, if you get a response with the wrong tool name, it’ll be caught by the tool_name_match
metric:
{
"content": "",
"tool_calls": [
{
# name (loc_to_lat_long) does not match (location_to_lat_long) - fails Metric.TOOL_NAME_MATCH
"name": "loc_to_lat_long",
"arguments": {
"location": "London"
}
}
]
}
Wrong parameter name
Of course, you can have the right tool name but not the right parameter name. This one will be caught by tool_parameter_key_match
metric:
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
# key (city) does not match (location) - fails Metric.TOOL_PARAMETER_KEY_MATCH
"city": "London"
}
}
]
}
Wrong parameter name or value
Finally, if you need to check both the key and the value of the parameter, you can use the tool_parameter_kv_match
metric. Both of these responses will be caught by the tool_parameter_kv_match
metric:
[
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
# key (city) does not match (location) - fails Metric.TOOL_PARAMETER_KV_MATCH
"city": "London"
}
}
]
},
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
# value (Paris) does not match (London) - fails Metric.TOOL_PARAMETER_KV_MATCH
"location": "Paris"
}
}
]
}
]
See tool_use.py
for the full sample code.
Tool-use metrics with Gemini function calling
So far, we have looked at samples with saved responses in the right format. In a real-world evaluation scenario, you’d get the function call responses and you’d need to transform those responses to the format that the Gen AI evaluation service expects. Moreover, in models like Gemini, there’s automatic function calling where the responses and function calls happen automatically. How can you set up an evaluation for those scenarios?
You can take a look at tool_use_gemini.py for an end-to-end sample, but let me walk you through the important parts.
First, you generate content with a prompt (What's the temperature, wind, humidity like in London, Paris?
) and function calls (location_to_lat_long
, lat_long_to_weather
). You can see the details tool_use_gemini.py. In Gemini, function calling is automatic, so you don’t necessarily get the function call requests in responses. However, the final response includes the automatic function calling history via response.automatic_function_calling_history
property. This is very useful for evaluation.
You pass the function calling history to a function to convert it to the Gen AI evaluation service format:
def convert_function_calling_history_for_eval(automatic_function_calling_history):
function_calls = []
for item in automatic_function_calling_history:
if item.role == 'model':
for part in item.parts:
if part.function_call:
output_format = {
"content": "",
"tool_calls": [
{
"name": part.function_call.name,
"arguments": part.function_call.args
}
]
}
function_calls.append(output_format)
return function_calls
At this point, we have the function call responses from the model in the format we need. Now, we can define our reference function calls sequence and values:
references = [
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
"location": "London"
}
}
]
},
{
"content": "",
"tool_calls": [
{
"name": "location_to_lat_long",
"arguments": {
"location": "Paris"
}
}
]
},
{
"content": "",
"tool_calls": [
{
"name": "lat_long_to_weather",
"arguments": {
"longitude": "-0.12574",
"latitude": "51.50853"
}
}
]
},
{
"content": "",
"tool_calls": [
{
"name": "lat_long_to_weather",
"arguments": {
"longitude": "2.3488",
"latitude": "48.85341"
}
}
]
},
]
Create an evaluation dataset with the reference and the response:
eval_dataset = pandas.DataFrame(
{
"response": [json.dumps(history) for history in converted_function_calling_history],
"reference": [json.dumps(reference) for reference in references],
}
)
Finally, run the evaluation task with the metrics we care about:
eval_task = EvalTask(
dataset=eval_dataset,
metrics=[
Metric.TOOL_CALL_VALID,
Metric.TOOL_NAME_MATCH,
Metric.TOOL_PARAMETER_KEY_MATCH,
Metric.TOOL_PARAMETER_KV_MATCH
],
experiment=get_experiment_name(__file__)
)
eval_result = eval_task.evaluate()
In the end, we should get the results back where the response perfectly matches with reference in this case:
==Metrics table==
response \
0 {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "London"}}]}
1 {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "Paris"}}]}
2 {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "-0.12574", "latitude": "51.50853"}}]}
3 {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"latitude": "48.85341", "longitude": "2.3488"}}]}
reference \
0 {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "London"}}]}
1 {"content": "", "tool_calls": [{"name": "location_to_lat_long", "arguments": {"location": "Paris"}}]}
2 {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "-0.12574", "latitude": "51.50853"}}]}
3 {"content": "", "tool_calls": [{"name": "lat_long_to_weather", "arguments": {"longitude": "2.3488", "latitude": "48.85341"}}]}
tool_call_valid/score tool_name_match/score \
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
tool_parameter_key_match/score tool_parameter_kv_match/score
0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
Conclusion
In this post, we covered tool-use metrics, and showed an end-to-end example on how to set it up with Gemini’s automatic function calling. In the next post, we’ll explore how to evaluate beyond tools and show how to evaluate agents.
References:
- Part 1: Gen AI Evaluation Service - An Overview
- Part 2: Gen AI Evaluation Service - Computation-Based Metrics
- Part 3: Gen AI Evaluation Service - Mode-Based Metrics
- Tutorial: Gen AI evaluation service - tool-use metrics
- Documentation: Gen AI evaluation service overview
- Notebooks: GenAI evaluation service samples