Mete Atamel

Gen AI Evaluation Service - Computation-Based Metrics

Posted on July 2, 2025

In my Gen AI Evaluation Service - An Overview post, I introduced Vertex AI’s Gen AI evaluation service and talked about the various classes of metrics it supports. In today’s post, I want to dive into computation-based metrics, what they provide, and discuss their limitations.

Computation-based metrics are metrics that can be calculated using a mathematical formula. They’re deterministic – the same input produces the same score, unlike model-based metrics where you might get slightly different scores for the same input.

Gen AI Evaluation Service - An Overview

Posted on June 30, 2025

Generating content with Large Language Models (LLMs) is easy. Determining whether the generated content is good is hard. That’s why evaluating LLM outputs with metrics is crucial. Previously, I talked about DeepEval and Promptfoo as some of the tools you can use for LLM evaluation. I also talked about RAG triad metrics specifically for Retrieval Augmented Generation (RAG) evaluation for LLMs.

In the next few posts, I want to talk about a Google Cloud specific evaluation service: the Gen AI evaluation service in Vertex AI. The Gen AI evaluation service in Vertex AI lets you evaluate any generative model or application against a set of criteria or your own custom criteria.

GenAI VertexAI Gemini Google Cloud Platform

Evaluating RAG pipelines with the RAG triad

Posted on May 14, 2025

Retrieval-Augmented Generation (RAG) emerged as a dominant framework for feeding Large Language Models (LLMs) the context beyond the scope of their training data and enabling LLMs to respond with more grounded answers and fewer hallucinations based on that context.

However, designing an effective RAG pipeline can be challenging. You need to answer questions such as:

How should you parse and chunk text documents for vector embedding? What chunk size and overlay size should you use?
What vector embedding model should you use?
What retrieval method should I use to fetch the relevant context? How many documents should you retrieve by default? Does the retriever 1.actually manage to retrieve the applicable documents?
Does the generator actually generate content that is in line with the retrieved context? What parameters (model, prompt template, temperature) work better?

The only way to objectively answer these questions is to measure how well the RAG pipeline works, but what exactly do you measure, and how do you measure it? This is the topic I’ll cover here.

GenAI VertexAI Gemini Google Cloud Platform

DeepEval adds native support for Gemini as an LLM Judge

Posted on April 29, 2025

In my previous post on DeepEval and Vertex AI, I introduced DeepEval, an open-source evaluation framework for LLMs. I also demonstrated how to use Gemini (on Vertex AI) as an LLM Judge in DeepEval, replacing the default OpenAI judge to evaluate outputs from other LLMs. At that time, the Gemini integration with DeepEval wasn’t ideal and I had to implement my own integration.

Thanks to the excellent work by Roy Arsan in PR #1493, DeepEval now includes native Gemini integration. Since it’s built on the new unified Google GenAI SDK, DeepEval supports Gemini models running both on Vertex AI and Google AI. Nice!

GenAI VertexAI Gemini Google Cloud Platform

Much simplified function calling in Gemini 2.X models

Posted on April 8, 2025

Last year, in my Deep dive into function calling in Gemini post, I talked about how to do function calling in Gemini. More specifically, I showed how to call two functions (location_to_lat_long and lat_long_to_weather) to get the weather information for a location from Gemini. It wasn’t difficult but it involved a lot of steps for 2 simple function calls.

I’m pleased to see that the latest Gemini 2.X models and the unified Google Gen AI SDK (that I talked about in my Gemini on Vertex AI and Google AI now unified with the new Google Gen AI SDK) made function calling much simpler.

GenAI VertexAI Gemini Google Cloud Platform

RAG with a PDF using LlamaIndex and SimpleVectorStore on Vertex AI

Posted on March 24, 2025

Previously, I showed how to do RAG with a PDF using LangChain and Annoy Vector Store and RAG with a PDF using LangChain and Firestore Vector Store. Both used a PDF as the RAG backend and used LangChain as the LLM framework to orchestrate RAG ingestion and retrieval.

LlamaIndex is another popular LLM framework. I wondered how to set up the same PDF based RAG pipeline with LlamaIndex and Vertex AI but I didn’t find a good sample. I put together a sample and in this short post, I show how to set up the same PDF based RAG pipeline with LlamaIndex.

GenAI GoogleAI VertexAI Gemini Google Cloud Platform

Ensuring AI Code Quality with SonarQube + Gemini Code Assist

Posted on March 4, 2025

In my previous Code Quality in the Age of AI-Assisted Development blog post, I talked about how generative AI is changing the way we code and its potential impact on code quality. I recommended using static code analysis tools to monitor AI-generated code, ensuring its security and quality.

In this blog post, I will explore one such static code analysis tool, SonarQube, and see how it improves the quality of AI-generated code.

GenAI GoogleAI VertexAI Gemini Google Cloud Platform

Code Quality in the Age of AI-Assisted Development

Posted on January 28, 2025

As developers transition from manual coding to AI-assisted coding, an increasing share of code is now being generated by AI. This shift has significantly boosted productivity and efficiency, but it raises an important question: how does AI-assisted development impact code quality? How can we ensure that AI-generated code maintains high quality, adheres to good style, and follows best practices? This question has been on my mind recently, and it is the topic of this blog post.

GenAI GoogleAI VertexAI Gemini Google Cloud Platform

Improve the RAG pipeline with RAG triad metrics

Posted on January 21, 2025

In my previous RAG Evaluation - A Step-by-Step Guide with DeepEval post, I showed how to evaluate a RAG pipeline with the RAG triad metrics using DeepEval and Vertex AI. As a recap, these were the results:

Answer relevancy and faithfulness metrics had perfect 1.0 scores whereas contextual relevancy was low at 0.29 because we retrieved a lot of irrelevant context:

The score is 0.29 because while the context mentions relevant information such as "The Cymbal Starlight 2024 has a cargo
capacity of 13.5 cubic feet", much of the retrieved context is irrelevant. For example, several statements discuss
towing capacity like "Your Cymbal Starlight 2024 is not equipped to tow a trailer", or describe how to access/load cargo
like "To access the cargo area, open the trunk lid using the trunk release lever located in the driver's footwell"
instead of focusing on the requested cargo capacity.

Can we improve this? Let’s take a look.

GenAI GoogleAI VertexAI Gemini Google Cloud Platform

RAG Evaluation - A Step-by-Step Guide with DeepEval

Posted on January 14, 2025

In my previous Evaluating RAG pipelines post, I introduced two approaches to evaluating RAG pipelines. In this post, I will show you how to implement these two approaches in detail. The implementation will naturally depend on the framework you use. In my case, I’ll be using DeepEval, an open-source evaluation framework.

Approach 1: Evaluating Retrieval and Generator separately

As a recap, in this approach, you evaluate the retriever and generator of the RAG pipeline separately with their own separate metrics. This approach allows to pinpoint issues at the retriever and the generator level:

GenAI GoogleAI VertexAI Gemini Google Cloud Platform