RAG with a PDF using LlamaIndex and SimpleVectorStore on Vertex AI

LlamaIndex and Vertex AI

Previously, I showed how to do RAG with a PDF using LangChain and Annoy Vector Store and RAG with a PDF using LangChain and Firestore Vector Store. Both used a PDF as the RAG backend and used LangChain as the LLM framework to orchestrate RAG ingestion and retrieval.

LlamaIndex is another popular LLM framework. I wondered how to set up the same PDF based RAG pipeline with LlamaIndex and Vertex AI but I didn’t find a good sample. I put together a sample and in this short post, I show how to set up the same PDF based RAG pipeline with LlamaIndex.

Read More →

Ensuring AI Code Quality with SonarQube + Gemini Code Assist

In my previous Code Quality in the Age of AI-Assisted Development blog post, I talked about how generative AI is changing the way we code and its potential impact on code quality. I recommended using static code analysis tools to monitor AI-generated code, ensuring its security and quality.

In this blog post, I will explore one such static code analysis tool, SonarQube, and see how it improves the quality of AI-generated code.

Read More →

Code Quality in the Age of AI-Assisted Development

As developers transition from manual coding to AI-assisted coding, an increasing share of code is now being generated by AI. This shift has significantly boosted productivity and efficiency, but it raises an important question: how does AI-assisted development impact code quality? How can we ensure that AI-generated code maintains high quality, adheres to good style, and follows best practices? This question has been on my mind recently, and it is the topic of this blog post.

Read More →

Improve the RAG pipeline with RAG triad metrics

In my previous RAG Evaluation - A Step-by-Step Guide with DeepEval post, I showed how to evaluate a RAG pipeline with the RAG triad metrics using DeepEval and Vertex AI. As a recap, these were the results:

RAG triad with DeepEval

Answer relevancy and faithfulness metrics had perfect 1.0 scores whereas contextual relevancy was low at 0.29 because we retrieved a lot of irrelevant context:

The score is 0.29 because while the context mentions relevant information such as "The Cymbal Starlight 2024 has a cargo
capacity of 13.5 cubic feet", much of the retrieved context is irrelevant. For example, several statements discuss
towing capacity like "Your Cymbal Starlight 2024 is not equipped to tow a trailer", or describe how to access/load cargo
like "To access the cargo area, open the trunk lid using the trunk release lever located in the driver's footwell"
instead of focusing on the requested cargo capacity.

Can we improve this? Let’s take a look.

Read More →

RAG Evaluation - A Step-by-Step Guide with DeepEval

In my previous Evaluating RAG pipelines post, I introduced two approaches to evaluating RAG pipelines. In this post, I will show you how to implement these two approaches in detail. The implementation will naturally depend on the framework you use. In my case, I’ll be using DeepEval, an open-source evaluation framework.

Approach 1: Evaluating Retrieval and Generator separately

As a recap, in this approach, you evaluate the retriever and generator of the RAG pipeline separately with their own separate metrics. This approach allows to pinpoint issues at the retriever and the generator level:

Read More →

Evaluating RAG pipelines

Retrieval-Augmented Generation (RAG) emerged as a dominant framework to feed LLMs the context beyond the scope of its training data and enable LLMs to respond with more grounded answers with less hallucinations based on that context.

However, designing an effective RAG pipeline can be challenging. You need to answer certain questions such as:

  1. How should you parse and chunk text documents for embedding? What chunk and overlay size to use?
  2. What embedding model is best for your use case?
  3. What retrieval method works most effectively? How many documents should you retrieve by default? Does the retriever actually manage to retrieve the relevant documents?
  4. Does the generator actually generate content in line with the relevant context? What parameters (e.g. model, prompt template, temperature) work better?

The only way to objectively answer these questions is to measure how well the RAG pipeline works but what exactly do you measure? This is the topic of this blog post.

Read More →

Gemini on Vertex AI and Google AI now unified with the new Google Gen AI SDK

If you’ve been working with Gemini, you’ve likely encountered the two separate client libraries for Gemini: the Gemini library for Google AI vs. Vertex AI in Google Cloud. Even though the two libraries are quite similar, there are slight differences that make the two libraries non-interchangeable.

I usually start my experiments in Google AI and when it is time to switch to Vertex AI on Google Cloud, I couldn’t simply copy and paste my code. I had to go through updating my Google AI libraries to Vertex AI libraries. It wasn’t difficult but it was quite annoying.

Read More →

Control LLM output with LangChain's structured and Pydantic output parsers

In my previous Control LLM output with response type and schema post, I talked about how you can define a JSON response schema and Vertex AI makes sure the output of the Large Language Model (LLM) conforms to that schema.

In this post, I show how you can implement a similar response schema using LangChain’s structured output parser with any model. You can further get the output parsed and populated into Python classes automatically with the Pydantic output parser. This helps you to really narrow down and structure LLM outputs.

Read More →

Tracing with Langtrace and Gemini

Large Language Models (LLMs) feel like a totally new technology with totally new problems. It’s true to some extent but at the same time, they also have the same old problems that we had to tackle in traditional technology.

For example, how do you figure out which LLM calls are taking too long or have failed? At the bare minimum, you need logging but ideally, you use a full observability platform like OpenTelemetry with logging, tracing, metrics and more. You need the good old software engineering practices, such as observability, applied to new technologies like LLMs.

Read More →

Batch prediction in Gemini

LLMs are great in generating content on demand but if left unchecked, you can be left with a large bill at the end of the day. In my Control LLM costs with context caching post, I talked about how to limit costs by using context caching. Batch generation is another technique you can use to save time at a discounted price.

What’s batch generation?

Batch generation in Gemini allows you to send multiple generative AI requests in batches rather than one by one and get responses asynchronously either in a Cloud Storage bucket or a BigQuery table. This not only simplifies processing of large datasets, but it also saves time and money, as batch requests are processed in paralllel and discounted 50% from standard requests.

Read More →