Introduction
It’s no secret that LLMs sometimes lie and they do so in a very confident kind of way. This might be OK for some applications but it can be a real problem if your application requires high levels of accuracy.
I remember when the first LLMs emerged back in early 2023. I tried some of the early models and it felt like they were hallucinating half of the time. More recently, it started feeling like LLMs are getting better at giving more factual answers. But it’s just a feeling and you can’t base application decisions (or any decision?) on feelings, can you?
How can I tell if my LLM is lying to me and how much?
The rigorous answer probably requires using something like Vertex AI’s Generative AI evaluation service which lets you evaluate model performance across specific use cases. It’s something I’ll look into and report back on in a future article.
For now, I’m interested in a quick lie detector test for my LLM and for that, you need:
- A high quality, labeled dataset that you can trust
- A good test question that you can ask the LLM about the dataset
- A way to measure the accuracy of the LLM response
Let’s talk about these in more detail.
Dataset
For a dataset, you have a lot of options and you probably want to use something that’s close to your application domain to make it a realistic test. However, in this quick & dirty lie detector test, we can use any dataset that you trust to be correct.
I was introduced to the Open Trivia database by a coworker last year. It is a free to use, user-contributed trivia question database with 4100+ verified questions and answers.
It also has an API where you can request random questions in different categories and answers with a URL like this: https://opentdb.com/api.php?amount=10
And you get a JSON response like this:
{
response_code: 0,
results: [
{
type: "multiple",
difficulty: "medium",
category: "Sports",
question: "How many games did Arsenal FC go unbeaten during the 2003-2004
season of the English Premier League",
correct_answer: "38",
incorrect_answers: [
"51",
"49",
"22"
]
},
...
}
Test
We have the dataset but what should the test question be for the LLM?
Again, there are different approaches you can take here but the simplest approach is to ask the LLM: Given this question and answers, find the correct answer. This is a good test because it’s a very specific task, the LLM has all the info it needs, and the lie is easy to detect.
We need to be careful about not biasing LLM with the correct answer. We also don’t want to burden the LLM with more info then it needs. This means, the previous JSON file can be transformed into this simplified JSON with all correct and incorrect answers shuffled in the same list:
[
{
question: "How many games did Arsenal FC go unbeaten during the 2003-2004
season of the English Premier League",
answers: [
"51",
"49",
"38",
"22"
]
},
...
]
Now, we can ask the LLM: Given these questions and answers, find the correct answer for each question.
Measure
Measuring accuracy is basically comparing the correct answers in the original dataset and the correct answers given by the LLM. If they all match, you get 100% accuracy but we know this won’t be the case. The question is: How close will we get to 100% accuracy with different models?
Code
The full code is in my opentrivia-llm-testing
repo. You can check out main.py
for details but to give you an overview:
get_questions
retrieves questions from Open Trivia and filters out the unnecessary fields.transform_questions
combines correct and incorrect answers into a single field.ask_llm
asks the LLM to find the correct answer and return a JSON back with the same format as Open Trivia.compare_question_lists
compares the lists and keeps track of how many correct answers the LLM returned.run_tests
runs multiple tests in an iteration, keeping track of accuracy and also time taken.
A few notes about the code:
- When prompting the LLM, I set the temperature to 0 for more consistency. Temperature can be thought of as a measure of how random or creative you want the response to be. Using a value of zero means we want the most repeatable results possible.
- I turned off safety settings to avoid censored output.
- Despite specifying the desired JSON output in our prompt, I occasionally still got malformed JSON so I had to add some post-processing to catch and correct that case.
- I observed quite a bit of variance in the results when running these tests, so I run multiple iterations and averaged the results.
- I tried adding the Google Search grounding option to see if it helps improve accuracy.
You can run the tests as follows with 4 test iterations and 25 questions in each iteration and no Google Search grounding:
python main.py your-project-id model-id
If you wanted to customize the test runs, you could do so as follows:
python main.py your-project-id model-id \
--num_iterations=4 --no_questions=25 --google_search_grounding
As a side note, I used Gemini Code Assist to help me write the Python code and it was amazing! It really sped up my development process and saved me quite a bit of time as a non-expert Python developer.
Results
Disclaimer: My tests are hardly scientific. For each model, I only ran 4 iterations with 25 questions in each iteration. 100 questions is not enough. I need to run against more questions, more iterations, and use more sophisticated evaluation methods beyond accuracy and time. Also, I don’t know if correctness against Open Trivia generalizes to other use cases.
Let’s look at the results with the disclaimer in mind. You can see the detailed test runs in the runs
folder with a results table.
Surprisingly, Google Search grounding did not help in my tests. It made the execution time longer with slightly less correctness percentage. My guess is that the dataset is already public and possibly models are already trained on it and that’s why grounding with Google Search does not help.
Here are the results without grounding and with average correctness and execution seconds:
Average results without grounding
Model | Percentage Correct (Avg) |
Execution Seconds (Avg) |
---|---|---|
gemini-1.0-pro |
91 | 14.35 |
gemini-1.0-ultra |
94 | 44.62 |
gemini-1.5-flash |
82 | 16.77 |
gemini-1.5-pro |
93 | 39.98 |
Most models achieved more than 90% correctness and that was good to see. There was a big variation in execution times, ranging from 14 seconds to almost 45 seconds. If speed is not important, I’d probably go with gemini-1.5-pro
with its best accuracy. Otherwise, I’d probably choose gemini-1.0-pro
as it seems to be quick but also has a good accuracy at the same time. Another consideration is probably cost but it’s not something I looked into.
Conclusion
In this blog post, I tried to see if I can figure out how much LLMs lie to me and came up with a simple test. As I said, my tests are hardly scientific and not sure if they generalize beyond the Open Trivia dataset. The main point is that if accuracy is important for your use case, you can also come up with a test. It was nice to see accuracy above 90% with most models in my case. It’s not 100% but a good start and with other techniques like RAG and grounding, it can probably be improved depending on the dataset and questions asked.
In a future post, I want to look into Vertex AI’s Generative AI evaluation service and see if I can do a more rigorous, scientific evaluation.