LLM Guard and Vertex AI

I’ve been focusing on evaluation frameworks lately because I believe that the hardest problem while using LLMs is to make sure they behave properly. Are you getting the right outputs grounded with your data? Are outputs free of harmful or PII data? When you make a change to your RAG pipeline or to your prompts, are outputs getting better or worse? How do you know? You don’t know unless you measure. What do you measure and how? These are the sort of questions you need to answer and that’s when evaluation frameworks come into the picture.

In DeepEval and Vertex AI post, I introduced DeepEval as an evaluation framework for your LLMs and in Promptfoo and Vertex AI post, I introduced Promptfoo as an evaluation and security framework.

In this post, I will introduce yet another security framework for LLMs: LLM Guard.

Introduction to LLM Guard

LLM Guard is a comprehensive security toolkit for LLMs. It has input and output scanners offering sanitization, harmful language detection, data leakage prevention, guard against prompt injections and more to make LLMs interactions safe and secure.

Installation is straightforward:

pip install llm-guard

Let’s now take a look at some input and output scanners and see what they can do for you. All the examples are in our LLM Guard and Vertex AI Tutorial.

Input scanners

LLM Guard has a long list of input scanners that lets you to sanitize what you send to LLMs.

For example, there’s Ban Code Scanner where you can detect if code is present in prompts and ban them for whatever reason (e.g. users trying to exploit vulnerabilities).

You can run it as follows:

def ban_code_input():
    prompt = "System.out.println('Hello World')"
    scanner = input_scanners.BanCode()
    prompt, is_valid, risk_score = scanner.scan(prompt)
    print_results(is_valid, prompt, risk_score)

It gives you back a valid flag with a risk score. As you see in this case, the input scanner deemed the prompt to be invalid with a 100% risk score:

python main.py ban_code_input

Detected code in the text      score=1.0 text=System.out.println('Hello World') threshold=0.97
Input/output: System.out.println('Hello World')
Valid? False
Risk score: 1.0

Or you might want to ban certain topics from prompts with Ban Topics Scanner.

For example, this is how you can ban politics:

def ban_topics_input():
    prompt = "Keir Starmer is the prime minister of the UK"
    scanner = input_scanners.BanTopics(topics=["politics"], threshold=0.5)
    prompt, is_valid, risk_score = scanner.scan(prompt)
    print_results(is_valid, prompt, risk_score)

Run it:

python main.py ban_topics_input

Topics detected for the prompt scores={'politics': 0.9804285764694214}
Input/output: Keir Starmer is the prime minister of the UK
Valid? False
Risk score: 1.0

Under the covers, LLM Guard uses models from HuggingFace to perform these scans. There are many more input scanners from banning competitors, substrings, code, gibberish to invisible text, detecting certain languages to prompt injections and sentiment.

Output scanners

Output scanners are similar to input scanners but they apply on the LLM outputs (rather than prompts).

For example, let’s say you want to prevent LLM talking nonsense, you can use Gibberish Scanner to detect non-sensical outputs like this:

def gibberish_output():
    from llm_guard.input_scanners.gibberish import MatchType

    model_output = "abcasd asdkhasd asdasd"
    scanner = output_scanners.Gibberish(match_type=MatchType.FULL)
    model_output, is_valid, risk_score = scanner.scan("", model_output)
    print_results(is_valid, model_output, risk_score)

python main.py gibberish_output

Detected gibberish text        score=1.0 threshold=0.7
Input/output: abcasd asdkhasd asdasd
Valid? False
Risk score: 1.0

Or maybe you want to make sure LLM only responds in a certain language. You can do that with Language Scanner.

Here, we only allow French:

def language_output():
    from llm_guard.input_scanners.gibberish import MatchType

    model_output = "This is some text in English"
    scanner = output_scanners.Language(valid_languages=["fr"], match_type=MatchType.FULL)
    model_output, is_valid, risk_score = scanner.scan("", model_output)
    print_results(is_valid, model_output, risk_score)

The output scanner detects that the language is English, not French:

python main.py language_output

Languages are found with high confidence languages=['en']
Input/output: This is some text in English
Valid? False
Risk score: 1.0

Similar to input scanners, there are many more output scanners.

Anonymize/Deanonymize scanner with Vertex AI

Now that we understand the basics of how input and output scanners work, let’s take a look at a more complete example with Vertex AI. Imagine you have a prompt with some personal data, somethign like this:

prompt = ("Make an SQL insert statement to add a new user to our database. Name is John Doe. Email is "
          "test@test.com but also possible to contact him with hello@test.com email. Phone number is 555-123-4567 "
          "and the IP address is 192.168.1.100. And credit card number is 4567-8901-2345-6789. " +
          "He works in Test LLC. " +
          "Only return the SQL statement and nothing else")

But you don’t want to send this personal information to the LLM. What can you do? You can use Anonymize Input Scanner:

vault = Vault()
input_scanners = [Anonymize(vault)]
sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners, prompt)

When you run this, you can get back sanitized prompt:

Sanitized prompt: Make an SQL insert statement to add a new user to our database. Name is [REDACTED_PERSON_1] Doe. 
Email is [REDACTED_EMAIL_ADDRESS_1] but also possible to contact him with [REDACTED_EMAIL_ADDRESS_2] email. 
Phone number is [REDACTED_PHONE_NUMBER_1] and the IP address is [REDACTED_IP_ADDRESS_1]. 
And credit card number is [REDACTED_CREDIT_CARD_RE_1]. He works in Test LLC. Only return the SQL statement and nothing else

Then, you pass this sanitized prompt to Gemini:

model = GenerativeModel('gemini-1.5-flash-001')
response = model.generate_content(sanitized_prompt)

You get back a response similar to this:

Response text: ```sql
INSERT INTO users (name, email, alternative_email, phone, ip_address, credit_card, company) VALUES ('[REDACTED_PERSON_1] Doe', 
'[REDACTED_EMAIL_ADDRESS_1]', '[REDACTED_EMAIL_ADDRESS_2]', '[REDACTED_PHONE_NUMBER_1]', '[REDACTED_IP_ADDRESS_1]', 
'[REDACTED_CREDIT_CARD_RE_1]', 'Test LLC');

At this point, you can feed this back to the Deanonymize Output Scanner:

output_scanners = [Deanonymize(vault)]
sanitized_response_text, results_valid, results_score = scan_output(
    output_scanners, sanitized_prompt, response.text
)

Voila! You get back the output with the personal data:

Sanitized output: ```sql
INSERT INTO users (name, email, alternative_email, phone, ip_address, credit_card, company) 
VALUES ('John Doe', 'test@test.com', 'hello@test.com', '555-123-4567', '192.168.1.100', '4567-8901-2345-6789', 'Test LLC');

The important point is that you never sent the personal information to the LLM, nice

You can see the full example in anonymize_vertexai.py.

Multiple scanners with Vertex AI

Last but not least, you can also chain multiple input and output scanners:

input_scanners = [Anonymize(vault), Toxicity(), TokenLimit(), PromptInjection()]
output_scanners = [Deanonymize(vault), NoRefusal(), Relevance(), Sensitive()]

And apply them to prompts and outputs:

sanitized_prompt, results_valid, results_score = scan_prompt(input_scanners, prompt)
...
sanitized_output, results_valid, results_score = scan_output(output_scanners, sanitized_prompt, response.text)

You can see the full example in multiple_vertexai.py.

Conclusion

LLM Guard is quite useful to sanitize inputs and outputs from LLMs. Combined with other evaluation and security frameworks, it can help you to make sure LLMs are well behaved. Here are some links for more reading:

GenAI GoogleAI VertexAI Gemini Google Cloud Platform