A tour of Gemini 1.5 Pro samples


Introduction

Back in February, Google announced Gemini 1.5 Pro with its impressive 1 million token context window.

Gemini 1.5 Pro

Larger context size means that Gemini 1.5 Pro can process vast amounts of information in one go — 1 hour of video, 11 hours of audio, 30,000 lines of code or over 700,000 words and the good news is that there’s good language support.

In this blog post, I will point out some samples utilizing Gemini 1.5 Pro in Google Cloud’s Vertex AI in different use cases and languages (Python, Node.js, Java, C#, Go).

Audio

Gemini 1.5 Pro can understand audio. For example, listen this audio file:

Audio player

It’s 10:28 long but maybe you don’t have time or patience to listen it fully.

You can use use Gemini to summarize it with Python in gemini_audio.py:

def summarize_audio(project_id: str) -> str:

    import vertexai
    from vertexai.generative_models import GenerativeModel, Part

    vertexai.init(project=project_id, location="us-central1")

    model = GenerativeModel(model_name="gemini-1.5-pro-preview-0409")

    prompt = """
    Please provide a summary for the audio.
    Provide chapter titles with timestamps, be concise and short, no need to provide chapter summaries.
    Do not make up any information that is not part of the audio and do not be verbose.
    """

    audio_file_uri = "gs://cloud-samples-data/generative-ai/audio/pixel.mp3"
    audio_file = Part.from_uri(audio_file_uri, mime_type="audio/mpeg")

    contents = [audio_file, prompt]

    response = model.generate_content(contents)
    print(response.text)

    return response.text

You get a nice summary from Gemini:

This episode of the Made by Google podcast discusses the Pixel feature drops,
which are software updates that bring new features and improvements to Pixel
devices. The hosts, Aisha Sharif and DeCarlos Love, who are both product
managers for various Pixel devices, talk about the importance of feature drops
in keeping Pixel devices up-to-date and how they use user feedback to decide
which features to include in each drop. They also highlight some of their
favorite features from past feature drops, such as call screening, direct my
call, and clear calling.

Chapter Titles with Timestamps:

00:00 Intro
00:14 Made by Google Podcast Intro
00:35 Transformative Pixel Features
01:49 Why Feature Drops Are Important
02:28 January Feature Drop Highlights
02:58 March Feature Drop: Pixel Watch
03:41 March Feature Drop: Pixel Phone 
05:34 More Portfolio Updates
06:09 Pixel Superfans Question
07:32 Importance of User Feedback
08:07 Feature Drop Release Date
08:23 Favorite Feature Drop Features
10:17 Outro
10:18 Podcast Outro

Maybe you want to transcribe the whole audio file instead. Here’s how you can do it with Node.js in gemini-audio-transcription.js:

async function transcript_audio(projectId = 'PROJECT_ID') {
  const vertexAI = new VertexAI({project: projectId, location: 'us-central1'});

  const generativeModel = vertexAI.getGenerativeModel({
    model: 'gemini-1.5-pro-preview-0409',
  });

  const filePart = {
    file_data: {
      file_uri: 'gs://cloud-samples-data/generative-ai/audio/pixel.mp3',
      mime_type: 'audio/mpeg',
    },
  };
  const textPart = {
    text: `
    Can you transcribe this interview, in the format of timecode, speaker, caption?
    Use speaker A, speaker B, etc. to identify speakers.`,
  };

  const request = {
    contents: [{role: 'user', parts: [filePart, textPart]}],
  };

  const resp = await generativeModel.generateContent(request);
  const contentResponse = await resp.response;
  console.log(JSON.stringify(contentResponse));
}

transcript_audio(...process.argv.slice(2)).catch(err => {
  console.error(err.message);
  process.exitCode = 1;
});

You get a full transcription (cropped here to keep it short):

## Interview Transcription

**00:00** Speaker A: Your devices are getting better over time, and so we think
about it across the entire portfolio, from phones to watch, to buds, to tablet.
We get really excited about how we can tell a joint narrative across everything. 

**00:14** Speaker B: Welcome to the Made by Google Podcast, where we meet the
people who work on the Google products you love. Here's your host, Rasheed
Finch.

...

**10:19** Speaker C: Don’t miss out on new episodes. Subscribe now wherever you
get your podcasts to be the first to listen. 

Video with audio

So far so good, but how about videos?

Take this 57 seconds long video for example:

Video with audio

You can describe the video and everything people said in the video in Java with VideoInputWithAudio.java:

  public static String videoAudioInput(String projectId, String location, String modelName)
      throws IOException {
    try (VertexAI vertexAI = new VertexAI(projectId, location)) {
      String videoUri = "gs://cloud-samples-data/generative-ai/video/pixel8.mp4";

      GenerativeModel model = new GenerativeModel(modelName, vertexAI);
      GenerateContentResponse response = model.generateContent(
          ContentMaker.fromMultiModalData(
              "Provide a description of the video.\n The description should also "
                  + "contain anything important which people say in the video.",
              PartMaker.fromMimeTypeAndData("video/mp4", videoUri)
          ));

      String output = ResponseHandler.getText(response);
      System.out.println(output);

      return output;
    }
  }

You get a pretty impressive output:

The video is an advertisement for the new Google Pixel phone. It features a
photographer in Tokyo who is using the phone to take pictures and  videos of the
city at night. The video highlights the phone's "Night Sight" feature, which
allows users to take clear and bright pictures and videos in low-light
conditions. The photographer also mentions that the phone's "Video Boost"
feature helps to improve the quality of videos taken in low light. The video
shows the photographer taking pictures and videos of various scenes in Tokyo,
including the city streets, a bar, and a puddle. The video ends with the
photographer saying that the new Pixel phone is "amazing" and that she "loves
it."

All modalities

You can go even further and process images, video, audio, and text at the same time. Here’s how to do it in C# with MultimodalAllInput.cs:

    public async Task<string> AnswerFromMultimodalInput(
        string projectId = "your-project-id",
        string location = "us-central1",
        string publisher = "google",
        string model = "gemini-1.5-pro-preview-0409")
    {

        var predictionServiceClient = new PredictionServiceClientBuilder
        {
            Endpoint = $"{location}-aiplatform.googleapis.com"
        }.Build();

        string prompt = "Watch each frame in the video carefully and answer the questions.\n"
                  + "Only base your answers strictly on what information is available in "
                  + "the video attached. Do not make up any information that is not part "
                  + "of the video and do not be too verbose, be to the point.\n\n"
                  + "Questions:\n"
                  + "- When is the moment in the image happening in the video? "
                  + "Provide a timestamp.\n"
                  + "- What is the context of the moment and what does the narrator say about it?";

        var content = new Content
        {
            Role = "USER"
        };
        content.Parts.AddRange(new List<Part>()
        {
            new() {
                Text = prompt
            },
            new() {
                FileData = new() {
                    MimeType = "video/mp4",
                    FileUri = "gs://cloud-samples-data/generative-ai/video/behind_the_scenes_pixel.mp4"
                }
            },
            new() {
                FileData = new() {
                    MimeType = "image/png",
                    FileUri = "gs://cloud-samples-data/generative-ai/image/a-man-and-a-dog.png"
                }
            }
        });

        var generateContentRequest = new GenerateContentRequest
        {
            Model = $"projects/{projectId}/locations/{location}/publishers/{publisher}/models/{model}"
        };
        generateContentRequest.Contents.Add(content);

        GenerateContentResponse response = await predictionServiceClient.GenerateContentAsync(generateContentRequest);

        string responseText = response.Candidates[0].Content.Parts[0].Text;
        Console.WriteLine(responseText);

        return responseText;
    }

The output:

- The timestamp of the image is 00:49.
- The context is that the narrator, a blind filmmaker, is talking about the
  story of his film. The story is about a blind man and his girlfriend, and the
  film follows them on their journey together.

Pdf files

Gemini 1.5 can even handle Pdf files. Here’s a Go example in pdf.go that summarizes a given PDF with the help of Gemini:

type pdfPrompt struct {
	pdfPath string
	question string
}

func generateContentFromPDF(w io.Writer, prompt pdfPrompt, projectID, location, modelName string) error {
	// prompt := pdfPrompt{
	// 	pdfPath: "gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
	// 	question: `
	// 		You are a very professional document summarization specialist.
	// 		Please summarize the given document.
	// 	`,
	// }
	// location := "us-central1"
	// modelName := "gemini-1.5-pro-preview-0409"
	ctx := context.Background()

	client, err := genai.NewClient(ctx, projectID, location)
	if err != nil {
		return fmt.Errorf("unable to create client: %w", err)
	}
	defer client.Close()

	model := client.GenerativeModel(modelName)

	part := genai.FileData{
		MIMEType: "application/pdf",
		FileURI:  prompt.pdfPath,
	}

	res, err := model.GenerateContent(ctx, part, genai.Text(prompt.question))
	if err != nil {
		return fmt.Errorf("unable to generate contents: %w", err)
	}

	if len(res.Candidates) == 0 ||
		len(res.Candidates[0].Content.Parts) == 0 {
		return errors.New("empty response from model")
	}

	fmt.Fprintf(w, "generated response: %s\n", res.Candidates[0].Content.Parts[0])
	return nil
}

You get the summary back (cropped here to keep it short):

## Gemini 1.5 Pro: A Summary of its Multimodal, Long-Context Capabilities

**Gemini 1.5 Pro** is a cutting-edge multimodal, large language model (LLM)
developed by Google DeepMind. Its most significant advancement lies in its
ability to process and understand extremely long contexts of information,
spanning millions of tokens across various modalities like text, audio, and
video. This represents a substantial leap forward from previous LLMs, which were
typically limited to processing hundreds of thousands of tokens.

Here are the key takeaways from the document:
...

System instructions

Last but not least, Gemini 1.5 supports system instructions. System instructions enable users to direct the behavior of the model based on their specific needs and use cases. It’s an additional context to understand the task over the full user interaction with the model.

For example, here’s a Python example in gemini_system_instruction.py on how to set system instructions:

def set_system_instruction(project_id: str) -> str:
    import vertexai

    from vertexai.generative_models import GenerativeModel

    vertexai.init(project=project_id, location="us-central1")

    model = GenerativeModel(
        model_name="gemini-1.5-pro-preview-0409",
        system_instruction=[
            "You are a helpful language translator.",
            "Your mission is to translate text in English to French.",
        ],
    )

    prompt = """
    User input: I like bagels.
    Answer:
    """

    contents = [prompt]

    response = model.generate_content(contents)
    print(response.text)

    return response.text

And with that instruction, the model answers in French:

J'aime les bagels.

Conclusion

Gemini 1.5 Pro is quite impressive with its multimodal nature and large context size. In this blog post, I provided you pointers to samples for different use cases in different languages. If you want to learn more, here’s a list of further resources:


As always, for any questions or feedback, feel free to reach out to me on Twitter @meteatamel.


See also