Vertex AI Gemini generateContent (non-streaming) API


Introduction

In my recent blog post, I’ve been exploring Vertex AI’s Gemini REST API and mainly talked about the streamGenerateContent method which is a streaming API.

Recently, a new method appeared in Vertex AI docs: generateContent which is the non-streaming (unary) version of the API.

In this short blog post, I take a closer look at the new non-streaming generateContent API and explain why it makes sense to use as a simpler API when the latency is not super critical.

Recap: streamGenerateContent method

As a recap, this is how you can use the streamGenerateContent method:

PROJECT_ID="genai-atamel"
LOCATION="us-central1"
API_ENDPOINT=${LOCATION}-aiplatform.googleapis.com
MODEL_ID="gemini-pro"

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    --no-buffer -H "Content-Type: application/json"  \
    https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:streamGenerateContent -d \
    $'{
      "contents": {
        "role": "USER",
        "parts": { "text": "Why is the sky blue?" }
      },
      "generation_config":{
        "temperature": 0.4,
        "top_p": 1,
        "top_k": 32,
        "max_output_tokens": 2048
      }
  }'

And this is the sort of response you’d get:

[{
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": "The sky appears blue due to a phenomenon called Rayleigh scattering. This occurs when sunlight, which is composed of all colors of the visible spectrum, passes through the Earth"
            }
          ]
        },
        "safetyRatings": [
          ...
      }
    ]
  }
  ,
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": "'s atmosphere. The atmosphere is made up of tiny particles, such as molecules and dust, that are much smaller than the wavelength of visible light.\n\nAs"
            }
          ]
        },
        "safetyRatings": [
          ...
        ]
      }
    ]
  }
  ,
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " sunlight passes through the atmosphere, these particles scatter the light in all directions. However, the amount of scattering depends on the wavelength of the light. Shorter wavelengths, such as blue light, are scattered more than longer wavelengths, such as red light. This is because the shorter wavelengths have a higher frequency and therefore interact more with"
            }
          ]
        },
        "safetyRatings": [
          ...
        ]
      }
    ]
  }
  ,
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " the particles in the atmosphere.\n\nAs a result, more blue light is scattered in all directions, which means that when we look up at the sky, we see more blue light than any other color. This is why the sky appears blue during the day.\n\nAt sunset and sunrise, the sunlight has to travel through"
            }
          ]
        },
        "safetyRatings": [
          ...
        ]
      }
    ]
  }
  ,
  {
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": " more of the atmosphere to reach our eyes. This means that more of the blue light is scattered away, and we see more of the longer wavelengths, such as red and orange. This is why the sky appears red or orange at these times of day."
            }
          ]
        },
        "finishReason": "STOP",
        "safetyRatings": [
          ...
        ]
      }
    ],
    "usageMetadata": {
      ...
    }
  }
  ]

Notice how the text is split in multiple chunks and the last chunk has finishReason: STOP to indicate that it’s the last one. Since this is a streaming API, you’d receive these chunks as they become available.

This is useful if you have a latency sensitive application such as a chat application. But it makes processing the response more complicated, as you need to combine the text in each chunk into a final text.

New: generateContent method

If you have an application where latency is not that important and you’d rather wait for the whole response before displaying anything to the user, then generateContent method is more appropriate. It is the non-streaming (unary) version of the API.

Usage of the generateContent is very similar to before:

PROJECT_ID="genai-atamel"
LOCATION="us-central1"
API_ENDPOINT=${LOCATION}-aiplatform.googleapis.com
MODEL_ID="gemini-pro"

curl -X POST -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: application/json"  \
    https://${API_ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}:generateContent -d \
    $'{
      "contents": {
        "role": "USER",
        "parts": { "text": "Why is the sky blue?" }
      },
      "generation_config":{
        "temperature": 0.4,
        "top_p": 1,
        "top_k": 32,
        "max_output_tokens": 2048
      }
  }'

But the response is different:

{
    "candidates": [
      {
        "content": {
          "role": "model",
          "parts": [
            {
              "text": "The sky appears blue because of a phenomenon called Rayleigh scattering. This occurs when sunlight passes through the Earth's atmosphere and interacts with molecules of nitrogen and oxygen. These molecules are much smaller than the wavelength of visible light, so they scatter the light in all directions. However, blue light is scattered more than other colors because it has a shorter wavelength. This means that more blue light reaches our eyes from all directions, making the sky appear blue."
            }
          ]
        },
        "finishReason": "STOP",
        "safetyRatings": [
          ...
        ],
        "citationMetadata": {
          ...
        }
      }
    ],
    "usageMetadata": {
      ...
    }
  }

As you can see, we got back a single chunk with the full text and finishReason: STOP. This is certainly much simpler to process and a better choice for applications where you don’t need to display responses right away.

Summary

In this short blog post, I showed you how to use the non-streaming generateContent API and explain why it makes sense to use as a simpler API when the latency is not super critical.

If you want to run these samples yourself, you can check out my GenAI repo on GitHub:

As always, for any questions or feedback, feel free to reach out to me on Twitter @meteatamel.


See also