Batch prediction in Gemini

LLMs are great in generating content on demand but if left unchecked, you can be left with a large bill at the end of the day. In my Control LLM costs with context caching post, I talked about how to limit costs by using context caching. Batch generation is another technique you can use to save time at a discounted price.

What’s batch generation?

Batch generation in Gemini allows you to send multiple generative AI requests in batches rather than one by one and get responses asynchronously either in a Cloud Storage bucket or a BigQuery table. This not only simplifies processing of large datasets, but it also saves time and money, as batch requests are processed in paralllel and discounted 50% from standard requests.

Consider an online bookstore with thousands of books. Instead of generating descriptions for each book one by one, which would be time-consuming, Gemini batch generation can generate all descriptions in parallel. This reduces the overall processing time at a discounted price.

Let’s take a look at how to use batch generation.

Create Cloud Storage buckets

You can use either Cloud Storage or BigQuery to prepare and save batch job results. Here, we’ll use Cloud Storage.

First, let’s create a bucket to save batch request input files:

PROJECT_ID=your-project-id
INPUT_BUCKET_URI=gs://$PROJECT_ID-batch-processing-input
gsutil mb $INPUT_BUCKET_URI

You also need a bucket to save the batch request results:

PROJECT_ID=your-project-id
OUTPUT_BUCKET_URI=gs://$PROJECT_ID-batch-processing-output
gsutil mb $OUTPUT_BUCKET_URI

Prepare batch generation input files

Next, you need to prepare batch prediction input files in jsonl files.

For example, take a look at batch_request_text_input.jsonl with text prompts to generate recipes for different cakes:

{
  "request": {
    "contents": [
      {
        "parts": {
          "text": "Give me a recipe for banana bread."
        },
        "role": "user"
      }
    ]
  }
}{
  "request": {
    "contents": [
      {
        "parts": {
          "text": "Give me a recipe for chocolate cake."
        },
        "role": "user"
      }
    ]
  }
}
...
{
  "request": {
    "contents": [
      {
        "parts": {
          "text": "Give me a recipe for pound cake."
        },
        "role": "user"
      }
    ]
  }
}

You can also use multimodal prompts with text, images, and videos for batch generation of content as shown in batch_request_multimodal_input.jsonl:

{
  "request": {
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "List objects in this image."
          },
          {
            "file_data": {
              "file_uri": "gs://cloud-samples-data/generative-ai/image/office-desk.jpeg",
              "mime_type": "image/jpeg"
            }
          }
        ]
      }
    ]
  }
}{
  "request": {
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "List objects in this image."
          },
          {
            "file_data": {
              "file_uri": "gs://cloud-samples-data/generative-ai/image/gardening-tools.jpeg",
              "mime_type": "image/jpeg"
            }
          }
        ]
      }
    ]
  }
}{
  "request": {
    "contents": [
      {
        "role": "user",
        "parts": [
          {
            "text": "What is the relation between the following video and image samples?"
          },
          {
            "fileData": {
              "fileUri": "gs://cloud-samples-data/generative-ai/video/animals.mp4",
              "mimeType": "video/mp4"
            }
          },
          {
            "fileData": {
              "fileUri": "gs://cloud-samples-data/generative-ai/image/cricket.jpeg",
              "mimeType": "image/jpeg"
            }
          }
        ]
      }
    ]
  }
}

Upload both files to the input bucket:

gsutil cp batch_request_text_input.jsonl $INPUT_BUCKET_URI
gsutil cp batch_request_multimodal_input.jsonl $INPUT_BUCKET_URI

Run batch generation

To run batch generation jobs, you need to submit a BatchPredictionJob with a input file and output bucket:

vertexai.init(project=args.project_id, location="us-central1")

# Submit a batch prediction job with Gemini model
batch_prediction_job = BatchPredictionJob.submit(
    source_model="gemini-1.5-flash-002",
    input_dataset=args.input_dataset_uri,
    output_uri_prefix=args.output_bucket_uri,
)

Then, you need to wait until the batch generation is done:

while not batch_prediction_job.has_ended:
    print(f"Job state: {batch_prediction_job.state.name}")
    time.sleep(10)
    batch_prediction_job.refresh()

Run batch generation for text prompts:

python main.py --project_id $PROJECT_ID \
  --input_dataset_uri $INPUT_BUCKET_URI/batch_request_text_input.jsonl \
  --output_bucket_uri $OUTPUT_BUCKET_URI

As it’s running, you can see its status on Cloud Console:

Batch generationbefore

Run batch generation for multimodal prompts:

python main.py --project_id $PROJECT_ID \
  --input_dataset_uri $INPUT_BUCKET_URI/batch_request_multimodal_input.jsonl \
  --output_bucket_uri $OUTPUT_BUCKET_URI

In the end, you’ll see both batch jobs are done:

Batch generationafter

You’ll see the output files with the prompt and the LLM responses in the bucket. Here are a couple examples:

Nice!

Conclusion

Batch generation is a more robust way of utilizing generate AI with large datasets and it saves time and money. Here are some resources for further reading:

GenAI GoogleAI VertexAI Gemini Google Cloud Platform

What’s batch generation?

Create Cloud Storage buckets

Prepare batch generation input files

Run batch generation

Conclusion

See also