Generative AI Learning Path Notes – Part 2


If you’re looking to upskill in Generative AI, there’s a Generative AI Learning Path in Google Cloud Skills Boost. It currently consists of 10 courses and provides a good foundation on the theory behind Generative AI.

As I went through these courses myself, I took notes, as I learn best when I write things down. In part 1 of the blog series, I shared my notes for courses 1 to 6. In this part 2 of the blog series, I continue sharing my notes for courses 7 to 10.

GenAI Learning Path

Let’s continue with course 7.

7. Attention Mechanism

In this course, you learn about the attention mechanism behind all the transformer models that is core to all LLM models.

Let’s say you want to translate text. You can use the Encoder and Decoder model but sometimes the words in the source language do not align with the target language. To improve translation, you can attach an attention mechanism to the encoder decoder.

Attention mechanism is a technique that allows the neural network to focus on specific parts of an input sequence by assigning weights to different parts of the input sequence with the most important parts receiving the highest weights.

In a traditional RNN based encoder-decoder, the model takes one word at a time as input updates the hidden state and passes it onto the next time step. In the end, only the final hidden state is passed to the decoder. The decoder works with the final hidden state for processing and translates it to the target language.

Attention model differs from a traditional sequence to sequence model in 2 ways:

  1. It passes more hidden states from encoder to decoder.
  2. It adds an extra step to the attention decoder before producing its output.

To focus only on the most relevant parts of the input, the decoder:

  1. Looks at the set of encoder hidden states (one for each word) that it received.
  2. Gives each hidden state a score.
  3. Multiplies each hidden state by its soft-maxed score, thus amplifying hidden states with the highest scores.

Note to the reader: The instructor goes through an example of how translation works with attention layer but I didn’t find it too useful to include in the notes.

Summary: The attention mechanism is used to improve the performance of encoder-decoder architecture.

8. Transformer Models and BERT Model

Note to the reader: This is probably the most technical course of the series.

This course is about transformer models and the BERT model.

Language modeling evolved over the years. The recent breakthroughs include the use of neural networks to represent text such as Word2vec and N-grams in 2013. In 2014, the sequence-to-sequence models such as RNNs and LSTMs helped to improve ML models on tasks such as translation and text classification. In 2015, attention mechanisms came along with the models built on it such as Transformers and the Bert model.

Based on a 2017 paper, Transformers contain the context and the usage of words based context (e.g. transformers can distinguish between bank robber and river bank).

A transformer is an encoder decoder model that uses the attention mechanism.

The encoder encodes the input sequence and passes to the decoder and the decoder decodes the representation to the relevant task.

The encoding component is a stack of encoders. The encoders are all identical in structure but with different weights. Each encoder has 2 sublayers: self attention and feedforward. The input first flows through a self attention layer, which encodes or looks at relevant parts of the words as it encodes a central word in the input. The output of self attention is fed to the feedforward neural network where the exact same feedforward neural network is applied to each position.

The decoding component is also a stack of decoders. The decoder has both the self attention and feedforward layer with an encoder decoder attention layer in between that helps a decoder to focus on relevant parts of the input.

In the self attention layer, the input embedding is broken into query, key, and value vectors that are computed using weights the transformer learns during the training process. The next step is to multiply each value vector by the soft max score in preparation to sum them up. The intention is to keep intact the values of the words you want to focus on and leave out irrelevant words by multiplying them with tiny numbers like 0.001. Next, you sum up the weight value vectors and that produces the output of the self attention layer at this position. You can then send along the resulting vector to the feedforward neural network.

There’s multiple variations of transformers out there: encoder & decoder (eg. Bart), decoder only (e.g. GPT-2, GPT-3), encoder only (e.g. BERT).

Bidirectional Encoder Representations from Transformers (BERT) is one of the trained transformer models:

  • Developed by Google in 2018 and powers Google Search.
  • Trained for 1 million steps.
  • Trained on different tasks, which means it has a multi-task objective.
  • It can handle long input contexts.
  • It was trained on the entire Wikipedia and books corpus.

BERT was trained in 2 variations:

  1. BERT base with 12 layers of transformers with ~110 millions parameters.
  2. BERT large with 24 layers of transformers with ~340 millions parameters.

BERT is trained on 2 different tasks:

  1. Masked language modeling: The sentences are masked (typically 15%) and the model is trained to predict the masked words.
  2. Next sentence prediction: Given two sets of sentences, the model aims to learn the relationships between the two and predicts if the second sentence can be the next sentence after the given first sentence (a binary classification task).

To train BERT, you need to feed 3 kinds of embeddings for the input sequence: token, segment, and position embeddings.

  • Token embedding is a representation of each token as an embedding in the input sentence (i.e. words are transformed into vector representations).
  • Segment embedding is a special token (SEP) that separates the two different splits of the sentence. This helps BERT distinguish the input in a given pair.
  • Position embedding allows BERT to learn a vector representation for each position of the word in the sentence.

BERT can solve text classification problems as well (eg. Are these two sentences similar?)

Note to the reader: There’s also a hands-on Transformer Models and BERT Model: Lab Walkthrough that accompanies the course.

9. Create Image Captioning Models

Note to the reader: This was my least favorite course with details that I didn’t get to learn that well.

In this course, you learn how to create an image captioning model that can generate text captions from images using encoder-decoder, attention mechanism, and a bit transformer.

We’ll use a dataset of images and text data. Our goal is to build and train a model that can generate text captions for images.

We pass images to the encoder to extract information from the images and create feature vectors and then the vectors are passed to the decoder to build captions.

For the encoder, we use classical InceptionResNetV2 from Keras.

The decoder gets words one by one, mixes the information of words and images from the encoder and tries to predict the next words.

Decoder has multiple steps including an attention layer:

  1. Embedding layer creates word embeddings.
  2. GRU layer (which is a variation of recurrent neural networks but remembers the sequence).
  3. Attention layer which mixes the information of text and image.
  4. Add layer + Layer Normalization layer.
  5. Dense layer.

Inference loop overview is as follows:

  1. Generate the GRU initial state and create a start token.
  2. Pass an input image to the encoder and extract a feature vector.
  3. Pass the vector to the decoder and generate caption words until it returns the end token or reach to max_caption_length.

Note to the reader: There’s also a hands-on Create Image Captioning Models: Lab Walkthrough that accompanies the course.

10. Introduction to Generative AI Studio

In this course, you learn what Generative AI Studio does and its different options.

Generative AI is a type of AI that generates content for you such as text, images, audio, and video. It can also help you with document summarization, information extraction, code generation and more.

It learns from a massive amount of existing content (training) and creates a foundational model. A large language model (LLM) is a foundational model for chat bots like Bard. The foundational model can then be used to generate content and solve general problems. It can also be trained further with new datasets to result in tailored models that can solve specific problems in other fields.

Vertex AI is an ML development platform on Google Cloud that helps you build, deploy and manage ML models.

You can use Generative AI Studio of Vertex AI to quickly prototype and customize GenAI models with no code or low code with prompt design and tuning.

You can use Model Garden to build and automate a GenAI model. You can discover and interact with Google’s foundational and third-party open source models. It also has built-in MLOps tools to automate the ML pipeline.

Generative AI Studio currently supports language, vision, and speech models:

  • Language: Design a prompt to perform tasks and tune language models.
  • Vision: Generate and edit images based on a prompt.
  • Speech: Generate text from speech or speech from text.

In Generative AI Studio: Language, you can:

  1. Design prompts for tasks relevant to your use case.
  2. Create conversations by specifying the context that instructs how the model should respond.
  3. Tune a model for a use case and deploy it to an endpoint.

1. Design prompts

The answers you get (i.e. the response from the model) depend on the questions you ask (the prompt you designed). Prompt design is the process of designing the best input text to get the desired response back from the model.

In the freeform prompt, you simply tell the model what you want (e.g. generate a list of items for a camping trip).

Zero-shot, one-shot, and few-shot prompting: Provide no, single, or a few examples to the LLM when asking.

You can use the structured mode to design few-shot prompting to provide context and examples.

Context: instructs how the model should respond with words the model can or cannot use, topics to focus or avoid etc.

Best practices for prompt design:

  • Be concise.
  • Be specific and well-defined.
  • Ask one task at a time.
  • Ask to classify instead of generate.
  • Include examples.

Once you design a prompt, you can save it to the prompt gallery, which is a curated collection of prompts that show how GenAI models work for a variety of use cases.

There a few model parameters you can experiment with:

  • Different models you can choose from.
  • Temperature: Low temperature means choose high possibility words for predictable words. High temperature means choose low possibility and more unusual and creative words.
  • Top K: The model returns a random word from a set of tok K possible words. Allows to return multiple words.
  • Top P: The model returns a random word from a set of words with the sum of the likelihoods not exceeding P.

2. Create conversations

First, you need to specify the conversation context (words to use or not, topics to avoid etc.).

You can tune the same parameters (temperature, etc.), the same as when you design prompts.

Google provides the APIs and SDKs to build your own application by clicking on the View Code and then use that code in your own applications.

3. Tune a model

If the quality of model responses is not great or consistent even after prompt design, you might need to tune the model.

In fine-tuning, you take a model that was pre trained on a generic dataset, make a copy of the model and re-train the model on a new domain-specific dataset using those learned weights as a starting point.

LLM fine-turning is high computation, high effort, and high cost and might not be the right approach. Instead, you can use parameter-efficient tuning that uses a subset of the parameters (existing or new).

Note to the reader: The instruction then shows how to tune a model from VertexAI. There’s also a Get Started with Generative AI Studio Lab that you need to complete on your own for the course but it requires 5 credits.


This is the end of my notes for Generative AI Learning Path. If you have questions or feedback or if you come across other good GenAI courses, feel free to let me know on Twitter @meteatamel.


See also