π -> 05/19/25: ECS189G-L21
Embeddings Slides
Transformer Slides
π€ Vocab
β Unit and Larger Context
Continuing embeddings from Bi-directional RNN
Embeddings:
Outline of Embeddings
β’ Natural Language Processing
β’ Word2vec and Word Prediction
β’ Word2vec: CBOW and Skip-gram
β’ Text Generation with RNN
β’ RNN Variants: Bi-directional RNN
β’ RNN Variants: Hierarchical RNN
Summary of Embeddings
Natural language processing history
Word embedding with word2vec
- CBOW: architecture and model training
- Skipgram: architecture and di erence from CBOW
Text generation with Recurrent Neural Network - RNN architecture and training
- Di erent architecture choices
- Bi-directional RNN
- Hierarchical RNN
β’ What is Attention? To be introduced in next class
Attention and Transformers
- What is Attention?
- Do we really need sequence models?
- Transformer with Attention
- Self Attention
- Multi-Head Self-Attention based Encoder
- A Deep Transformer
- Transfer Learning and BERT with Pre-Training
βοΈ -> Scratch Notes
Embeddings
Bidirectional RNN
word2vec and RNN:
- Word2vec captures context patterns
- RNN only captures pattern from forward context
Bidirectional RNN - BRNN - Similar to an RNN, but there is a backward context as well as a forward context:
as well as , where goes in the reverse direction (starts at end) - Both the generation forward and backward
and are combined to create output
Hierarchical RNN
Document hierarchical structure
- Document contains multiple paragraph
- Paragraph contains multiple sentences
- Sentence contains multiple words


- Learned importance factor. βWord attentionβ - Could be:
- nn parameter
- Could be:
next slides
Transformer and Language Models
What is Attention?
People selective attend to things:
- Images: Faces of people, and animals
- Text: Names, years, numbers
Not a set thing, people attend to different things (different people etc.)
What is Attention in DL?
One component of a networkβs architecture in charg eof managing and quantifying the interdependece between:
- Input and output elements (general attention)
- input elements (self attention)
- different input sources (cross attention)
Transformer
Transformer architecture

- Number of stacked decoders
- FInal decoder projects to subsequent decoders
- Decoders feed into each other, along with encoded input
Encoder / Decoders


Self Attention:
x - embeddings
q - queries
k - keys
v - values
z - final sum
Now, this is how you calculate the Q, K, V boxes on top left
- z1 is the captured importance from x1 and x2
In the future, we will often hear about Attention Matrix A
REVIEW
π§ͺ -> Refresh the Info
Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)
Did a specific aspect of the reading raise questions for you or relate to other ideas and findings youβve encountered, or are there other related issues you wish had been covered?)
π -> Links
Resources
- Put useful links here
Connections
- Link all related words