📗 -> 05/21/25: ECS189G-L22

🎤 Vocab

❗ Unit and Larger Context

Small summary

✒️ -> Scratch Notes

Transformer walk through

Attention:

If we had and .
If we run through a self attention block their outputs will be identical. The position does not matter.

Multi-head (similar to multi-channel used in CNN)

Learn seperate matrices

in Attention Head Number 0
in Attention Head Number 1

Multi head Embedding Fusion:

Concatenate all the attention heads into a vector
Multiply with a weight matrix that was trained jointly with the model
The result would be the matrix that captures information from all the attention heads. We can send this forward to the FFNN

For 32 heads, and 256

Z=concatenate(z0, z1, ..., z7) * WO

concat
WO
Z

Self Attention Based Encoder

Input Sentence
Embed each words
Split into 8 heads, multiply embedding X or R with weight matrices
- In all encoders other than #0, we don’t need embedding. We start directly with the output of the encoder right below this one
Calculate Attention using Q/K/V matrices
Concatenate resulting Z matrices, then multiply with weight matrix WO to produce the output of the layer

Residual for “Deep” model architecture learning

Add input x to output z, then normalize

Normalization:

Normalization

Given

If there is a large value, we prefer to normalize the vector
Different norm methods:
- Min-max normalization:
  1. Given a vector, find the minimum and maximum values in the vector. Two scalars.
    - Here min=0.1, max =100
  2. - Normalizes x: .
- Mean STD Normalization
  1. Find mean and STD. They are both scalars
- Layer norm does mean/std norm, finding mean and STD across layers
  - Layer norm normalizes instances
- Batch norm does mean/std norm, but across columns
  - Batch norm normalizes features

A Deep Transformer with Encoder and Decoder

Pulling all of the above together

The encoder and decoder are actually very similar

🧪 -> Refresh the Info

Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)

Did a specific aspect of the reading raise questions for you or relate to other ideas and findings you’ve encountered, or are there other related issues you wish had been covered?)

Vault

Explorer

ECS189G-L22

📗 -> 05/21/25: ECS189G-L22

🎤 Vocab

❗ Unit and Larger Context

✒️ -> Scratch Notes

Transformer walk through

Multi-head (similar to multi-channel used in CNN)

Multi head Embedding Fusion:

Self Attention Based Encoder

Residual for “Deep” model architecture learning

Normalization

A Deep Transformer with Encoder and Decoder

🧪 -> Refresh the Info

🔗 -> Links

Resources

Connections

Graph View

Table of Contents

Backlinks