πŸ“— -> 05/21/25: ECS189G-L22


Transformer Slides

🎀 Vocab

❗ Unit and Larger Context

Small summary

βœ’οΈ -> Scratch Notes

Transformer walk through

Attention:

If we had and .
If we run through a self attention block their outputs will be identical. The position does not matter.

Multi-head (similar to multi-channel used in CNN)

Learn seperate matrices

  • in Attention Head Number 0
  • in Attention Head Number 1

Multi head Embedding Fusion:

  1. Concatenate all the attention heads into a vector
  2. Multiply with a weight matrix that was trained jointly with the model
  3. The result would be the matrix that captures information from all the attention heads. We can send this forward to the FFNN
  • For 32 heads, and 256

Z=concatenate(z0, z1, ..., z7) * WO

  • concat
  • WO
  • Z

Self Attention Based Encoder

  1. Input Sentence
  2. Embed each words
  3. Split into 8 heads, multiply embedding X or R with weight matrices
    • In all encoders other than #0, we don’t need embedding. We start directly with the output of the encoder right below this one
  4. Calculate Attention using Q/K/V matrices
  5. Concatenate resulting Z matrices, then multiply with weight matrix WO to produce the output of the layer

Residual for β€œDeep” model architecture learning

Add input x to output z, then normalize

  • Normalization:
Normalization

Given

  • If there is a large value, we prefer to normalize the vector
  • Different norm methods:
    • Min-max normalization:
      1. Given a vector, find the minimum and maximum values in the vector. Two scalars.
        • Here min=0.1, max =100
        • Normalizes x: .
    • Mean STD Normalization
      1. Find mean and STD. They are both scalars
    • Layer norm does mean/std norm, finding mean and STD across layers
      • Layer norm normalizes instances
    • Batch norm does mean/std norm, but across columns
      • Batch norm normalizes features

A Deep Transformer with Encoder and Decoder

Pulling all of the above together

  • The encoder and decoder are actually very similar

πŸ§ͺ -> Refresh the Info

Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)

Did a specific aspect of the reading raise questions for you or relate to other ideas and findings you’ve encountered, or are there other related issues you wish had been covered?)

Resources

  • Put useful links here

Connections

  • Link all related words