📗 -> ECS189G-L8

sec_4_stochastic_optimization.pdf - Google Drive
sec_5_deep_learning_basic

Filling in prior to midterm, did not attend lecture

Summary of Section 4:

Deep Learning model optimization

Data perspective
- Input: decide to use full batch, instances, mini-batch
- Output (real value, probability, etc.): decide loss function
Design your model
- Initialize your variables to be learned
Decide your optimizer
- SGD vs Momentum vs Adagrad vs Adam vs …
- Specify optimizer parameters
  - Learning rate
  - Other parameters
Use error back-propagation algorithm (with your GD based optimizer) to learn model variables until convergence

✒️ -> Scratch Notes

Gradient Descent Optimizers

Momentum, Adagrad, Adam

Pure GD:

= Current params / location
= Learning Rate
= Current Acceleration (not velocity?)

Momentum

Incorporate past gradients into next gradient

- Momentum term weight (usually 0.9)

Adagrad

Learning rate adaptation

Notice the learning rate is divided by the sqrt sum and epsilon
- Different learning rate for different model variables
1. For variables with small gradient, they have larger learning rate
2. For variables with large gradient, they have smaller learning rate instead

Adam (Adaptive Moment Estimation)

Incorporates Momentum + Adagrad

Prediction Output and Loss Functions

Softmax

Notes:

Basically, scale each output by their proportion of the output, scaled exponential.
- This has the property of turning the full vector output into a probability distribution, summing to 1
Why ? Related question
- It is not particularly important to the ratio, HOWEVER it makes the math easier
- has properties that make working with it in derivates/general math much neater. That is why it’s chosen specifically
Notice the distribution scaled to 1
Notice the exponential scaling

Classification Loss

Mean Absolute Error (MAE) Loss

One hot encode the label, and compare its prediction to truth

Mean Square Error (MSE) Loss

Cross Entropy Loss

More we dont cover as well…

Slides 2

Deep Learning Basics

Why do we need Deep Learning? (DL)

Great for dealing with complex unstructured data
- He gives the example of discriminating dogs and muffins, they blur the lines of features (number of eyes, nose color, fur color, etc.) needing more complex analysis
  What is it?
Broad family of ML algos based on Artificial Neural Networks (ANN)

History

Brief history of artificial neural networks

Why is it working all of a sudden?

GPUs, HPCs, Cloud Compute
Big data
New deep model architecture

On top of that, they’re increasing exponentially in number of parameters

🧪 -> Refresh the Info

Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)

Did a specific aspect of the reading raise questions for you or relate to other ideas and findings you’ve encountered, or are there other related issues you wish had been covered?)

Vault

Explorer

ECS189G-L8