πŸ“— -> 04/14/25: Deep Learning Optimization


sec_4_stochastic_optimization.pdf - Google Drive

🎀 Vocab

❗ Unit and Larger Context

Small summary

βœ’οΈ -> Scratch Notes

DL Model Optimization Key Factors

  • Data and Task
    • Inputs: Full batch vs instance vs mini-batch
    • Outputs: Loss functions
  • Model architecture, variables to be learned
    • Model variable initialization methods
    • Hyper-parameter searching
  • Optimizer and optimizer’s hyper-parameters
    • SGD vs Momentum vs Adagrad vs Adam vs …
    • Optimizer setting hyper-parameters
  • Specific algorithm for deep model optimization
    • Error backprop algo

Optimizing a Single Variable

Given:

One method: Lagrange

Minimum can be obtained with: Lagrange Multiplier Method

Advantages vs Disadvantages

  • Straightforwards and can obtain the precise minimum in theory
  • Works for simple functions (convex) with one variable but cannot handle more complex cases (non-convex cases) with multiple variables
    • e.g. loss functions for deep learning models
Another Method: Gradient Descent

Start at a random and take steps. Repeat until convergence

  1. Calculate current gradient:
  2. Update x value (alpha = learning rate):
  3. Keep updating until convergence gets

Gradient Descent (GD)

GD also works for more complex cases!

Vector of Model Parameters -
Loss Function -
Initialization -

  1. Calculate the current gradient:
  2. Update with learning rate:
  3. Continue until convergence to
    Return as the locally optimal learned model variables

Problems

GD will converge to a local optimum, can’t guarantee global optimum

  • Initialization matters:
    • Controlled by random seed

Learning Rate:

Large LR can lead to large/drastic updates, leading to divergent behaviors
Small LR requires many updates before reaching the minimum

Batch Size

Full Batch GD:

  • IE for every instance in dataset, calculate loss, sum them together, and assign that as the Loss
    Stochastic GD:
    Mini-batch SGD:
    Instance-wise SGD: Update the loss function after one data point.
for epoch in range(epochs):
	for iteration in batch:

Sum and average them within batch?

Batch Performance analysis:

Full Batch GD:

  • Advantage: robust, fast to converge
  • Disadvantage: requires large memory, slow in computation
    SGD:
  • Advantage: Fast computation, low memory
  • Disadvantage: Not robust, gradient can be slightly mis-leading, can slow down convergence
    Mini-batch SGD:
  • Compromise between Full Batch and SGD

Data ordering can poison gradient updates

Say the data is ordered: 100 dogs, 100 cats, 100 cows, …
If you use batches of 100, you will get VERY biased loss updates
Shuffle Data!

Optimizers

When updating, you can get more fancy (more hyperparameters) than just the learning rate

  • Momentum
  • Change learning rate

πŸ§ͺ -> Refresh the Info

Did you generally find the overall content understandable or compelling or relevant or not, and why, or which aspects of the reading were most novel or challenging for you and which aspects were most familiar or straightforward?)

Did a specific aspect of the reading raise questions for you or relate to other ideas and findings you’ve encountered, or are there other related issues you wish had been covered?)

Resources

  • Put useful links here

Connections

  • Link all related words