Summary of topics

Lecture 1 - Introduction

  • Supervised
  • Unsupervised
  • Reinforcement Learning
  • Parametric vs. Non-parametric:
    • Fixed num of parameters is called parametric
    • If the num of parameters increases with data size its called non-parametric

Lecture 2 - Linear Regression

  • Linear regression
    • Ordinary Least Squares (OLS)
    • Assumes linear function of input
  • Residual Sum of Squares
    • How to minimize error?
      • OLS or GD
    • Learn how to implement OLS or GD
  • Least Mean Squares (LMS)

Lecture 3 - Linear Regression Part 2

  • Overfitting, polynomial order, and sample size
  • Test/Train curve
  • Regularization - Ridge/Lasso
  • Gaussian/Gamma distributions
  • Maximum Likelihood Estimator (MLE)
    • How to do, and equivalent formulas (NLL)
  • RSS/SSE/MSE/OLS

Lecture 4 - Logistic Regression and Classification

  • Regression vs. Classification
  • Sigmoid/Logistic Function
  • Solving for optimal parameters:
    • MLE and GD
  • Could also use Newton’s method
  • Perceptron learning algo

Lecture 5 - Intro to ANN

  • Perceptron learning algorithm
  • FFNN, MLP
  • Activation functions (ReLu, logistic, step)
  • When and how to change architecture
  • ANN classifier

Lecture 6 - Backpropagation in ANN

  • The entire slide show?
  • Maximize the log likelihood vs minimize the negative log likelihood (cost function)
  • To train:
    • Use optimization method (GD, or gradient ascent in NLL)
    • Take derivatives with respect to w
  • Forward propagation
  • Calculate the errors for each layer
  • For each sample:
    • set a(1) = xk
    • compute a(l) for all layers l (forward prop)
    • compute error in final layer all hidden layers
    • compute partial derivatives
    • use the derivative to update with a heuristic optimization method
  • NN types:
    • FFNN if ANN graph is acyclic
    • Recurrent networks when it is cyclic
    • Radial Basis Function Networks, Hopfield Networks, long-short memory etc.

Lecture 7 - Naive Bayes

  • Naive bayes method: generative method
  • Concepts of
    • Prior, posterior, and likelihood
  • Assumptions of naive bayes (features independent of other features)
  • When does it work well (features uncorrelated, and modest training data)
  • Understand complexity explosion with the discriminant function approach
  • Understand how assuming feature independence simplifies calc
  • Measuring classification performance
    • Cross-validation, binary classification errors, statistical measures (f1, etc), ROC, AUC

Lecture 8 - Decision Trees and Random Forest

  • How to decide which rules to split on?
  • What is purity/homogenity of final sets (leaves?)
    • Entropy
    • Gini index
  • Balance between complex rules and simple rules
    • Optimal?
  • How to prune back the tree
    • measure performance
    • cross-validation
    • minimum description length

Lecture 9 - Autoencoders

  • General structure of autoencoder
  • Desired characteristics:
    • Sparce representation
    • Lower dimensionality
    • Spatial and temporal info maintained
  • Activation functions
  • Loss functions
  • Other uses:
    • Speed of compute
    • Anomaly detection
    • Denoising
  • Convolutional autoencoder
  • Generation:
    • Variational autoencoder
    • Reqs - continuity and completeness

Deeper Dive

Tab stack

lec 4
samp mt
samp mt sol

Lecture 1: Intro

Nothing too crazy, just the topics above

Lecture 2: Linear Regression

Linear Regression: Model depends linearly on unknown parameters, estimated from the data

  • Simple linear regression: one independent, one dependent
  • Multiple linear regression: multiple independent, one dependent
  • Multivariate LR: multiple independent, multiple dependent
    • (general linear regression)
OLS

OLS - Method for estimating parameters in linear regression

Ideal Linear Regression:
Un-ideal: The deviation from ideal leaves the residual,
RSS:

  • To measure error

Minimizing the RSS is equivalent to maximizing log-likelihoodof data given model:


  • Solve this with either OLS or GD

For any vector:

  • Also, is symmetric
  • That gives us:

Derivation now:

Footnotes:

  • is a constant, and gets zeroed out by derivative
  • “Since is a scalar, its transpose is equal to itself. That is, . Therefore, the two middle terms are equal”, which gives us
Gradient Descent (GD)

3 main types:

  1. Stochastic GD (1) - One sample, update weights accordingly
  2. Mini-batch GD (1<m<n) - A batch of samples smaller than entire training set (think a handful conceptually), update weights
  3. Batch GD (n) - Whole training set

Central rule:

LMS update rule - (also called Widrow-Hoff):

  • Repeat until convergence:
    • SGD:
    • BGD:

Lecture 3: Linear Regression Pt.2

Understand model complexity and data tradeoffs, the polynomial fitting problem

Regularization:

Bounding the sum of weights, and putting it in the objective function

Ridge/L2 Regularization:

Lasso/L1 Regularization:

Conditional Probs:

Reframe problem in a probabilistic way:

  • Error is gaussian:
  • Output is a normal function centered on estimations, with noise throw in:
Maximum Likelihood Estimation (MLE)

We want to maximize the chance our parameters produce the data:

Equivalently:

  • Since it’s now logs, we can assume M training samples are i.i.d. and treat independently:
      • Estimating prob of observing data given assumptions, smaller if more wrong
      • “The log-likelihood increases as the residual sum of squares decreases — i.e., the model assigns higher probability to data that lies closer to the predicted line”
    • Maximizing prob minimizes the RSS, which for linear regressions case gives OLS
  • Instead of maximizing we can minimize the negative log likelihood (NLL):

Lecture 4 - Logistic Regression and Classification