Summary of topics
Lecture 1 - Introduction
- Supervised
- Unsupervised
- Reinforcement Learning
- Parametric vs. Non-parametric:
- Fixed num of parameters is called parametric
- If the num of parameters increases with data size its called non-parametric
Lecture 2 - Linear Regression
- Linear regression
- Ordinary Least Squares (OLS)
- Assumes linear function of input
- Residual Sum of Squares
- How to minimize error?
- OLS or GD
- Learn how to implement OLS or GD
- How to minimize error?
- Least Mean Squares (LMS)
Lecture 3 - Linear Regression Part 2
- Overfitting, polynomial order, and sample size
- Test/Train curve
- Regularization - Ridge/Lasso
- Gaussian/Gamma distributions
- Maximum Likelihood Estimator (MLE)
- How to do, and equivalent formulas (NLL)
- RSS/SSE/MSE/OLS
Lecture 4 - Logistic Regression and Classification
- Regression vs. Classification
- Sigmoid/Logistic Function
- Solving for optimal parameters:
- MLE and GD
- Could also use Newton’s method
- Perceptron learning algo
Lecture 5 - Intro to ANN
- Perceptron learning algorithm
- FFNN, MLP
- Activation functions (ReLu, logistic, step)
- When and how to change architecture
- ANN classifier
Lecture 6 - Backpropagation in ANN
- The entire slide show?
- Maximize the log likelihood vs minimize the negative log likelihood (cost function)
- To train:
- Use optimization method (GD, or gradient ascent in NLL)
- Take derivatives with respect to w
- Forward propagation
- Calculate the errors for each layer
- For each sample:
- set a(1) = xk
- compute a(l) for all layers l (forward prop)
- compute error in final layer all hidden layers
- compute partial derivatives
- use the derivative to update with a heuristic optimization method
- NN types:
- FFNN if ANN graph is acyclic
- Recurrent networks when it is cyclic
- Radial Basis Function Networks, Hopfield Networks, long-short memory etc.
Lecture 7 - Naive Bayes
- Naive bayes method: generative method
- Concepts of
- Prior, posterior, and likelihood
- Assumptions of naive bayes (features independent of other features)
- When does it work well (features uncorrelated, and modest training data)
- Understand complexity explosion with the discriminant function approach
- Understand how assuming feature independence simplifies calc
- Measuring classification performance
- Cross-validation, binary classification errors, statistical measures (f1, etc), ROC, AUC
Lecture 8 - Decision Trees and Random Forest
- How to decide which rules to split on?
- What is purity/homogenity of final sets (leaves?)
- Entropy
- Gini index
- Balance between complex rules and simple rules
- Optimal?
- How to prune back the tree
- measure performance
- cross-validation
- minimum description length
Lecture 9 - Autoencoders
- General structure of autoencoder
- Desired characteristics:
- Sparce representation
- Lower dimensionality
- Spatial and temporal info maintained
- Activation functions
- Loss functions
- Other uses:
- Speed of compute
- Anomaly detection
- Denoising
- Convolutional autoencoder
- Generation:
- Variational autoencoder
- Reqs - continuity and completeness
Deeper Dive
Tab stack
Lecture 1: Intro
Nothing too crazy, just the topics above
Lecture 2: Linear Regression
Linear Regression: Model depends linearly on unknown parameters, estimated from the data
- Simple linear regression: one independent, one dependent
- Multiple linear regression: multiple independent, one dependent
- Multivariate LR: multiple independent, multiple dependent
- (general linear regression)
OLS
OLS - Method for estimating parameters in linear regression
Ideal Linear Regression:
Un-ideal: The deviation from ideal leaves the residual,
RSS:
- To measure error
Minimizing the RSS is equivalent to maximizing log-likelihoodof data given model:
Solve this with either OLS or GD
For any vector:
- Also,
is symmetric - That gives us:
Derivation now:
Footnotes:
is a constant, and gets zeroed out by derivative - “Since
is a scalar, its transpose is equal to itself. That is, . Therefore, the two middle terms are equal”, which gives us
Gradient Descent (GD)
3 main types:
- Stochastic GD (1) - One sample, update weights accordingly
- Mini-batch GD (1<m<n) - A batch of samples smaller than entire training set (think a handful conceptually), update weights
- Batch GD (n) - Whole training set
Central rule:
LMS update rule - (also called Widrow-Hoff):
- Repeat until convergence:
- SGD:
- BGD:
- SGD:
Lecture 3: Linear Regression Pt.2
Understand model complexity and data tradeoffs, the polynomial fitting problem
Regularization:
Bounding the sum of weights, and putting it in the objective function
Ridge/L2 Regularization:
Lasso/L1 Regularization:
Conditional Probs:
Reframe problem in a probabilistic way:
- Error is gaussian:
- Output is a normal function centered on estimations, with noise throw in:
Maximum Likelihood Estimation (MLE)
We want to maximize the chance our parameters
Equivalently:
- Since it’s now logs, we can assume M training samples are i.i.d. and treat independently:
- Estimating prob of observing data given assumptions, smaller if more wrong
- “The log-likelihood increases as the residual sum of squares decreases — i.e., the model assigns higher probability to data that lies closer to the predicted line”
- Maximizing prob minimizes the RSS, which for linear regressions case gives OLS
- Instead of maximizing
we can minimize the negative log likelihood (NLL):