Dataview:

list from [[]] and !outgoing([[]])

📗 -> Insert word

🎤 Vocab

Multicollinearity

Multicollinearity denotes when independent variables in a linear regression equation are correlated. Multicollinear variables can negatively affect model predictions on unseen data. Several regularization techniques can detect and fix multicollinearity.

Independent Variables (IV) - A variable that stands alone and isnt changed by the other variables you’re trying to measure
Dependent Variable (DV) - A variable that varies off the input of the IV

Overfitting - Models with low training error and high generalization error

❗ Information

Multicollinearity vs Collinearity:

  • Collinearity is when two independent variables in a regression analysis are correlated

  • Multicolleanrity is when more then two are correlated

  • Their opposite is orthogonality, which is when IVs are uncorrelated

It’s important to measure and address this, so that you don’t overcomplicate models or have overfitting

  • Multicollinear data fails to generalize as well, as they factor variance and noise into the model too much

Estimated coefficients for multicollinear variables possess a large variability, also known as a large standard error

📄 -> Methodology

Regression Analysis - Regression

The standard regression equation is as follows:

  • Y is the predicted output (DV)
  • X is the predictor (IV or explantory varaible)
  • B is the regression coefficient

To estimate coefficients , we use standard OLS matrix coefficient estimator:

Causes of Multicollinearity

Data Collection - If data isn’t representative

For instance, Montgomery et al. supply the example of a supply chain delivery dataset in which order distance and size are independent variables of a predictive model. In the data they provide, order inventory size appears to increase with delivery distance. The solution to this correlation is straightforward: collect and include data samples for short distance deliveries with large inventories, or vice-versa

Model Constraints - The model being fed into the data might already have collinearity, because of the nature of the data itself

Imagine we are creating a predictive model to measure employee satisfaction in the workplace, with hours worked per week and reported stress being two of several predictors. There may very well be a correlation between these predictors due to the nature of the data—i.e. people who work more will likely report higher stress.

Over defined Model - This can occur when there are more model predictors than data observation points.

This issue can arise particularly in biostatistics or other biological studies. Resolving overdefined models requires eliminating select predictors from the model altogether. But how to determine which models to remove? One can conduct a several preliminary studies using subsets of regressors (i.e. predictors) or utilize principal component analysis (PCA) to combine multicollinear variables.

Data Based Collinearity

Certain data is particularly susceptible, like time series data. Economics data likely fluctuates together for example.

Detecting Multicollinearity

Relatively Simple Options:

  • Scatter Plot
  • Correlation Matrix - Elements of the matrix are the correlation coefficients between each predictor in the model. Higher the absolute value is (ranges from -1 to 1), the more correlation

Variance Inflation Factor (VIF) - The most common method for finding multicollinearity. Complicated method, just use a package.
The higher the degree of VIF, the greater degree of multicollinearity.

A widely repeated rule of thumb is that a VIF value greater than or equal to ten indicates severe multicollinearity

Fixing Multicollinearity

Fixes for multicollinearity range from diversifying or enlarging the sample size of training data to removing parameters altogether

You can also employ regularization techniques to address this

  • Ridge regression - Penalizes high value coefficients
  • Lasso regression - Similarly penalizes high-value coefficients

The primary difference between these two is that ridge merely reduces coefficient values to near-zero while lasso can reduce coefficients to zero, effectively removing independent variables from the model altogether.

✒️ -> Usage

Both of the following feature high dimensional data which often correlates together

  • Finance
  • Criminal Justice

🧪-> Example

It’s important to measure and address this, so that you don’t overcomplicate models or overfit

  • You don’t want to duplicate data

Resources

Connections