Dataview:
list from [[]] and !outgoing([[]])π -> Missing Values in Dataset Preprocessing
Vocab To Know:
- MCAR (Missing Completely at Random) β Data is missing in a way that is unrelated to the observed or unobserved data.
- MAR (Missing at Random) β Data is missing in a way that is related to the observed data but not the unobserved data.
- MNAR (Missing Not at Random) β Data is missing in a way that is related to the unobserved data.
- Imputation β Replacing missing values with predicted values.
β Information
Every dataset is going to have some sort of missing values, so itβs important to address this.
π -> How to address for each missing pattern
MCAR
- Deletion: Very easy to implement, but loses information and can introduce bias
- Mean Imputation: Replacing missing continuous values with the mean of that feature. Can introduce bias
MAR
- Regression Imputation: Training a regression model on samples, and estimating missing values.
Stochastic regression can be used to add randomness to predictions and avoid overfitting a single value. Multivariate regressions can model linear relationships between multiple features.
- Hot Deck Imputation: Replace missing values with observed values from similar records. Similar to KNN.
MNAR
- Multiple Imputation: Generate possible values multiple times with randomness, and average out the results at the end.
Multiple imputation is a technique that generates multiple imputed values for each missing entry. It involves training a model on complete data and using it to impute missing values, creating multiple imputed versions of the data set. Model training and prediction occur on each imputed set, and the results are averaged to obtain final predictions, accounting for imputation uncertainty. Multiple imputation provides a principled statistical approach but increases computational expense.
- Expectation-Maximization (EM): An iterative algorithm that aims to maximize the likelihood of the observed data, algorithm consists of:
- Expectation Step: Compute the EV of the missing data given observed data and current prediction model
- Maximization Step: Updates the estimates of the model given the EV of the missing data
EM can handle various types of data and models but may require careful initialization and convergence monitoring.
βοΈ -> Usage
Mean Imputation:
df.fillna(df.mean(), inplace=True) Hot-deck Imputation:
for col in df.columns:
for i, val in enumerate(df[col]):
if pd.isna(val):
df.at[i, col] = df[col].dropna().sample().iloc[0]π§ͺ-> Example
- Define examples where it can be used
π -> Related Word
- Link all related words