3 Visualizations You Should Use With Multivariate Linear Regression
When creating a linear regression model with multiple predictors, there are important considerations to factor into your analysis. I’ve outlined three major ones below that concisely provide a lot of useful information.
- Boxplots: do an initial examination for outliers
Boxplots display a lot of information: the minimum, the first quartile, median, third quartile, and maximum. It can also visually show how tightly your data is grouped. You can use this information to decide if you want to log transform your variables to normalize the data or whether further examination into the outliers itself is warranted (data entry errors, `NaN` values, etc.).
2. Jointplots: simple linear regression and distribution
These visualizations provide you with a great initial look at the distribution of each feature as well as a simple regression line to see whether there is a linear relationship with the target variable.
3. Multicollinearity: address potentially correlated predictors
In linear regression, you want to understand the role of each independent variable, or predictor, keeping all other predictors constant (the premise of regression analysis). Therefore, you need to address multicollinearity as it violates this premise. Coefficients and p-values might not be reliable for correlated predictors and steps should be taken to remove or otherwise address the correlation.
For example, we see the following highly correlated features:
- bathrooms and sqft_living (0.76)
- sqft_above and sqft_living (0.88)
- sqft_living15 and sqft_living (0.76)
Let’s look at each feature’s relationship to the target variable, `price`, using simple linear regression to determine which to keep for our model.
A good general rule is to drop the feature that isn’t as strongly correlated with the target variable. Even though these variables are significant with a 0.05 a threshold, we see that `bathrooms`, `sqft_above`, and `sqft_living15` relatively capture a smaller proportion of the variation that can be attributed to the target variable of `price`.
Dropping `sqft_above` makes sense as it can represent the same space as `sqft_living`.
`bathrooms` could be turned into a dichotomous variable (yes/no), but every property would have bathroom(s). Alternatively, if we combined it with `bedrooms` as a new feature, we would be sacrificing interpretability for accuracy, so we made a choice to drop it, as well.
We decided to drop the `sqft_living15` feature as well, rather than create a combined feature with `sqft_lot15` for the same reasons of balancing interpretability and accuracy.
Conclusion
There are many visualizations to help preprocess your data for model fitting — starting with a boxplot, jointplot, and heatmap will give you a quick and efficient start in the process.