3 Visualizations You Should Use With Multivariate Linear Regression

Eric
3 min readOct 1, 2020

When creating a linear regression model with multiple predictors, there are important considerations to factor into your analysis. I’ve outlined three major ones below that concisely provide a lot of useful information.

  1. Boxplots: do an initial examination for outliers

Boxplots display a lot of information: the minimum, the first quartile, median, third quartile, and maximum. It can also visually show how tightly your data is grouped. You can use this information to decide if you want to log transform your variables to normalize the data or whether further examination into the outliers itself is warranted (data entry errors, `NaN` values, etc.).

Boxplot: look for data points outside of the “box and whiskers”

2. Jointplots: simple linear regression and distribution

These visualizations provide you with a great initial look at the distribution of each feature as well as a simple regression line to see whether there is a linear relationship with the target variable.

Jointplot: distributions of variables and plot of simple regression line

3. Multicollinearity: address potentially correlated predictors

In linear regression, you want to understand the role of each independent variable, or predictor, keeping all other predictors constant (the premise of regression analysis). Therefore, you need to address multicollinearity as it violates this premise. Coefficients and p-values might not be reliable for correlated predictors and steps should be taken to remove or otherwise address the correlation.

Heatmap: higher numbers signal high correlation between predictors

For example, we see the following highly correlated features:
- bathrooms and sqft_living (0.76)
- sqft_above and sqft_living (0.88)
- sqft_living15 and sqft_living (0.76)

Let’s look at each feature’s relationship to the target variable, `price`, using simple linear regression to determine which to keep for our model.

Simple OLS Regression: higher r_squared means more of the variation in price can be attributed to that independent variable

A good general rule is to drop the feature that isn’t as strongly correlated with the target variable. Even though these variables are significant with a 0.05 a threshold, we see that `bathrooms`, `sqft_above`, and `sqft_living15` relatively capture a smaller proportion of the variation that can be attributed to the target variable of `price`.

Dropping `sqft_above` makes sense as it can represent the same space as `sqft_living`.

`bathrooms` could be turned into a dichotomous variable (yes/no), but every property would have bathroom(s). Alternatively, if we combined it with `bedrooms` as a new feature, we would be sacrificing interpretability for accuracy, so we made a choice to drop it, as well.

We decided to drop the `sqft_living15` feature as well, rather than create a combined feature with `sqft_lot15` for the same reasons of balancing interpretability and accuracy.

Conclusion

There are many visualizations to help preprocess your data for model fitting — starting with a boxplot, jointplot, and heatmap will give you a quick and efficient start in the process.

--

--

Eric

How do you put out a fire in your office wastebasket? First, set fire to more wastebaskets to get a larger sample size. Setting wastebasket fires since 2020.