Exploratory Data Analysis Walk-Through: Part 1

Business Understanding

A major company has decided to create a new movie studio, but needs more data to better understand the movie industry. I decided to utilize a financial metric, return on production costs (“ROI”), based on worldwide box office sales. This is a good indicator of the success of a movie, while taking into consideration budgeting concerns. With this metric, I analyzed the following questions:

  1. How does a movie’s return on production costs (ROI) relate to its popularity? (covered in this post)

These questions will provide the company with actionable data on what type of films the team should be creating — from the genre, to who should be cast in the main roles, to which movie studio should distribute the film, to which month to release it. In short, this data analysis will provide a comprehensive picture of the next possible steps to explore.


We will start by importing in the necessary libraries needed to perform exploratory data analysis.

To get a better visualization of the data, we will begin by examining the first 5 rows of each file.

It looks like we have a lot of data across multiple files to examine! Let’s start off with the first business question we want to answer.

1. How does a movie’s return on production costs (ROI) relate to its popularity?

This question is primarily designed to get the correct DataFrames created for the subsequent questions we’ll answer; but it is worthwhile to examine if there is any relationship between a movie’s return on production costs and how popular the movie was amongst fans and critics.

Import in relevant files and assign to a variable:

Outline of Section:

I. Create ROI DataFrame
II. Create Popularity DataFrame
III. ROI vs. Popularity
IV. Data Visualization
V. Interpretation

I. ROI DataFrame

We want to start off by obtaining the ROI calculation for the movies and creating a DataFrame to visualize this. This will involve joining the df_bom and df_tn DataFrames. Let’s look at the column names (and reference the above table previews) to see how we can join the two files.

We’ll need to join the two DataFrames and clean up the data:

Next, we’ll add a column to df_budget_roi for ROI calculation.

Now that we’ve calculated the return on production costs for the movies, let’s look at creating a popularity DataFrame.

II. Popularity DataFrame

Much like above, I have examined the column names and determined we’ll join these two tables below on the tcont column.
Note: we will use averagerating as a proxy for determining popularity.

We will join the tables and clean the data:

Great! We now have two DataFrames — one with an ROI calculation(df_budget_roi) and one with average ratings (df_popularity).

III. ROI vs. Popularity

Let’s return to the original question of examining how a movie’s return on production costs is related to the popularity of a movie.

Let’s take a look to see if there are any duplicate titles in this DataFrame we just created, df_roi_pop.

IV. Visualize the Data

V. Interpretation

Examining the visualizations, we don’t see a strong pattern in any of the graphs, indicating there isn’t a positive relationship for average rating against production budget, worldwide gross, or ROI.

In other words, the amount of money spent on a movie, nor the amount of money the movie makes has a positive relationship with the popularity of the movie and how well it will be received.

A note about outliers

We see in the visualizations above a couple of outliers, as noted below.

In the next blog post, we go more in-depth with the data exploration and examine other parameters — take a look!

How do you put out a fire in your office wastebasket? First, set fire to more wastebaskets to get a larger sample size. Setting wastebasket fires since 2020.