Business Understanding
A major company has decided to create a new movie studio, but needs more data to better understand the movie industry. I decided to utilize a financial metric, return on production costs (“ROI”), based on worldwide box office sales. This is a good indicator of the success of a movie, while taking into consideration budgeting concerns. With this metric, I analyzed the following questions:
- How does a movie’s return on production costs (ROI) relate to its popularity? (covered in this post)
- Which movie studio has the highest ROI on production costs? (covered in Part 2)
- Which actor/actress has the highest ROI on production costs? (covered in Part 2)
- Which genre has the highest ROI on production costs and box office sales? (covered in Part 3)
- Which month experienced the most box office success in terms of gross sales? (covered in Part 3)
These questions will provide the company with actionable data on what type of films the team should be creating — from the genre, to who should be cast in the main roles, to which movie studio should distribute the film, to which month to release it. In short, this data analysis will provide a comprehensive picture of the next possible steps to explore.
Libraries
We will start by importing in the necessary libraries needed to perform exploratory data analysis.
To get a better visualization of the data, we will begin by examining the first 5 rows of each file.
It looks like we have a lot of data across multiple files to examine! Let’s start off with the first business question we want to answer.
1. How does a movie’s return on production costs (ROI) relate to its popularity?
This question is primarily designed to get the correct DataFrames created for the subsequent questions we’ll answer; but it is worthwhile to examine if there is any relationship between a movie’s return on production costs and how popular the movie was amongst fans and critics.
Import in relevant files and assign to a variable:
Outline of Section:
I. Create ROI DataFrame
II. Create Popularity DataFrame
III. ROI vs. Popularity
IV. Data Visualization
V. Interpretation
I. ROI DataFrame
We want to start off by obtaining the ROI calculation for the movies and creating a DataFrame to visualize this. This will involve joining the df_bom
and df_tn
DataFrames. Let’s look at the column names (and reference the above table previews) to see how we can join the two files.
We’ll need to join the two DataFrames and clean up the data:
Next, we’ll add a column to df_budget_roi
for ROI calculation.
Now that we’ve calculated the return on production costs for the movies, let’s look at creating a popularity DataFrame.
II. Popularity DataFrame
Much like above, I have examined the column names and determined we’ll join these two tables below on the tcont
column.
Note: we will use averagerating
as a proxy for determining popularity.
We will join the tables and clean the data:
Great! We now have two DataFrames — one with an ROI calculation(df_budget_roi
) and one with average ratings (df_popularity
).
III. ROI vs. Popularity
Let’s return to the original question of examining how a movie’s return on production costs is related to the popularity of a movie.
Let’s take a look to see if there are any duplicate titles in this DataFrame we just created, df_roi_pop
.
IV. Visualize the Data

V. Interpretation
Examining the visualizations, we don’t see a strong pattern in any of the graphs, indicating there isn’t a positive relationship for average rating against production budget, worldwide gross, or ROI.
In other words, the amount of money spent on a movie, nor the amount of money the movie makes has a positive relationship with the popularity of the movie and how well it will be received.
A note about outliers
We see in the visualizations above a couple of outliers, as noted below.
In the next blog post, we go more in-depth with the data exploration and examine other parameters — take a look!