Exploratory Data Analysis Walk-Through: Part 2

3 min readAug 24, 2020


We started with the initial premise that a big company is looking to start a new movie studio and we have been tasked to provide data analysis to help decide the next steps. (view Part 1 here)

In this post, we will continue with the analysis and answer the next questions:

  1. Which movie studio has the highest ROI on production costs?
  2. Which actor/actress has the highest ROI on production costs?

To answer these questions, we can examine the same DataFrame that was introduced in the “ROI vs. Popularity” section, df_roi_pop.

1. Which movie studio has the highest ROI on production costs?

I. Group data by studio

II. Visualize the data

III. Interpretation

We can see that WB (NL), which is New Line Cinema, a label of Warner Bros. Picture Group, has the highest average return on production costs. Second is UTV, which is UTV Motion Pictures, an Indian motion picture company that is a subsidiary of UTV Software Communications which is in turn owned by The Walt Disney Company India. Based on this data, these two studios have historically shown they're experienced and knowledgeable in successful distribution and production of movies.

As shown below, WB (NL) does extremely well with movies in the horror genre, while UTV attributes it's high ROI to one film, "Dangal". Further analysis can be done by setting a threshold on the minimum number of movies needed to be considered for this calculation.

2. Which actor/actress has the highest ROI on production costs?

I. Joining DataFrames

We will join a couple more DataFrames to obtain the final DataFrame that allows us to answer the question.

II. Actor/Actress Optimization

We also want to remove any actors/actresses that were only in a couple of films, as it is not definitive proof that their limited appearances result in high-grossing films. However, actors/actresses who have starred in multiple films that have led to large box offices overall can be a possible indicator of future film success.

III. Calculating the average ROI of movies for these actors/actresses

IV. Visualize the data

V. Interpretation

Looking at the data, we see that James McAvoy has the highest average return on production costs, as the movies that he appears in averages a 5.74 multiple on the production budget. Followed closely is Jennifer Lawrence, whose films average a 5.56 multiple on the production budget.

Extra: Calculating the total box office sales for actors/actresses

Let’s examine the top 10 actors/actresses whose films had the sum-total highest box office success.

Visualize the data

Note: Comparing the two lists of actors/actresses and examining for overlap, a good opportunity to secure a higher multiple on production costs and result in higher box office sales might be bringing people like Jennifer Lawrence, Kristen Wiig, and Mark Ruffalo to star in the movie.

We’ll continue our analysis by exploring genres and the most popular months that movies were released in Part 3!




Written by Eric

How do you put out a fire in your office wastebasket? First, set fire to more wastebaskets to get a larger sample size. Setting wastebasket fires since 2020.

No responses yet