Movies constitute a vast global industry and a significant part of the international entertainment sector. Given the industry’s scale, one might ask: how can you create a successful movie? By analyzing data from the CMU Movie Summary Corpus and The Movie Database (TMDB), we utilize various tools such as linear regression and natural language processing to explore how different factors influence a movie’s box office and rating success. Follow along to discover some interesting dynamics which influence the movie world.

Dataset description

First of all the dataset we are using is the CMU Movie Summary Corpus and TMDB. Secondly, before starting to analyse the data, it is important to actually understand what each column name represents and what the importance of it is. Here we are going to create a table that gives a nice overview of the different columns name.

Importantly we are only going to mention the significant ones and the ones that need some explanation - the full list can be found below by toggling the collapsible section “Details”.

Column Name	Data Type	Description
Movie runtime	Float	Duration of the movie in minutes
Movie languages	String	Languages the movie has been released in
Movie countries	String	Countries where the movie was produced
TMDB_original_language	String	Original language of the movie as per TMDb
TMDB_vote_average	Float	Average rating for the movie from TMDb
TMDB_vote_count	Float	Number of votes the movie received on TMDb
Movie box office revenue	Float	Box office revenue of the movie

All Dataset Columns

Column Name	Data Type	Description
Wikipedia Movie ID	Integer	Wikipedia’s unique identifier for the movie
Freebase Movie ID	String	Freebase’s unique identifier for the movie
Movie release date	String	The release date of the movie
Movie runtime	Float	Duration of the movie in minutes
Movie languages	String	Languages the movie has been released in
Movie countries	String	Countries where the movie was produced
Movie genres	String	Genres associated with the movie
TMDB_id	Float	The Movie Database (TMDb) unique identifier
TMDB_original_language	String	Original language of the movie as per TMDb
TMDB_original_title	String	Original title of the movie as per TMDb
TMDB_overview	String	Overview or summary of the movie as per TMDb
TMDB_popularity	Float	Popularity score of the movie as per TMDb
TMDB_release_date	String	Release date of the movie as per TMDb
TMDB_title	String	Title of the movie as per TMDb
TMDB_vote_average	Float	Average rating for the movie from TMDb
TMDB_vote_count	Float	Number of votes the movie received on TMDb
TMDB_runtime	Float	Runtime of the movie as per TMDb
TMDB_budget	Float	Budget of the movie as per TMDb
TMDB_IMDB_id	String	IMDb’s unique identifier for the movie
TMDB_genres	String	Genres of the movie according to TMDb
Movie box office revenue	Float	Box office revenue of the movie
Movie release year	Float	Year of the movie’s release
log Movie box office revenue	Float	Logarithm of the movie’s box office revenue
log TMDB_vote_count	Float	Logarithm of the number of TMDb votes
Male_actor_percentage	Float	Percentage of male actors in the movie
Mean_actor_age_at_movie_release	Float	Average age of actors at the time of movie release
balanced Movie box office revenue	Float	Balanced box office revenue of the movie
log balanced Movie box office revenue	Float	Logarithm of balanced box office revenue

Taking a first look at the data set, we see that only around 10% of the movies have a box office revenue entry. Therefore, we enrich the data set with TMDB data (TMDB). The TMDB data gives us new attributes like revenue, original movie language and movie rating, that might be interesting for our analysis. After enriching the data, 12% of the movies now have a box office revenue entry.

Data cleaning

Before diving into the data, it needs to be cleaned. Some of the movies have a runtime of 10 hours, or an actor with a height of 3 meters. All of these unrealistic attributes are removed, before continuing with the data. After removing outliers, the distributions are as shown below.

Character meta dataset before cleaning

The figures below show the distribution of the numerical variables in the two datasets. Some of the variables also look very heavy-tailed and might need to be transformed. The two attributes Movie box office revenue and TMDB_vote_count (movie rating) has a very skewed distribution. Therefore, we transform this variable to make it normal distributed, as shown on the right.

Movie box office revenue and TMDB_vote_count before and after log transformation

But what about inflation?

However, the value of the US dollar is not the same in 2023 as it was over two hundred years ago. Therefore, we have to take inflation into account. The plot below shows the increase of revenue and value of the US dollar throughout the years. (“Inflation rate in the US from 1800 to 2023” OfficialData)

Inflation of the US dollar's effect on revenue

The log Movie box office revenue before and after balancing can be seen in the plot below.

Inflation of the US dollar's effect on revenue

After balancing the revenue attribute, we can now continue our analysis.

Taking a peek at the different attributes

In this section we are going to use a more ‘traditional’ method, where we take a look at the different features and attributes of the dataset and test the correlation between the rating and revenue.

In the end we’re going to combine these attributes and try to create a linear model that can possible answer what a movie needs to obtain a high rating or revenue.

How does gender representation influence movie revenue?

Let first take a look at the representation of male and female actors in movies over time.

Male and female actor count on for each movie release year

There is a trend over all years, that there are more male actors than female actors in movies. But has there been no change in the last century? Let us dive into that.

Development of percentage of female actors over time

With a p-value of 0.10 of the linear regression, there has been no significant change of female characters in the past century. But looking at the plot, there seems to be a trend from 1980 and forward. Let us look at that.

Development of percentage of female actors over time

With a p-value of 0.00 of the linear regression bt, there is has been a significant increase in female actors in the past century. But does gender even effect the box office revenue and rating of the movie?

Movie box office revenue given as a function of percentage of female actor in a given movie

With a p-value of 0.00, the linear regression for revenue shows, that there is a significant decrease in the box office revenue for a movie, if there are more female actors. However, there is no significant effect on rating, with a p-value of 0.06.

Difference between male and female actors

We divide the character dataset into male and female character dataset. Looking at the attribute, the only relevant one to explore is Actor age at movie release.

Male and female distribution

The plot shows us, that there is a significant difference in the age of male and female actors. It indicates, that female actors generally are younger than male actors in movies.

How about the original language of a movie?

It comes as no surprise, that most movies are in English. But does that mean the movie will get a higher revenue or rating? We divide the dataset into movies with English as the original language, and movies that have non-English as the original language.

When performing linear regression, we see that both revenue and rating has a significant effect - but in the opposite direction!

Revenue for english and non-english movies

The plot above shows, that English-spoken movies tend to have a higher revenue, but lower rating.

Is release year the magic key?

Let’s now take a look at the release year and how it might be able to help us!

Movie Release Year and Revenue and Rating

The trend line shows a minor decline in TMDB vote average over time with an R-squared of 0.02, indicating the year of release explains only 2.2% of the variance in ratings. Furthermore, in regard to revenue the upward trend line suggests a slight increase in log-transformed movie box office revenue over the years, with an R-squared of 0.02, meaning the release year accounts for 2.6% of the revenue variance.

All in all the release year is not the best predictor - let’s keep going!

Movie runtime: is it all about the length?

Let’s start by taking a look at if movie runtime has any effect on the revenue or rating!

Movie Runtime and Log Balanced Movie Box Office Revenue

Looking at the plot above most movies are around 100 minutes and have a revenue of 1 million US dollars.

Movie Runtime and

Rating: The regression line on the scatter plot reveals a modest positive correlation between movie runtime and rating, although the low R-squared value of 0.10 from the liner regression results suggests that runtime is a weak predictor of a movie’s rating.

Revenue: Again the regression lien suggests a modest positive correlation between runtime and revenue, but the R-squared value of 0.05 suggest that it is a weak predictor.

Is success actually just explained by popularity?

We now want to look at the effect the number of votes has on vote average and revenue. The vote count of movies are distributed as a power law where some movies in our dataset has zero or very few votes and other movies have around 35 000 votes. Because of this we log-transform the vote count variable and plot this against the vote average and log-transformed revenue. As we can see on the two plots below the correlations between the independent variable and the dependent variables are positive and quite high. This suggests that as the number of votes increases, the average vote increases as well. Likewise, the plot showing log vote count vs log revenue shows a positive correlation. Here the correlation coefficient (r) and the coefficient of determination (R²) is higher indicating a stronger relationship.

Movie Vote count

Does age matter?

As previously seen, female actors tend to be younger than male actors. Let us see if it has as effect on revenue and rating.

Actor Age vs Revenue and Rating

In the blue graph in regard to revenue, actor age also doesn’t correlate with how audiences rate a film. The relationship is negligibly negative, meaning actor age doesn’t really affect the average voting score of a movie.

Moreover, the revenue graph shows that there’s practically no connection between the ages of actors and how much money a movie makes. The correlation is very small, so age isn’t a good predictor of a film’s financial success.

Connecting the dots - can we make one big linear model?

We now want to investigate how the different attributes in our dataset contribute to a high revenue and rating.

We do this by creating two linear regression model predicting revenue and rating based on basic attributes of the movies. We were able to include the following numerical attributes in our models: Movie runtime, Movie release year, Vote count, Male actor percentage, and Mean actor age. We did this in order to only include relevant predictor variables.

In order to get an overview of the pairwise correlation between the different numerical variables of our model, we created the correlation and pairs plot displayed below. The plot the displays the coefficient of correlation of every variable against all other variables in the upper right side of the diagonal. The diagonal of the plot shows a histogram of the variables. Lastly, the lower left half of the plot shows scatter plots of the different variables against each other together with the linear regression line of the data. With this plot the relevant predictor variables can be inspected in terms of the pairwise correlations and distributions.

We want to avoid multicollinearity in our model, and therefore it is crucial that the independent variables remain uncorrelated. Otherwise the integrity of our model’s predictions can be questioned. As observed in the plot, while certain variables show a tendency to correlate, such as the ‘TMDB vote count’ and ‘Movie revenue’ with a correlation coefficient of 0.57 suggesting a moderate positive relationship, other variables demonstrate weaker correlations. This can be seen from the close to horizontal trend lines and small correlation coefficients.

BOBsYndlingsPlot

We also want to add some categorical attributes to our model: Movie country, Movie language, and Movie genre. These categorical variables have a lot of different entries and it would therefore be very computationally expensive to include all countries, languages and genres of our dataset. Therefore, we one-hot-encode the top 10 most occuring values of the variables.

Having stated all our independent variables to our models we can now formulate our linear regression models. The two models describing the vote average and revenue have the same predictor variables but different target variables as seen below:

Predictor variables: C(Original_language) + Movie_runtime + Movie_release_year + log_vote_count + Male_actor_percentage + Mean_actor_age + C(Movie_countries) + C(Movie_genres)

Target variables: Model1: Log_revenue Model2: Avg_rating

The R-squared value is 0.452 for the revenue model and 0.402 for the average rating model. This means that approximately 45.2% and 40.2% of the variability in movie revenues and TMDB vote average can be explained by the model’s variables. This is a moderate level of explanatory power. The models also have significant F-statistics, indicating that the variables, as a group, have a statistically significant effect on movie revenues.

Can we still draw some insights from this model?

Even though the overall fit of our model is not great we can still look at which variables shows the highest impact on our target variables. The coefficient plots below show the significant coefficients in our two models sorted after coefficient value. Variables with bars extending right have a positive effect, while those extending left have a negative effect on average ratings. The length of each bar indicates the strength of the variable’s impact within a statistical significance of p-value < 0.05.

Coefficient_rating_mode

The plot above shows the coefficients for the linear regression model on movie ratings. We can see that certain genres such as Family and Action positively influence ratings. The same is evident for certain production countries and languages.

coefficient_revenue_model

The revenue coefficient plot above has a number of positive coefficients. The top 10 genre’s or country’s positively impact revenue. Family films and movies from the United States show the strongest positive relationships. In contrast, South Korean productions and the Adventure genre show a negative correlation with revenue. Non-categorical variables like ‘log_TMDB_vote_count’ significantly influence revenue, indicating a robust relationship between a film’s popularity and its financial success.

Finally, we can conclude that categorical variables of the 10 most evident genres, contries and languages show a higher influence on the model than the non-categorical variables.

How can we move to better models?

The attributes associated with our dataset do not explain a large percentage of the variance or our data. A reason that this somewhat simple linear regression cannot easily predict movie success in terms of revenue or rating could be because these models only provide general information about the movies and not specific details of the storyline of the movies. We therefore want to investigate whether knowing the specific storyline of each movie can help us in predicting movie success. As plot summaries are more complex and contain more specific details about the movies we can maybe extract more distinct features from our dataset. This leads us to the next part of our analysis where we will try to cluster movies into distinct categories based on their respective plot summaries.

Natural Language Processing - a possible solution?

We have explored the impact of genres on revenue and rating, but as the genres only capture a small part of the movie, we are going to use the plot summaries provided in the dataset. We will now go ahead and use the plot summaries to divide the movies into different clusters and try to find some correlation between the plot of the movies and the revenue and rating.

Bag-of-words | Text Data Magic: From words to 26 cool clusters

We decide to cluster the movies based on similarity score between words in the plot summaries. To prepare the plot summaries for analysis, we apply stemming, lemmatization, and stopword removal to our text data. Next, we use TF-IDF for text representation. For dimensionality reduction, we aim for 95% variance retention, leading to component reduction. We then employ K-means clustering, determining the optimal k=30 based on the silhouette score and performance reasons, resulting in 30 clusters. After filtering out clusters with 100 samples or fewer, we end up with 19 clusters.

The clusters are formed by similarity between words in the movies’ plot summaries. Therefore, each cluster contains a list of words ranked from most important to least important. A way to represent the cluster, and what words characterize them, is by the word clouds seen below. Each word in the word cloud is represented in the cluster, and the bigger the word, the more significant it is for that cluster. The plot below shows the word clouds for cluster five to eight.

Example of word clouds for cluster 5, 6, 7 and 8

We see that the four cluster seen here have some clear tendencies as well. Looking at cluster 5 the most significant (biggest) words seem to fall in the same category: german, solider, attack & kill.

Rating & revenue

But let us circle back to our goal of this analysis: to create a movie with highest possible revenue and best possible rating. Let’s see how these plot summaries can help us with that!

Each cluster contains a number of movies, each with a given revenue and rating. To check if there is a significant difference between revenue and rating in the cluster, we randomly choose 100 movies from each cluster and calculate the mean revenue and rating. We do this 100 times and calculate the average of all the means from all the samples.

The average mean of each cluster for both rating and revenue is plotted below with error bars indicating the 95% confidence interval.