Skip to the content.

Movies constitute a vast global industry and a significant part of the international entertainment sector. Given the industry’s scale, one might ask: how can you create a successful movie? By analyzing data from the CMU Movie Summary Corpus and The Movie Database (TMDB), we utilize various tools such as linear regression and natural language processing to explore how different factors influence a movie’s box office and rating success. Follow along to discover some interesting dynamics which influence the movie world.

Dataset description

First of all the dataset we are using is the CMU Movie Summary Corpus and TMDB. Secondly, before starting to analyse the data, it is important to actually understand what each column name represents and what the importance of it is. Here we are going to create a table that gives a nice overview of the different columns name.

Importantly we are only going to mention the significant ones and the ones that need some explanation - the full list can be found below by toggling the collapsible section “Details”.

Column Name Data Type Description
Movie runtime Float Duration of the movie in minutes
Movie languages String Languages the movie has been released in
Movie countries String Countries where the movie was produced
TMDB_original_language String Original language of the movie as per TMDb
TMDB_vote_average Float Average rating for the movie from TMDb
TMDB_vote_count Float Number of votes the movie received on TMDb
Movie box office revenue Float Box office revenue of the movie

All Dataset Columns

Column Name Data Type Description
Wikipedia Movie ID Integer Wikipedia’s unique identifier for the movie
Freebase Movie ID String Freebase’s unique identifier for the movie
Movie release date String The release date of the movie
Movie runtime Float Duration of the movie in minutes
Movie languages String Languages the movie has been released in
Movie countries String Countries where the movie was produced
Movie genres String Genres associated with the movie
TMDB_id Float The Movie Database (TMDb) unique identifier
TMDB_original_language String Original language of the movie as per TMDb
TMDB_original_title String Original title of the movie as per TMDb
TMDB_overview String Overview or summary of the movie as per TMDb
TMDB_popularity Float Popularity score of the movie as per TMDb
TMDB_release_date String Release date of the movie as per TMDb
TMDB_title String Title of the movie as per TMDb
TMDB_vote_average Float Average rating for the movie from TMDb
TMDB_vote_count Float Number of votes the movie received on TMDb
TMDB_runtime Float Runtime of the movie as per TMDb
TMDB_budget Float Budget of the movie as per TMDb
TMDB_IMDB_id String IMDb’s unique identifier for the movie
TMDB_genres String Genres of the movie according to TMDb
Movie box office revenue Float Box office revenue of the movie
Movie release year Float Year of the movie’s release
log Movie box office revenue Float Logarithm of the movie’s box office revenue
log TMDB_vote_count Float Logarithm of the number of TMDb votes
Male_actor_percentage Float Percentage of male actors in the movie
Mean_actor_age_at_movie_release Float Average age of actors at the time of movie release
balanced Movie box office revenue Float Balanced box office revenue of the movie
log balanced Movie box office revenue Float Logarithm of balanced box office revenue

Taking a first look at the data set, we see that only around 10% of the movies have a box office revenue entry. Therefore, we enrich the data set with TMDB data (TMDB). The TMDB data gives us new attributes like revenue, original movie language and movie rating, that might be interesting for our analysis. After enriching the data, 12% of the movies now have a box office revenue entry.

Data cleaning

Before diving into the data, it needs to be cleaned. Some of the movies have a runtime of 10 hours, or an actor with a height of 3 meters. All of these unrealistic attributes are removed, before continuing with the data. After removing outliers, the distributions are as shown below.

Character meta dataset before cleaning

Character meta dataset before cleaning

The figures below show the distribution of the numerical variables in the two datasets. Some of the variables also look very heavy-tailed and might need to be transformed. The two attributes Movie box office revenue and TMDB_vote_count (movie rating) has a very skewed distribution. Therefore, we transform this variable to make it normal distributed, as shown on the right.

Movie box office revenue and TMDB_vote_count before and after log transformation

But what about inflation?

However, the value of the US dollar is not the same in 2023 as it was over two hundred years ago. Therefore, we have to take inflation into account. The plot below shows the increase of revenue and value of the US dollar throughout the years. (“Inflation rate in the US from 1800 to 2023” OfficialData)

Inflation of the US dollar's effect on revenue

The log Movie box office revenue before and after balancing can be seen in the plot below.

Inflation of the US dollar's effect on revenue

After balancing the revenue attribute, we can now continue our analysis.

Taking a peek at the different attributes

In this section we are going to use a more ‘traditional’ method, where we take a look at the different features and attributes of the dataset and test the correlation between the rating and revenue.

In the end we’re going to combine these attributes and try to create a linear model that can possible answer what a movie needs to obtain a high rating or revenue.

How does gender representation influence movie revenue?

Let first take a look at the representation of male and female actors in movies over time.

Male and female actor count on for each movie release year

There is a trend over all years, that there are more male actors than female actors in movies. But has there been no change in the last century? Let us dive into that.

Development of percentage of female actors over time

With a p-value of 0.10 of the linear regression, there has been no significant change of female characters in the past century. But looking at the plot, there seems to be a trend from 1980 and forward. Let us look at that.

Development of percentage of female actors over time

With a p-value of 0.00 of the linear regression bt, there is has been a significant increase in female actors in the past century. But does gender even effect the box office revenue and rating of the movie?

Movie box office revenue given as a function of percentage of female actor in a given movie

With a p-value of 0.00, the linear regression for revenue shows, that there is a significant decrease in the box office revenue for a movie, if there are more female actors. However, there is no significant effect on rating, with a p-value of 0.06.

Difference between male and female actors

We divide the character dataset into male and female character dataset. Looking at the attribute, the only relevant one to explore is Actor age at movie release.

Male and female distribution

The plot shows us, that there is a significant difference in the age of male and female actors. It indicates, that female actors generally are younger than male actors in movies.

How about the original language of a movie?

It comes as no surprise, that most movies are in English. But does that mean the movie will get a higher revenue or rating? We divide the dataset into movies with English as the original language, and movies that have non-English as the original language.

When performing linear regression, we see that both revenue and rating has a significant effect - but in the opposite direction!

Revenue for english and non-english movies

The plot above shows, that English-spoken movies tend to have a higher revenue, but lower rating.

Is release year the magic key?

Let’s now take a look at the release year and how it might be able to help us!

Movie Release Year and Revenue and Rating

The trend line shows a minor decline in TMDB vote average over time with an R-squared of 0.02, indicating the year of release explains only 2.2% of the variance in ratings. Furthermore, in regard to revenue the upward trend line suggests a slight increase in log-transformed movie box office revenue over the years, with an R-squared of 0.02, meaning the release year accounts for 2.6% of the revenue variance.

All in all the release year is not the best predictor - let’s keep going!

Movie runtime: is it all about the length?

Let’s start by taking a look at if movie runtime has any effect on the revenue or rating!

Movie Runtime and Log Balanced Movie Box Office Revenue

Looking at the plot above most movies are around 100 minutes and have a revenue of 1 million US dollars.

Movie Runtime and

Rating: The regression line on the scatter plot reveals a modest positive correlation between movie runtime and rating, although the low R-squared value of 0.10 from the liner regression results suggests that runtime is a weak predictor of a movie’s rating.

Revenue: Again the regression lien suggests a modest positive correlation between runtime and revenue, but the R-squared value of 0.05 suggest that it is a weak predictor.

Is success actually just explained by popularity?

We now want to look at the effect the number of votes has on vote average and revenue. The vote count of movies are distributed as a power law where some movies in our dataset has zero or very few votes and other movies have around 35 000 votes. Because of this we log-transform the vote count variable and plot this against the vote average and log-transformed revenue. As we can see on the two plots below the correlations between the independent variable and the dependent variables are positive and quite high. This suggests that as the number of votes increases, the average vote increases as well. Likewise, the plot showing log vote count vs log revenue shows a positive correlation. Here the correlation coefficient (r) and the coefficient of determination (R²) is higher indicating a stronger relationship.

Movie Vote count

Does age matter?

As previously seen, female actors tend to be younger than male actors. Let us see if it has as effect on revenue and rating.

Actor Age vs Revenue and Rating

In the blue graph in regard to revenue, actor age also doesn’t correlate with how audiences rate a film. The relationship is negligibly negative, meaning actor age doesn’t really affect the average voting score of a movie.

Moreover, the revenue graph shows that there’s practically no connection between the ages of actors and how much money a movie makes. The correlation is very small, so age isn’t a good predictor of a film’s financial success.

Connecting the dots - can we make one big linear model?

We now want to investigate how the different attributes in our dataset contribute to a high revenue and rating.

We do this by creating two linear regression model predicting revenue and rating based on basic attributes of the movies. We were able to include the following numerical attributes in our models: Movie runtime, Movie release year, Vote count, Male actor percentage, and Mean actor age. We did this in order to only include relevant predictor variables.

In order to get an overview of the pairwise correlation between the different numerical variables of our model, we created the correlation and pairs plot displayed below. The plot the displays the coefficient of correlation of every variable against all other variables in the upper right side of the diagonal. The diagonal of the plot shows a histogram of the variables. Lastly, the lower left half of the plot shows scatter plots of the different variables against each other together with the linear regression line of the data. With this plot the relevant predictor variables can be inspected in terms of the pairwise correlations and distributions.

We want to avoid multicollinearity in our model, and therefore it is crucial that the independent variables remain uncorrelated. Otherwise the integrity of our model’s predictions can be questioned. As observed in the plot, while certain variables show a tendency to correlate, such as the ‘TMDB vote count’ and ‘Movie revenue’ with a correlation coefficient of 0.57 suggesting a moderate positive relationship, other variables demonstrate weaker correlations. This can be seen from the close to horizontal trend lines and small correlation coefficients.

BOBsYndlingsPlot

We also want to add some categorical attributes to our model: Movie country, Movie language, and Movie genre. These categorical variables have a lot of different entries and it would therefore be very computationally expensive to include all countries, languages and genres of our dataset. Therefore, we one-hot-encode the top 10 most occuring values of the variables.

Having stated all our independent variables to our models we can now formulate our linear regression models. The two models describing the vote average and revenue have the same predictor variables but different target variables as seen below:

Predictor variables: C(Original_language) + Movie_runtime + Movie_release_year + log_vote_count + Male_actor_percentage + Mean_actor_age + C(Movie_countries) + C(Movie_genres)

Target variables: Model1: Log_revenue Model2: Avg_rating

The R-squared value is 0.452 for the revenue model and 0.402 for the average rating model. This means that approximately 45.2% and 40.2% of the variability in movie revenues and TMDB vote average can be explained by the model’s variables. This is a moderate level of explanatory power. The models also have significant F-statistics, indicating that the variables, as a group, have a statistically significant effect on movie revenues.

Can we still draw some insights from this model?

Even though the overall fit of our model is not great we can still look at which variables shows the highest impact on our target variables. The coefficient plots below show the significant coefficients in our two models sorted after coefficient value. Variables with bars extending right have a positive effect, while those extending left have a negative effect on average ratings. The length of each bar indicates the strength of the variable’s impact within a statistical significance of p-value < 0.05.

Coefficient_rating_mode

The plot above shows the coefficients for the linear regression model on movie ratings. We can see that certain genres such as Family and Action positively influence ratings. The same is evident for certain production countries and languages.

coefficient_revenue_model

The revenue coefficient plot above has a number of positive coefficients. The top 10 genre’s or country’s positively impact revenue. Family films and movies from the United States show the strongest positive relationships. In contrast, South Korean productions and the Adventure genre show a negative correlation with revenue. Non-categorical variables like ‘log_TMDB_vote_count’ significantly influence revenue, indicating a robust relationship between a film’s popularity and its financial success.

Finally, we can conclude that categorical variables of the 10 most evident genres, contries and languages show a higher influence on the model than the non-categorical variables.

How can we move to better models?

The attributes associated with our dataset do not explain a large percentage of the variance or our data. A reason that this somewhat simple linear regression cannot easily predict movie success in terms of revenue or rating could be because these models only provide general information about the movies and not specific details of the storyline of the movies. We therefore want to investigate whether knowing the specific storyline of each movie can help us in predicting movie success. As plot summaries are more complex and contain more specific details about the movies we can maybe extract more distinct features from our dataset. This leads us to the next part of our analysis where we will try to cluster movies into distinct categories based on their respective plot summaries.

Natural Language Processing - a possible solution?

We have explored the impact of genres on revenue and rating, but as the genres only capture a small part of the movie, we are going to use the plot summaries provided in the dataset. We will now go ahead and use the plot summaries to divide the movies into different clusters and try to find some correlation between the plot of the movies and the revenue and rating.

Bag-of-words | Text Data Magic: From words to 26 cool clusters

We decide to cluster the movies based on similarity score between words in the plot summaries. To prepare the plot summaries for analysis, we apply stemming, lemmatization, and stopword removal to our text data. Next, we use TF-IDF for text representation. For dimensionality reduction, we aim for 95% variance retention, leading to component reduction. We then employ K-means clustering, determining the optimal k=30 based on the silhouette score and performance reasons, resulting in 30 clusters. After filtering out clusters with 100 samples or fewer, we end up with 19 clusters.

The clusters are formed by similarity between words in the movies’ plot summaries. Therefore, each cluster contains a list of words ranked from most important to least important. A way to represent the cluster, and what words characterize them, is by the word clouds seen below. Each word in the word cloud is represented in the cluster, and the bigger the word, the more significant it is for that cluster. The plot below shows the word clouds for cluster five to eight.

Example of word clouds for cluster 5, 6, 7 and 8

We see that the four cluster seen here have some clear tendencies as well. Looking at cluster 5 the most significant (biggest) words seem to fall in the same category: german, solider, attack & kill.

Rating & revenue

But let us circle back to our goal of this analysis: to create a movie with highest possible revenue and best possible rating. Let’s see how these plot summaries can help us with that!

Each cluster contains a number of movies, each with a given revenue and rating. To check if there is a significant difference between revenue and rating in the cluster, we randomly choose 100 movies from each cluster and calculate the mean revenue and rating. We do this 100 times and calculate the average of all the means from all the samples.

The average mean of each cluster for both rating and revenue is plotted below with error bars indicating the 95% confidence interval.

Average rating and revenue for each cluster

The plots reveal that the confidence intervals for several of the clusters do not overlap, indicating a statistically significant difference in both revenues and ratings among the clusters. In the plot above, the cluster have been arranged in increasing order of both rating and revenue. However, if we arrange the cluster only with increasing revenue, we see that the two attributes don’t go hand in hand. Actually, one of the movie clusters with the highest revenue has the worst ratings (cluster 12).

Average rating and revenue for each cluster

To get a more clear picture of what plot summaries yield a high rating and revenue, we have chosen the top 3 and bottom 3 movies for both rating and revenue. We then had ChatGPT-4 (OpenAI) create a movie poster for the top and bottom clusters based on their most significant words. This visualization helps us to understand what kind of movies get a high rating and revenue, and which movies fail to do so.

Top-rated movies - The clusters that yield high quality

Maybe not so surprising, we find that the top-rated movies look like classics most could put a real movie title to. At the very top we find all the war movies which interestingly seem mostly to include movies from World War II. A classic western is to be found in the second place and in third we see the crime-dramas that often relies on heavier and more complex plots.

Average rating and revenue for each cluster

Lowest rated movies - The clusters that reviewers slaughter

In this end of the scale we find the movies that did not do well with reviewers. The worst and third worst rated clusters looks respectively like the classic evil doctor/virus/epidemic turns into a monster and the alien invasion movies that we have all seen. These plots might be to far reached to fall into the likes of the typical reviewer. In the second to last place we find movies that seem as if they could be based on video-games. One common view of these movies are that they just do not live up to the expectations of the enthusiastic “gamers”.

Average rating and revenue for each cluster

Highest grossing movies - The clusters that bring back money

Again not very surprisingly we see here the posters that most of us probably could imagine hanging in front of the cinemas. These are what we can call the blockbusters. You might get the feeling that you have already seen these films from looking at the posters.

Looking at the highest grossing cluster we recognize the typical action movie, involving some government, a range of armed men and lots of explosions.

In second place we find our typical Sci-Fi films with some kind of threat to the human race, with lots of CGI and some very thin plot about saving the world. What is very interesting about this one is that is falls into both the top 3 highest grossing films and in the bottom 3, when looking at ratings. We could imagine this to be caused by an over-promising trailer with high budgets, but then not performing when it comes to the quality of the story. In third place it shows that apparently a classic drama surrounding a wedding draws a lot of viewers.

Average rating and revenue for each cluster

Lowest grossing movies - The worst nightmare for studios

Here we find some og the more some of the softer plots. The storylines do not look as obvious as the best performing movies. Though we can still get the general gist from these posters. Interestingly we see a lot of women represented in the posters. The plots seem to revolve around families and more wholesome plot-lines than the action and “explosive” themes we saw in the top grossing movies.

Average rating and revenue for each cluster

Combined score - The golden standard

By normalizing the rating and the revenue we have made a combined score to determine the clusters that perform best overall.

Average rating and revenue for each cluster From the graph we can see that the confidence interval seem to overlap alot. But following our performed ANOVA test we see that the clusters overall score are indeed significant!

Average rating and revenue for each cluster From this we get these three films that we have already touched upon. Their clusters do, from our research, make the best combined performance when considering both their rating and their revenue. Not to bad huh?!

Now - How to create the perfect movie?

Two things to think about: Do you want to make a movie that has a high rating or a high revenue - or perhaps the golden middle ground?

Based on our analysis, to create a movie with high rating you need to focus on these key aspects:

Based on our analysis, to create a movie with high revenue, these are the focus points:

Now for the final reveal - a movie which captures both elements, here are the takeaways!

Our exploration into the data behind successful movies reveals valuable insights into what makes a great movie. The project has dived into what attributes like characters, genre, and plot summaries contribute to a movie’s success. We found that the genre of the movie was most impactful when if came to the success of a movie. With the tools from this analysis, we are ready to become successful movie makers!