Sentiment Analysis on Movie ReviewOverviewDataScraping Implementation ResultsData CleaningDescriptive AnalysisModelEvaluationIMDb ReviewRotten Tomato Test PerformanceInsightful Story - Divergence of Critics and AudienceOutput Product - Time Series Sentiment Monitor Index
Big Picture:
Predict both the users' and critics' sentiment attitude towards movie from their movie reviews.
Assumption
People's sentiment could be captured from their rating points.
Intuitively, five star reflect a very positive attitude while a one star refers to negative preference.
Data Source
Professional Voice: Rotten Tomato's critic reviews
Model
GloVe - word vector representation
Bidirectional LSTM neural network - deeply learning the text
Output
obtain review and rating of each movie from web scraping
Selenium is a Web Browser Automation Tool. It allows to open a browser and perform tasks as a human being would, such as clicking buttons.
under this task, apply selenium to expand full content with load button, open up the hidden review of spoliers warning and turn next page
random sleep
sleep with random time to avoid detection
xxxxxxxxxx
from selenium import webdriver
webdriver.Chrome('./chromedriver')
import time
time.sleep(0.5)
scrape the IMDb and Rotten Tomato webpage:
IMDb
general information of film (rating, cast, megascores, etc)
user reviews with corresponding rating sorting by helpfulness as effective
Rotten Tomato
critic reviews with the rating
choose films from January,2014 to June, 2018; English version; with at least 1000 votes in IMDB. in total, we collect 2846 films with 99994 user reviews for IMDb (json 118.3 MB) and 1860 films with 67888 critic reviews for Rotten Tomato (json 22.8 MB).
Eliminate the outlier observation with none type
Text process
apply regex to modify the words (I'm I am e.g. ) and drop the irrelevant part ([Spoiler Warning] e.g.)
xxxxxxxxxx
# use re for regex
import re
note, due to the word embedding and RNN's complexity, there is no need to apply traditional stemmer or lemmatization such word reduction technique for text convertion
Reviews
Rotten Tomato critics reviews typically are short summary comment with two or three sentences.
IMDb users reviews' are larger paragraph with hundred-level words.
Film Rating
for IMDb, divide the rating into five type:
transfer into classification machine learning problem
for Rotten Tomato, from pre-traing,
the LSTM could not learn well the 5-type classification...
simplify the model task: relax into 3-type classfication, model could then perform well
thus divide the rating into negative, neutralized and postive three type
GloVe
GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
It orginated from Stanford NLP Lab. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the model captures linear substructures of the word vector space.
In a nutshell, it transfers the high-dimension word index into compressed lower dimension vector space while preserving the property of word-word relationship.
In practice, we implement the Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download) pretrained model.
Bidirectional LSTM
Recurrent Neural Network - RNN
RNN captures the dependency of sequential data, the current state is not only determined from current input but also its history state information
Long Short-Term Memory - LSTM RNN
LSTM is a modification of traditional RNN and proved both theoretically powerful with long-term dependency and empirically more effective in practice with the sequential field.
It in general solves the RNN problem of vanishing or explosive gradient problem, i.e., during training the current station will not dependent on the very early states and , thus failing to capture the long-term dependency (e.g. in "reading" paragraph, model only captures the last output sentence but forget the initial sentence)
Bidirectional LSTM
Address the direction issue : learn representations from future time steps to better understand the context and eliminate ambiguity.
better capture the context of the sentence.
Traing Procedure
Dataset Split
divide the data into three component: training set, validation set and test set
apply training set and validation set to tune hyperparameters and traing the model coefficients, use test set to evaluate the model generalization ability.
Model Architecture
LSTM is much more complex than machine learning models and have more hyperparameters to tune with.
Pretrain the model and choose the good set of parameter based on performance in validation set, specifically, the accuracy in validation set.
The best model for IMDb reviews is a double layer LSTM (300, 150), while for Rotten Tomato, one-layer bidirectional LSTM (128) is better to hedge the overfitting.
xxxxxxxxxx
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 160, 300) 6000300
_________________________________________________________________
batch_normalization_1 (Batch (None, 160, 300) 1200
_________________________________________________________________
bi_lstm_1 (Bidirectional) (None, 160, 600) 1442400
_________________________________________________________________
bi_lstm_2 (Bidirectional) (None, 300) 901200
_________________________________________________________________
batch_normalization_2 (Batch (None, 300) 1200
_________________________________________________________________
dropout_1 (Dropout) (None, 300) 0
_________________________________________________________________
dense_1 (Dense) (None, 5) 1505
=================================================================
Total params: 8,347,805
Trainable params: 2,346,305
Non-trainable params: 6,001,500
_________________________________________________________________
Regularization
Deep learning is prone to overfitting problem, and during this trainging task a set of methods are implemented to assure the out-of-sample performance.
Implementation
The model was implemented with Keras, a high-level deep learning API written in Python.
xxxxxxxxxx
from keras.preprocessing.text import Tokenizer # text parsing
from keras.models import Sequential
from keras.layers import *
And the traning procedure was monitored with Tensor Board.
The best model reaches at epoch 7 (index 6) and the validation accuracy is around 65%, with F1 score at 0.63, Pricison 0.63 and Recall 0.64.
The best model reaches also at epoch 7 (index 6) and the validation set accuracy is 0.6269.
IMDb User | Rotten Tomato User | |
---|---|---|
Accuracy | 0.6213 | 0.6216 |
F1 | 0.5851 | 0.6010 |
Precision | 0.5907 | 0.6049 |
Recall | 0.5846 | 0.6045 |
The F1, Precision and Recall metrics are specifically adapted macro-avearage adjustments for multi-class classfication evaluation.
The IMDb model out-of-sample performance is slightly worse than validation set, while the rotten tomato user maintains same prediciton level. Both the model show good generalization ability and could be robust for usage.
The multi-classfication prediction accuracy is at 60% level, which is far more better than random guess (20% and 33% respectively).
sentiment well reflect the rating and capture more
the circle radius reflect the value of gross revenue
consider the two points with high IMDb general rating, high critic sentiment while low mass sentiment
xxxxxxxxxx
df_general[(df_general['predicted_sentiment'] <= 0.5) & \
(df_general['critics_sentiment'] >= 1.5) & \
(df_general['rate'] >= 7)]
name | certificate | description | genre | link | meta_score | rate | Budget | Open | Studio | ... | Western | famous_director | famous_actor | Release_date | Release_Year | Release_Month | Release_Day | famous_studio | predicted_sentiment | critics_sentiment | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42 | Star Wars: The Last Jedi | PG-13 | Rey develops her newly discovered abilities wi... | Action, Adventure, Fantasy | https://www.imdb.com/title/tt2527336/?ref_=adv... | 85.0 | 7.2 | NaN | 220009584.0 | Walt Disney Pictures,Lucasfilm,Ram Bergman Pro... | ... | 0 | 0.0 | 0.0 | 15-Dec-17 | 15.0 | 12.0 | 2017.0 | 1.0 | 0.358974 | 1.748768 |
160 | Star Wars: The Force Awakens | PG-13 | Three decades after the Empire's defeat, a new... | Action, Adventure, Fantasy | https://www.imdb.com/title/tt2488496/?ref_=adv... | 81.0 | 8.0 | 20.416667 | 247966675.0 | Lucasfilm,Bad Robot,Truenorth Productions | ... | 0 | 0.0 | 0.0 | 18-Dec-15 | 18.0 | 12.0 | 2015.0 | 0.0 | 0.073171 | 1.773684 |
It truns out to be the new Star Wars Series!
scrape all the IMDb reviews (contains rating or not) at daily frequency, compute the individual sentiment and average into integrated index (ranging from 1 to 5)