sentimental

Sentiment Analysis on Movie Review

Sentiment Analysis on Movie ReviewOverviewDataScraping Implementation ResultsData CleaningDescriptive AnalysisModelEvaluationIMDb ReviewRotten Tomato Test PerformanceInsightful Story - Divergence of Critics and AudienceOutput Product - Time Series Sentiment Monitor Index

Overview

Big Picture:
Predict both the users' and critics' sentiment attitude towards movie from their movie reviews.
Assumption
People's sentiment could be captured from their rating points.
Intuitively, five star reflect a very positive attitude while a one star refers to negative preference.
Data Source
- Public Opinion: IMDb user reviews
- Professional Voice: Rotten Tomato's critic reviews
Model
- GloVe - word vector representation
- Bidirectional LSTM neural network - deeply learning the text
Output
- predicted sentiment index

Data

obtain review and rating of each movie from web scraping

Scraping Implementation

selenium automation
Selenium is a Web Browser Automation Tool. It allows to open a browser and perform tasks as a human being would, such as clicking buttons.
under this task, apply selenium to expand full content with load button, open up the hidden review of spoliers warning and turn next page
random sleep
sleep with random time to avoid detection
xxxxxxxxxx
from selenium import webdriver
webdriver.Chrome('./chromedriver')
import time
time.sleep(0.5)

Results

scrape the IMDb and Rotten Tomato webpage:

IMDb
general information of film (rating, cast, megascores, etc)
user reviews with corresponding rating sorting by helpfulness as effective
Rotten Tomato
critic reviews with the rating

choose films from January,2014 to June, 2018; English version; with at least 1000 votes in IMDB. in total, we collect 2846 films with 99994 user reviews for IMDb (json 118.3 MB) and 1860 films with 67888 critic reviews for Rotten Tomato (json 22.8 MB).

Data Cleaning

Eliminate the outlier observation with none type
Text process
apply regex to modify the words (I'm $\rightarrow$ I am e.g. ) and drop the irrelevant part ([Spoiler Warning] e.g.)
```
xxxxxxxxxx
# use re for regex
import re
```
note, due to the word embedding and RNN's complexity, there is no need to apply traditional stemmer or lemmatization such word reduction technique for text convertion

Descriptive Analysis

Reviews
Rotten Tomato critics reviews typically are short summary comment with two or three sentences.

IMDb users reviews' are larger paragraph with hundred-level words.

Film Rating
for IMDb, divide the rating into five type:
- Very Postive - 4 (rating in [8, 10])
- Positive - 3 (rating in [6, 8)
- Neutral - 2 (rating in [4, 6)
- Negative - 1 (rating in [2, 4)
- Very Negative- 0 (rating in [0, 2)
transfer into classification machine learning problem
for Rotten Tomato, from pre-traing,
the LSTM could not learn well the 5-type classification...
simplify the model task: relax into 3-type classfication, model could then perform well
thus divide the rating into negative, neutralized and postive three type

Model

GloVe
GloVe is an unsupervised learning algorithm for obtaining vector representations for words.
It orginated from Stanford NLP Lab. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the model captures linear substructures of the word vector space.
In a nutshell, it transfers the high-dimension word index into compressed lower dimension vector space while preserving the property of word-word relationship.
In practice, we implement the Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download) pretrained model.
Bidirectional LSTM
- Recurrent Neural Network - RNN
  RNN captures the dependency of sequential data, the current state is not only determined from current input $X_t$ but also its history state information $h_{t-1}$
- Long Short-Term Memory - LSTM RNN
  LSTM is a modification of traditional RNN and proved both theoretically powerful with long-term dependency and empirically more effective in practice with the sequential field.
  It in general solves the RNN problem of vanishing or explosive gradient problem, i.e., during training the current station $h_{t+1}$ will not dependent on the very early states $X_0$ and $X_1$ , thus failing to capture the long-term dependency (e.g. in "reading" paragraph, model only captures the last output sentence but forget the initial sentence)
- Bidirectional LSTM
  Address the direction issue : learn representations from future time steps to better understand the context and eliminate ambiguity.
  better capture the context of the sentence.

Traing Procedure

Dataset Split
divide the data into three component: training set, validation set and test set
apply training set and validation set to tune hyperparameters and traing the model coefficients, use test set to evaluate the model generalization ability.

Model Architecture

LSTM is much more complex than machine learning models and have more hyperparameters to tune with.

Pretrain the model and choose the good set of parameter based on performance in validation set, specifically, the accuracy in validation set.

The best model for IMDb reviews is a double layer LSTM (300, 150), while for Rotten Tomato, one-layer bidirectional LSTM (128) is better to hedge the overfitting.


xxxxxxxxxx
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 160, 300)          6000300   
_________________________________________________________________
batch_normalization_1 (Batch (None, 160, 300)          1200      
_________________________________________________________________
bi_lstm_1 (Bidirectional)    (None, 160, 600)          1442400   
_________________________________________________________________
bi_lstm_2 (Bidirectional)    (None, 300)               901200    
_________________________________________________________________
batch_normalization_2 (Batch (None, 300)               1200      
_________________________________________________________________
dropout_1 (Dropout)          (None, 300)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 1505      
=================================================================
Total params: 8,347,805
Trainable params: 2,346,305
Non-trainable params: 6,001,500
_________________________________________________________________

Regularization
Deep learning is prone to overfitting problem, and during this trainging task a set of methods are implemented to assure the out-of-sample performance.
- Dropout
- Batch Normalization
- Early Stopping

Implementation

The model was implemented with Keras, a high-level deep learning API written in Python.


xxxxxxxxxx
from keras.preprocessing.text import Tokenizer # text parsing
from keras.models import Sequential 
from keras.layers import *

And the traning procedure was monitored with Tensor Board.

Evaluation

IMDb Review

The best model reaches at epoch 7 (index 6) and the validation accuracy is around 65%, with F1 score at 0.63, Pricison 0.63 and Recall 0.64.

lstm_loss

lstm_acc

Rotten Tomato

The best model reaches also at epoch 7 (index 6) and the validation set accuracy is 0.6269.

critic_acc

Test Performance

	IMDb User	Rotten Tomato User
Accuracy	0.6213	0.6216
F1	0.5851	0.6010
Precision	0.5907	0.6049
Recall	0.5846	0.6045

The F1, Precision and Recall metrics are specifically adapted macro-avearage adjustments for multi-class classfication evaluation.

The IMDb model out-of-sample performance is slightly worse than validation set, while the rotten tomato user maintains same prediciton level. Both the model show good generalization ability and could be robust for usage.

The multi-classfication prediction accuracy is at 60% level, which is far more better than random guess (20% and 33% respectively).

Insightful Story - Divergence of Critics and Audience

sentiment well reflect the rating and capture more

the circle radius reflect the value of gross revenue

consider the two points with high IMDb general rating, high critic sentiment while low mass sentiment


xxxxxxxxxx
df_general[(df_general['predicted_sentiment'] <= 0.5) & \
           (df_general['critics_sentiment'] >= 1.5) & \
           (df_general['rate'] >= 7)]

	name	certificate	description	genre	link	meta_score	rate	Budget	Open	Studio	...	Western	famous_director	famous_actor	Release_date	Release_Year	Release_Month	Release_Day	famous_studio	predicted_sentiment	critics_sentiment
42	Star Wars: The Last Jedi	PG-13	Rey develops her newly discovered abilities wi...	Action, Adventure, Fantasy	https://www.imdb.com/title/tt2527336/?ref_=adv...	85.0	7.2	NaN	220009584.0	Walt Disney Pictures,Lucasfilm,Ram Bergman Pro...	...	0	0.0	0.0	15-Dec-17	15.0	12.0	2017.0	1.0	0.358974	1.748768
160	Star Wars: The Force Awakens	PG-13	Three decades after the Empire's defeat, a new...	Action, Adventure, Fantasy	https://www.imdb.com/title/tt2488496/?ref_=adv...	81.0	8.0	20.416667	247966675.0	Lucasfilm,Bad Robot,Truenorth Productions	...	0	0.0	0.0	18-Dec-15	18.0	12.0	2015.0	0.0	0.073171	1.773684

It truns out to be the new Star Wars Series!

Divergence - High Critics Review yet Negative Audience Response!

Output Product - Time Series Sentiment Monitor Index

scrape all the IMDb reviews (contains rating or not) at daily frequency, compute the individual sentiment and average into integrated index (ranging from 1 to 5)

Example - individual current movie [Fantastic Beasts]
from November 14 to December 3, it shows the same trend as the real rating