Sentiment Analysis on Movie Review

img

 

Sentiment Analysis on Movie ReviewOverviewDataScraping Implementation ResultsData CleaningDescriptive AnalysisModelEvaluationIMDb ReviewRotten Tomato Test PerformanceInsightful Story - Divergence of Critics and AudienceOutput Product - Time Series Sentiment Monitor Index

 

Overview

 

Data

obtain review and rating of each movie from web scraping

Scraping Implementation

 

Results

scrape the IMDb and Rotten Tomato webpage:

choose films from January,2014 to June, 2018; English version; with at least 1000 votes in IMDB. in total, we collect 2846 films with 99994 user reviews for IMDb (json 118.3 MB) and 1860 films with 67888 critic reviews for Rotten Tomato (json 22.8 MB).

Data Cleaning

  1. Eliminate the outlier observation with none type

  2. Text process

    apply regex to modify the words (I'm I am e.g. ) and drop the irrelevant part ([Spoiler Warning] e.g.)

    note, due to the word embedding and RNN's complexity, there is no need to apply traditional stemmer or lemmatization such word reduction technique for text convertion

Descriptive Analysis

Model

 

Evaluation

IMDb Review

The best model reaches at epoch 7 (index 6) and the validation accuracy is around 65%, with F1 score at 0.63, Pricison 0.63 and Recall 0.64.

lstm_loss

lstm_acc

Rotten Tomato

The best model reaches also at epoch 7 (index 6) and the validation set accuracy is 0.6269.

critic_acc

Test Performance

 IMDb UserRotten Tomato User
Accuracy0.62130.6216
F10.58510.6010
Precision0.59070.6049
Recall0.58460.6045

The F1, Precision and Recall metrics are specifically adapted macro-avearage adjustments for multi-class classfication evaluation.

The IMDb model out-of-sample performance is slightly worse than validation set, while the rotten tomato user maintains same prediciton level. Both the model show good generalization ability and could be robust for usage.

The multi-classfication prediction accuracy is at 60% level, which is far more better than random guess (20% and 33% respectively).

 

Insightful Story - Divergence of Critics and Audience

Divergence - High Critics Review yet Negative Audience Response!

 

Output Product - Time Series Sentiment Monitor Index

scrape all the IMDb reviews (contains rating or not) at daily frequency, compute the individual sentiment and average into integrated index (ranging from 1 to 5)