-
Notifications
You must be signed in to change notification settings - Fork 93
Description
I used the Sentiment Analyzer model to perform binary classification on the Amazon Reviews Dataset. Before training, I perform the following steps for pre-processing:
- truncate input at 500 chars
- strip stopwords
- strip corrupt utf8 chars (iso-8859-1 chars)
- stemming to root words
The following are inference results:
Accuracy: 49.64325
Precision: 0.497469903015904
Recall: 0.701445
F1 Score: 0.5821059947510917
I also compared the accuracy (of TextAnalysis' model pretrained on the IMDB dataset) with a logistic model (trained on 12000 reviews of the Amazon Reviews trainset) in sklearn. The sklearn model scored 46.47175 in accuracy.
To improve on Sentiment Analyzer's accuracy, I think that part of speech tagging could be implemented. However, it is at the moment very time-consuming to perform, taking up to 24 hours for pre-processing on 10000 reviews (the entire testset has 400000 samples), which made it infeasible to test in Google Code In!