Machine Learning

Sentiment Analysis Series 1 (15-min reading)


What is Sentiment Analysis?

Sentiment analysis (also known as opinion mining or emotion AI) is essentially the process of determining the emotional tone behind a series of words, used to gain an understanding of the attitudes, opinions, and emotions expressed within an online mention.

Why should I care about it?

The applications of sentiment analysis are endless and extremely powerful!

The Obama administration used sentiment analysis to gauge public opinion to policy announcements and campaign messages ahead of 2012 presidential election.

Companies monitor social media to track customer reviews, survey responses, competitors. Finance industry uses it to predict stock prices by understanding customers sentiment towards certain brands.

Sentiment analysis is in demand because of its efficiency. Thousands of text documents can be processed for sentiment (and other features including named entities, topics, themes, etc.) in seconds, compared to the hours it would take a team of people to manually complete.

What is the magic behind it?

There is no magic. It’s math and statistics. In this case, it’s Naive Bayes Classifier.

What is Naive Bayes Classification?

Naive Bayes Classification a family of family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.

Why is it called “Naive”?

As explained above, all Naive Bayes are based on the independent feature assumption. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Naive is efficient and highly scalable.

It requires a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers.

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector x=(x1,….xn) representing some n features (independent variables), it assigns to this instance probabilities for each of k possible outcomes or classes

which can be translated into English as 

Multinomial Naive Bayes

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial (p1, p2…. pn) where pi is the probability that event i occurs (or K such multinomials in the multiclass case).


Bernoulli Naive Bayes

In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence features are used rather than term frequencies


The model is based on a simple model from this blog on sentiment analysis as a starting point.  In the model, I use a movie review corpus from NLTK with reviewed categorized into two categories: positive reviews and negative reviews. I simply started with three simple naive bayes classifier as a baseline, with boolean word as feature extraction. Then evaluate the model based on their accuracy, recall and precision


Word Feature Extraction

I extracted word features as the training set. For this problem, I used a simplified bag-of-words model and mapped individual word (feature name) to a boolean value (feature value).


3/4 of total records or 1500 records were used as training set, and another 500 records was used as testing set to evaluate the model

Naive Bayes Classifier 

Multinomial Naive Bayes 

Bernoulli Naive Bayes Classifier 

Evaluation And Conclusion

For evaluation, I used accuracy as the core metric to evaluate the model. In addition, precision and recall are also used, as they provide great insights on biases.

The results are shown below  

As seen, Multinomial naive Bayes and Bernoulli naive Bayes performed significantly better than the regular NaiveBayes, with an accuracy around 80. I am quite amazed by this result considering human sentiment analysis accuracy is around 80 percent too. It’s also surprising that “avoids” is one of the top most informative features.


According to the precision and recall, 98% positive reviews are identified by the model. On the other hand, 96% of selected negative reviews are correct.

Flaws and Next Steps

  1. I didn’t filter out stop words for this model, and it’s a good practice, in general, to remove noise like stopwords 
  2. This model is using bag-of-words model, and it treats each word as an individual object. Bag-of-words ignores the context of words. It can fail badly in some specific cases.  For example: “not bad” doesn’t not equal to “Bad”. I will apply N-gram model in the next version.
  3. For the word features, it used a boolean value as feature values. I am interested in using term frequencies along with TF-IDF to see how the model performs.
  4. In addition, I plan to create a hybrid model that consists of all the models, having each model vote and take the majority votes as the result, which might improve the accuracy as well.
  5. Eventually, I plan to extract social media newsfeeds (Twitter for example) and apply this model to get an idea on people’s opinions


This is my first blog, would you please give me feedbacks on one of the following topics:

  1. Am communicating my thoughts clearly? 
  2. Are there any other better ways to improve the model?
  3. How do you like the length of this article?  

Stay tuned for the second blog, let’s see how I can improve the model. 🙂








2 thoughts on “Sentiment Analysis Series 1 (15-min reading)

Leave a Reply

Your email address will not be published. Required fields are marked *