Sentiment analysis

Python enjoys a thriving ecosystem, particularly in regard to machine learning and natural language processing (NLP): nltk, textblob and pattern provide a nice toolkit to start with for playing around some.

  • nltk provides interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries.
  • TextBlob provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, etc, and uses nltk and pattern.
  • pattern provides tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.

I have settled on using pycharm, and I had to install the library packages first with pip (File → Settings → Project → Project Interpreter → pip).

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

def word_feats(words):
    return dict([(word, True) for word in words])

positive_vocabulary = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)', 'love']
negative_vocabulary = ['bad', 'terrible', 'useless', 'hate', ':(', 'horrible']
neutral_vocabulary = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocabulary]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocabulary]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocabulary]

training_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(training_set)

# Predict
neg = 0
pos = 0
sentence = "Great movie, I loved it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1

print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))


Positive: 1.0
Negative: 0.0

Process finished with exit code 0

People don't usually use these fixed dictionaries of positive/negative/neutral words except for programming exercises. In real life, using naive bayes, each word is given a score. Negative scores are negative sentiment words and positive scores are positive sentiment words. The words with scores near zero could be thought of as neutral words. But it suffices to “grok” the basics: In its most simple form, classification is done using several steps: training and prediction. The training phase needs to have training data. The classifier will use the training data to make predictions.

  • Define three classes: positive, negative and neutral where each of the classes is defined by a vocabulary.
  • Convert every word into a feature using a simplified bag of words model.
  • Sum the three into the training_set = negative_features + positive_features + neutral_features
  • Train the classifier: classifier = NaiveBayesClassifier.train(training_set)
from textblob import TextBlob
simple_text = TextBlob("And as it turns out, the Cambridge Analytica case was just the tip of an iceberg. Cambridge Analytica was part of a much bigger company, SCL, which had worked as a defence contractor for governments and militaries around the world, then branched into elections in developing countries, and, only in its final iteration, entered western politics. More than 100 election campaigns in over 30 countries spanning five continents have been influenced by SCL. Quite the track record.")


Sentiment(polarity=0.1, subjectivity=0.6)

Process finished with exit code 0

Polarity refers to how negative or positive the tone of the input text rates from -1 to +1, with -1 being the most negative and +1 being the most positive. Subjectivity refers to how subjective the statement rates from 0 to 1 with 1 being highly subjective.

The scores may be off. For example, sarcasm is not only hard to detect, but may throw off the results enormously:

from textblob import TextBlob
simple_text = TextBlob("If you find me offensive, I suggest you quit finding me.")

Results in:

Sentiment(polarity=0.0, subjectivity=0.0)

Process finished with exit code 0