R, Text Mining

Sentiment Analysis in R

The sentiment analysis is done using a fairly simple algorithm that I haven’t developed myself, credit goes to Jeffrey Breen.

The algorithm basically assesses the sentiment of a piece of text based on the frequency of positive and negative words according to the lexicons it is being fed. We will use Hu and Liu’s sentiment lexicons which are two lists of positive and negative terms such as ‘good’ and ‘bad’. I generally add a couple of terms depending on the context – bias – etc. Here is why it matters.

One may say ‘The ending of this movie was ‘unpredictable’, here ‘unpredictable’ is positive. However I cannot include that in my rudimentary list of positive words because its meaning will differ depending on the context. ‘Unpredictable’ is negative in the sentence ‘This car is unpredictable’.

Let’s crack on with the code!

First we will need some next. We will pull some tweets as we often do on Lobster Heaven. If you do not know how to you can look at my previous blog post on geo-located tweets.



#Pull 2000 tweets in English mentioning the term "lobster"
tweets <- searchTwitter("lobster", n=2000, lang="en")

This returns a list, use the twListToDF() function to turn the list of tweets into a data frame.

 [1] "list"
 df <- twListToDF(tweets)
 [1] "data.frame"

Now that we have some text we can run the sentiment analysis algorithm from Jeffrey Breen which you can find here – or download here. Once you have it save it as .R file so that you can source() in future sessions.

You will also need to load the lexicons of positive and negative terms, which you can find here.

#Load Lexicons & Algorithm
pos <- readLines ('D:/positive-words.txt')
neg <- readLines ('D:/negative-words.txt')

#Source Algorithm

One everything is in place you can actually run the score.sentiment() function

scores <- score.sentiment(df$text, pos, neg, .progress='text')

Now look at the actual scores



Finally I would like to illustrate the importance of tweaking the lexicons of positive and negative words.

Here are the results obtained by the initial sentiment analysis.

 > summary(scores$score)
     Min.  1st Qu.   Median     Mean  3rd  Qu.     Max.
 -4.00000  0.00000  0.00000  0.02725   1.00000  4.00000

Now I will tweak the lexicon of positive words to intentionally skew (screw) the results.

 pos <- readLines('D:/positive-words.txt')
 pos <- c(pos, "lobster")

 #re-run the scoring function
 scores <- score.sentiment(tb$text,  pos, neg, .progress='text')
 > summary(scores$score)
    Min. 1st Qu.  Median    Mean 3rd  Qu.    Max.
 -3.0000  0.0000  1.0000  0.9586   2.0000  6.0000

You can find the code on my git.




  1. Pingback: Sentiment Heatmap | lobsterheaven - August 23, 2014

  2. Pingback: A word on sentiment analysis | SocialFunction() - September 9, 2014

  3. Pingback: Comparison of male and female sentiment in tweets | SocialFunction() - January 17, 2015

  4. Pingback: A word on sentiment analysis - Jabber Cruncher - May 27, 2015


Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: