The sentiment analysis is done using a fairly simple algorithm that I haven’t developed myself, credit goes to Jeffrey Breen.
The algorithm basically assesses the sentiment of a piece of text based on the frequency of positive and negative words according to the lexicons it is being fed. We will use Hu and Liu’s sentiment lexicons which are two lists of positive and negative terms such as ‘good’ and ‘bad’. I generally add a couple of terms depending on the context – bias – etc. Here is why it matters.
One may say ‘The ending of this movie was ‘unpredictable’, here ‘unpredictable’ is positive. However I cannot include that in my rudimentary list of positive words because its meaning will differ depending on the context. ‘Unpredictable’ is negative in the sentence ‘This car is unpredictable’.
Let’s crack on with the code!
First we will need some next. We will pull some tweets as we often do on Lobster Heaven. If you do not know how to you can look at my previous blog post on geo-located tweets.
library("twitteR") #Authenticate source("D:/Twitter_OAuth_1_1.R") #Pull 2000 tweets in English mentioning the term "lobster" tweets <- searchTwitter("lobster", n=2000, lang="en")
This returns a list, use the twListToDF() function to turn the list of tweets into a data frame.
class(tweets)  "list" df <- twListToDF(tweets) class(df)  "data.frame"
Now that we have some text we can run the sentiment analysis algorithm from Jeffrey Breen which you can find here – or download here. Once you have it save it as .R file so that you can source() in future sessions.
You will also need to load the lexicons of positive and negative terms, which you can find here.
#Load Lexicons & Algorithm pos <- readLines ('D:/positive-words.txt') neg <- readLines ('D:/negative-words.txt') #Source Algorithm source('D:/score_sentiment.R')
One everything is in place you can actually run the score.sentiment() function
scores <- score.sentiment(df$text, pos, neg, .progress='text')
Now look at the actual scores
Finally I would like to illustrate the importance of tweaking the lexicons of positive and negative words.
Here are the results obtained by the initial sentiment analysis.
> summary(scores$score) Min. 1st Qu. Median Mean 3rd Qu. Max. -4.00000 0.00000 0.00000 0.02725 1.00000 4.00000
Now I will tweak the lexicon of positive words to intentionally skew (screw) the results.
pos <- readLines('D:/positive-words.txt') pos <- c(pos, "lobster") #re-run the scoring function scores <- score.sentiment(tb$text, pos, neg, .progress='text') > summary(scores$score) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.0000 0.0000 1.0000 0.9586 2.0000 6.0000
You can find the code on my git.